arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.05165 2026-06-04 cs.LG cs.CL 版本更新

STRIDE: Training Data Attribution via Sparse Recovery from Subset Perturbations

STRIDE: 通过子集扰动的稀疏恢复进行训练数据归因

Rishit Dagli, Abir Harrasse, Luke Zhang, Florent Draye, Amirali Abdullah, Bernhard Schölkopf, Zhijing Jin

发表机构 * Jinesis AI Lab, University of Toronto & Vector Institute（Jinesis AI实验室，多伦多大学及向量研究所）； Max Planck Institute for Intelligent Systems, Tübingen, Germany（智能系统马克斯·普朗克研究所，图宾根，德国）； Thoughtworks（Thoughtworks公司）； Martian ； ELLIS Institute, Tübingen, Germany（图宾根ELLIS研究所，德国）； EuroSafeAI

AI总结提出STRIDE框架，将训练数据归因建模为压缩感知中的稀疏恢复问题，通过激活空间中的轻量级“引导算子”模拟数据子集的影响，实现高效且准确的LLM预训练归因。

Comments project page: https://stride-tda.github.io/

详情

AI中文摘要

训练数据归因（TDA）旨在将模型的预测追溯到其训练数据。TDA的黄金标准依赖于因果干预，观察模型在数据添加或移除时的变化，但对于大型语言模型（LLMs）而言，重复训练在计算上具有挑战性。因此，大多数方法在参数空间中使用梯度来近似这种效应。然而，跟踪数十亿参数的梯度不仅成本高昂，而且依赖于局部近似。在这项工作中，我们提出了一种转变：我们不估计参数变化，而是在激活空间中建模训练数据的功能效应。我们引入了STRIDE（基于引导的训练数据影响分解），这是一个将TDA表述为压缩感知精神下的稀疏恢复问题的框架。STRIDE学习轻量级的“引导算子”，这些算子模拟在数据子集上训练引起的行为变化。通过测量这些算子如何扰动测试预测，我们通过稀疏线性分解恢复单个训练示例的影响。STRIDE在LLM预训练归因中达到了最先进的性能，同时比先前的方法快一个数量级（13倍）。我们通过下游应用（包括数据选择、数据污染和定性分析）进一步验证了其实用性。

英文摘要

Training Data Attribution (TDA) seeks to trace a model's predictions back to its training data. The gold standard for TDA relies on causal interventions, observing how a model changes when data is added or removed, but repeated retraining is computationally challenging for Large Language Models (LLMs). Consequently, most approaches approximate this effect in the parameter space using gradients. However, tracking gradients across billions of parameters is not only prohibitively expensive but relies on local approximations. In this work, we propose a shift: rather than estimating parameter changes, we model the functional effect of training data in the activation space. We introduce STRIDE (Steering-based Training Data Influence Decomposition), a framework that formulates TDA as a sparse recovery problem in the spirit of compressive sensing. STRIDE learns lightweight "steering operators" that mimic the behavioral shift caused by training on data subsets. By measuring how these operators perturb test predictions, we recover individual training example influences via sparse linear decomposition. STRIDE achieves state-of-the-art for LLM pre-training attribution while being an order of magnitude ($13\times$) faster than previous art. We further validate its practical utility through downstream applications including data selection, data contamination, and qualitative analysis.

URL PDF HTML ☆

赞 0 踩 0

2606.05161 2026-06-04 cs.SD cs.CL 版本更新

Beyond Text Following: Repairable Arbitration Reversals in Audio-Language Models

超越文本跟随：音频-语言模型中的可修复仲裁反转

Yichen Gao, Yiqun Zhang, Zijing Wang, Yujia Li, Heng Guo, Xi Wu, Xiaocui Yang, Shi Feng, Yifei Zhang, Daling Wang

发表机构 * Northeastern University, China（东北大学）； Shanghai Artificial Intelligence Laboratory, China（上海人工智能实验室）

AI总结本文通过同音频反事实实验发现，音频-语言模型在冲突任务中常因文本主导而忽略音频证据，并提出无训练解码规则GACL，通过插值联合分数与同音频分数来修复仲裁反转，显著提升忠实度。

详情

AI中文摘要

音频-语言模型（ALMs）常常遵循与音频冲突的文本，即使音频证据清晰。这引发了一个基本问题：音频支持的答案是不可用的，还是被表示出来但被冲突文本覆盖了？我们使用一个同音频反事实来研究这个问题，该反事实保持音频固定，仅移除冲突文本，并测量模型偏好由此产生的变化。在五个ALM和四个冲突任务中，64.1%的冲突样本显示出符号翻转：同音频分支偏好音频支持的答案，而联合分支偏好文本支持的答案。这种模式表明，相关的音频证据被编码但在仲裁中失败。激活修补进一步将反转定位到答案位置计算，并且修补效果与输出候选分数差异紧密相关（Spearman rho=0.93）。利用这一诊断，我们提出了门控音频反事实逻辑校正（GACL），一种无训练解码规则，在联合分数和同音频分数之间进行插值。在严格的5个百分点的忠实度下降预算下，GACL在最佳对比基线上将nAUC提高了17.8个点，并且无需重新调整即可迁移到视觉-文本仲裁（最高+40.5个百分点）。

英文摘要

Audio-language models (ALMs) often follow text that conflicts with audio, even when the audio evidence is clear. This raises a basic question: is the audio-supported answer unavailable, or is it represented but overridden by the conflicting text? We examine this question using a same-audio counterfactual that keeps the audio fixed, removes only the conflicting text, and measures the resulting shift in model preference. Across five ALMs and four conflict tasks, 64.1% of conflict samples show a sign flip: the same-audio branch prefers the audio-supported answer, whereas the joint branch prefers the text-supported answer. This pattern suggests that the relevant audio evidence is encoded but loses in arbitration. Activation patching further localizes the reversal to answer-position computation, and patching effects closely track output candidate-score differences (Spearman rho=0.93). Using this diagnostic, we propose Gated Audio Counterfactual Logit Correction (GACL), a training-free decoding rule that interpolates between joint and same-audio scores. Under a strict 5 pp faithfulness-drop budget, GACL improves nAUC by 17.8 points over the best contrastive baseline and transfers without retuning to vision-text arbitration (up to +40.5 pp).

URL PDF HTML ☆

赞 0 踩 0

2606.05158 2026-06-04 cs.CL cs.AI cs.MA 版本更新

Streaming Communication in Multi-Agent Reasoning

多智能体推理中的流式通信

Zhen Yang, Xiaogang Xu, Wen Wang, Cong Chen, Xander Xu, Ying-Cong Chen

发表机构 * HKUST(GZ)（香港科技大学（广州））； Alibaba Group（阿里巴巴集团）； ZJU（浙江大学）； HKUST（香港科技大学）

AI总结提出流式多智能体推理系统StreamMA，通过将推理步骤实时流式传输给下游智能体来降低延迟，并意外地提升了效果，同时首次给出流式、串行和单协议三种模式的闭式联合分析。

Comments project page: https://zhenyangcs.github.io/StreamMA-website/

详情

AI中文摘要

多智能体推理系统采用“生成-然后传输”范式，导致端到端延迟与流水线深度成线性关系。我们提出StreamMA，一种多智能体推理系统，它将每个推理步骤在生成后立即流式传输给下游智能体，流水线化相邻智能体，从而降低延迟。令人惊讶的是，这种流水线化也提高了效果：因为多步推理质量不均匀，早期步骤比后期步骤更可靠，使用这些可靠的早期步骤而不是完整链条可以防止容易出错的后期步骤误导下游智能体。我们通过首个流式、串行和单协议三种模式的闭式联合分析，形式化了这两种优势，推导出效果排序、加速上限和成本比。在涵盖数学、科学和代码的八个推理基准测试中，使用两个前沿LLM（Claude Opus 4.6和GPT-5.4）以及三种拓扑结构（链、树、图），StreamMA均优于两个基线（平均+7.3个百分点，在HMMT 2026上最高+22.4个百分点；Claude Opus 4.6-high）。除了这些贡献，我们还发现了一个“步骤级缩放定律”：增加每个智能体的步骤持续提高效果和效率，这是一个与智能体数量缩放正交且可组合的新缩放维度。

英文摘要

Multi-agent reasoning systems adopt a "generate-then-transfer" paradigm that forces end-to-end latency to scale linearly with pipeline depth. We introduce StreamMA, a multi-agent reasoning system that streams each reasoning step to downstream agents as soon as it is generated, pipelining adjacent agents and thus reducing latency. Surprisingly, this pipelining also improves effectiveness: because multi-step reasoning quality is non-uniform and early steps are more reliable than later ones, working with these reliable early steps instead of the full chain prevents error-prone late steps from misleading downstream agents. We formalize both advantages with the first closed-form joint analysis of stream, serial, and single protocols, deriving the effectiveness ordering, speedup upper bound, and cost ratio. Across eight reasoning benchmarks spanning mathematics, science, and code, two frontier LLMs (Claude Opus 4.6 and GPT-5.4), and three topologies (Chain, Tree, Graph), StreamMA outperforms both baselines (avg. +7.3 pp, max +22.4 pp on HMMT 2026; Claude Opus 4.6-high). Beyond these contributions, we discover a "step-level scaling law": increasing per-agent steps consistently improves both effectiveness and efficiency, a new scaling dimension orthogonal to and composable with agent-count scaling.

URL PDF HTML ☆

赞 0 踩 0

2606.05145 2026-06-04 cs.LG cs.AI cs.CL 版本更新

Failed Reasoning Traces Tell You What Is Fixable (But Not by Reading Them)

失败推理轨迹告诉你什么是可修复的（但仅凭阅读它们不行）

Nizar Islah, Istabrak Abbes, Irina Rish, Sarath Chandar, Eilif B. Muller

发表机构 * Mila - Quebec AI Institute（魁北克人工智能研究所）； Université de Montréal（蒙特利尔大学）； Polytechnique Montréal（蒙特利尔理工学院）； CHU Sainte-Justine（圣约斯特医院）

AI总结本文提出通过失败推理轨迹的分布特征而非文本内容来识别可修复的失败，并设计无训练的路由规则提升测试时干预效果。

详情

AI中文摘要

当后训练语言模型在推理问题上失败时，常见的测试时扩展响应是花费更多计算进行额外尝试，而失败轨迹不再发挥作用。我们认为这丢弃了一个关键信号；一些失败源于不幸运的采样，此时更多滚动有助于解决，而其他失败是结构性的，无论预算如何都无法通过重采样解决。我们提出失败轨迹编码了可恢复性结构：即哪些测试时干预可以挽救特定失败的推理时特征。三个问题级别的轨迹特征，源自可用干预的结构，从失败滚动的分布特征（而非其文本）中恢复这种结构。它们将失败聚类为稳定区域，刻画不同后训练方法的失败地形（准确率84.3±4.3%，比多数类基线高20%），并支持一个无训练的路由规则，在部署相关的Steerable-Hard子集（重试不足且可达有界干预的失败）上将挽救率提升12.2%。这些特征和路由规则在两个跨家族探针上可迁移。因此，相同的三个特征将失败轨迹从丢弃数据转化为诊断对象，支持测试时路由和后训练分析，无需训练时或权重空间访问。

英文摘要

When post-trained language models fail on reasoning problems, the common test-time-scaling response is to spend more compute on additional attempts, and the failed traces play no further role. We argue this discards a crucial signal; some failures come from unlucky sampling, where more rollouts help, while others are structural and resist resampling regardless of budget. We propose that failed traces encode recoverability structure: the inference-time signature of which test-time interventions can rescue a given failure. Three problem-level trajectory features, derived from the structure of available interventions, recover this structure from the distributional signature of failed rollouts, not their text. They cluster failures into stable regimes, characterize the failure topography of different post-training methods ($84.3{\pm}4.3\%$ accuracy, $+20\%$ over a majority-class baseline), and support a training-free routing rule that lifts rescue by $+12.2\%$ on the deployment-relevant Steerable-Hard subset (failures where retry is insufficient and a bounded intervention is reachable). The features and the routing rule transfer across two cross-family probes. The same three features thus convert failed traces from discarded data into a diagnostic object, supporting test-time routing and post-training analysis without training-time or weight-space access.

URL PDF HTML ☆

赞 0 踩 0

2606.05134 2026-06-04 cs.CL cs.LG 版本更新

Activation-Based Active Learning for In-Context Learning: Challenges and Insights

基于激活的主动学习用于上下文学习：挑战与见解

Yaseen M. Osman, Geoff V. Merrett, Stuart E. Middleton

发表机构 * School of Electronics and Computer Science (ECS), University of Southampton（电子与计算机科学学院（ECS），南安普顿大学）

AI总结本文研究了基于MLP激活的深度主动学习方法在上下文学习中的应用，发现激活信号与示例质量或任务性能相关性弱，表明此类方法不适用于上下文学习。

Comments 9 pages, 3 figures

详情

AI中文摘要

深度主动学习此前已被探索用于大语言模型的上下文样本选择，但未利用对Transformer激活理解的最新进展。在本文中，我们测试了模型激活能否提供细粒度信号以优化上下文示例选择的假设。我们提出了迄今为止最全面的基于MLP激活的深度主动学习方法应用于上下文学习的分析，包括不同注意力掩码策略如何影响跨多样分类和生成数据集的主动学习，使用了Llama-3.2-3B和Qwen2.5-3B基础模型。然而，我们得到了负面结果：通过大规模激活或前四阶矩视角观察的MLP输出，与示例质量或任务性能不相关。具体来说，对于所有测试的任务和模型，绝对Spearman相关系数至多为0.33，表明此类基于激活的采样不应用于上下文学习。我们假设这可能是由于叠加现象，即模型表示的特征数量超过其维度，表明稀疏自编码器等方法可能是未来有前景的方向。

英文摘要

Deep active learning has previously been explored for LLM in-context sample selection, but not with methods that utilise recent advances in understanding of transformer activations. In this paper, we test the hypothesis that model activations could provide a fine-grained signal to optimise the selection of in-context examples. We present the most comprehensive analysis to date of MLP activation-based deep active learning methods applied to in-context learning, including how different attention masking strategies impact active learning across diverse classification and generative datasets, using both Llama-3.2-3B and Qwen2.5-3B base models. However, we find a negative result: MLP outputs, viewed through the lenses of massive activations or the first four moments, do not correlate with example quality or task performance. Specifically, the absolute Spearman correlation coefficient is at most 0.33 for all tasks and models we tested, showing that such activation-based sampling should not be used for in-context learning. We hypothesise that this may be due to superposition, whereby models represent more features than they have dimensionality, suggesting that methods like Sparse Autoencoders (SAEs) may be a promising future direction.

URL PDF HTML ☆

赞 0 踩 0

2606.05122 2026-06-04 cs.CL 版本更新

评估大型语言模型在标准化病人案例中的动态临床决策能力

Cheng Liang, Pengcheng Qiu, Ya Zhang, Yanfeng Wang, Chaoyi Wu, Weidi Xie

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）

AI总结本文提出MedSP1000基准，通过标准化病人案例模拟动态临床交互，评估LLM在信息收集、治疗计划和长期管理中的表现，发现当前模型在过程级评估中远未达到临床安全标准。

详情

AI中文摘要

大型语言模型（LLMs）越来越多地被提议作为临床代理，然而静态的单轮基准无法捕捉模型在诊疗过程中如何动态地提供护理：收集信息、规划治疗以及跨连续患者状态调整长期管理。医学教育长期以来通过标准化病人（SPs）解决了类似的挑战：经过培训的演员一致地扮演临床案例，实现逼真的实践和客观的脚本化评估。在此，我们介绍MedSP1000，一个源自SP的交互式基准，用于临床代理评估，包括1,638个SP案例和24,602个轨迹级同行评审评分标准。MedSP1000将同行评审的SP教学案例转化为可执行场景，包含定义的SP案例脚本、临床环境上下文和人工验证的结构化评分标准。在每次模拟评估运行中，临床代理与患者代理和环境控制器闭环交互，其行为根据原始材料中指定的专家标准在整个诊疗过程中进行评分。将MedSP1000应用于一系列通用和医学专用LLMs，我们发现静态基准上的表现并不能可靠地转化为此类教育场景。表现最好的模型GPT-5.5仅完成了60.4%的专家定义评分项目，而最强的医学专用模型达到了40.0%；增加测试时计算量没有产生可测量的增益。这些结果表明，当前的LLMs，包括为医学调整的代理系统，尚未足够可靠以安全地整合到实际临床实践中。更广泛地说，MedSP1000展示了过程级、SP式评估如何揭示单轮基准无法捕捉的临床相关失败模式。

英文摘要

Large language models (LLMs) are increasingly proposed as clinical agents, yet static, single-turn benchmarks cannot capture how a model dynamically delivers care across an encounter: gathering information, planning treatment, and adapting longitudinal management across successive patient states. Medical education has long addressed an analogous challenge through standardized patients (SPs): trained actors who consistently portray clinical cases, enabling realistic practice and objective, scripted assessment. Here we introduce MedSP1000, an SP-derived interactive benchmark for clinical-agent evaluation, including 1,638 SP cases with 24,602 trajectory-level peer-reviewed rubrics. MedSP1000 converts peer-reviewed SP teaching cases into executable scenarios with defined SP case scripts, clinical environment contexts, and human-validated structured rubric. In each simulation evaluation run, a clinical agent interacts in closed loop with a patient agent and an environment controller, and its behaviour is scored throughout the encounter against expert criteria specified in the original materials. Applying MedSP1000 to a range of general-purpose and medically specialized LLMs, we find that performance on static benchmarks does not reliably translate to such educational scenarios. The best-performing model, GPT-5.5, completes only 60.4% of expert-defined rubric items, whereas the strongest medically specialized model reaches 40.0%; increasing test-time compute produces no measurable gain. These results suggest that current LLMs, including agentic systems tuned for medicine, are not yet reliable enough to be safely integrated into actual clinical practice. More broadly, MedSP1000 shows how process-level, SP-style evaluation can reveal clinically relevant failure modes that single-turn benchmarks miss.

URL PDF HTML ☆

赞 0 踩 0

2606.05106 2026-06-04 cs.CL cs.AI cs.CY 版本更新

Arithmetic Pedagogy for Language Models

语言模型的算术教学法

Andhika Bernard Lumbantobing, Hokky Situngkir

发表机构 * Bandung Fe Institute & Adjunct Science Fellow in InaAI（巴旦格Fe研究所及InaAI兼职科学研究员）； AI Research Center IT Del & Bandung Fe Institute（IT Del人工智能研究中心及巴旦格Fe研究所）

AI总结借鉴人类数学教学法，通过将GASING方法操作化为链式思维监督训练小规模GPT-2模型，使其在算术推理上达到高准确率并展现出联想式心算能力。

Comments 18 pages, 6 figures

详情

AI中文摘要

我们研究人类数学教学法能否指导语言模型训练以实现算术推理。基于GASING方法——一种通过从左到右过程解决基本算术的印尼教学法，该过程与令牌生成的因果顺序一致——我们将每个操作操作化为一个计算过程，其执行轨迹序列化为自然语言的链式思维监督。使用仅下一个令牌预测目标（无强化学习或基于奖励的优化），从零开始训练一个带有音节-粘着TOBA分词器的小型GPT-2解码器（86M参数）。监控训练揭示了三个不同的学习阶段，机制分析——对链式思维信息图的注意力掩码干预、残差流探测和对数透镜检查——表明模型首先内化程序化路径，随后发展出联想式“心算”能力，无需显式逐步计算即可检索中间结果。训练后的模型在保留问题上达到超过80%的准确率，并与显著更大的语言模型相比具有竞争力，表明有针对性的、基于教学法的训练可以在小规模下产生强大且经济的算术能力。

英文摘要

We investigate whether methods of human mathematics pedagogy can guide the training of language models toward arithmetic reasoning. Building on the GASING method -- an Indonesian pedagogy that solves basic arithmetic through a left-to-right procedure aligned with the causal order of token generation -- we operationalize each operation as a computational procedure whose execution trace is serialized into natural-language Chain-of-Thought (CoT) supervision. A small GPT-2 decoder (86M parameters) with a syllabic-agglutinative TOBA tokenizer for Indonesian is trained from scratch on this data using only a next-token prediction objective, without reinforcement learning or reward-based optimization. Monitoring training reveals three distinct learning phases, and mechanistic analyses -- attention-masking interventions on the CoT information graph, residual-stream probing, and logit-lens inspection -- show that the model first internalizes a procedural pathway and subsequently develops an associative, ``mental-arithmetic'' capacity that retrieves intermediate results without explicit step-by-step computation. The trained model reaches over 80% accuracy on held-out problems and attains competitive performance against substantially larger language models, indicating that targeted, pedagogically grounded training can yield strong and economical arithmetic capability at small scale.

URL PDF HTML ☆

赞 0 踩 0

2606.05087 2026-06-04 cs.CL 版本更新

Light or Full Verb? A Minimal-Pair Dataset for Probing Phraseological Competence in Language Models

轻动词还是实义动词？用于探究语言模型短语能力的极小对比数据集

Francesca Franzon, Nicolas Rosàs Gómez, Leo Wanner

发表机构 * Universitat Pompeu Fabra (UPF)（庞培法布拉大学）

AI总结通过构建极小对比数据集，探究语言模型在轻动词与实义动词用法上的区分能力，发现模型能在最小上下文中区分这两种用法，并表现出跨宾语类型的可分离模式。

2606.05085 2026-06-04 cs.CL cs.AI 版本更新

Automatic Generation of Titles for Research Papers Using Language Models

使用语言模型自动生成研究论文标题

Tohida Rehman, Debarshi Kumar Sanyal, Samiran Chattopadhyay

发表机构 * Jadavpur University（贾达沃尔大学）； Indian Association for the Cultivation of Science（印度科学培养协会）

AI总结提出利用预训练语言模型和大语言模型从摘要生成论文标题的方法，通过微调PEGASUS-large在多个数据集上取得最优性能。

Comments 24 pages, 24 tables, 01 figure

详情

赋予大语言模型双向逻辑以进行稳健的链修复

Zehua Cheng, Wei Dai, Jiahao Sun, Thomas Lukasiewicz

发表机构 * Department of Computer Science, University of Oxford, UK（英国牛津大学计算机系）； FLock.io ； Institute of Logic and Computation, TU Wien, Austria（奥地利技术大学逻辑与计算研究所）

AI总结针对自回归链式推理中错误雪崩问题，提出Teleological Reasoning Infilling (TRI)框架，通过将错误推理段重构为填充中间任务并引入前缀-后缀-中间序列重排，结合符号验证器监督微调和直接偏好优化，实现仅修复受损段的高效链修复。

Comments 25 Pages

详情

Journal ref: In Proceedings of European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases 2026

AI中文摘要

大型语言模型（LLMs）中的自回归链式推理（CoT）本质上是前向的：每一步仅依赖于先前的令牌。这种单向归纳偏差使得即使是能力强的模型也容易受到错误雪崩的影响，即早期步骤中的单个逻辑或算术错误会不可逆地破坏整个推理链。我们提出了Teleological Reasoning Infilling (TRI)，一个训练框架，赋予仅解码器变换器原生的目标条件桥接能力。关键见解是将错误的推理段重构为填充中间（FIM）任务：给定一个验证过的前缀前提P、一个验证过的下游里程碑S和原始查询Q，模型必须综合出连接P到S的逻辑桥M，要求严格且完整。为了实现这一目标，我们引入了一种前缀-后缀-中间（PSM）序列重排，使用三个非重叠的哨兵令牌，使得M能够同时关注P和S，而无需对自注意力机制进行任何结构修改。训练分两个阶段进行：（i）在从形式数学语料库中提取的符号验证的(P, S, M)三元组上进行监督微调（SFT），以及（ii）以确定性符号验证器（Lean 4 / Python）作为唯一奖励神谕的直接偏好优化（DPO），消除了LLM评判的谄媚。在推理时，TRI作为双系统循环中的外科修复模块运行：因果草稿模型生成初始轨迹，验证器定位失败点，TRI仅填充受损段，保留已验证部分不变。在三个基准上的综合实验表明，TRI在所有任务上达到了最先进的性能，同时每个问题的令牌消耗减少了31.2%。

英文摘要

Autoregressive chain-of-thought (CoT) reasoning in large language models (LLMs) is fundamentally forward-directed: each step conditions only on prior tokens. This unidirectional inductive bias renders even capable models susceptible to error snowballing, wherein a single logical or arithmetic mistake in an early step irreversibly corrupts the entire reasoning chain. We introduce Teleological Reasoning Infilling (\TRI{}), a training framework that endows decoder-only transformers with a native \emph{goal-conditioned bridging} capability. The key insight is to reframe erroneous reasoning segments as fill-in-the-middle (FIM) tasks: given a verified prefix premise $P$, a verified downstream milestone $S$, and the original query $Q$, the model must synthesise the logical bridge $M$ that connects $P$ to $S$ rigorously and completely. To achieve this with standard causal architectures, we introduce a Prefix-Suffix-Middle (PSM) sequence rearrangement with three non-overlapping sentinel tokens, enabling $M$ to attend to both $P$ and $S$ without any structural modification to the self-attention mechanism. Training proceeds in two stages: (i) Supervised Fine-Tuning (SFT) on symbolically verified $(P, S, M)$ triples extracted from formal mathematics corpora, and (ii) Direct Preference Optimisation (DPO) with a deterministic symbolic verifier (Lean 4 / Python) as the sole reward oracle, eliminating LLM-judge sycophancy. At inference, TRI operates as a surgical repair module within a dual-system loop: a causal draft model generates an initial trace, the verifier pinpoints failures, and TRI infills only the damaged segment, leaving verified sections intact. Comprehensive experiments on three benchmarks demonstrate that TRI achieves state-of-the-art performance across all tasks, while reducing per-problem token expenditure by 31.2%.

URL PDF HTML ☆

赞 0 踩 0

2606.05029 2026-06-04 cs.LG cs.CL 版本更新

Validity Threats for Foundation Model Research

基础模型研究的有效性威胁

Gunnar König, Martin Pawelczyk, Ulrike von Luxburg, Sebastian Bordt

发表机构 * University of Tübingen, Tübingen AI Center（图宾根大学，图宾根人工智能中心）； University of Vienna（维也纳大学）

AI总结本文提出一个因果推断评估框架，将基础模型研究中的不同近似实验策略（代理实验、观察性研究、单次运行设计）映射为四种有效性（统计、内部、外部、构念）的权衡，揭示并分析计算节省带来的隐蔽有效性威胁。

详情

AI中文摘要

受控实验是机器学习研究的基石，但在现代基础模型的规模下，它们变得过于昂贵。相反，研究界越来越依赖于以较低成本近似理想实验的研究策略：代理实验和缩放定律、使用公开模型的观察性研究，以及利用单个训练运行内部变化的单次运行设计。在这项工作中，我们认为在计算预算内近似大规模实验没有免费午餐。具体来说，计算节省是以有效性威胁为代价的——隐藏且有时无法检验的假设，当这些假设被违反时，会使研究主张无效。为了帮助应对这些威胁，我们提出了一个评估框架，将基础模型研究视为因果推断问题。在这个框架内，我们通过从经验社会科学中改编的四种有效性——统计、内部、外部和构念有效性——来评估不同的研究策略。我们发现每种策略都有其特有的有效性特征：代理实验以外部和构念有效性换取统计和内部有效性；观察性研究面临混杂和效应异质性；单次运行设计则因处理单元之间的干扰而紧张。这一分析揭示了文献中未得到充分关注的若干有效性威胁。总体而言，我们的评估框架为研究人员提供了一个实用的工具包，用于审视基础模型研究设计中的有效性威胁。

英文摘要

Controlled experiments are the backbone of machine learning research, but at the scale of modern foundation models, they have become prohibitively expensive. Instead, the community increasingly relies on research strategies that approximate the ideal experiment at a fraction of the cost: proxy experiments and scaling laws, observational studies with publicly available models, and single-run designs that leverage variation within individual training runs. In this work, we argue that there is no free lunch when approximating large-scale experiments on a compute budget. Specifically, savings in compute come at the cost of validity threats -- hidden and sometimes untestable assumptions that, when violated, can invalidate research claims. To help navigate such threats, we propose an evaluation framework that casts foundation model research as a causal inference problem. Within this framework, we evaluate different research strategies through four types of validity adapted from the empirical social sciences -- statistical, internal, external, and construct validity. We find that each strategy comes with a characteristic validity profile: proxy experiments trade external and construct validity for statistical and internal validity; observational studies face confounding and effect heterogeneity; and single-run designs are strained by interference between treated units. This analysis reveals several validity threats that have received insufficient attention in the literature. Overall, our evaluation framework provides researchers with a practical toolkit for scrutinizing validity threats in foundation model research~designs.

URL PDF HTML ☆

赞 0 踩 0

2606.05016 2026-06-04 cs.CL 版本更新

TaDA: Calibrated Probe Gating for Task-Domain LoRA Merging

TaDA: 任务-领域LoRA合并的校准探针门控

Huy Quoc To, Fuyi Li, Guangyan Huang, Ming Liu

发表机构 * Deakin University（德克萨斯大学）； Adelaide University（阿德莱德大学）

AI总结针对任务与领域LoRA适配器合并中的深度不对称性，提出无训练算法TaDA，通过校准探针引导的逐层门控和逐分量子空间感知合并，在六个科学QA和六个图像分类基准上取得最优性能。

详情

AI中文摘要

将任务LoRA适配器与领域LoRA适配器组合成一个统一模型是一个实际但很大程度上未被探索的挑战。现有方法将两个适配器视为对称对等体，对所有层应用统一权重。我们认为，任务和领域适配器在Transformer架构中表现出一致的深度依赖不对称性。领域主导性随层深度增加而增强，而较浅层保留更强的任务相关信号。受此观察启发，我们提出$ extbf{TaDA}$（$ extbf{Ta}$sk-$ extbf{D}$omain LoR$ extbf{A}$ Merging），一种无训练算法，通过校准探针引导的逐层门控和逐分量子空间感知合并来利用这种结构。门控使用被证明对适配器权重幅度不变的探针信号，为每层和投影类型分配独立权重。合并则在组合剩余分量之前丢弃冲突的奇异方向。$ extbf{TaDA}$产生一个标准秩$r$的LoRA适配器，推理开销为零。在Llama-2-7B的六个科学QA基准上，TaDA平均准确率达到0.452，比DARE-TIES高出3.6个百分点，并在所有六个基准上取得最佳结果。在ViT-L/16的六个图像分类基准上，TaDA平均准确率达到85.9%，在六个基准中的三个上领先，同时优于最强的合并基线。

英文摘要

Combining a task LoRA adapter with a domain LoRA adapter into a single unified model is a practical yet largely unexplored challenge. Existing methods treat both adapters as symmetric peers, applying uniform weights across all layers. We argue that task and domain adapters exhibit a consistent depth-dependent asymmetry across transformer architectures. Domain dominance increases with layer depth, while shallower layers retain stronger task-relevant signals. Motivated by this observation, we propose $\textbf{TaDA}$ ($\textbf{Ta}$sk-$\textbf{D}$omain LoR$\textbf{A}$ Merging), a training-free algorithm that exploits this structure through calibrated probe-guided per-layer gating and per-component subspace-aware merging. The gating assigns individual weights per layer and projection type using a probe signal proved invariant to adapter weight magnitude. The merging discards conflicting singular directions before combining the remaining components. $\textbf{TaDA}$ produces a standard rank-$r$ LoRA adapter with zero inference overhead. On six scientific QA benchmarks with Llama-2-7B, TaDA achieves an average accuracy of 0.452, outperforming DARE-TIES by +3.6 percentage points and obtaining the best result on all six benchmarks. On six image classification benchmarks with ViT-L/16, TaDA reaches 85.9\% average accuracy, improving over the strongest merging baseline while leading in three of the six individual benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2606.05009 2026-06-04 cs.CL cs.AI 版本更新

DAR: Deontic Reasoning with Agentic Harnesses

DAR: 基于智能体框架的道义推理

Guangyao Dou, William Jurayj, Nils Holzenberger, Benjamin Van Durme

发表机构 * Johns Hopkins University（约翰霍普金斯大学）； Télécom Paris, Institut Polytechnique de Paris（巴黎电信学院，巴黎理工学院）

AI总结提出DAR框架，通过让模型按需与法规交互来提升基于LLM的道义推理能力，实验表明智能体框架可提升性能但存在非均匀改进和弱模型数值任务退化问题。

2606.05008 2026-06-04 cs.CV cs.AI cs.CL 版本更新

M$^3$Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks

M$^3$Eval: 通过认知基础视频任务的多模态记忆评估

Jie Huang, Ruixun Liu, Sirui Sun, Xinyi Yang, Yin Li, Yixin Zhu, Yiwu Zhong

发表机构 * School of Intelligence Science and Technology, Peking University（北京大学智能科学与技术学院）； State Key Laboratory of General Artificial Intelligence, Peking University（北京大学通用人工智能国家重点实验室）； Yuanpei College, Peking University（北京大学元培学院）； Institute for Artificial Intelligence, Peking University（北京大学人工智能研究院）； School of Psychological and Cognitive Sciences, Peking University（北京大学心理学与认知科学学院）； University of Wisconsin-Madison（威斯康星大学麦迪逊分校）

AI总结提出首个多模态模型记忆评估框架M$^3$Eval，通过认知心理学设计的视频任务系统评估模型在记忆保持、忠实性和鲁棒性上的表现，发现模型在并行视频流处理、干扰模式、时空记忆和符号记忆方面的显著缺陷。

Comments We present an evaluation designed for multi-modal memory in multi-modal models

详情

AI中文摘要

随着多模态模型向长视频理解发展，记忆成为关键能力。尽管在视频数据集和基准测试方面做出了大量努力，现有工作主要关注感知和推理，而没有系统评估记忆：模型保留了什么、信息如何忠实保存、以及记忆在干扰下的鲁棒性。为填补这一空白，我们引入了M$^3$Eval，这是第一个用于探测多模态模型中不同记忆维度的综合评估框架和基准。基于认知心理学，我们的设计通过精心构建的任务来隔离记忆的关键方面。利用M$^3$Eval，我们在代表性多模态模型上进行了大量实验，揭示了一致的弱点和独特行为。我们发现，模型在处理并行视频流时难以保持解耦表示，表现出与人类记忆显著不同的干扰模式，在空间域比时间域更可靠地定位记忆源，并且符号记忆有限。总的来说，我们的基准为未来研究提供了宝贵资源，而我们的发现强调了记忆作为基本但未充分探索的能力，并为设计更有效的多模态模型记忆机制提供了见解。我们的代码和数据集可在https://pku-value-lab.github.io/m3eval-homepage获取。

英文摘要

As multi-modal models advance towards long-form video understanding, memory emerges as a critical capability. Despite substantial efforts in developing video datasets and benchmarks, existing works primarily focus on perception and reasoning, without systematically evaluating memory: what models retain, how faithfully information is preserved, and how robust memory remains under interference. To address this gap, we introduce M$^3$Eval, the first comprehensive evaluation framework and benchmark for probing different memory dimensions in multi-modal models. Grounded in cognitive psychology, our design features carefully constructed tasks that isolate key aspects of memory. Leveraging M$^3$Eval, we conduct extensive experiments across representative multi-modal models, revealing consistent weaknesses and distinctive behaviors. We find that models struggle to maintain disentangled representations when processing parallel video streams, exhibit interference patterns differing substantially from those observed in human memory, ground memory sources more reliably in the spatial domain than the temporal domain, and demonstrate limited symbolic memory. Collectively, our benchmark provides a valuable resource for future research, while our findings highlight memory as a fundamental yet underexplored capability and offer insights for designing more effective memory mechanisms in multi-modal models. Our code and dataset are available at https://pku-value-lab.github.io/m3eval-homepage.

URL PDF HTML ☆

赞 0 踩 0

2606.05002 2026-06-04 cs.CL 版本更新

GARL: Game-Theoretic Reinforcement Learning for Multi-Agent Strategic Prioritisation

GARL：面向多智能体战略优先级排序的博弈论强化学习

Yuxiao Ye, Yiwen Zhang, Huiyuan Xie, Yuqin Huang, Zhiyuan Liu

发表机构 * Tsinghua University（清华大学）； The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））

AI总结提出GARL框架，将多智能体战略优先级排序形式化为两阶段博弈，通过博弈论效用转化为角色特定强化信号，优化交互策略，在争议问题排序任务中提升性能并使小型开源LLM与强闭源LLM竞争。

详情

AI中文摘要

基于LLM的多智能体系统越来越多地用于战略决策任务。在此类设置中，性能不仅取决于单个模型的能力，还取决于智能体交互和适应的策略。多智能体强化学习可以优化这些交互策略，但其奖励设计通常特定于任务且与交互结构的关联较弱。为弥补这一差距，我们提出GARL，一种面向多智能体战略优先级排序的博弈论强化学习框架。GARL将战略优先级排序形式化为两阶段博弈：竞争智能体首先在共享候选集上分配战略资源，然后更高级别的仲裁者产生最终排名。由此产生的博弈论效用被转化为角色特定的强化信号，使策略优化能够由结构化交互引导。我们在争议问题排序任务上实例化GARL，其目标是在法律程序中优先处理核心问题。实验表明，GARL提高了排序性能，使小型开源LLM在相同候选排名设置下与强大的闭源LLM竞争，并在法律领域能力和更广泛的战略决策方面取得收益。总体而言，GARL展示了如何将博弈论交互结构转化为强化学习目标，为多智能体战略优先级排序中的策略优化提供了原则性方法。

英文摘要

LLM-based multi-agent systems are increasingly used for strategic decision-making tasks. In such settings, performance depends not only on individual model capabilities, but also on the policies by which agents interact and adapt. Multi-agent reinforcement learning can optimise these interaction policies, but its reward design often remains task-specific and weakly grounded in interaction structure. To address this gap, we propose GARL, a GAme-theoretic Reinforcement Learning framework for multi-agent strategic prioritisation. GARL formalises strategic prioritisation as a two-stage game: competing agents first allocate strategic resources over a shared candidate set, and a higher-level arbiter then produces the final ranking. The resulting game-theoretic utilities are converted into role-specific reinforcement signals, allowing policy optimisation to be guided by structured interaction. We instantiate GARL on issues-in-dispute ranking, where the goal is to prioritise core issues in legal proceedings. Experiments show that GARL improves ranking performance, enables small open-source LLMs to become competitive with a strong closed-source LLM under the same candidate-ranking setting, and yields gains in legal-domain competence and broader strategic decision-making. Overall, GARL demonstrates how game-theoretic interaction structure can be turned into reinforcement-learning objectives, providing a principled approach to policy optimisation in multi-agent strategic prioritisation.

URL PDF HTML ☆

赞 0 踩 0

2606.04987 2026-06-04 cs.CL cs.AI cs.HC 版本更新

DeliChess: A Multi-party Dialogue Dataset for Deliberation in Chess Puzzle Solving

DeliChess: 一个用于国际象棋谜题求解中深思熟虑的多方对话数据集

Xiaochen Zhu, Georgi Karadzhov, Tom Stafford, Andreas Vlachos

发表机构 * University of Cambridge（剑桥大学）； University of Sheffield（谢菲尔德大学）

AI总结提出DeliChess数据集，包含多方协作解决国际象棋谜题的对话，通过讨论显著提升群体准确性，并分析探询性话语的作用。

详情

AI中文摘要

多方对话是研究协作推理和决策的关键场景，然而现有数据集很少关注结构化、深入的复杂推理任务。我们引入了DeliChess，一个新颖的群体深思熟虑对话数据集，其中参与者协作解决多项选择国际象棋谜题。每个小组首先单独完成谜题，然后进行多方讨论，最后提交修正后的集体答案。该数据集包含107个对话，附有完整转录、讨论前后的选择以及关于谜题难度和走棋质量的元数据。我们使用基于象棋引擎评估的三个指标评估性能，发现深思熟虑显著提高了群体准确性。我们进一步利用先前深思熟虑数据训练的分类器分析了探询性话语（即引发提议、理由或战略反思的消息）的作用。虽然探询性话语使讨论后的群体表现更加多变，但它并未持续带来更好的性能。我们的数据集为在一个明确定义的策略领域中建模群体推理、对话动态以及不同观点和意见的解决提供了丰富的测试平台。

英文摘要

Multi-party dialogue is a critical setting for studying collaborative reasoning and decision-making, yet existing datasets rarely focus on structured, in-depth complex reasoning tasks. We introduce DeliChess, a novel dataset of group deliberation dialogues in which participants collaboratively solve multiple-choice chess puzzles. Each group first completes the puzzle individually, then engages in a multi-party discussion before submitting a revised collective answer. The dataset includes 107 dialogues with full transcripts, pre- and post-discussion choices, and metadata on puzzle difficulty and move quality. We evaluate performance using three metrics based on chess engine evaluations, and find that deliberation significantly improves group accuracy. We further analyse the role of probing utterances (i.e., messages that elicit proposals, justifications, or strategic reflection) using a classifier trained on prior deliberation data. While probing makes group performance more variable after discussion, it does not consistently lead to better performance. Our dataset offers a rich testbed for modelling group reasoning, dialogue dynamics, and the resolution of differing perspectives and opinions in a well-defined strategic domain.

URL PDF HTML ☆

赞 0 踩 0

2606.04978 2026-06-04 cs.CL cs.CY econ.GN q-fin.EC 版本更新

临床远程参与助手（CARE-link）：一种用于管理糖尿病的基于网络的电子健康记录软件

Prince Ebenezer Adjei, Joshua Teye Tettey, Toufiq Musah, Audrey Agbeve, John Amuasi

发表机构 * Global One Health Research Group, Bernhard Nocht Institute of Tropical Medicine（全球健康研究组，伯恩哈德-诺克特热带医学研究所）； Global Health and Infectious Diseases Research Group, Kumasi Centre for Collaborative Research in Tropical Medicine（全球健康与传染病研究组，库马西协作热带医学研究中心）； Department of Computer Engineering, Kwame Nkrumah University of Science and Technology（计算机工程系，库马西大学科学与技术学院）； Department of Global Health, School of Public Health, Kwame Nkrumah University of Science and Technology（全球健康系，公共卫生学院，库马西大学科学与技术学院）

AI总结 CARE-link是一个开源、基于网络的临床支持平台，通过LLM介导的工作流程连接临床医生和患者，用于改善妊娠期糖尿病管理，系统汇总院外患者生成数据、提供临床决策支持，并通过WhatsApp界面为患者提供管理计划解释和生活方式指导。

2606.04928 2026-06-04 cs.LG cs.CL 版本更新

Data Attribution in Large Language Models via Bidirectional Gradient Optimization

通过双向梯度优化实现大型语言模型中的数据归因

Frédéric Berdoz, Luca A. Lanzendörfer, Kaan Bayraktar, Roger Wattenhofer

发表机构 * EPFL, Switzerland（瑞士联邦理工学院）； ETH Zurich, Switzerland（瑞士苏黎世联邦理工学院）

AI总结提出一种基于双向梯度优化的训练数据归因方法，用于自动回归大型语言模型，以识别影响模型输出的关键训练数据，提升模型可解释性。

Comments Presented at the AI Governance (AIGOV) Workshop at AAAI 2026

详情

AI中文摘要

大型语言模型（LLMs）越来越多地部署在各种应用中，引发了关于治理、问责和数据溯源的关键问题。理解哪些训练数据对模型的输出影响最大仍然是一个基本开放问题。我们通过扩展逆公式来解决自动回归LLMs的训练数据归因（TDA）挑战：如果模型在训练期间看到了生成的输出，训练数据会如何受到影响？我们的方法通过对生成的文本样本进行双向梯度优化（梯度上升和下降）来扰动基础模型，并测量训练样本上损失的变化。我们的框架支持任意数据粒度的归因，能够实现事实和风格归因。我们在已知数据集的预训练模型上评估了我们的方法，并表明它在影响力指标上优于先前的工作，从而增强了模型的可解释性，这是负责任AI系统的基本要求。

英文摘要

Large Language Models (LLMs) are increasingly deployed across diverse applications, raising critical questions for governance, accountability, and data provenance. Understanding which training data most influenced a model's output remains a fundamental open problem. We address this challenge through training data attribution (TDA) for auto-regressive LLMs by expanding upon the inverse formulation: How would training data be affected if the model had seen the generated output during training? Our method perturbs the base model using bidirectional gradient optimization (gradient ascent and descent) on a generated text sample and measures the resulting change in loss across training samples. Our framework supports attribution at arbitrary data granularity, enabling both factual and stylistic attribution. We evaluate our method against baselines on pretrained models with known datasets, and show that it outperforms previous work on influence metrics, thereby enhancing model interpretability, an essential requirement for accountable AI systems.

URL PDF HTML ☆

赞 0 踩 0

2606.04924 2026-06-04 cs.CL 版本更新

Can Crowdsourcing Survive the LLM Era? A Community Survey on Human Data Collection

众包能否在LLM时代幸存？关于人类数据收集的社区调查

Aswathy Velutharambath, Neele Falk, Sofie Labat, Tarun Tater, Amelie Wuehrl

发表机构 * University of Stuttgart（斯图加特大学）； Ghent University（根特大学）； Harvard University（哈佛大学）； IT University of Copenhagen（哥本哈根技术大学）

AI总结通过调查155名NLP及相关领域研究者，探讨LLM对众包数据有效性的挑战、检测策略及应对措施，发现44%的受访者观察到LLM使用，但现有努力仍不足。

详情

AI中文摘要

大型语言模型（LLM）作为写作工具的广泛使用挑战了众包数据的有效性，因为众包工作者可能将任务外包给模型。为了更好地了解如何解决这一问题，我们调查了155名NLP及相关领域的研究人员，了解他们通过众包收集自由文本回复的经验和意见。本文概述了从业者面临的挑战、缓解策略以及对数据质量的预期影响。44%的受访者报告在其众包数据中观察到LLM的使用。虽然其中93%的人预料到了这一点，但一半的人不确定应采取何种预防措施。最普遍的检测策略是独特的文本风格模式和异常快速的完成时间。总体而言，调查回复显示研究社区意识到这一问题并正在采取措施，但现有努力仍不足以完全解决。最后，我们提出了一系列考虑因素，以指导LLM时代未来的众包自由文本数据收集。

英文摘要

The widespread use of Large Language Models (LLMs) as writing tools challenges the validity of crowdsourced data, as crowdworkers may outsource tasks to models. To better understand how this is addressed, we surveyed 155 researchers in NLP and related disciplines about their experiences and opinions on collecting free-text responses via crowdsourcing. This paper provides an overview of practitioners' challenges, mitigation strategies, and the foreseen implications on data quality. 44% of respondents reported observing LLM usage in their crowdsourced data. While 93% of them had anticipated this, half were unsure what precautions to take. The most prevalent detection strategies are distinctive textual style patterns and unusually fast completion times. Overall, survey responses show that the research community is aware of the problem and taking measures, but existing efforts remain insufficient to fully address it. Finally, we derive a set of considerations to guide future crowdsourced free-text data collection in the era of LLMs.

URL PDF HTML ☆

赞 0 踩 0

2606.04923 2026-06-04 cs.LG cs.AI cs.CL 版本更新

Reproducing, Analyzing, and Detecting Reward Hacking in Rubric-Based Reinforcement Learning

基于评分标准的强化学习中的奖励黑客行为的复现、分析与检测

Xuekang Wang, Zhuoyuan Hao, Shuo Hou, Hao Peng, Juanzi Li, Xiaozhi Wang

发表机构 * Tsinghua University（清华大学）； Harbin Institute of Technology, Shenzhen（哈尔滨工业大学（深圳））； Xi’an Jiaotong University（西安交通大学）

AI总结本文提出可控黑客环境CHERRL，通过注入已知偏见复现奖励黑客行为，分析其可发现性与可利用性，并探索基于智能体的自动检测方法。

Comments 23 pages, 7 figures

详情

AI中文摘要

基于评分标准的强化学习（RL）使用LLM作为评判者（LaaJ）根据评分标准对模型输出进行评分作为奖励。然而，策略模型可能利用评判者中的潜在偏见，导致奖励黑客行为以及无效或不安全的训练结果。在真实的基于评分标准的RL中，此类黑客行为通常微妙且与多种评判者偏见纠缠在一起，使得分析、检测和缓解变得困难。在本文中，我们引入了CHERRL，一个用于基于评分标准的RL的可控黑客环境。通过将已知偏见注入LaaJ，CHERRL能够稳定复现奖励黑客行为，明确观察奖励发散，并精确识别黑客行为的起始点。这为研究基于评分标准的RL中奖励黑客行为的机制和缓解措施提供了一个干净的实验测试平台。为了展示其效用，我们从可发现性和可利用性的角度分析了不同的评判者偏见，并探索了一个基于智能体的系统，用于从训练日志中自动检测奖励黑客行为的起始点。代码和环境公开于https://github.com/THUAIS-Lab/CHERRL。

英文摘要

Rubric-based reinforcement learning (RL) uses an LLM-as-a-Judge (LaaJ) to score model outputs according to rubrics as rewards. However, policy models may exploit latent biases in the judge, leading to reward hacking and ineffective or unsafe training outcomes. In real-world rubric-based RL, such hacking behaviors are often subtle and entangled with multiple judge biases, making them difficult to analyze, detect, and mitigate. In this paper, we introduce CHERRL, a controllable hacking environment for rubric-based RL. By injecting known biases into LaaJ, CHERRL enables stable reproduction of reward hacking, explicit observation of reward divergence, and precise identification of hacking onset. This provides a clean experimental testbed for studying the mechanisms and mitigations of reward hacking in rubric-based RL. To demonstrate its utility, we analyze different judge biases from the perspectives of discoverability and exploitability, and explore an agent-based system for automatically detecting reward hacking onset from training logs. The code and environment are publicly available at https://github.com/THUAIS-Lab/CHERRL.

URL PDF HTML ☆

赞 0 踩 0

2606.04915 2026-06-04 cs.CL cs.IR 版本更新

Caliper: Probing Lexical Anchors versus Causal Structure in LLMs

Caliper: 探究LLM中的词汇锚点与因果结构

Zhenyu Yu, Shuigeng Zhou

发表机构 * Fudan University（复旦大学）

AI总结通过词汇匿名化扰动，揭示大语言模型在因果推理基准上的表现主要依赖词汇模式匹配而非结构因果推理。

详情

AI中文摘要

大语言模型在CLadder等因果推理基准上达到50%至70%的准确率，但尚不清楚这反映的是结构推理还是词汇模式匹配。我们引入Caliper，一种受控扰动方法，在保留每个问题的因果图和概率规范的同时，用占位符标记替换语义变量名。在九个指令微调LLM（从3.8B到671B参数）和三个因果推理基准上，词汇匿名化在本地3.8B-14B模型集上导致稳健的准确率下降，分别为+7.6、+27.0和+11.1个百分点；在跨越2024-2026代际的九个前沿模型上，CRASS和e-CARE上的下降幅度升至+29.6和+18.0个百分点。在40个模型-基准组合中，39个显示出正差距，而在CLadder的伪词子集上，差距缩小了17倍。结构化提示和少样本上下文学习各自缩小了差距，但主要是通过降低较小模型上的P0准确率，而非恢复P1。当前指令微调LLM在零样本评估下，一旦移除词汇锚点，几乎没有证据表明其具备结构因果推理能力。

英文摘要

Large language models reach 50 to 70% accuracy on causal reasoning benchmarks such as CLadder, but it is unclear whether this reflects structural reasoning or lexical pattern matching. We introduce Caliper, a controlled perturbation that replaces semantic variable names with placeholder tokens while preserving the causal graph and probabilistic specification of each question. Across nine instruction-tuned LLMs from 3.8B to 671B and three causal reasoning benchmarks, lexical anonymization yields robust accuracy drops of +7.6, +27.0, and +11.1 pp on a local 3.8B-14B set, rising to +29.6 and +18.0 pp on CRASS and e-CARE across nine frontier models spanning the 2024-2026 generations. Of 40 engaged model-by-benchmark cells, 39 show a positive gap, and the gap collapses by 17x on CLadder's pseudoword subset. Structured scaffolding and few-shot in-context learning each narrow the gap, but mainly by lowering P0 accuracy on smaller models rather than recovering P1. Current instruction-tuned LLMs, evaluated zero-shot, show little evidence of structural causal reasoning once lexical anchors are removed.

URL PDF HTML ☆

赞 0 踩 0

2606.04911 2026-06-04 cs.CV cs.CL 版本更新

BreastGPT: A Multimodal Large Language Model for the Full Spectrum of Breast Cancer Clinical Routine

BreastGPT: 面向乳腺癌临床全流程的多模态大语言模型

Yang Liu, Jiajin Zhang, Danyang Tu, Yaojun Hu, Jiao Qu, Jiuyu Zhang, Yu Shi, Wei Fang, Shi Gu, Ling Zhang, Yingda Xia

发表机构 * DAMO Academy, Alibaba Group（阿里巴巴集团 DAMO 院）； Zhejiang University（浙江大学）； Hupan Lab（华潘实验室）； West China Hospital（西京医院）； China Medical University（中国医科大学）

AI总结提出BreastGPT多模态大语言模型，通过构建工作流对齐的指令语料库BreastStage和双分支视觉编码器，实现乳腺癌筛查、诊断和治疗规划全流程的多模态推理，在BreastStage-Bench上取得75.66%封闭式准确率和89.92%开放式得分。

详情

AI中文摘要

乳腺癌仍然是女性癌症相关死亡的主要原因。其临床管理需要跨临床工作流（包括筛查、诊断和治疗规划）的多模态推理，其中每个阶段涉及不同的成像模态、任务目标和推理模式。然而，受限于数据稀缺和模型通用性，现有的医学多模态大语言模型通常仅在孤立的模态或狭窄的任务族上进行评估，限制了它们支持工作流级临床推理的能力。在这项工作中，我们首先引入了BreastStage，一个工作流对齐的乳腺影像指令语料库，包含来自5种成像模态的17个子数据集和136个任务模板的186万条指令遵循对。其保留子集BreastStage-Bench为评估乳腺癌护理连续体中的多模态推理提供了全面的基准。基于该语料库，我们提出了BreastGPT，一个统一的多模态大语言模型，配备双分支视觉编码器和概念保持的令牌压缩，以弥合标准放射学与千兆像素病理学之间的尺度差距。在BreastStage-Bench上，BreastGPT实现了75.66%的封闭式准确率和89.92%的开放式得分，在临床阶段和任务格式上均优于通用和医学专用多模态大语言模型。这些结果表明，工作流对齐的数据和跨尺度视觉建模对于临床基础的医学多模态大语言模型至关重要。所有数据、代码和模型检查点已在https://yangyy-liu.github.io/BreastGPT.io发布。

英文摘要

Breast cancer remains a leading cause of cancer-related mortality among women. Its clinical management requires multimodal reasoning across a clinical workflow that spans \textit{screening}, \textit{diagnosis} and \textit{treatment planning}, where each stage involves distinct imaging modalities, task objectives, and reasoning patterns. However, constrained by data scarcity and model versatility, existing medical MLLMs are typically evaluated on isolated modalities or narrow task families, limiting their ability to support workflow-level clinical reasoning. In this work, we first introduce \textbf{BreastStage}, a workflow-aligned breast imaging instruction corpus comprising 1.86M instruction-following pairs curated from 17 sub-datasets across 5 imaging modalities and 136 task templates. Its held-out split, \textbf{BreastStage-Bench}, provides a comprehensive benchmark for evaluating multimodal reasoning across the breast cancer care continuum. Building on this corpus, we propose \textbf{BreastGPT}, a unified MLLM equipped with a dual-branch visual encoder and concept-preserving token compression to bridge the scale gap between standard radiology and gigapixel pathology. On BreastStage-Bench, BreastGPT achieves 75.66\% closed-ended accuracy and 89.92\% open-ended score, outperforming both general-purpose and medical-specific MLLMs across clinical stages and task formats. These results suggest that workflow-aligned data and cross-scale visual modeling are critical for clinically grounded medical MLLMs. All data, code, and model checkpoints are released at https://yangyy-liu.github.io/BreastGPT.io.

URL PDF HTML ☆

赞 0 踩 0

2606.04909 2026-06-04 cs.IR cs.CL 版本更新

BEATS: Bootstrapping E-commerce Attribute Taxonomies for Search through Iterative Human-AI Collaboration

BEATS: 通过迭代人机协作引导电商搜索属性分类

Yung-Yu Shih, Shang-Yu Su, Tzu-I Ho, Dongzhe Wang, Yun-Nung Chen

发表机构 * National Taiwan University（国立台湾大学）； Rakuten Group, Inc.（拉肯集团）； Taiwan Rakuten Ichiba, Inc.（台湾拉肯Ichiba公司）； Rakuten Asia Pte. Ltd.（拉肯亚洲有限公司）

AI总结针对新兴市场电商平台缺乏结构化属性模式的问题，提出BEATS框架，利用人机协作的LLM流水线从零构建产品属性分类，并通过属性标注提升搜索系统性能。

Comments 6 pages, 1 figure, 5 tables. Accepted to SIGIR 2026 Industry Track. Official version: https://doi.org/10.1145/3805712.3808520

详情

DOI: 10.1145/3805712.3808520

AI中文摘要

一个标注了副词性多词表达的法语语料库

Eric Laporte, Takuya Nakamura, Stavroula Voyatzi

发表机构 * Université Paris-Est（巴黎-est大学）； Institut Gaspard-Monge - LabInfo（加斯帕尔-蒙日研究所 - LabInfo）

AI总结本文介绍了一个标注了副词性多词表达的法语语料库，旨在支持信息检索、信息抽取以及深层和浅层句法分析的研究。

2606.04823 2026-06-04 cs.AI cs.CL cs.MA 版本更新

R-APS: Compositional Reasoning and In-Context Meta-Learning for Constrained Design via Reflective Adversarial Pareto Search

R-APS：基于反思性对抗帕累托搜索的组合推理与上下文元学习用于约束设计

João Pedro Gandarela, Thiago Rios, Stefan Menzel, André Freitas

发表机构 * Idiap Research Institute（Idiap研究 institute）； École Polytechnique Fédérale de Lausanne（瑞士联邦理工学院）； Honda Research Institute Europe（本田欧洲研究机构）； Department of Computer Science, University of Manchester（曼彻斯特大学计算机科学系）； National Biomarker Centre, CRUK-MI, University of Manchester（曼彻斯特大学国家生物标志物中心）

AI总结提出R-APS方法，通过推理模式分解、分阶段组合推理、敏感性引导对抗测试和元归纳规则提取，联合解决LLM在代理设置中的错误传播、最坏情况扰动和知识失效问题，在平面机构合成任务上实现更紧的鲁棒性证书和更快的迭代速度。

详情

AI中文摘要

大型语言模型（LLM）在开放式任务上表现流畅，但在需要规划、使用工具和长时间行动的代理设置中，流畅性并不能保证可靠交付。我们将这一差距归因于三个耦合的结构性失败：错误传播而不定位、最坏情况扰动未评估、积累的知识从未失效。我们认为这些失败有一个共同根源：溯因、反事实、元归纳、纠正和归纳推理将共享上下文拉向不相容的方向。我们提出反思性对抗帕累托搜索（R-APS），据我们所知，这是第一种通过推理模式分解联合解决所有三个失败的方法，为每种推理模式分配其自己的上下文，并在三个时间尺度上协调交互：带有类型化验证批评者的分阶段组合推理（失败定位）、作为第一类帕累托目标的敏感性引导反事实压力测试（鲁棒性）、以及带有显式失效的元归纳规则提取（持久记忆）。R-APS无需微调，仅通过结构化协议设计在冻结的LLM上运行。我们在平面机构综合（机器人、假肢、机械设计）上评估，每个候选解由运动学求解器检查。在32个目标轨迹上，R-APS提供的鲁棒性证书比均匀扰动基线紧3.5倍，首次接纳迭代速度提高46%，Chamfer距离比Enum+GA减少2.1倍，同时联合控制杆数和最坏情况鲁棒性。小型4B推理专用模型在协议内与通用70B骨干模型竞争，表明结构化协议可以部分抵消模型规模。

英文摘要

Large language models (LLMs) are fluent on open-ended tasks, yet in agentic settings, where a system must plan, use tools, and act over extended horizons, fluency does not ensure reliable delivery. We trace this gap to three coupled structural failures: errors propagate without localization, worst-case perturbations go unevaluated, and accumulated knowledge is never invalidated. We argue these share a root cause: abductive, counterfactual, meta-inductive, corrective, and inductive reasoning pull a shared context in incompatible directions. We introduce Reflective Adversarial Pareto Search (R-APS), to our knowledge the first method addressing all three failures jointly via reasoning-mode decomposition, allocating each reasoning mode its own context and orchestrating interaction across three timescales: staged compositional reasoning with a typed validation critic (failure localization), sensitivity-guided counterfactual stress-testing as a first-class Pareto objective (robustness), and meta-inductive rule extraction with explicit invalidation (persistent memory). R-APS requires no fine-tuning and operates on a frozen LLM purely via structured protocol design. We evaluate on planar mechanism synthesis (robotics, prosthetics, mechanical design), with every candidate checked by a kinematic solver. On 32 target trajectories, R-APS delivers robustness certificates 3.5x tighter than uniform-perturbation baselines, 46% faster iterations-to-first-admission, and 2.1x Chamfer-distance reduction over Enum+GA while jointly controlling bar-count and worst-case robustness. Small 4B reasoning-specialized models prove competitive with general-purpose 70B backbones inside the protocol, suggesting structured protocols can partially offset model scale.

URL PDF HTML ☆

赞 0 踩 0

2606.04807 2026-06-04 cs.AI cs.CL cs.CY cs.LG 版本更新

BiasGRPO: Stabilizing Bias Mitigation in High-Variance Reward Landscapes via Group-Relative Policy Optimization

BiasGRPO：通过组相对策略优化在高方差奖励景观中稳定偏差缓解

Saket Reddy, Ke Yang, ChengXiang Zhai

发表机构 * University of Illinois - Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结提出BiasGRPO框架，利用组相对策略优化（GRPO）通过归一化组内奖励来稳定大语言模型的社会偏差缓解，优于DPO和PPO。

Comments Accepted to Findings of the ACL

详情

AI中文摘要

缓解大语言模型（LLMs）中的社会偏差提出了一个独特的对齐挑战：与可验证任务不同，偏差缺乏单一的真实标准，从而产生高方差、主观的奖励景观。先前的基于偏好的微调方法存在主要权衡：直接偏好优化（DPO）受限于离线训练中缺乏探索，而近端策略优化（PPO）由于潜在不可靠的评论家估计可能导致训练不稳定。在本文中，我们提出了BiasGRPO，一个使用组相对策略优化（GRPO）的框架，通过对一组采样完成进行奖励归一化来稳定对齐。通过用组相对基线替代价值函数，我们的方法在保持在线训练探索优势的同时减少了不稳定性。我们发现BiasGRPO在多个基准测试中优于DPO和PPO，表明其有效性。为了适应GRPO，我们综合扩展了一个涵盖多个领域和上下文的数据集。我们还创建并发布了一个定制的偏差奖励模型，该模型在有效指导生成的同时高度计算高效且避免知识退化，提供了一个可无缝集成到多目标RLHF流程中的宝贵资源。

英文摘要

Mitigating social bias in Large Language Models (LLMs) presents a distinct alignment challenge: unlike verifiable tasks, bias lacks a single ground truth, creating a high-variance, subjective reward landscape. Previous preference-based fine-tuning methods have major trade-offs: Direct Preference Optimization (DPO) is limited by the lack of exploration inherent in offline training, while Proximal Policy Optimization (PPO) can lead to training instability due to potentially unreliable critic estimates. In this paper, we propose BiasGRPO, a framework using Group Relative Policy Optimization (GRPO) to stabilize alignment by normalizing rewards across a group of sampled completions. By substituting the value function with a group-relative baseline, our approach reduces instability while maintaining the exploration benefits of online training. We find that BiasGRPO outperforms DPO and PPO across multiple benchmarks, indicating its effectiveness. To adapt GRPO, we synthetically extend a dataset spanning multiple domains and contexts. We also create and release a custom bias reward model that effectively guides generation while being highly compute-efficient and avoiding knowledge degradation, providing a valuable resource that can be seamlessly integrated into multi-objective RLHF pipelines.

URL PDF HTML ☆

赞 0 踩 0

2606.04780 2026-06-04 cs.CL 版本更新

PersonaTree: Structured Lifecycle Memory for Person Understanding in LLM Agents

PersonaTree: 面向LLM智能体人物理解的结构化生命周期记忆

Yubo Hou, Jingwei Song, Hongbo Zhang, Zhisheng Chen, Bang Xiao, Tao Wan, Zengchang Qin

发表机构 * School of ASEE, Beihang University, Beijing, China（北京航空航天大学航空科学与工程学院）； The University of Hong Kong, Hong Kong, China（香港大学）； Peking University, Beijing, China（北京大学）； University of Chinese Academy of Sciences, Beijing, China（中国科学院大学）； School of BME, Beihang University, Beijing, China（北京航空航天大学生物医学工程学院）； CAIR and CECS, VinUniversity, Hanoi, Vietnam（越南河内 Vin 大学 CAIR 和 CECS）

AI总结提出PersonaTree，一种结构化生命周期记忆框架，通过三级人物树和显式支持路径，将交互证据抽象为人物理解，在多个基准上取得领先性能。

详情

AI中文摘要

持久化的LLM智能体需要记忆表示，使得在长期交互中人物理解的形成变得明确。现有的智能体记忆方法强调信息保留和检索，但对累积的交互证据如何被抽象为人物理解的解释有限。我们将这一过程视为图式形成，其中情境证据被抽象为可重用模式和稳定的人物层面断言。我们引入PersonaTree，一种结构化生命周期记忆框架，通过三级人物树实现这一观点，并具有从证据到断言的显式支持路径。PersonaTree通过保守写入、置信度引导的合并和查询条件路径检索来维护树，仅返回每个查询所需的证据深度。在六个涉及人物理解和持久记忆的基准测试中，使用三个回答骨干，PersonaTree在18个紧凑分数中排名第一，并在16个设置中进入前两名。消融实验表明，层次结构提高了KnowMe上的抽象人物理解，而在可比上下文预算下，支持路径检索提高了RealPref的对齐度。

英文摘要

Persistent LLM agents require memory representations that make the formation of person understanding explicit across long term interaction. Existing agent memory methods emphasize information retention and retrieval, yet give limited account of how accumulated interaction evidence is abstracted into person understanding. We view this process as schema formation, where situated evidence is abstracted into reusable patterns and stable person level claims. We introduce PersonaTree, a structured lifecycle memory framework that realizes this view as a three level persona tree with explicit support paths from evidence to claims. PersonaTree maintains the tree through conservative writing, confidence guided consolidation, and query conditioned path retrieval, returning only the evidence depth required by each query. Across six person understanding and persistent memory benchmarks with three answer backbones, PersonaTree ranks first in 12 of 18 compact scores and reaches the top two in 16 settings. Ablations show that hierarchy improves abstract person understanding on KnowMe, while support path retrieval improves RealPref alignment under a comparable context budget.

URL PDF HTML ☆

赞 0 踩 0

2606.04778 2026-06-04 cs.AI cs.CL cs.LG 版本更新

Inference-Time Vulnerability Beyond Shallow Safety: Alignment Along Generation Trajectories

超越浅层安全的推理时脆弱性：沿生成轨迹的对齐

Kyungmin Park, Taesup Kim

发表机构 * Hankuk University of Foreign Studies（翰江大学外国语大学）； Seoul National University（首尔国立大学）

AI总结本文揭示安全对齐的大语言模型在推理时存在更广泛的脆弱性，即任意生成步骤的短标记注入都能显著改变后续安全行为，并提出通过直接在生成轨迹上对齐模型来提升鲁棒性。

详情

AI中文摘要

安全对齐的大语言模型（LLMs）在推理时仍然容易受到干预，这些干预会将生成导向有害输出。最近的研究将其归因于浅层安全，即对齐集中在最初的几个输出标记上。我们表明，浅层安全是更广泛的推理时脆弱性的一个特例，其中在任何生成步骤的短标记注入都能显著改变后续的安全行为。我们还发现，模型在其隐藏状态中与拒绝方向的对齐并不能预测其对这种注入的鲁棒性，这表明在扰动下，内部状态本身并不能决定生成行为。为了解决这个问题，我们通过模拟序列中段扰动构建的生成轨迹上直接对齐模型，并表明这提高了对中段注入的鲁棒性，并泛化到利用早期标记生成的攻击。我们的工作认为，鲁棒的安全对齐需要对生成过程本身进行训练，而不仅仅是其输出。

英文摘要

Safety-aligned Large Language Models (LLMs) remain vulnerable to interventions during inference that redirect generation toward harmful outputs. Recent work attributes this to shallow safety, where alignment concentrates in the first few output tokens. We show that shallow safety is a special case of a broader inference-time vulnerability, in which short token injections at any generation step can substantially alter subsequent safety behavior. We also find that a model's alignment with refusal directions in its hidden states does not predict its robustness to such injection, revealing that internal state alone does not determine generation behavior under perturbation. To address this, we align models directly on generation trajectories constructed by simulating mid-sequence perturbation, and show that this improves robustness to mid-sequence injection and generalizes to attacks that exploit early-token generation. Our work argues that robust safety alignment requires training on the generation process itself, not only its outputs.

URL PDF HTML ☆

赞 0 踩 0

2606.04773 2026-06-04 cs.CV cs.CL 版本更新

NextMotionQA: Benchmarking and Judging Human Motion Understanding with Vision-Language Models

NextMotionQA: 使用视觉语言模型基准测试和评判人体运动理解

Yong Cao, Chuqiao Li, Xianghui Xie, Gerard Pons-Moll, Andreas Geiger

发表机构 * University of Tübingen（图宾根大学）； Tübingen AI Center（图宾根人工智能中心）； Max Planck Institute for Informatics（马克斯·普朗克信息学院）； Saarland Informatics Campus（萨尔兰州信息学院）

AI总结提出NextMotionQA基准，通过三项互补任务和多粒度难度分层，系统评估视觉语言模型在人体运动理解中的能力，并揭示其在细粒度评判中的局限性。

Comments 23 pages, 8 figures, 9 tables

详情

AI中文摘要

人体运动理解的可靠评估对于推进具身人工智能、机器人和动画至关重要。然而，现有基准存在语义粒度粗糙、难度无区分、标注质量有限以及答案模糊等问题，无法诊断当前模型的失败之处。为弥补这一差距，我们引入NextMotionQA，这是一个全面的基准，利用视觉语言模型（VLM）进行半自动化、专家验证的数据集构建。NextMotionQA包含三项互补任务：多项选择题问答、视频字幕生成和细粒度错误纠正。每项任务沿三个核心语义轴系统组织，并分为三个任务复杂度级别。我们对十二个代表性VLM的广泛评估揭示了在传统单任务评估中不可见的关键能力差距和弱点。在互补方向上，近期工作开始使用VLM作为文本到运动评估的评判者；我们探究它们在更困难任务下是否表现出同样的退化。我们发现，VLM在粗粒度标准上与专家评分高度一致（Cohen's κ=0.70），但在细粒度、部件级评判上表现不佳（κ=0.10），验证了该范式在其强项领域的有效性，同时明确了其局限性。

英文摘要

Reliable evaluation of human motion understanding is fundamental to advancing embodied AI, robotics, and animation. However, existing benchmarks suffer from coarse semantic granularity, undifferentiated difficulty, limited annotation quality, and pervasive answer ambiguity, leaving them unable to diagnose where current models fail. To bridge this gap, we introduce NextMotionQA, a comprehensive benchmark that leverages vision-language models (VLMs) for semi-automated, expert-verified dataset. NextMotionQA features three complementary tasks: multiple-choice question answering, video captioning, and fine-grained error correction. Each task is systematically structured across three core semantic axes and stratified into three task complexity levels. Our extensive evaluation of twelve representative VLMs uncovers critical capability gaps and weakness that remain invisible under conventional, single-task evaluations. In a complementary direction, recent work has begun using VLMs as judges for text-to-motion evaluation; we ask whether they show the same degradation under harder tasks. We find that VLMs align strongly with expert ratings on coarse criteria (Cohen's κ=0.70) but break down on fine-grained, part-level judgment (κ=0.10), validating the paradigm in its strong regime while clarifying its limits.

URL PDF HTML ☆

赞 0 踩 0

2606.04743 2026-06-04 cs.CL cs.AI cs.LG 版本更新

TIDE: Proactive Multi-Problem Discovery via Template-Guided Iteration

TIDE：通过模板引导迭代的主动多问题发现

Soyeong Jeong, Jinheon Baek, Minki Kang, Sung Ju Hwang

发表机构 * KAIST（韩国科学技术院）； DeepAuto.ai

AI总结提出TIDE框架，通过模板引导的迭代机制主动发现用户上下文中隐藏的多个问题，并给出具体行动方案，在个人工作区和软件仓库两个场景中显著提升任务覆盖率和问题识别与解决能力。

详情

AI中文摘要

智能体被广泛部署为文档、工具和代码的助手。然而，它们通常仅对明确的用户请求做出响应，这些请求只反映了用户已注意到的问题，而许多其他重要问题共存于更广泛的用户上下文中，隐藏于显而易见之处，且其总数事先未知。我们将此定义为从上下文中发现多个隐藏问题的任务，其中应揭示共存的问题，基于支持性证据，并配以具体行动。为此，我们引入了TIDE，一个模板引导的迭代框架，包含两种互补机制。具体而言，基于单次预测倾向于关注最显著案例并产生泛化结论的观察，我们提出迭代发现：每轮生成一小批候选，同时基于已发现结果进行条件化，从而后续轮次扩展覆盖范围；以及思维模板：从先前解决的案例中提炼的可重用模式，指定应关注哪些上下文信号以及如何连接它们，将每个预测锚定于可识别的问题类别。我们在两个现实场景（个人工作区和软件仓库）中，使用四种模型骨干验证了TIDE，在任务覆盖率、识别和解决方面显著优于单次和并行多智能体基线。

英文摘要

Agents are widely deployed as assistants over documents, tools, and code. However, they typically act only on explicit user requests, which surface only the problems the user has noticed, while many other important problems coexist, hidden in plain sight, within the broader user context, with their total number unknown in advance. We frame this as the task of discovering multiple hidden problems from context, in which coexisting problems should be uncovered, grounded in supporting evidence, and paired with concrete actions. To this end, we introduce TIDE, a template-guided iterative framework with two complementary mechanisms. Specifically, motivated by the observation that single-pass prediction anchors on the most salient cases and yields generic claims, we propose iterative discovery, which surfaces a small batch of candidates per round while conditioning on what has already been found, so subsequent rounds extend coverage; and thought templates, reusable schemas distilled from previously solved cases that specify what contextual signals to attend to and how to connect them, anchoring each prediction in a recognizable problem class. We validate TIDE on two realistic settings, personal workspaces and software repositories, across four model backbones, showing substantial gains over single-shot and parallel multi-agent baselines on task coverage, identification, and resolution.

URL PDF HTML ☆

赞 0 踩 0

2606.04730 2026-06-04 cs.CL eess.AS 版本更新

Multilingual Long-Form Speech Instruction Following: KIT's Submission to IWSLT 2026

多语言长篇语音指令跟随：KIT 在 IWSLT 2026 的提交

Enes Yavuz Ugan, Maike Züfle, Yuka Ko, Supriti Sinhamahapatra, Fabian Retkowski, Seymanur Akti, Jan Niehues, Alexander Waibel

发表机构 * Karlsruhe Institute of Technology（卡尔斯鲁厄理工大学）； Carnegie Mellon University（卡内基梅隆大学）

AI总结提出一种通用数据增强流水线，通过片段拼接、LLM标签生成和跨语言翻译将短语音语料转换为长语音训练数据，结合似然与最小贝叶斯风险解码解决长语音语义任务退化问题。

Comments 9 pages main paper, IWSLT 2026 Instruction Following track

详情

AI中文摘要

随着大语言模型的出现，单任务和基于标记的多任务模型已演变为基于指令的系统，该系统从自然语言提示中隐式推断任务和目标语言。这一趋势反映在IWSLT的指令跟随赛道中，该赛道今年引入了包括未知惊喜任务在内的新任务，对已知任务的过拟合构成了真正的挑战。我们展示了KIT在无约束设置下对长指令和短指令跟随赛道的提交。我们的方法结合了一个通用数据增强流水线，通过片段拼接、基于LLM的标签生成和跨语言翻译将短语音语料转换为长语音训练数据，在六个任务和四种语言上产生了超过100万个实例。我们进一步表明，基于似然的重新排序虽然对ASR非常有效，但会系统地降低语义任务，通过选择从分段音频处理而非整体长语音推理中生成的候选者，这一失败模式通过将似然与最小贝叶斯风险解码相结合得以解决。

英文摘要

With the advent of Large Language Models, single-task and token-based multi-task models have evolved into instruction-based systems that infer task and target language implicitly from natural language prompts. This trend is reflected in IWSLT's Instruction Following Track, which this year introduced new tasks including an unknown surprise task, posing a genuine challenge against overfitting to known tasks. We present KIT's submission to the Long and Short Instruction Following tracks in the unconstrained setting. Our approach combines a general data augmentation pipeline that converts short-form corpora into long-form training data through segment concatenation, LLM-based label generation, and cross-lingual translation, yielding over 1M instances across six tasks and four languages. We further show that likelihood-based re-ranking, while highly effective for ASR, systematically degrades semantic tasks by spuriously selecting candidates generated from segmented audio processing rather than holistic long-form inference, a failure mode resolved by combining likelihood with Minimum Bayes Risk decoding.

URL PDF HTML ☆

赞 0 踩 0

2606.04719 2026-06-04 cs.CL 版本更新

Query-based Cross-Modal Projector Bolstering Mamba Multimodal LLM

基于查询的跨模态投影器增强Mamba多模态大语言模型

SooHwan Eom, Jay Shim, Gwanhyeong Koo, Haebin Na, Mark A. Hasegawa-Johnson, Sungwoong Kim, Chang D. Yoo

发表机构 * Korea Advanced Institute of Science and Technology / Korea, Republic of（韩国科学技术院）； University of Illinois in Urbana-Champaign / United States of America（伊利诺伊大学厄巴纳-香槟分校）； Korea University / Korea, Republic of（韩国大学）

AI总结提出基于查询的跨模态投影器，通过交叉注意力压缩视觉令牌，消除手动设计2D扫描顺序的需求，提升Mamba多模态LLM的性能和吞吐量。

Comments Accepted to EMNLP 2024 Findings

详情

DOI: 10.18653/v1/2024.findings-emnlp.827

AI中文摘要

Transformer的复杂度随输入长度呈二次增长，给大语言模型（LLM）带来了不可持续的计算负担。相比之下，选择性扫描结构化状态空间模型（即Mamba）有效解决了这一计算挑战。本文探索了一种基于查询的跨模态投影器，通过交叉注意力机制根据输入压缩视觉令牌，从而增强Mamba在视觉-语言建模中的效率。这种创新的投影器还消除了将原始图像特征转换为Mamba LLM输入序列时手动设计2D扫描顺序的需求。在各种视觉-语言理解基准上的实验结果表明，所提出的跨模态投影器增强了基于Mamba的多模态LLM，提升了性能和吞吐量。

英文摘要

The Transformer's quadratic complexity with input length imposes an unsustainable computational load on large language models (LLMs). In contrast, the Selective Scan Structured State-Space Model, or Mamba, addresses this computational challenge effectively. This paper explores a query-based cross-modal projector designed to bolster Mamba's efficiency for vision-language modeling by compressing visual tokens based on input through the cross-attention mechanism. This innovative projector also removes the need for manually designing the 2D scan order of original image features when converting them into an input sequence for Mamba LLM. Experimental results across various vision-language understanding benchmarks show that the proposed cross-modal projector enhances Mamba-based multimodal LLMs, boosting both performance and throughput.

URL PDF HTML ☆

赞 0 踩 0

2606.04703 2026-06-04 cs.CL cs.LG 版本更新

Rethinking Continual Experience Internalization for Self-Evolving LLM Agents

重新思考持续经验内化以实现自我进化的大语言模型智能体

Jingwen Chen, Wenkai Yang, Shengda Fan, Wenbo Nie, Chenxing Sun, Shaodong Zheng, Yangen Hu, Lu Pan, Ke Zeng, Yankai Lin

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China（中国人民大学人工智能学院 Gallagher 学院）； School of Software, Beihang University（北航软件学院）； Meituan（美团）

AI总结本文通过经验粒度、注入模式和内化机制三个维度，提出一种稳定可持续的经验内化方法，解决多轮经验学习中的能力崩溃问题。

Comments 10 pages, 8 figures

详情

AI中文摘要

经验内化将过去交互中的上下文经验转化为可重用的参数化能力，为大型语言模型（LLM）的持续学习提供了一条有前景的路径。虽然先前的工作主要关注单次迭代迁移，但我们发现在多轮经验学习下，现有方法遭受的是渐进的能力崩溃而非复合改进。我们通过经验内化的三个关键维度系统地考察了这种失败：（1）经验粒度：我们发现原则级经验比实例级经验更持久，因为它有效地从轨迹特定细节中抽象出可迁移的策略。（2）经验注入模式：我们的分析表明，逐步注入通过将经验与中间决策状态对齐，显著优于全局注入，这一特性对于长程工具使用至关重要。（3）内化机制：我们证明，在高质量教师轨迹上的离策略上下文蒸馏提供了比在策略上下文蒸馏更稳定的训练信号，后者固有地受限于对学生诱导的缺陷状态的局部修正。这些见解共同产生了一个简单而稳健的配方，用于稳定和可持续的经验内化，为工程化自我进化和持续学习的LLM提供了具体指导。

英文摘要

Experience internalization converts contextual experience from past interactions into reusable parametric capability, offering a promising path toward continual learning in large language models (LLMs). While prior work has predominantly focused on single-iteration transfer, we discover that under multi-iteration experience learning, existing methods suffer from a progressive capability collapse rather than compounding improvement. We systematically examine this failure through three vital dimensions of experience internalization: (1) Experience Granularity: We find that principle-level experience is more durable than instance-level experience, as it effectively abstracts transferable strategies away from trajectory-specific details. (2) Experience Injection Pattern: Our analysis reveals that step-wise injection significantly outperforms global injection by aligning experience with intermediate decision states, a property that is critical for long-horizon tool use. (3) Internalization Regime: We demonstrate that off-policy context-distillation on high-quality teacher trajectories provides a substantially more stable training signal than on-policy context-distillation, which is inherently limited by local corrections on student-induced flawed states. Together, these insights yield a simple yet robust recipe for stable and sustainable experience internalization, providing concrete guidance for engineering self-evolving and continually learning LLMs.

URL PDF HTML ☆

赞 0 踩 0

2606.04701 2026-06-04 cs.CV cs.CL 版本更新

Benchmarking Living-Screen-Native GUI Agents on Short-Video Platforms

在短视频平台上对原生动态屏幕GUI代理的基准测试

Jiashu Yao, Heyan Huang, Daiqing Wu, Wangke Chen, Huaxi Ai, Haoyu Wen, Zeming Liu, Yuhang Guo

发表机构 * Beijing Institute of Technology（北京理工大学）； Tsinghua University（清华大学）； Beihang University（北航）

AI总结针对短视频平台等动态屏幕环境，提出LivingScreen基准测试，通过三级任务套件和联合评估准确性与信息效率的指标，发现现有GUI代理存在观察过度或不足的问题。

Comments preprint

2606.04691 2026-06-04 cs.CL 版本更新

SMADE-IE: Sparse Multi-Agent Framework with Evidence-Driven Debate for Zero-Shot Information Extraction

SMADE-IE: 基于证据驱动辩论的稀疏多智能体框架用于零样本信息抽取

Kenfeng Huang, Yi Cai, Xin Wu, Zikun Deng, Li Yuan

发表机构 * School of Software Engineering, South China University of Technology（华南理工大学软件学院）

AI总结提出SMADE-IE稀疏多智能体框架，通过自适应模式选择器和证据驱动辩论机制，在零样本信息抽取中减少冗余交互并提升性能。

Comments 21 pages, 9 figures

详情

AI中文摘要

基于大型语言模型的零样本信息抽取因其无需任务特定训练即可适应新模式和领域的灵活性而受到越来越多的关注。现有方法主要依赖于整体提示、逐类型提示或多智能体辩论。然而，整体提示常常遭受边界和类型错误，而逐类型提示和多智能体辩论引入了跨类型冲突、冗余智能体交互和大量令牌开销。为了解决这些挑战，我们提出了SMADE-IE，一种用于零样本信息抽取的稀疏且证据驱动的多智能体框架。SMADE-IE首先采用自适应模式选择器将输入动态路由到轻量级全局抽取模式或类型中心抽取模式，减少不必要的类型选择和推理噪声。对于冲突预测，我们进一步引入了证据驱动辩论机制，将论证结构化为图尔敏式组件，并通过外部证据评分和贝叶斯更新进行置信度聚合。在NER、RE和JERE任务的9个基准数据集上的实验结果表明，SMADE-IE在持续优于现有零样本信息抽取基线的同时，通过稀疏智能体选择和早期停止辩论提高了令牌效率。

英文摘要

Zero-shot information extraction (IE) with large language models (LLMs) has attracted increasing attention due to its flexibility in adapting to new schemas and domains without task-specific training. Existing approaches mainly rely on monolithic prompting, each-type prompting, or multi-agent debate. However, monolithic prompting often suffers from boundary and type errors, while each-type prompting and multi-agent debate introduce cross-type conflicts, redundant agent interactions, and substantial token overhead. To address these challenges, we propose SMADE-IE, a sparse and evidence-driven multi-agent framework for zero-shot IE. SMADE-IE first employs an Adaptive Mode Selector to dynamically route inputs into either a lightweight Global Extraction Mode or a Type-Centric Extraction Mode, reducing unnecessary type selection and reasoning noise. For conflicting predictions, we further introduce an Evidence-Driven Debate mechanism that structures arguments into Toulmin-style components and performs confidence aggregation through external evidence scoring and Bayesian updates. Experimental results on 9 benchmark datasets across NER, RE, and JERE tasks show that SMADE-IE consistently outperforms existing zero-shot IE baselines while also improving token efficiency through sparse agent selection and early-stopping debate.

URL PDF HTML ☆

赞 0 踩 0

2606.04680 2026-06-04 eess.AS cs.CL cs.SD 版本更新

Read What You Hear: Reference-Free Hypotheses Evaluation with Acoustic Discrepancy

听你所写：基于声学差异的无参考假设评估

Zhihan Li, Hankun Wang, Yiwei Guo, Bohan Li, Xie Chen, Kai Yu

发表机构 * X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, China（X-LANCE实验室、计算机科学学院、上海交通大学、中国）； MoE Key Lab of Artificial Intelligence, Jiangsu Key Lab of Language Computing, China（人工智能MOE重点实验室、江苏省语言计算重点实验室、中国）

AI总结提出READ指标，利用预训练自回归TTS模型计算语音与文本假设的声学差异，无需参考转录即可评估ASR假设，并在噪声条件下实现高达20%的相对错误率降低。

Comments Submitted to Interspeech 2026. 6 pages, 4 figures

2606.04661 2026-06-04 cs.CL cs.LG 版本更新

CRAFT: Cost-aware Refinement And Front-aware Tuning of Prompts

CRAFT: 成本感知的提示精炼与前沿感知的调优

Shanu Kumar, Shubhanshu Khandelwal, Akhila Yesantarao Venkata, Parag Agrawal, Yova Kementchedjhieva, Manish Gupta

发表机构 * MBZUAI ； Microsoft（微软）

AI总结提出CRAFT方法，通过帕累托前沿优化提示的准确性和成本，避免标量化崩溃，在多个基准上实现更广泛的准确-成本权衡。

详情

AI中文摘要

为准确性调优的提示通常变长，每次模型调用都会增加推理成本。最佳的准确-成本权衡取决于任务和预算，因此提示优化是在准确性和提示令牌成本的帕累托前沿上的搜索，而不是针对单个提示。通常的捷径是将目标折叠成加权和，在搜索前固定权衡权重，通常只能恢复前沿的狭窄区域，我们称之为标量化崩溃。我们提出了CRAFT（成本感知的精炼和前沿感知的调优），一种帕累托前沿提示优化器，将目标LLM验证调用视为稀缺资源，并将其分配给乐观候选前沿附近的候选。每轮，互补的面向准确性和面向成本的生成器提出编辑，帕累托差距获取花费每轮的验证预算，NSGA-II保留保持分布广泛的种群。在六个分类和推理基准上，CRAFT保留的前沿同时达到高准确性和低成本区域，而仅准确性、仅成本和加权和基线各自集中在更窄的区域。准确-成本权衡成为搜索后的选择，而不是搜索前的权重。

英文摘要

Prompts tuned for accuracy often grow long, raising inference cost on every model call. The best accuracy-cost trade-off depends on the task and the budget, so prompt optimization is a search over the Pareto front of accuracy and prompt-token cost rather than for one prompt. The usual shortcut, collapsing the objectives into a weighted sum, fixes the trade-off weight before search and often recovers only a narrow region of the front, a failure we call scalarization collapse. We present CRAFT (Cost-aware Refinement And Front-aware Tuning), a Pareto-front prompt optimizer that treats target-LLM validation calls as the scarce resource and allocates them to candidates near the optimistic candidate front. Each round, complementary accuracy-oriented and cost-oriented generators propose edits, Pareto-gap acquisition spends the per-round validation budget, and NSGA-II retention keeps a spread-out population. Across six classification and reasoning benchmarks, CRAFT's retained fronts reach both high-accuracy and low-cost regions, while accuracy-only, cost-only, and weighted-sum baselines each concentrate in narrower regions. The accuracy-cost trade-off becomes a post-search choice, not a pre-search weight.

URL PDF HTML ☆

赞 0 踩 0

2606.04660 2026-06-04 cs.CL 版本更新

LifeSide: Benchmarking Agents as Lifelong Digital Companions

LifeSide: 将智能体作为终身数字伴侣的基准测试

Yuqian Wu, Zhijie Deng, Wei Chen, Junwei Li, Yutian Jiang, Junle Chen, Zhengjun Huang, Qingxiang Liu, Jing Tang, Jiaheng Wei, Yuxuan Liang

发表机构 * Hong Kong University of Science and Technology (Guangzhou)（香港理工大学（广州））； Hong Kong University of Science and Technology（香港理工大学）； Tencent（腾讯）

AI总结针对现有评估无法捕捉终身数字伴侣所需的多会话记忆、用户理解和隐私适应能力的问题，提出LifeSide基准，通过多智能体模拟构建记忆-情感-环境循环，评估模型在记忆追踪、用户理解、隐私控制和情感陪伴方面的表现，发现即使当前记忆基准饱和的模型也无法在长期内维持准确的用户理解和真正的陪伴。

Comments 28 pages, 23 figures, 7 tables

详情

AI中文摘要

终身数字伴侣必须整合跨会话线索，持续更新对用户的理解，并适应不断变化的隐私边界。现有评估未能捕捉到这一点，而是孤立地测试记忆回忆和短期共情。为了弥补这一差距，我们引入了\benchmark，一个以多会话 extit{记忆-情感-环境}循环为中心的基准。通过将用户建模为具有分层档案和事件轨迹的持久世界，\benchmark使用多智能体模拟将环境动态投射到对话中，保留了潜在思想与可观察表达之间的关键差距。在记忆追踪、用户理解、隐私控制和情感陪伴方面评估了2,000个角色和111K个任务，我们的实验结果揭示了一个严峻的现实：即使是在当前记忆基准上饱和的模型，也无法在长期内维持准确的用户理解和真正的陪伴。

英文摘要

Lifelong digital companions must integrate cross-session cues, continually update their understanding of users, and adapt to shifting privacy boundaries. Existing evaluations fail to capture this, testing memory recall and short-term empathy in isolation. To bridge this gap, we introduce \benchmark, a benchmark centered on multi-session \textit{Memory-Emotion-Environment} loops. By modeling users as persistent worlds with layered profiles and event trajectories, \benchmark uses multi-agent simulation to project environmental dynamics into dialogue, preserving the critical gap between latent thoughts and observable expressions. Evaluating 2,000 personas and 111K tasks across memory tracking, user understanding, privacy control, and emotional companionship, our experiment results reveal a stark reality: even models that saturate current memory benchmarks fail to sustain accurate user understanding and true companionship over long horizons.

URL PDF HTML ☆

赞 0 踩 0

2606.04646 2026-06-04 cs.CL cs.AI cs.IR 版本更新

QO-Bench: Diagnosing Query-Operator-Preserving Retrieval over Typed Event Tuples

QO-Bench: 诊断类型化事件元组上的查询操作符保持检索

Mengao Zhang, Xiang Yang, Chang Liu, Tianhui Tan, Ke-wei Huang

发表机构 * Asian Institute of Digital Finance, National University of Singapore（亚洲数字金融研究所，新加坡国立大学）

AI总结提出QO-Bench基准，通过类型化事件元组上的确定性评估，诊断检索增强生成系统在查询操作符（如连接、交集）上的执行瓶颈。

Comments 14 pages

详情

AI中文摘要

许多关于商业、法律和科学语料库的现实世界问题是文本中潜在记录的数据库风格查询的自然语言版本。现有的检索增强生成（RAG）系统主要针对语义相关性进行优化，但检索到看似相关的段落并不能保证正确的查询执行。我们引入了QO-Bench，一个用于类型化事件元组上查询操作符问答的诊断基准。该基准涵盖22,984篇新闻文章和614个公司事件，涉及18个查询模板，在785个问题上进行评估。每个黄金答案由类型化事件元组确定性计算得出，并通过召回率评分，答案通过精确匹配而非LLM评判器与黄金元组匹配。这种设计支持操作符级别的诊断，如连接和交集。我们在匹配条件下评估了RAG、ReAct RAG、GraphRAG和信息提取到SQL的方法，并设置了一个长上下文oracle上限以隔离检索失败。一个双轴框架——索引时保持与查询时执行——预测了每种范式失败的位置，结果证实了这一点：系统检索到相关文本，但丢弃了操作符所需的类型化值，并且可部署的范式排名在不同操作符间反转，相似性检索在过滤/投影上领先，而提取到SQL在交集和计数上领先。即使提供了黄金证据，长上下文oracle也远未饱和，因此操作符执行——而不仅仅是检索——是一个核心瓶颈，更强的答案模型也无法消除。QO-Bench将目标从段落相关性重新定义为查询操作符保持检索。

英文摘要

Many real-world questions over business, legal, and scientific corpora are natural-language versions of database-style queries over records latent in text. Existing retrieval-augmented generation (RAG) systems are optimized primarily for semantic relevance, but retrieving plausible passages does not guarantee correct query execution. We introduce QO-Bench, a diagnostic benchmark for query-operator question answering over typed event tuples. The benchmark covers 22,984 news articles and 614 corporate events across 18 query templates, evaluated on 785 questions. Each gold answer is deterministically computed from typed event tuples and scored by recall, with answers matched to the gold tuples by exact match rather than an LLM judge. This design enables operator-level diagnosis such as joins and intersection. We evaluate RAG, ReAct RAG, GraphRAG, and information-extraction-to-SQL under matched conditions, with a long-context oracle ceiling to isolate retrieval failure. A two-axis framework -- index-time preservation versus query-time execution -- predicts where each paradigm fails, and the results bear it out: systems retrieve relevant text but discard the typed values operators need, and the deployable paradigm ranking inverts across operators, with similarity retrieval leading on filter/project and extraction-to-SQL on intersection and counting. Even given the gold evidence, a long-context oracle stays far from saturated, so operator execution -- not retrieval alone -- is a core bottleneck that a stronger answer model does not remove. QO-Bench reframes the goal from passage relevance to query-operator-preserving retrieval.

URL PDF HTML ☆

赞 0 踩 0

2606.04645 2026-06-04 cs.CL cs.DB 版本更新

CYGNET: Cypher Gate for Neural Execution Triage and Cost Containment

CYGNET: 用于神经执行分类与成本控制的密码门

Nikodem Tomczak

发表机构 * Thulge Labs, Singapore（新加坡Thulge实验室）

AI总结提出CYGNET门控机制，通过预执行验证和错误修正，在保证生成准确率的同时，高效拦截结构错误的Cypher查询并标记成本过高的执行计划。

详情

AI中文摘要

作为知识图谱代理的语言模型生成的Cypher查询可能因结构错误（在数据库中崩溃）或语义错误（执行但返回错误结果）而失败。我们在查询生成与生产级Neo4j数据库之间设置了一个预执行门。该门通过一个四后端链验证结构，最终在镜像图上执行，中位延迟为5.6毫秒。结构错误的查询被路由到一个修正器，该修正器通过语言模型迭代结构化错误反馈。在七个CypherBench模式（2348个问题，ACL 2025）上，该流水线在所有测试模型上保持了生成准确率，证实其作为安全防御层的有效性。修正器在五个模型上的成功率为81%至95%（平均89%）。在九个模式的模板生成语料库上，该门捕获了100%的解析错误、100%的约束违规以及100%的路径查询中带标签端点的模式引用错误，在1135个查询中零误报。属性兄弟交换（替换后的名称在目标标签上有效）得分为0%，标志着结构验证结束和语义验证开始的正式边界。基于规划器的成本门在执行前标记灾难性的计划结构。

英文摘要

Language models acting as agents over knowledge graphs generate Cypher queries that fail structurally (crashing at the database) or semantically (executing but returning wrong results). We place a pre-execution gate between query generation and a production Neo4j database. The gate validates structure through a four-backend chain culminating in execution against a mirror graph at 5.6 ms median latency. Structurally broken queries are routed to a corrector that iterates structured error feedback through a language model. On seven CypherBench schemas (2348 questions, ACL 2025) the pipeline maintains generation accuracy on every model tested, confirming it operates as a safe defensive layer. The corrector achieves 81% to 95% success across five models (mean 89%). On a template-generated corpus across nine schemas the gate catches 100% of parse errors, 100% of constraint violations, and 100% of schema-reference errors in path queries with labelled endpoints, at zero false positives across 1135 queries. Property sibling-swaps where the substituted name is valid on the target label score 0%, marking the formal boundary where structural validation ends and semantic validation must begin. A planner-based cost gate flags catastrophic plan structures before execution.

URL PDF HTML ☆

赞 0 踩 0

2606.04632 2026-06-04 cs.LG cs.CL 版本更新

VentAgent: When LLMs Learn to Breathe -- Multi-Objective Arbitration for ARDS Ventilation

VentAgent：当大语言模型学会呼吸——ARDS通气的多目标仲裁

Teqi Hao, Yuxuan Fu, Xiaoyu Tan, Shaojie Shi, Bohao Lv, Yinghui Xu, Xihe Qiu

发表机构 * School of Electronic and Electrical Engineering, Shanghai University of Engineering Science（上海工程技术大学电子与电气工程学院）； Tencent Youtu Lab（腾讯优图实验室）； Artificial Intelligence Innovation and Incubation Institute, Fudan University（复旦大学人工智能创新与孵化院）

AI总结提出VentAgent分层框架，利用大语言模型作为透明仲裁者，通过感知-规划-编排三阶段将机械通气控制转化为动态多目标仲裁过程，在生理模拟器上优于强化学习和经典控制基线，并提供可解释的推理链。

详情

AI中文摘要

急性呼吸窘迫综合征（ARDS）的机械通气需要平衡竞争性的生理目标，包括氧合、肺保护和酸碱平衡。然而，当前的数据驱动方法，尤其是模仿回顾性电子健康记录（EHR）的方法，常常遭受模仿偏差。它们可能从不一致的临床演示中捕获表面相关性，例如将被动呼吸机设置与生存关联，因为这种设置在稳定患者中很常见，因此无法泛化到不稳定或分布外的表型。标准的强化学习（RL）方法也难以处理重症监护中的对抗性权衡，并常常产生不透明且临床可解释性有限的策略。为了解决这些局限性，我们引入了VentAgent，一个分层框架，其中大语言模型（LLM）作为机械通气的透明仲裁者。我们将通气控制重新表述为动态多目标仲裁过程，而非单目标优化。VentAgent将决策分解为三个可解释的阶段：感知、规划和编排。通过利用LLM的语义推理能力，它综合来自异构专家的策略，并通过显式协调机制解决冲突的临床优先级。在高保真生理模拟器上的评估表明，VentAgent优于最先进的RL和经典控制基线。此外，它将控制决策转化为人类可读的推理链，为重症监护自动化提供了更安全、更可解释和更自适应的范式。

英文摘要

Mechanical ventilation for Acute Respiratory Distress Syndrome (ARDS) requires balancing competing physiological goals, including oxygenation, lung protection, and acid-base homeostasis. However, current data-driven methods, especially those imitating retrospective Electronic Health Records (EHR), often suffer from imitation bias. They may capture superficial correlations from inconsistent clinical demonstrations, such as associating passive ventilator settings with survival because such settings are common in stable patients, and thus fail to generalize to volatile or out-of-distribution phenotypes. Standard Reinforcement Learning (RL) methods also struggle with the adversarial trade-offs of critical care and often produce opaque policies with limited clinical interpretability. To address these limitations, we introduce VentAgent, a hierarchical framework in which Large Language Models (LLMs) act as transparent arbitrators for mechanical ventilation. We reformulate ventilation control as a dynamic Multi-Objective Arbitration process rather than single-objective optimization. VentAgent decomposes decision-making into three interpretable stages: Perception, Planning, and Orchestration. By leveraging the semantic reasoning capabilities of LLMs, it synthesizes strategies from heterogeneous experts and resolves conflicting clinical priorities through an explicit coordination mechanism. Evaluations on a high-fidelity physiological simulator show that VentAgent outperforms state-of-the-art RL and classical control baselines. Moreover, it converts control decisions into human-readable reasoning chains, offering a safer, more interpretable, and adaptable paradigm for critical care automation.

URL PDF HTML ☆

赞 0 踩 0

2606.04628 2026-06-04 cs.CL cs.MA 版本更新

RAMPART: Registry-based Agentic Memory with Priority-Aware Runtime Transformation

RAMPART: 基于注册表的代理记忆与优先级感知运行时转换

Nikodem Tomczak

发表机构 * Nikodem Tomczak

AI总结提出RAMPART编译时记忆模型和纯内存块注册表，通过可编程运行时操作和五种原语实现上下文组装，实验表明块位置和分组显著影响任务成功率，并实现零提示令牌成本的共享注册表协调。

详情

AI中文摘要

RAMPART是一种用于基于LLM的代理的编译时记忆模型和纯内存块注册表。上下文组装是一种可编程的运行时操作，其中内容根据显式策略（排序、包含和驱逐）从结构化注册表中编译。五种可组合原语（提升、门控、写入、驱逐、回滚）在编译前对命名可寻址块进行操作，且零提示令牌成本。来源标签和不可驱逐的作者标志实现了具有块级所有权的许可记忆模型。使用Qwen3-8B Q4进行的受控探测表明，编译时放置以及块与任务查询之间的结构关系影响任务成功，当任务跟随注册表时，性能在约第七个块位置急剧下降，当任务先于注册表时则在第十二个位置。将关键块与内容相邻的邻居分组，并将该组作为一个单元提升，在单块放置失败的位置将任务成功率提高数十个百分点。在Qwen2.5-7B、Llama-3.1-8B、Mistral-7B-v0.3和Qwen3-14B上的跨模型复现表明，内容启动效应在不同家族中出现在相同的绝对位置，幅度随模型强度变化。块分组使Mistral在最难注册表大小下的平均通过率提高约五倍，并且在中间注册表区域，使用干预的较小模型可以超越不使用干预的较大模型。相关性门控将提示成本降低67.8%，同时恢复83%的提升条件成功率。模式驱逐产生0%的调用，而存在模式时为100%，这是基于策略的方法无法通过构造保证的属性。共享注册表协调将代理间通信减少为方法调用，且零协调令牌成本。

英文摘要

RAMPART is a compile-time memory model and pure in-RAM block registry for LLM-based agents. Context assembly is a programmable runtime operation where content is compiled from a structured registry under explicit policy for ordering, inclusion, and eviction. Five composable primitives (promote, gate, write, evict, rollback) act on named addressable blocks before compilation at zero prompt-token cost. Provenance tags and non-evictable authorship flags implement a permissioned memory model with block-level ownership. Controlled probes with Qwen3-8B Q4 show that compile-time placement and the structural relationship between blocks and the task query affect task success, with the cliff falling at roughly the seventh block position when the task follows the registry and the twelfth when it precedes. Grouping the critical block with content-adjacent neighbours and promoting the group as a unit lifts task success by tens of percentage points at positions where single-block placement fails. Cross-model replication on Qwen2.5-7B, Llama-3.1-8B, Mistral-7B-v0.3, and Qwen3-14B shows the content-priming effect appears at the same absolute positions across families, with magnitude varying with model strength. Block grouping raises Mistral's mean pass rate roughly fivefold at the hardest registry size, and a smaller model with the intervention can outperform a larger model without it in the mid-registry zone. Relevance gating reduces prompt cost by 67.8\% while recovering 83% of the promoted-condition success rate. Schema eviction produces 0% invocations against 100% with the schema present, a property policy-based approaches cannot guarantee by construction. Shared-registry coordination reduces inter-agent communication to a method call at zero coordination token cost.

URL PDF HTML ☆

赞 0 踩 0

2606.04612 2026-06-04 cs.CL 版本更新

Hybrid Adversarial Defence for Natural Language Understanding Tasks

混合对抗防御用于自然语言理解任务

Manar Abouzaid, Yang Wang, Chenghua Lin, Stuart E. Middleton

发表机构 * School of Electronics and Computer Science, University of Southampton, UK（南安普顿大学电子与计算机科学学院）； Department of Computer Science, University of Manchester, UK（曼彻斯特大学计算机科学系）

AI总结提出一种结合熵、不确定性和几何特征的混合防御框架，在多个自然语言理解数据集上同时提升了干净任务性能和对抗鲁棒性。

详情

AI中文摘要

大型语言模型（LLMs）既容易产生幻觉，也容易受到对抗性操纵。尽管这些问题密切相关，但现有的防御方法通常分别处理它们。我们研究了一种混合防御框架，该框架结合了旨在减少幻觉的基于熵的模型，以及旨在降低脆弱性的基于不确定性的模型和基于几何的模型。在自然语言理解数据集（FEVER、HotpotQA、CSQA、SIQA）上的域内测试中，我们发现我们的混合模型提高了干净任务性能（准确率提升高达43.34%）和对抗鲁棒性（准确率提升高达64.92%，攻击成功率降低62.27%）。对于分布外数据集（AeroEngQA、CPIQA），我们的混合模型表现出类似的对抗鲁棒性（准确率提升高达57.14%）。对于提示注入（SafeGuard）和越狱检测（AdvBench、DAN）数据集，我们的混合模型也非常强大（与最先进的基线模型相比，攻击成功率降低高达51%）。总体而言，我们的结果表明，对于域内和分布外任务，结合熵、不确定性和几何特征比单独使用任何单一特征都能提供更有效的防御策略。

英文摘要

Large Language Models (LLMs) are vulnerable both to hallucination and adversarial manipulation. Although these problems are closely related, existing defences typically address them separately. We investigate a hybrid defence framework that combines entropy-based models, designed to reduce hallucinations, with uncertainty-based models and geometric-based models, designed to reduce vulnerability. Under in-domain tests on Natural Language Understanding datasets (FEVER, HotpotQA, CSQA, SIQA) we find our hybrid model improves both clean-task performance (up to 43.34\% increase in accuracy) and adversarial robustness (up to 64.92\% improvement in accuracy and 62.27\% reduction in attack success rate). For out-of-distribution datasets (AeroEngQA, CPIQA) we see similar adversarial robustness from our hybrid model (up to 57.14\% improvement in accuracy). For prompt injection (SafeGuard) and jailbreak detection (AdvBench, DAN) datasets our hybrid model is also very strong (up to 51\% reduction in attack success rate compared to state of the art baseline models). Overall, our results show that combining entropy, uncertainty and geometric features provides a more effective defence strategy than using any single feature alone for both in-domain and out-of-distribution tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.04596 2026-06-04 cs.CL 版本更新

A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs

多视频摘要中位置偏差的系统评估：基于多模态大语言模型

Huangchen Xu, Yuan Wu, Yi Chang

发表机构 * School of Artificial Intelligence, Jilin University（吉林大学人工智能学院）； Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, Jilin University（知识驱动人机智能工程研究中心）； International Center of Future Science, Jilin University（未来科学国际中心）

AI总结本研究系统评估了多模态大语言模型在多视频摘要任务中的位置偏差，通过构建基准和三种互补指标揭示了领域与模型依赖的偏差特性，并分析了提示级缓解方法。

详情

AI中文摘要

多模态大语言模型（MLLMs）越来越多地用于视频理解，但它们在多视频输入下的可靠性仍知之甚少。我们研究了多视频摘要中的位置偏差，即每个视频摘要的质量可能随视频输入槽位的变化而变化，即使底层内容不变。我们从ActivityNet和新闻视频构建了一个基准，涵盖烹饪、家庭、休闲和新闻场景，包含两个和四个视频输入。我们评估了九个开源和专有MLLMs，并使用三种互补指标测量位置效应：覆盖率、方向性位置偏差（DPB）和中间边缘差距（MEG）。我们的结果表明，位置效应是领域和模型依赖的：即使中间位置表现不佳，有符号的方向性偏差也可能很小；增加视觉或生成预算并不能均匀地消除不平衡。我们进一步分析了提示级缓解方法。总之，结果表明多视频摘要仍然对输入协议和位置敏感，这促使开发更鲁棒的、顺序不变的多模态系统。

英文摘要

Multimodal Large Language Models (MLLMs) are increasingly used for video understanding, yet their reliability under multi-video inputs remains poorly understood. We study positional bias in multi-video summarization, where the quality of a per-video summary can change with the video's input slot even when the underlying content is unchanged. We construct a benchmark from ActivityNet and News videos, covering Cooking, Domestic, Leisure, and News settings with two- and four-video inputs. We evaluate nine open-source and proprietary MLLMs and measure position effects with three complementary metrics: Coverage, Directional Positional Bias (DPB), and Middle-Edge Gap (MEG). Our results show that positional effects are domain- and model-dependent: signed directional bias can be small even when middle positions underperform, and increasing visual or generation budget does not uniformly remove the imbalance. We further analyze prompt-level mitigation methods. Together, the results show that multi-video summarization remains sensitive to input protocol and position, motivating more robust order-invariant multimodal systems.

URL PDF HTML ☆

赞 0 踩 0

2606.04591 2026-06-04 cs.CL cs.CV 版本更新

Fine-grained Fragment Retrieval in Multi-modal Long-form Dialogues

多模态长对话中的细粒度片段检索

Hanbo Bi, Zhiqiang Yuan, Chongyang Li, Qiwei Yan, Zexi Jia, Jiapei Zhang, Xiaoyue Duan, Yingchao Feng, Jinchao Zhang, Jie Zhou

发表机构 * Pattern Recognition Center, WeChat AI, Tencent Inc（模式识别中心、微信AI、腾讯公司）； Aerospace Information Research Institute, Chinese Academy of Sciences（航天信息研究所、中国科学院）

AI总结提出细粒度片段检索任务，通过强化学习训练的生成式检索模型F2RVLM和两阶段系统FFRS，实现多模态长对话中多语句、多图像片段的精准定位。

详情

AI中文摘要

随着多模态交流平台的广泛采用，文本和图像交织的长对话变得越来越普遍。用户通常需要检索与特定主题相关的连贯对话片段，而不是孤立的语句。我们提出了细粒度片段检索（FFR），用于在多模态长对话中定位语义相关的多语句、多图像片段。我们探索了两种设置：（1）单对话内的FFR，从给定对话中检索片段；（2）对话语料库内的FFR，从大规模语料库中为开放域场景检索片段。对于（1），我们引入了F2RVLM，一种基于生成的检索模型，使用强化学习训练，通过多目标奖励和难度感知课程采样来增强片段连贯性。对于（2），我们开发了FFRS，一个两阶段系统，结合了离线片段级索引和在线检索。具体来说，每个对话被分解为最小语义片段，由片段嵌入模型（FEM）编码到向量数据库中；在推理时，FEM快速召回Top-K候选，F2RVLM进行细粒度推理以识别最相关的子内容。为支持FFR，我们构建了MLDR，迄今为止最长的多模态对话检索数据集，以及一个基于微信的真实世界测试集。在两个基准上的实验表明，F2RVLM和FFRS在单对话和语料库级别的FFR上始终取得优越性能。

英文摘要

With the widespread adoption of multi-modal communication platforms, long-form dialogues interleaving text and images have become increasingly common. Users often need to retrieve coherent dialogue fragments related to specific topics, rather than isolated utterances. We propose Fine-grained Fragment Retrieval (FFR), which locates semantically relevant multi-utterance, multi-image fragments in multi-modal long-form dialogues. We explore two settings: (1) FFR within Single-Dialogue, retrieving fragments from a given dialogue; and (2) FFR within Dialogue Corpus, retrieving from a large-scale corpus for open-domain scenarios. For (1), we introduce F2RVLM, a generation-based retrieval model trained with reinforcement learning, using multi-objective rewards and difficulty-aware curriculum sampling to enhance fragment coherence. For (2), we develop FFRS, a two-stage system combining offline fragment-level indexing with online retrieval. Specifically, each dialogue is decomposed into minimal semantic fragments encoded by a Fragment Embedding Model (FEM) into a vector database; at inference, FEM rapidly recalls Top-K candidates, and F2RVLM performs fine-grained reasoning to identify the most relevant sub-content. To support FFR, we construct MLDR, the longest multi-modal dialogue retrieval dataset to date, and a WeChat-based real-world test set. Experiments on both benchmarks demonstrate that F2RVLM and FFRS consistently achieve superior performance across single-dialogue and corpus-level FFR.

URL PDF HTML ☆

赞 0 踩 0

2606.04588 2026-06-04 cs.CL 版本更新

VCIFBench: Evaluating Complex Instruction Following for Video Understanding

VCIFBench：评估视频理解中的复杂指令遵循能力

Huangchen Xu, Yuan Wu, Yi Chang

发表机构 * School of Artificial Intelligence, Jilin University（吉林大学人工智能学院）； Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, Jilin University（知识驱动人机智能工程研究中心，吉林大学）； International Center of Future Science, Jilin University（未来科学国际中心，吉林大学）

AI总结提出VCIFBench基准，通过混合验证流水线评估多模态大模型在视频理解中遵循内容、格式、风格和结构约束的复杂指令能力，实验表明联合约束满足仍具挑战，DPO训练可提升性能。

2606.04557 2026-06-04 cs.CL cs.IR cs.LG 版本更新

Cartridges at Scale: Training Modular KV Caches over Large Document Collections

大规模弹匣：训练模块化KV缓存以处理大型文档集合

Momchil Hardalov, Gonzalo Iglesias, Adrià de Gispert

发表机构 * Amazon AGI（亚马逊人工智能研究院）

AI总结提出Cartridges at Scale (CAS)框架，通过动态干扰混合和内存高效预算管理器实现大规模多弹匣训练，在减少预填充开销的同时保持准确性，性能优于单块弹匣10-31点，接近全上下文学习。

Comments 21 pages, 5 figures, 17 tables

详情

AI中文摘要

大型语言模型能够处理长上下文，但预填充数百万个标记是浪费的，因为许多内容在查询之间保持不变。弹匣通过将文档集合提炼为可重用的键值（KV）缓存来解决这一问题，从而消除预填充同时保持准确性。这种方法的一个关键限制是弹匣是单块且非组合的：将整个集合编码为单个KV块无法扩展，并且天真地混合单独训练的弹匣会使性能下降到接近随机水平。我们引入了Cartridges at Scale (CAS)，这是一个可扩展的多弹匣学习训练框架，具有动态干扰混合和内存高效的预算管理器，可在GPU和持久存储之间轮换数百个每文档弹匣。我们的方法可扩展到超过一百万个标记的集合，在可比标记预算下，比单块弹匣提高10-31点。即使在高度压缩下，Oracle弹匣准确率也接近完全上下文学习的2-6点范围内。当与检索结合用于弹匣选择时，CAS匹配或超过传统RAG准确率，同时消耗的提示标记减少3-4倍。

英文摘要

Large Language Models can reason over long contexts, yet prefilling millions of tokens is wasteful as much of the content remains static across queries. Cartridges address this by distilling document collections into reusable key-value (KV) caches that eliminate prefilling while preserving accuracy. A critical limitation of this approach is that cartridges are monolithic and non-compositional: encoding an entire collection into a single KV block does not scale, and naively mixing cartridges trained in isolation collapses performance to near chance. We introduce Cartridges at Scale (CAS), a training framework for scalable multi-cartridge learning with dynamic distractor mixing and a memory-efficient budget manager that rotates hundreds of per-document cartridges between GPU and persistent storage. Our approach scales to collections exceeding a million tokens, improving over a monolithic cartridge by 10-31 points at comparable token budgets. Oracle cartridge accuracy falls within 2-6 points of full in-context learning even at high compression. When paired with retrieval for cartridge selection, CAS matches or exceeds conventional RAG accuracy while consuming 3-4x fewer prompt tokens.

URL PDF HTML ☆

赞 0 踩 0

2606.04555 2026-06-04 cs.CL cs.AI 版本更新

Temporal Order Matters for Agentic Memory: Segment Trees for Long-Horizon Agents

时间顺序对智能体记忆至关重要：面向长程智能体的线段树

Yifan Simon Liu, Liam Gallagher, Faeze Moradi Kalarde, Jiazhou Liang, Armin Toroghi, Scott Sanner

发表机构 * University of Toronto（多伦多大学）； Vector Institute for Artificial Intelligence（人工智能向量研究所）

AI总结提出线段树记忆架构SegTreeMem，通过在线右边缘更新规则保持对话历史的时间顺序，结合层次化时间上下文进行检索，在长程记忆基准上优于现有方法。

详情

AI中文摘要

长程对话智能体需要通过与用户交互不断演化的事件、任务和目标进行互动。这些历史记录本质上是时间性的，然而许多现有的记忆系统主要按主题相似性组织信息，可能忽略事件发生的顺序。我们引入线段树记忆（Segment Tree Memory，简称SegTreeMem），这是一种将对话历史表示为按时间顺序排列的线段树的记忆架构。SegTreeMem通过在线最右边缘更新规则逐步插入新话语，在形成层次化记忆片段的同时保持时间顺序。在检索时，SegTreeMem通过树传播相关性分数，将局部语义匹配与层次化时间上下文相结合。在三个长程记忆基准和两个LLM骨干网络上，SegTreeMem在答案质量上优于平面检索、图结构记忆和树结构记忆基线。额外的时间顺序排列分析表明，性能提升依赖于在记忆构建过程中保持时间顺序，这支持了时间顺序是智能体记忆关键结构的观点。

英文摘要

Long-horizon conversational agents need to interact with users through evolving events, tasks, and goals. Such histories are naturally temporal, yet many existing memory systems organize information primarily by topical similarity and may ignore the order in which events occur. We introduce Segment Tree Memory, or SegTreeMem, a memory architecture that represents conversation history as a temporally ordered Segment Tree over utterances. SegTreeMem incrementally inserts new utterances through an online rightmost-frontier update rule, preserving chronological order while forming hierarchical memory segments. For retrieval, SegTreeMem propagates relevance scores through the tree to combine local semantic matching with hierarchical temporal context. Across three long-horizon memory benchmarks and two LLM backbones, SegTreeMem improves answer quality over flat retrieval, graph-structured memory, and tree-structured memory baselines. Additional temporal-order permutation analysis shows that the performance gain depends on preserving temporal order during memory construction, supporting the claim that temporal order is a key structure for agentic memory.

URL PDF HTML ☆

赞 0 踩 0

2606.04552 2026-06-04 cs.CL q-bio.GN 版本更新

LDARNet: DNA Adaptive Representation Network with Learnable Tokenization for Genomic Modeling

LDARNet: 用于基因组建模的DNA自适应表示网络与可学习分词

Daria Ledneva, Denis Kuznetsov

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出LDARNet，一种结合动态分块和双向路由的120M参数层次基因组基础模型，在27个任务中优于更大模型，并发现学习到的边界与生物学基序对齐。

详情

AI中文摘要

基因组基础模型越来越多地采用大型语言模型架构，但几乎普遍依赖于固定的分词方案，如$k$-mers、BPE或单核苷酸，这些方案强加了可能掩盖生物学相关结构的任意序列边界。我们提出了LDARNet，一个120M参数的层次基因组基础模型，它将H-Net风格的动态分块从自回归生成适应到掩码语言建模，结合了BiMamba-2状态空间层与局部注意力、双向路由以及基于比值的正则化器，以在无监督的情况下诱导自适应标记边界。在来自Nucleotide Transformer和Genomic Benchmarks套件的27个任务上进行微调后，LDARNet在紧凑模型（<300M参数）中取得了11/18的胜率，并在5个组蛋白修饰任务上取得了最先进的结果，优于高达20倍大的模型。一个FLOPs匹配的对照实验将学习到的路由确定为这些增益的来源：在相同计算量下，学习到的边界在组蛋白任务上比固定网格边界高出多达14个百分点。进一步的核苷酸分辨率分析表明，学习到的边界在无监督的情况下与典型的启动子基序和剪接连接点对齐，为基因组基础模型中的自适应分词提供了生物学解释。

英文摘要

Genomic foundation models increasingly adopt large language model architectures, yet almost universally rely on fixed tokenization schemes such as $k$-mers, BPE, or single nucleotides, which impose arbitrary sequence boundaries that may obscure biologically relevant structure. We present LDARNet, a 120M-parameter hierarchical genomic foundation model that adapts H-Net-style dynamic chunking from autoregressive generation to masked language modeling, combining BiMamba-2 state-space layers with local attention, bidirectional routing, and a ratio-based regularizer to induce adaptive token boundaries without supervision. Fine-tuned on 27 tasks from the Nucleotide Transformer and Genomic Benchmarks suites, LDARNet achieves 11/18 wins among compact models ($<$300M parameters) and state-of-the-art results on 5 histone modification tasks, outperforming models up to 20$\times$ larger. A FLOPs-matched controlled experiment isolates learned routing as the source of these gains: learned boundaries beat fixed-grid boundaries by up to 14 percentage points on histone tasks at identical compute. Nucleotide-resolution analysis further shows that the learned boundaries align with canonical promoter motifs and splice junctions without supervision, providing a biological interpretation for adaptive tokenization in genomic foundation models.

URL PDF HTML ☆

赞 0 踩 0

2606.04535 2026-06-04 cs.CL cs.AI 版本更新

Dynamic Infilling Anchors for Format-Constrained Generation in Diffusion Large Language Models

扩散大语言模型中用于格式约束生成的动态填充锚点

Boyan Han, Yiwei Wang, Yi Song, Yujun Cai, Chi Zhang

发表机构 * AGI Lab, Westlake University, China（西溪大学AGI实验室，中国）； University of California, Merced, USA（加州大学梅尔德分校，美国）； Teeni AI, China（Teeni AI，中国）； The University of Queensland, Australia（昆士兰大学，澳大利亚）

AI总结提出动态填充锚点（DIA），一种无需训练的方法，通过动态估计结束锚点位置调整生成长度，确保格式约束下的结构正确性和语义连贯性，在GSM8K和MATH上实现零样本性能提升。

Comments Accepted to the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)

详情

AI中文摘要

扩散大语言模型（dLLMs）提供双向注意力和并行生成，使其能够利用全局上下文并自然支持格式约束任务，如可解析的JSON或推理模板。虽然直接的固定锚点可以强制执行此类约束，但它们通常强加刚性跨度，导致推理截断或内容冗余。为了克服这一点，我们提出了动态填充锚点（DIA），一种无需训练的方法，在迭代填充之前动态估计结束锚点位置以调整生成长度。这种灵活机制确保了结构正确性和语义连贯性，避免了固定跨度方法的低效。在推理基准上的实验表明，DIA显著提高了格式合规性和答案准确性，在GSM8K和MATH上实现了显著的零样本增益。这些结果确立了DIA作为通往可靠、结构感知生成的一条稳健路径。

英文摘要

Diffusion large language models (dLLMs) offer bidirectional attention and parallel generation, enabling them to exploit global context and naturally support format-constrained tasks like parseable JSON or reasoning templates. While straightforward fixed anchors can enforce such constraints, they often impose rigid spans, leading to truncated reasoning or redundant content. To overcome this, we propose Dynamic Infilling Anchors (DIA), a training-free method that dynamically estimates end-anchor positions to adjust generation length before iterative infilling. This flexible mechanism ensures structural correctness and semantic coherence, avoiding the inefficiencies of fixed-span methods. Experiments on reasoning benchmarks demonstrate that DIA substantially improves format compliance and answer accuracy, achieving significant zero-shot gains on GSM8K and MATH. These results establish DIA as a robust pathway toward reliable, structure-aware generation.

URL PDF HTML ☆

赞 0 踩 0

2606.04511 2026-06-04 cs.CL cs.LG 版本更新

SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference

SparDA: 用于高效长上下文LLM推理的稀疏解耦注意力

Yaosheng Fu, Guangxuan Xiao, Xin Dong, Song Han, Oreste Villa

发表机构 * NVIDIA ； Thinking Machines Lab ； ByteDance Seed ； MIT

AI总结提出SparDA架构，通过引入第四投影Forecast实现KV缓存预取与注意力解耦，减少稀疏选择开销，在长上下文推理中实现1.25倍预填充加速和1.7倍解码加速。

详情

AI中文摘要

稀疏注意力减少了长上下文LLM推理的计算和内存带宽。然而，仍然存在两个关键挑战：（1）KV缓存容量随序列长度增长，卸载到CPU内存引入了PCIe传输瓶颈；（2）稀疏选择步骤本身保持$O(T^2)$复杂度，在长上下文中可能主导注意力成本。我们提出SparDA，一种解耦的稀疏注意力架构，它在Query、Key和Value之外引入了第四个逐层投影——Forecast。Forecast预测下一层所需的KV块，从而实现超前选择，将CPU到GPU的预取与当前层执行重叠。由于Forecast与注意力查询解耦，我们的GQA实现为每个GQA组使用一个Forecast头，相比原始多头选择器减少了选择开销。SparDA增加了<0.5%的参数，并通过匹配原始选择器的注意力分布仅训练Forecast投影。在两个稀疏预训练的8B模型上，SparDA匹配或略微提高了准确性，并且相比稀疏注意力卸载基线，提供了高达1.25倍的预填充加速和1.7倍的解码加速。通过使单个GPU上可行的批量大小更大，SparDA进一步实现了比非卸载稀疏基线高达5.3倍的解码吞吐量。我们的源代码可在https://github.com/NVlabs/SparDA获取。

英文摘要

Sparse attention reduces compute and memory bandwidth for long-context LLM inference. However, two key challenges remain: (1) KV cache capacity still grows with sequence length, and offloading to CPU memory introduces a PCIe transfer bottleneck; (2) the sparse selection step itself retains $O(T^2)$ complexity and can dominate attention cost at long contexts. We propose SparDA, a decoupled sparse attention architecture that introduces a fourth per-layer projection, the Forecast, alongside Query, Key, and Value. The Forecast predicts the KV blocks needed by the next layer, enabling lookahead selection that overlaps CPU-to-GPU prefetch with current-layer execution. Because Forecast is decoupled from the attention query, our GQA implementation uses one Forecast head per GQA group, reducing selection overhead versus the original multi-head selector. SparDA adds $<$0.5% parameters and trains only the Forecast projections by matching the original selector's attention distribution. On two sparse-pretrained 8B models, SparDA matches or slightly improves accuracy and delivers up to 1.25$\times$ prefill speedup and 1.7$\times$ decode speedup over the sparse-attention offload baseline. By enabling larger feasible batch sizes on a single GPU, SparDA further reaches up to 5.3$\times$ higher decode throughput than the non-offload sparse baseline. Our source code is available at https://github.com/NVlabs/SparDA.

URL PDF HTML ☆

赞 0 踩 0

2606.04507 2026-06-04 cs.CL cs.AI 版本更新

Self-Evolving Deep Research via Joint Generation and Evaluation

通过联合生成与评估实现自我进化的深度研究

Han Zhu, Chengkun Cai, Yuanfeng Song, Xing Chen, Sirui Han, Yike Guo

发表机构 * The Hong Kong University of Science and Technology（香港科技大学）； ByteDance, China（字节跳动）； University College London（伦敦大学学院）

AI总结提出SCORE框架，通过共享参数的协同进化训练联合优化评估器与求解器，解决深度研究报告生成中奖励不可验证的问题，持续提升生成质量。

详情

AI中文摘要

大型语言模型（LLM）在日常应用中越来越广泛，其中深度研究是一项特别重要的能力。与传统的问答（QA）任务不同，深度研究报告生成缺乏明确的真实答案，这使得奖励设计本质上不可验证，限制了有效的强化学习。现有方法通过LLM作为评判者和查询相关的评估标准来缓解这一挑战，但它们仍然依赖静态评估器，无法随着求解器的改进而调整标准，导致优化压力不足并最终饱和。我们通过一个用于深度研究评估和生成的 extbf{自}我进化 extbf{协}同进化训练框架（SCORE）来解决这一限制，该框架在共享参数的学习过程中紧密耦合评估器和求解器。我们不将生成和评估视为孤立的模块，而是利用它们的内在联系，在单个共享参数模型中实现联合改进。为了限制这一过程，我们引入了一个元控制机制，该机制根据求解器的性能动态控制评估环境，鼓励有效的评估维度和足够深入的评估器搜索。在深度研究基准上的大量实验表明，报告生成质量持续提升，表明协同进化评估和生成是训练开放式研究代理的一个有前景的方向。

英文摘要

Large Language Models (LLMs) have become increasingly adopted in daily applications, with deep research standing out as a particularly important capability. Unlike traditional question-answering (QA) tasks, deep research report generation lacks definitive ground-truth, making reward design inherently unverifiable and limiting effective reinforcement learning. Existing approaches mitigate this challenge with LLM-as-a-judge and query-dependent evaluation rubrics, but they still rely on static evaluators that cannot adapt their standards as the solver improves, leading to insufficient and eventually saturated optimization pressure. We address this limitation with a \textbf{s}elf-evolving \textbf{co}-evolutionary training framework for deep \textbf{re}search evaluation and generation (SCORE), which tightly couples an evaluator and a solver in a shared-parameter learning process. Rather than treating generation and evaluation as isolated modules, we leverage their intrinsic connection to enable joint improvement within a single shared-parameter model. To restrict this process, we introduce a meta-harness, which dynamically controls the evaluation environment based on solver performance, encouraging valid evaluation dimensions and sufficiently deep evaluator search. Extensive experiments on deep research benchmarks demonstrate consistent improvement in report generation quality, showing that co-evolving evaluation and generation is a promising direction for training open-ended research agents.

URL PDF HTML ☆

赞 0 踩 0

2606.04500 2026-06-04 cs.CL 版本更新

SANE Schema-aware Natural-language Evaluation of Biological Data

SANE：生物数据的模式感知自然语言评估

Rolf Gattung, Martin Krueger, Markus Reischl

发表机构 * Institute for Automation and Applied Informatics (IAI), Karlsruhe Institute of Technology (KIT)（自动化与应用信息研究所（IAI）、卡尔斯鲁厄理工学院（KIT））

AI总结提出SANE范式，通过模式感知的自动生成基准，评估少样本大语言模型在特定领域文本到SQL任务中的可靠性，发现结构化提示和约束可实现准确查询生成。

Comments 5 pages, 3 figures, submitted but not yet reviewed by BMT2026

详情

Token排名是不可伪造的语言模型签名

Matthew Finlayson, Andreas Grivas, Xiang Ren, Swabha Swayamdipta

发表机构 * University of Southern California（南加州大学）； University of Edinburgh（爱丁堡大学）

AI总结本文发现语言模型的token排名（按概率排序）构成唯一且不可伪造的签名，并研究了在限制API下如何平衡签名展示与参数泄露。

详情

AI中文摘要

已知语言模型参数对其logit输出施加了（每个模型）独特的几何约束，这作为识别模型的签名，但当API分发logits时也会泄露模型的最后一层参数。我们研究了更严格的API，这些API只暴露token排名（即按概率排序，但不暴露概率值），并发现排名也构成签名：对于足够大的$k$，每个模型都有一组唯一的可行top-$k$排名。此外，排名签名是第一个已知的（多项式时间）不可伪造签名，因为找到一个具有相同可行排名集的模型是NP难的。在安全方面，我们发现token排名已经足以近似窃取模型的最后一层，类似于logits，尽管近似太粗糙以至于无法伪造签名，并且可以通过将API限制为足够小的$k$的top-$k$ token来有效应对。由于展示模型签名所需的top-$k$通常小于防止窃取所需的$k$，因此API可以在不泄露模型参数的情况下展示不可伪造的签名。

英文摘要

Language model parameters are known to impose unique (to each model) geometric constraints on their logit outputs, which serves as a signature that identifies the model, but also leaks the model's final layer parameters when an API distributes logits. We investigate more restrictive APIs that expose token rankings (i.e., their ordering by probability, but not the probability values) and find that rankings also constitute a signature: every model has a unique set of feasible top-$k$ rankings for sufficiently large $k$. Furthermore, the ranking signature is the first known (polynomially) unforgeable signature, since finding a model with the same set of feasible rankings is NP-hard. On the security front, we find that token rankings are already sufficient to approximately steal the final layer of the model, similar to logits, though the approximation is too coarse to forge the signature, and can be effectively countered by restricting the API to top-$k$ tokens with sufficiently small $k$. Since the top-$k$ required to present the model signature is generally smaller than the $k$ required to prevent stealing, it is possible for an API to present an unforgeable signature without leaking model parameters.

URL PDF HTML ☆

赞 0 踩 0

2606.04455 2026-06-04 cs.AI cs.CL 版本更新

The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development?

元智能体挑战：当前智能体能否自主开发智能体？

Xinyu Lu, Tianshu Wang, Pengbo Wang, zujie wen, Zhiqiang Zhang, Jun Zhou, Boxi Cao, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun

发表机构 * Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences（中国科学院软件研究所信息处理实验室）； University of Chinese Academy of Sciences（中国科学院大学）； Ant Group（蚂蚁集团）

AI总结提出元智能体挑战（MAC）框架，评估前沿模型自主开发智能体系统的能力，发现多数元智能体难以匹敌人类设计的基线策略，且存在鲁棒性和对齐问题。

Comments Website: https://meta-agent-challenge.com/

详情

AI中文摘要

当前的AI基准测试评估智能体在人类设计的工作流程中执行任务的能力。这些评估从根本上未能衡量一个关键的更高级能力：模型能否自主开发智能体系统。我们引入了元智能体挑战（MAC），这是一个评估框架，旨在测试前沿模型自主开发智能体的能力。具体来说，一个代码智能体（元智能体）被赋予一个沙盒环境、一个评估API和一个时间限制，以迭代地编程一个智能体工件，该工件在五个领域的保留测试集上最大化性能。为确保评估完整性，该框架通过多层防御机制防止奖励黑客攻击。利用该框架，我们证明元智能体很少能匹配人类设计的基线策略，而少数能匹配的则主要由专有前沿模型主导。此外，设计过程表现出高方差，高优化压力会浮现出诸如真实数据窃取等新兴对抗行为——凸显了鲁棒性和模型对齐方面的关键缺陷。最终，MAC为自主AI研究和开发提供了一个严格的、开源的基准测试，为评估递归自我改进提供了经验代理。基准测试公开于：https://github.com/ant-research/meta-agent-challenge。

英文摘要

Current AI benchmarks evaluate agents on task execution within human-designed workflows. These evaluations fundamentally fail to measure a critical next-level capability: whether models can autonomously develop agent systems. We introduce the Meta-Agent Challenge (MAC), an evaluation framework designed to test the capacity of frontier models for autonomous agent development. Specifically, a code agent (the meta-agent) is given a sandboxed environment, an evaluation API, and a time limitation to iteratively program an agent artifact that maximizes performance on a held-out test set across five domains. To ensure evaluation integrity, this framework is secured by multi-layer defenses against reward hacking. Leveraging this framework, we demonstrate that meta-agents rarely match human-engineered baseline policies, and the few that do are dominated by proprietary frontier models. Moreover, the design process exhibits high variance, and high optimization pressure surfaces emergent adversarial behaviors like ground-truth exfiltration-highlighting critical deficits in both robustness and model alignment. Ultimately, MAC provides a rigorous, open-source benchmark for autonomous AI research and development, offering an empirical proxy for evaluating recursive self-improvement. Benchmark is publicly available at: https://github.com/ant-research/meta-agent-challenge.

URL PDF HTML ☆

赞 0 踩 0

2606.04454 2026-06-04 cs.CL 版本更新

Stepwise Reasoning Enhancement for LLMs via External Subgraph Generation

通过外部子图生成增强大语言模型的逐步推理

Xin Zhang, Yang Cao, Baoxing Wu, Kai Song, Siying Li

发表机构 * School of Information Science and Engineering, Chongqing Jiaotong University（重庆交通大学信息科学与工程学院）； School of Computer Science and Technology, Chongqing University of Posts and Telecommunications（重庆邮电大学计算机科学与技术学院）

AI总结提出SGR框架，通过从知识图谱生成查询相关子图来引导大语言模型进行逐步推理，提升复杂多步推理的准确性、鲁棒性和可解释性。

详情

AI中文摘要

大语言模型在自然语言生成和下游推理任务中表现出色，但在复杂多步推理中仍面临逻辑一致性、事实基础和可解释性方面的挑战。为解决这些局限，本文提出SGR，一种通过查询相关子图生成将大语言模型与外部知识图谱集成的逐步推理增强框架。给定输入问题，SGR首先提取关键实体、关系和约束以构建结构化模式，然后通过模式引导查询从知识图谱中检索紧凑子图。生成的子图提供明确的关系证据，引导语言模型进行逐步推理。此外，SGR结合了基于Cypher的直接推理与协作推理集成，允许根据模型置信度和图一致性验证和聚合来自多个推理路径的候选答案。在包括CWQ、WebQSP、GrailQA和KQA Pro的基准数据集上的实验表明，SGR在推理准确性和Hits@1性能上优于标准提示和几种知识增强基线。消融研究进一步表明，模式引导和基于Neo4j的检索对框架的有效性都至关重要。这些结果表明，动态生成的外部子图可以提高基于大语言模型的推理的准确性、鲁棒性和可解释性。

英文摘要

Large language models have shown strong performance in natural language generation and downstream reasoning tasks, but they still struggle with logical consistency, factual grounding, and interpretability in complex multi-step reasoning. To address these limitations, this paper proposes SGR, a stepwise reasoning enhancement framework that integrates large language models with external knowledge graphs through query-relevant subgraph generation. Given an input question, SGR first extracts key entities, relations, and constraints to construct a structured schema, then retrieves compact subgraphs from a knowledge graph using schema-guided querying. The generated subgraphs provide explicit relational evidence that guides the language model through step-by-step reasoning. In addition, SGR combines direct Cypher-based reasoning with collaborative reasoning integration, allowing candidate answers from multiple reasoning paths to be validated and aggregated according to both model confidence and graph consistency. Experiments on benchmark datasets including CWQ, WebQSP, GrailQA, and KQA Pro demonstrate that SGR improves reasoning accuracy and Hits@1 performance over standard prompting and several knowledge-enhanced baselines. Ablation studies further show that schema guidance and Neo4j-based retrieval are both crucial to the effectiveness of the framework. These results indicate that dynamically generated external subgraphs can improve the accuracy, robustness, and interpretability of LLM-based reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.04450 2026-06-04 cs.CL cs.CY 版本更新

Listening to the Workforce: Measuring Construction Worker Safety Attitudes from Social Media Discourse Using LLMs

倾听劳动力：使用LLMs从社交媒体话语中测量建筑工人安全态度

Farouq Sammour, Yuxin Zhang, Zhenyu Zhang

发表机构 * Texas A&M University（德克萨斯A&M大学）

AI总结提出并验证了建筑安全态度框架（CSAF），通过LLM分类器从Reddit社区话语中测量工人安全态度，实现高精度多维分析。

详情

AI中文摘要

工人安全态度是决定建筑工地上保护措施是否被应用或规避的关键因素。然而，大规模测量安全态度一直难以实现。安全态度是多维的，因话题而异，并且在工人自己的对话中最为坦诚。本研究创建并验证了建筑安全态度框架（CSAF），该框架整合了两个组成部分：一个基于理论的结构，沿八个维度表征安全态度；以及一个用于在工人自然话语中测量这些态度的操作化编码手册。将CSAF应用于Reddit上r/Construction社区的250条帖子和评论，经过训练的编码者达到了高度一致（Krippendorff's α = 0.85）。成对提升度和条件概率证实了八个维度既相关又不同。为了将框架应用于大量话语，CSAF通过大语言模型（LLM）分类器进行操作化。在450条r/Construction贡献中，分类器再现了专家人工编码（Cohen's κ = 0.90，精确率 = 0.98，召回率 = 0.98），并且在400条r/Roofing贡献中，转移到不同行业社区后仍保持该准确率（κ = 0.89，精确率 = 0.98，召回率 = 0.97）。一项价值验证案例研究将经过验证的分类器应用于10,346条r/Roofing贡献，证明CSAF能够按安全主题区分多维态度，追踪它们随时间的变化，并追溯不利态度背后的推理。因此，本研究提供了一个理论扎实、经验验证的工具来检查安全态度，为针对不安全实践背后态度的干预措施提供了基础。

英文摘要

Worker safety attitudes are key determinants of whether protective practices are applied or bypassed on construction sites. Yet measuring them at scale has remained out of reach. Safety attitudes are multidimensional, vary across topics, and surface most candidly in workers' own conversations. This study created and validated the Construction Safety Attitude Framework (CSAF), which integrates two components: a theory-grounded structure that characterizes safety attitudes along eight dimensions, and an operational codebook for measuring them in worker naturalistic discourse. Applying CSAF to 250 posts and comments from the r/Construction community on Reddit, trained coders reached strong agreement (Krippendorff's α = 0.85). Pairwise lift and conditional probability confirmed that the eight dimensions are related yet distinct. To apply the framework across large volumes of discourse, CSAF was operationalized through a large language model (LLM) classifier. On 450 r/Construction contributions, the classifier reproduced expert human coding (Cohen's \k{appa} = 0.90, precision = 0.98, recall = 0.98), and on 400 contributions from r/Roofing it retained that accuracy after transfer to a different trade community (\k{appa} = 0.89, precision = 0.98, recall = 0.97). A proof-of-value case study then applied the validated classifier to 10,346 contributions from r/Roofing, demonstrating that CSAF can distinguish multidimensional attitudes by safety topic, track how they shift over time, and trace the reasoning behind unfavorable ones. The study therefore provides a theoretically grounded, empirically vetted instrument for examining safety attitudes, offering a basis for targeted interventions that address the attitudes underlying unsafe practices.

URL PDF HTML ☆

赞 0 踩 0

2606.04442 2026-06-04 cs.CL cs.AI 版本更新

MemoryDocDataSet: A Benchmark for Joint Conversational Memory and Long Document Reasoning

MemoryDocDataSet: 联合对话记忆与长文档推理的基准测试

Qiyang Xie, Jialun Wu, Xinjie He, Su Liu, Shuai Xiao, Zhiyuan Lin, Weikai Zhou

发表机构 * Northeastern University（东北大学）； Johns Hopkins University（约翰霍普金斯大学）； Columbia University（哥伦比亚大学）； Independent Researcher（独立研究者）

AI总结提出MemoryDocDataSet合成基准，包含50个微世界和1000个QA对，评估系统同时处理多轮对话历史和长文档阅读理解的能力，其中75.1%的问题需要混合推理（先导航对话历史再提取文档答案），实验显示联合检索存在明显差距。

Comments 17 pages, 2 figures, 8 tables. Submitted for peer review

详情

AI中文摘要

人工智能系统越来越需要结合两种要求很高的能力：导航多轮对话历史和在长文档中进行深度阅读理解。然而，现有的基准测试没有同时评估这两者。我们引入了MemoryDocDataSet，一个包含50个微世界和1000个QA对的合成基准，其中每个实例包含3-5个人物角色、一个跨越数月活动的时间事件图、3-5篇真实长文档（每篇20,000-50,000个token，来自Caselaw Access Project）、基于这些文档的多轮对话，以及跨越五个推理类别的20个问答对。其定义特征是混合源标签：需要系统首先导航对话历史以确定哪个文档相关，然后从该文档中提取答案的问题。混合问题占数据集的75.1%。通过使用LLM作为评判者的提示敏感性自一致性分析来表征数据集质量，在所有50个微世界中得到中位数Cohen's $κ= 0.634$。我们评估了六种基线配置，涵盖截断上下文、长上下文LLM、检索增强生成（RAG）和记忆系统。最佳基线（RAG-Both）在整体F1上达到0.358，在混合问题上达到0.342。仅文档检索（RAG-Doc）在混合问题上降至0.267，尽管在仅文档问题上达到0.453，这显示了明显的联合检索差距，激励了统一对话记忆与长文档导航的架构。我们发布了数据集、生成流水线和所有基线实现。

英文摘要

AI systems increasingly need to combine two demanding capabilities: navigating multi-session conversation history and performing deep reading comprehension within long documents. Yet no existing benchmark evaluates both simultaneously. We introduce MemoryDocDataSet, a synthetic benchmark of 50 micro-worlds and 1,000 QA pairs in which each instance comprises 3-5 personas, a temporal event graph spanning months of activity, 3-5 real long documents (20,000-50,000 tokens each sourced from the Caselaw Access Project), multi-session conversations grounded on those documents, and 20 question-answer pairs across five reasoning categories. The defining feature is the Hybrid source tag: questions requiring a system to first navigate conversation history to identify which document is relevant, then extract the answer from within that document. Hybrid questions account for 75.1% of the dataset. Dataset quality is characterised through a prompt-sensitivity self-consistency analysis using LLM-as-judge, yielding a median Cohen's $κ= 0.634$ across all 50 micro-worlds. We evaluate six baseline configurations spanning truncated context, long-context LLMs, retrieval-augmented generation (RAG), and memory systems. The best baseline (RAG-Both) achieves 0.358 overall F1 and 0.342 on Hybrid. Document-only retrieval (RAG-Doc) collapses to 0.267 on Hybrid despite achieving 0.453 on Doc-only questions, demonstrating a clear joint-retrieval gap that motivates architectures unifying conversational memory with long-document navigation. We release the dataset, generation pipeline, and all baseline implementations.

URL PDF HTML ☆

赞 0 踩 0

2606.04435 2026-06-04 cs.AI cs.CL cs.CR cs.IR 版本更新

Cascading Hallucination in Agentic RAG: The CHARM Framework for Detection and Mitigation

智能体RAG中的级联幻觉：用于检测和缓解的CHARM框架

Saroj Mishra

发表机构 * University of North Dakota（北达科他大学）

AI总结针对多步智能体RAG管道中早期错误传播并放大为最终错误输出的级联幻觉问题，提出CHARM框架，通过阶段级事实验证、跨阶段一致性跟踪、置信度传播监控和级联解析触发四个组件实现检测与缓解，在多个数据集上达到89.4%的级联检测率和82.1%的错误传播减少。

详情

AI中文摘要

多步智能体检索增强生成（RAG）管道在复杂推理任务中展现出显著能力，但仍然容易受到一类现有幻觉检测机制系统性遗漏的故障影响：级联幻觉，即在管道早期阶段引入的错误会通过连续推理步骤传播并放大，产生自信但事实不正确的最终输出。为解决这一漏洞，我们将级联幻觉形式化为智能体RAG系统中的一种独特故障模式，提出四种级联模式的分类法，并引入CHARM（级联幻觉感知解析与缓解），一种用于检测和中断多步推理管道中错误传播的架构框架。CHARM包含四个组件——阶段级事实验证、跨阶段一致性跟踪、置信度传播监控和级联解析触发——它们与标准智能体RAG管道并行运行，无需替换架构。我们在HotpotQA、MuSiQue、2WikiMultiHopQA以及一个自定义对抗数据集上，在LangChain智能体管道配置下评估CHARM，实现了89.4%的级联检测率、5.3%的假阳性率、每阶段平均215 ms ± 18 ms的延迟开销，以及82.1%的错误传播减少，而输出级检测器仅为18.5%。组件消融实验证实每个检测模块对整体级联覆盖都有显著贡献。CHARM与人在回路监督框架集成，为生产级智能体AI部署提供完整的可靠性和治理栈。

英文摘要

Multi-step agentic retrieval-augmented generation (RAG) pipelines have demonstrated significant capability for complex reasoning tasks, yet remain vulnerable to a class of failure that existing hallucination detection mechanisms systematically miss: cascading hallucination, where errors introduced at early pipeline stages propagate and amplify across successive reasoning steps, producing confident but factually incorrect final outputs. To address this vulnerability, we formalize cascading hallucination as a distinct failure mode in agentic RAG systems, present a four-type taxonomy of cascade patterns, and introduce CHARM (Cascading Hallucination Aware Resolution and Mitigation), an architectural framework for detecting and interrupting error propagation in multi-step reasoning pipelines. CHARM comprises four components - stage-level fact verification, cross-stage consistency tracking, confidence propagation monitoring, and cascade resolution triggering - that operate alongside standard agentic RAG pipelines without requiring architectural replacement. We evaluate CHARM on HotpotQA, MuSiQue, 2WikiMultiHopQA, and a custom adversarial dataset across LangChain agentic pipeline configurations, achieving an 89.4% cascade detection rate with a 5.3% false positive rate and 215 ms +/- 18 ms average latency overhead per stage, achieving an error propagation reduction of 82.1%, compared to 18.5% for output-level detectors. Component ablations confirm that each detection module contributes meaningfully to overall cascade coverage. CHARM integrates with human-in-the-loop oversight frameworks to provide a complete reliability and governance stack for production agentic AI deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.04433 2026-06-04 cs.CV cs.CL cs.LG 版本更新

Stateful Visual Encoders for Vision-Language Models

用于视觉-语言模型的有状态视觉编码器

Zirui Wang, Junwei Yu, Adam Yala, David M. Chan, Joseph E. Gonzalez, Trevor Darrell

发表机构 * University of California, Berkeley（加州大学伯克利分校）； UC Berkeley（加州大学伯克利分校）

AI总结提出有状态视觉编码器，通过将每个视觉表示条件于先前的视觉特征，增强视觉-语言模型在多图像、多轮交互中的视觉变化感知能力，在跨图像空间聚合、多目标视觉差异和轨迹行为克隆等任务上取得一致改进。

Comments Project page: https://statefulvisualencoders.github.io/

详情

AI中文摘要

视觉-语言模型（VLM）越来越多地用于多图像、多轮代理场景，其中决策依赖于视觉变化。然而，在现有的开源权重VLM中，视觉比较仅在语言模型内部进行，而视觉编码器本身是无状态的：每个图像独立编码，无法访问先前的视觉上下文。因此，微小但任务关键的变化可能在语言模型有机会比较之前被减弱，尤其是当这些变化不影响场景的高层语义时。我们引入了一种有状态视觉编码器，它将每个视觉表示条件于先前的视觉特征。在监督微调下，配备有状态编码器的VLM在涉及跨图像空间聚合、多目标视觉差异和视觉轨迹行为克隆的控制任务上取得了一致的改进。这些改进在输入分辨率、语言模型大小和VLM骨干网络上保持一致。最后，我们在实际任务上验证了我们的模型，包括纵向放射学、细粒度图像比较和遥感，其中有状态编码器一致地改进了通用VLM基线，并在选定领域可以匹配或超越专用模型。项目页面：https://statefulvisualencoders.github.io/

英文摘要

Vision-language models (VLMs) are increasingly used in multi-image, multi-turn agentic settings where decisions depend on visual changes. However, in existing open-weight VLMs, visual comparisons happen only inside the language model, while the visual encoder itself remains stateless: each image is encoded independently, without access to the prior visual context. As a result, small but task-critical changes may be attenuated before the language model has a chance to compare them, especially when those changes do not affect the high-level semantics of the scene. We introduce a Stateful Visual Encoder, which conditions each visual representation on prior visual features. Under supervised finetuning, VLMs equipped with stateful encoders achieve consistent improvements on controlled tasks involving cross-image spatial aggregation, multi-object visual differencing, and visual trajectory behavior cloning. These improvements are consistent across input resolutions, language model sizes, and VLM backbones. Finally, we validate our model on real-world tasks, including longitudinal radiology, fine-grained image comparison, and remote sensing, where stateful encoders consistently improve generalist VLM baselines and can match or surpass specialized models in selected domains. Project page: https://statefulvisualencoders.github.io/

URL PDF HTML ☆

赞 0 踩 0

2606.04418 2026-06-04 cs.SD cs.CL eess.AS 版本更新

CleanCodec: Efficient and Robust Speech Tokenization via Perceptually Guided Encoding

CleanCodec：通过感知引导编码实现高效且鲁棒的语音分词化

Eugene Kwek, Feng Liu, Rui Zhang, Wenpeng Yin

发表机构 * Pennsylvania State University（宾夕法尼亚州立大学）； Drexel University（德雷塞尔大学）

AI总结提出CleanCodec，一种去噪音频编解码器，通过选择性信息瓶颈编码仅保留感知重要特征，以12.5 tokens/s实现最先进的分词效率，在说话人相似度和语音可懂度上显著优于现有编解码器，并在下游任务中实现高达17倍推理加速。

详情

AI中文摘要

神经音频编解码器是语音处理流程的关键组件，将音频压缩为离散令牌以供下游建模。然而，现有编解码器难以平衡重建质量与令牌效率，常常以牺牲语言和声学有意义内容为代价，编码背景噪声和录音伪影等感知无关信息。我们将音频分词化重新定义为选择性信息瓶颈问题，并提出CleanCodec，一种去噪音频编解码器，学习仅编码感知重要特征并丢弃不可感知信息。在每秒仅12.5个令牌的情况下，CleanCodec实现了最先进的分词效率，在说话人相似度和语音可懂度上大幅优于现有编解码器。在下游文本到语音和语音转换任务上的评估进一步展示了改进的性能和高达17倍的推理加速，凸显了显著的效率提升。

英文摘要

Neural audio codecs are a key component of speech processing pipelines, compressing audio into discrete tokens for downstream modeling. However, existing codecs struggle to balance reconstruction quality with token efficiency, often encoding perceptually irrelevant information such as background noise and recording artifacts at the expense of linguistically and acoustically meaningful content. We reframe audio tokenization as a selective information bottleneck problem and propose CleanCodec, a denoising audio codec which learns to encode only perceptually important features and discard imperceptible information. At just 12.5 tokens per second, CleanCodec achieves state-of-the-art tokenization efficiency, substantially outperforming existing codecs in speaker similarity and speech intelligibility. Evaluations on downstream text-to-speech and voice conversion tasks further demonstrate improved performance and up to 17x faster inference, highlighting significant efficiency gains.

URL PDF HTML ☆

赞 0 踩 0

2606.04396 2026-06-04 cs.CL 版本更新

Read the Trace, Steer the Path: Trajectory-Aware Reinforcement Learning for Diffusion Language Models

读取轨迹，引导路径：面向扩散语言模型的轨迹感知强化学习

Anant Khandelwal, Manish Gupta

发表机构 * Microsoft AI, India（微软印度人工智能）

AI总结提出CAPR算法，通过缓存轨迹状态和块级价值头，利用去噪轨迹提供类似树搜索的细粒度监督，在降低计算成本的同时提升扩散语言模型的强化学习效果。

Comments 19 pages, 10 figures, 7 Tables

详情

AI中文摘要

扩散大语言模型（dLLMs）通过并行迭代去掩码和修正多个位置来生成响应。这一过程留下了丰富的去噪轨迹，描绘了哪些标记变得可信、哪些仍不稳定以及何时形成承诺。现有的dLLM强化学习方法仅弱化地使用这一信号。扁平化展开成本低，但将单一结果奖励分配给整个轨迹。树展开通过分支部分轨迹并将叶节点奖励向上传播，提供更精细、可验证的训练信号，但计算密集。我们提出疑问：去噪轨迹本身能否在不使用树级计算的情况下提供类似树的监督？我们引入CAPR（缓存-摊销路径细化），一种dLLM-RL算法，它将去噪轨迹总结为紧凑的路径状态，利用缓存轨迹状态生成廉价的兄弟延续，并训练块级价值头用于局部块级监督。在块级去掩码调度下，CAPR记录路径状态和块进度特征，然后根据每个块中揭示的标记将最终结果奖励重新分配到各个块。这训练价值头将一个稀疏奖励转换为块级PPO权重。因此，CAPR恢复了树搜索的大部分粒度，同时避免了完整的树扩展，将展开生成成本降低到扁平展开的大约0.75倍和树展开的0.6倍（在标准设置下）。在4x4数独、Countdown、GSM8K和Math500上，使用密集和混合专家LLaDA骨干网络，CAPR在256和512标记预算下为RL调优的dLLMs设立了新的最先进水平。在数独上，它以不到三分之一的每步计算量匹配了最强的树结构基线。

英文摘要

Diffusion large language models (dLLMs) generate responses by iteratively unmasking and revising many positions in parallel. This process leaves a rich denoising trace depicting which tokens become confident, which remain unstable, and when commitments form. Existing dLLM reinforcement learning methods use this signal only weakly. Flat rollouts are cheap, but assign a single outcome reward to the whole trajectory. Tree rollouts provide finer, verifiable training signals by branching partial trajectories and propagating leaf rewards upward, but are compute intensive. We ask whether the denoising trace itself can provide tree-like supervision without tree-level compute. We introduce CAPR (Cached-Amortized Path Refinement), a dLLM-RL algorithm that summarizes the denoising trace into a compact path state, uses cached trajectory states to generate cheap sibling continuations, and trains a block-level value head for local block-wise supervision. Under a block-wise unmasking schedule, CAPR records path-state and block-progress features, then redistributes the final outcome reward across blocks according to the tokens revealed in each block. This trains the value head to convert one sparse reward into block-level PPO weights. CAPR therefore recovers much of the granularity of tree search while avoiding full tree expansion, reducing rollout-generation cost to roughly 0.75x that of flat rollouts and 0.6x that of tree rollouts (under standard settings). Across 4x4 Sudoku, Countdown, GSM8K, and Math500, on dense and mixture-of-experts LLaDA backbones, CAPR sets a new state of the art for RL-tuned dLLMs at 256- and 512-token budgets. On Sudoku, it matches the strongest tree-structured baseline at less than one third of the per-step compute.

URL PDF HTML ☆

赞 0 踩 0

2606.04392 2026-06-04 cs.LG cs.CL 版本更新

Physics-Informed Neural Network Modeling of Biodegradable Contaminant Transport through GCL/SL Composite Liners

物理信息神经网络建模可生物降解污染物通过GCL/SL复合衬垫的迁移

Dong Li, Yapeng Cao, Haiping Zhao, Shutong Han

发表机构 * Department of Civil, Environmental, and Infrastructure Engineering, George Mason University（乔治·马歇尔大学土木、环境与基础设施工程系）； State Key Laboratory of Cryospheric Science and Frozen Soil Engineering, Northwest Institute of Eco-Environment and Resources, Chinese Academy of Sciences（中国科学院寒区工程与冻土科学联合实验室，西北生态环境资源研究院）； Laboratoire Navier/CERMES, École Nationale des Ponts et Chaussées, Institut Polytechnique de Paris（巴黎理工学院劳达实验室/塞姆斯实验室，法国国家桥梁与道路学院）

AI总结提出双域物理信息神经网络框架，通过硬约束PINN精确模拟GCL/SL复合衬垫中污染物迁移，并扩展至逆问题识别降解半衰期。

详情

AI中文摘要

本研究开发了一个双域物理信息神经网络框架，用于污染物通过GCL/SL复合衬垫系统的迁移，其中薄GCL层采用稳态平流-弥散-生物降解公式处理，而下层土壤衬垫建模为瞬态传输域。在不同渗滤液水头条件下，评估了两种公式与解析解和有限元参考解的对比：标准软约束PINN（Std-PINN）和硬约束PINN（H-PINN），其中选定的边界和初始条件直接嵌入试验解中。Std-PINN捕捉了整体突破行为，但在早期传输阶段显示出较大误差，特别是在平流传输更显著的高水头条件下。H-PINN减少了与基于惩罚的约束执行相关的优化负担，提供了更准确和稳定的浓度预测，将MAE从Std-PINN的约0.058-0.067降低到H-PINN的约0.011-0.023，同时将MRE从约9.10%-19.16%降低到约2.08%-3.14%。参数分析证实，采用tanh激活函数和优化网络结构的H-PINN提供了最佳的预测精度。H-PINN进一步扩展到逆建模，用于从有限的浓度观测中识别SL降解半衰期，显示出对预设值的可靠收敛性以及在低到中等观测噪声下的可接受鲁棒性。

英文摘要

This study develops a two-domain physics-informed neural network framework for contaminant transport through a GCL/SL composite liner system, in which the thin GCL layer is treated using a steady-state advection-dispersion-biodegradation formulation and the underlying soil liner is modeled as a transient transport domain. Two formulations are evaluated against analytical and finite-element reference solutions under different leachate-head conditions: a standard PINN with soft constraint enforcement (Std-PINN) and a hard-constrained PINN (H-PINN), in which selected boundary and initial conditions are embedded directly into the trial solutions. The Std-PINN captures the overall breakthrough behavior but shows larger errors during the early transport stage, particularly under higher leachate heads where advective transport becomes more pronounced. The H-PINN reduces the optimization burden associated with penalty-based constraint enforcement and provides more accurate and stable concentration predictions, lowering the MAE from approximately 0.058-0.067 for the Std-PINN to about 0.011-0.023 for the H-PINN, while reducing the MRE from approximately 9.10%-19.16% to about 2.08%-3.14%. Parametric analyses confirm that the H-PINN with the tanh activation function and an optimized network structure provides the best predictive accuracy. The H-PINN is further extended to inverse modeling for identifying the SL degradation half-life from limited concentration observations, showing reliable convergence toward prescribed values and acceptable robustness under low-to-moderate observation noise.

URL PDF HTML ☆

赞 0 踩 0

2606.04389 2026-06-04 cs.CL 版本更新

When Clients Stop Following: A Cognitive Conceptualization Diagram-driven Framework for Strategic Counseling

当来访者不再跟随：基于认知概念化图的策略性咨询框架

Yihao Qin, Junyi Zhao, Changsheng Ma, Yongfeng Tao, Minqiang Yang, Chang Liu, Bin Hu

发表机构 * School of Information Science and Engineering, Lanzhou University（兰州大学信息科学与工程学院）

AI总结针对现有评估协议中来访者过度顺从导致评分虚高的问题，提出基于认知行为疗法的抵抗感知框架，通过认知概念化图模拟动态抵抗，并利用强化学习优化策略推理与响应生成，以提升在困难咨询交互中的策略鲁棒性。

详情

Deliberate Evolution: 基于智能体推理的样本高效符号回归与LLM

Xinyu Pang, Zhanke Zhou, Xuan Li, Fangrui Lv, Shanshan Wei, Sen Cui, Bo Han, Changshui Zhang

发表机构 * TMLR Group, Department of Computer Science, Hong Kong Baptist University（香港 Baptist 大学计算机科学系 TMLR 组）； Beijing National Research Center for Information Science（北京信息科学国家研究中心）； Technology (BNRist), Department of Automation, Tsinghua University, Beijing, P.R. China（技术（BNRist），自动化系，清华大学，北京，中华人民共和国）； Lenovo Research（联想研究）

AI总结提出Deliberate Evolution框架，通过解耦符号生成与搜索控制，利用自适应算子、分析工具和反思记忆，在仅用40%样本预算下超越现有LLM符号回归方法。

Comments ICML 2026

2606.04340 2026-06-04 cs.CL 版本更新

Noisy memory encoding explains negative polarity illusions

噪声记忆编码解释了负极词幻觉

Yuhan Zhang, Edward Gibson

发表机构 * Department of Linguistics, Stanford University（斯坦福大学语言学系）； BIO-X Interdisciplinary Biosciences Institute, Stanford University（斯坦福大学生物交叉科学研究所）； Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology（麻省理工学院脑科学与认知科学系）

AI总结本研究利用Hahn等人(2022)的有损上下文惊奇理论，提出不完美的句子编码导致负极词幻觉，并通过六个新型限定词对的 acceptability 判断实验验证了限定词相似性增强幻觉效应的假设。

Comments 21 pages, 5 figures, submitted for journal publication

详情

AI中文摘要

像“The authors that no critics recommended have ever received acknowledgment for a best-selling novel”这样的句子有时被认为可接受，尽管严格来说它不合语法，因为负极词“ever”在其位置未获许可。这种行为效应有时被称为“负极词幻觉”。这里我们提出，Hahn等人(2022)的有损上下文惊奇理论——即人们对复杂句子的编码不完美——可能解释这种效应。我们假设人们对主句和从句主语中的限定词记忆表征较差，并可能设想一种限定词交换来许可“ever”。我们提出，这些位置上更相似的限定词会引发更强的幻觉效应。使用六种新型限定词对（例如，“few”和“many”，“few”和“most”）的可接受性判断任务支持了我们的提议，具体表明，即使没有时间压力，新句子“Many authors that few critics recommended have ever received acknowledgment for a best-selling novel”也比规范句引发了更强的幻觉。这些结果进一步支持了人类语言处理是不完美且资源理性的观点：面对工作记忆限制，人类理性地从噪声语言输入中重构最可能的内容，以促进下游处理。

英文摘要

A sentence like "The authors that no critics recommended have ever received acknowledgment for a best-selling novel" is sometimes rated as acceptable even though, strictly speaking, it is ungrammatical because the negative polarity word "ever" is not licensed where it is. This behavioral effect is sometimes called a "negative polarity illusion". Here we propose that the lossy context surprisal theory of Hahn et al. (2022) -- whereby people have an imperfect encoding of complex sentences -- might explain this effect. We hypothesize that people have poor memory representation of the determiners in the main-clause and embedded-clause subjects and could entertain a determiner exchange that licenses ever. We propose that more similar determiners in those positions would trigger stronger illusion effects. Acceptability judgment tasks with six novel determiner pairs (e.g., "few" and "many", "few" and "most") support our proposal, showing, specifically, that a novel sentence, "Many authors that few critics recommended have ever received acknowledgment for a best-selling novel", triggered a much stronger illusion than the canonical one even without time pressure. These results offer further support for the suggestion that human language processing is imperfect and resource-rational: in face of working memory limitations, humans rationally reconstruct what is most likely from noisy linguistic input to facilitate downstream processing.

URL PDF HTML ☆

赞 0 踩 0

2606.04325 2026-06-04 cs.CL 版本更新

Parameter-Efficient Fine-Tuning with Learnable Rank

可学习秩的参数高效微调

Arpit Garg, Simon Lucey, Hemanth Saratchandran

发表机构 * Australian Institute for Machine Learning（澳大利亚机器学习研究所）

AI总结提出LR-LoRA方法，通过训练过程中学习适配器秩而非固定秩，在语言理解和常识推理基准上达到最先进性能。

Comments In Submission

详情

AI中文摘要

低秩适配（LoRA）是一种流行的参数高效微调（PEFT）方法，通过将权重更新限制为低秩适配器，在低维子空间中进行优化，从而引入固定的低秩归纳偏置。在这项工作中，我们质疑固定秩约束是否是参数高效微调最有效的归纳偏置。我们引入了*可学习秩LoRA（LR-LoRA）*，一种在训练过程中学习适配器秩的PEFT方法。LR-LoRA不为所有适配器层规定统一的秩，而是允许优化器为每一层确定合适的秩。使用这种方法，我们发现学习到的秩在层间存在显著差异，Transformer模型中的注意力层和MLP层表现出系统性的不同秩偏好。在一系列语言理解和常识推理基准测试中，LR-LoRA在大多数设置下达到了最先进的性能，并且始终优于强大的PEFT基线，表明可学习秩比固定秩适配提供了更灵活和有效的归纳偏置。

英文摘要

Low-Rank Adaptation (LoRA) is a popular parameter-efficient fine-tuning (PEFT) method that restricts weight updates to low-rank adapters, introducing a fixed low-rank inductive bias by optimizing in a low-dimensional subspace. In this work, we question whether a fixed-rank constraint is the most effective inductive bias for parameter-efficient fine-tuning. We introduce *Learnable Rank LoRA (LR-LoRA)*, a PEFT method in which the adapter rank is learned during the training process. Instead of prescribing a uniform rank for all adapter layers, LR-LoRA allows the optimizer to determine the appropriate rank for each layer. Using this approach, we find substantial layer-wise variation in the learned ranks, with the attention and MLP layers in the transformer models exhibiting systematically different rank preferences. Across a range of language understanding and commonsense reasoning benchmarks, LR-LoRA achieves state-of-the-art performance in most settings and consistently outperforms strong PEFT baselines, demonstrating that a learnable rank provides a more flexible and effective inductive bias than fixed-rank adaptations.

URL PDF HTML ☆

赞 0 踩 0

2606.04302 2026-06-04 cs.CL cs.LG 版本更新

LazyAttention: Efficient Retrieval-Augmented Generation with Deferred Positional Encoding

LazyAttention: 高效检索增强生成中的延迟位置编码

Haocheng Xia, Mihir Pamnani, Hanxi Fang, Supawit Chockchowwat, Yongjoo Park

发表机构 * Siebel School of Computing and Data Science, University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校Siebel计算与数据科学学院）； Google（谷歌）； Amazon（亚马逊）

AI总结针对检索增强生成中KV缓存位置编码复用性差的问题，提出LazyAttention机制，通过核化延迟位置编码实现零拷贝、位置无关的KV重用，显著降低首令牌延迟并提升推理吞吐量。

Comments ICML 2026

详情

AI中文摘要

键值（KV）缓存通过重用已生成令牌的过去计算来加速大型语言模型（LLM）的推理。在长上下文应用（如检索增强生成（RAG）和上下文学习（ICL））中，其重要性更加凸显。然而，传统的KV缓存将位置信息直接嵌入缓存中，限制了其可重用性。现有解决方案要么将重用限制为前缀，要么需要昂贵的内存物化来进行位置重新编码。我们引入了LazyAttention，一种新颖的注意力机制，它通过核化延迟位置编码来实现零拷贝、位置无关的KV重用。通过在注意力内核中动态调整位置编码，LazyAttention解决了物化瓶颈，使得单个物理KV副本能够服务于任意位置的多个逻辑请求。利用为预填充和解码定制的注意力内核，我们的系统实现了显著的效率提升：在偏斜的文档分布下，与最先进的Block-Attention相比，首令牌延迟（TTFT）降低了1.37倍，推理吞吐量提高了1.40倍，同时保持了可比的输出质量。

英文摘要

Key-value (KV) caching accelerates inference of large language models (LLMs) by reusing past computations for generated tokens. Its importance becomes even greater in long-context applications such as retrieval-augmented generation (RAG) and in-context learning (ICL). However, conventional KV caching embeds positional information directly into the cache, limiting its reusability. Existing solutions either restrict reuse to prefixes or require expensive memory materialization for positional re-encoding. We introduce LazyAttention, a novel attention mechanism that kernelizes deferred positional encoding to enable zero-copy, position-agnostic KV reuse. By adjusting positional encoding within attention kernels on-the-fly, LazyAttention resolves the materialization bottleneck, allowing a single physical KV copy to serve multiple logical requests at arbitrary positions. Leveraging attention kernels tailored for prefilling and decoding, our system achieves significant efficiency improvements: under skewed document distributions, it reduces time-to-first-token (TTFT) by 1.37$\times$ and increases inference throughput by 1.40$\times$ compared to the state-of-the-art Block-Attention, while maintaining comparable output quality.

URL PDF HTML ☆

赞 0 踩 0

2606.04286 2026-06-04 cs.CL 版本更新

Using Text-Based Causal Inference to Disentangle Factors Influencing Online Review Ratings

使用基于文本的因果推断解构影响在线评论评分的因素

Linsen Li, Aron Culotta, Nicholas Mattei

发表机构 * Department of Computer Science Tulane University（计算机科学系路易斯安那大学）

AI总结提出基于CausalBERT的文本因果分析方法，通过温度缩放、超参数优化和可解释性改进，从60万条美国K-12学校评论中解构各因素对整体评分的影响。

Comments HLT/NAACL 2025

详情

DOI: 10.18653/v1/2025.naacl-long.562
Journal ref: In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies

AI中文摘要

在线评论提供了对产品或服务各方面感知质量的宝贵见解。虽然基于方面的情感分析侧重于从评论中提取这些方面，但关于每个方面对整体感知影响的研究较少。由于方面之间的相关性，分离每个方面的影响尤其具有挑战性。本文介绍了一种基于文本因果分析最新进展的方法，特别是CausalBERT，以解构每个因素对整体评论评分的影响。我们通过三个关键改进增强了CausalBERT：用于更校准的处理分配估计的温度缩放；减少混杂过度调整的超参数优化；以及表征发现混杂因素的可解释性方法。在这项工作中，我们将评论中的文本提及视为现实世界属性的代理。我们在来自超过60万条美国K-12学校评论的真实和半合成数据上验证了我们的方法。我们发现，所提出的增强方法产生了更可靠的估计，并且对学校管理和基准测试表现的感知是整体学校评分的重要驱动因素。

英文摘要

Online reviews provide valuable insights into the perceived quality of facets of a product or service. While aspect-based sentiment analysis has focused on extracting these facets from reviews, there is less work understanding the impact of each aspect on overall perception. This is particularly challenging given correlations among aspects, making it difficult to isolate the effects of each. This paper introduces a methodology based on recent advances in text-based causal analysis, specifically CausalBERT, to disentangle the effect of each factor on overall review ratings. We enhance CausalBERT with three key improvements: temperature scaling for better calibrated treatment assignment estimates; hyperparameter optimization to reduce confound overadjustment; and interpretability methods to characterize discovered confounds. In this work, we treat the textual mentions in reviews as proxies for real-world attributes. We validate our approach on real and semi-synthetic data from over 600K reviews of U.S. K-12 schools. We find that the proposed enhancements result in more reliable estimates, and that perception of school administration and performance on benchmarks are significant drivers of overall school ratings.

URL PDF HTML ☆

赞 0 踩 0

2606.04284 2026-06-04 cs.LG cs.AI cs.CL 版本更新

Sparse Mixture-of-Experts Reward Models Learn Interpretable and Specialized Experts for Personalized Preference Modeling

稀疏混合专家奖励模型学习可解释且专业化的专家用于个性化偏好建模

Yifan Wang, Jinyi Mu, Mayank Jobanputra, Yu Wang, Ji-Ung Lee, Soyoung Oh, Isabel Valera, Vera Demberg

发表机构 * Saarland University（萨尔兰大学）； Independent Researcher（独立研究者）； Bielefeld University（比勒菲尔德大学）； Max Planck Institute for Software Systems（马克斯·普朗克软件系统研究所）； Max Planck Institute for Informatics（马克斯·普朗克信息研究所）

AI总结提出稀疏混合专家奖励模型，通过稀疏路由和专家多样性训练，从二元偏好数据中学习可解释的专家模式，提升个性化偏好建模的测试时适应性和可解释性。

详情

AI中文摘要

偏好建模在基于人类反馈的强化学习（RLHF）中扮演核心角色，使大型语言模型（LLMs）与人类价值观对齐。然而，大多数现有方法假设一个通用的奖励函数，忽视了人类偏好的多样性和异质性。为了在不增加额外标注成本的情况下解决这一限制，最近的工作提出从二元数据中学习多个偏好组件，并组合它们以建模个体偏好。然而，这些组件往往无法捕捉连贯且解耦的模式，限制了其可解释性和个性化效果。在这项工作中，我们提出了一种稀疏混合专家（MoE）奖励模型，该模型在二元偏好数据训练过程中鼓励稀疏路由和专家多样性。在受控和真实世界的实验中，稀疏MoE学习了可解释的路由模式和专业化的专家。它还改进了测试时的个性化，并且适应后的专家权重变化为分析模型如何适应个性化偏好提供了定性视角。

英文摘要

Preference modeling plays a central role in reinforcement learning from human feedback (RLHF), enabling large language models (LLMs) to align with human values. However, most existing approaches assume a universal reward function, neglecting the diversity and heterogeneity of human preferences. To address this limitation without additional annotation costs, recent work has proposed learning multiple preference components from binary data and combining them to model individual preferences. Nevertheless, these components often fail to capture coherent and disentangled patterns, limiting their interpretability and effectiveness for personalization. In this work, we propose a sparse Mixture-of-Experts (MoE) reward model that encourages sparse routing and expert diversity during training on binary preference data. Across controlled and real-world experiments, sparse MoE learns interpretable routing patterns and specialized experts. It also improves test-time personalization, and post-adaptation shifts in expert weights provide a qualitative lens for analyzing how the model adapts to personalized preferences.

URL PDF HTML ☆

赞 0 踩 0

2606.04274 2026-06-04 cs.CL cs.CY 版本更新

StepPRM-RTL：基于逐步过程奖励引导的LLM微调以增强RTL综合

Prashanth Vijayaraghavan, Apoorva Nitsure, Luyao Shi, Ehsan Degan, Vandana Mukherjee

发表机构 * IBM Research San Jose CA USA（IBM研究院圣何塞加州美国）

AI总结提出StepPRM-RTL框架，结合逐步轨迹建模、过程奖励模型和检索增强微调，通过密集反馈和蒙特卡洛树搜索探索推理路径，提升LLM生成RTL代码的功能正确性和推理保真度，在基准数据集上相比先前方法提升超10%。

Comments 6 pages, 2 figures, DAC'2026

详情

DOI: 10.1145/3770743.3804218

AI中文摘要

由于Verilog和VHDL中的长程推理、多步依赖和严格正确性约束，数字硬件设计的RTL代码自动生成仍然具有挑战性。我们提出StepPRM-RTL，一种新颖的框架，结合逐步轨迹建模、过程奖励模型（PRM）和检索增强微调（RAFT），以增强基于LLM的RTL代码生成的功能正确性和推理保真度。StepPRM-RTL从规范解构建逐步推理轨迹，其中每一步包含一个理由和增量代码修改。过程奖励模型（PRM）评估中间步骤，提供密集反馈，指导RAFT微调期间的强化式更新。蒙特卡洛树搜索（MCTS）探索替代推理路径，用高质量轨迹丰富训练数据集。这种逐步和结果感知奖励的集成使模型能够学习如何以及为何构建正确的RTL，从而改善超出标准监督或基于结果训练的长程推理。在基准Verilog和VHDL数据集上的实验评估表明，StepPRM-RTL在功能正确性和推理保真度指标上优于先前最佳方法超过10%。消融研究证实，PRM引导奖励和逐步轨迹探索的结合是其性能的关键。StepPRM-RTL跨RTL语言泛化，并为高保真、可解释的代码生成提供了可扩展框架，为LLM辅助硬件设计自动化建立了新标准。

英文摘要

Automatic generation of RTL code for digital hardware designs remains challenging due to long-horizon reasoning, multi-step dependencies, and strict correctness constraints in Verilog and VHDL. We present StepPRM-RTL, a novel framework that combines stepwise trajectory modeling, process-reward modeling (PRM), and retrieval-augmented fine-tuning (RAFT) to enhance both the functional correctness and reasoning fidelity of LLM-based RTL code generation. StepPRM-RTL constructs stepwise reasoning trajectories from canonical solutions, where each step contains a rationale and incremental code modification. A Process Reward Model (PRM) evaluates intermediate steps, providing dense feedback that guides reinforcement-style updates during RAFT fine-tuning. Monte Carlo Tree Search (MCTS) explores alternative reasoning paths, enriching the training dataset with high-quality trajectories. This integration of stepwise and outcome-aware rewards allows the model to learn both how and why to construct correct RTL, improving long-horizon reasoning beyond standard supervised or outcome-based training. Experimental evaluation on benchmark Verilog and VHDL datasets demonstrates that StepPRM-RTL outperforms the best prior methods by over 10\% in functional correctness and reasoning fidelity metrics. Ablation studies confirm that the combination of PRM-guided rewards and stepwise trajectory exploration is key to its performance. StepPRM-RTL generalizes across RTL languages and provides a scalable framework for high-fidelity, interpretable code generation, establishing a new standard for LLM-assisted hardware design automation.

URL PDF HTML ☆

赞 0 踩 0

2606.04244 2026-06-04 cs.AI cs.CL cs.CV cs.LG 版本更新

VAMPS: Visual-Assisted Mathematical Problem Solving Benchmark

VAMPS: 视觉辅助数学问题求解基准

Amirhossein Dabiriaghdam, Shayan Vassef, Mohammadreza Bakhtiari, Yasamin Medghalchi, Ilker Hacihaliloglu, Mesrob Ohannessian, Lele Wang, Giuseppe Carenini

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出VAMPS基准，通过1,168道双语多选题评估多模态大模型在借助绘图工具进行数学推理时的表现，发现直接解析求解优于工具辅助视觉求解。

详情

AI中文摘要

多模态大语言模型在复杂推理方面能力日益增强，但当它们必须通过工具外部化问题然后基于工具输出进行推理时，尤其是在依赖视觉辅助的情况下，其性能往往会下降。这一差距尤为重要，因为真实的工程和科学工作流程通常依赖可视化工具进行分析、验证和决策。为了研究这一差异，我们引入了VAMPS（视觉辅助数学问题求解），一个用于图辅助数学的基准。VAMPS包含1,168个多模态、双语选择题问答对，这些题目来自伊朗大学入学考试的代数和微积分问题，并通过人工审核的LLM生成的合成变体进行了扩展，所有题目都经过精心挑选，使得绘图能够通过揭示交点、极值、渐近线等提供自然的求解策略。VAMPS旨在用于基准测试和诊断，它超越了以往主要评估在固定视觉输入上进行推理的多模态基准，通过测试模型是否能够从构建有用的图形中受益并将其答案基于结果可视化。总体而言，我们发现，在一组多样化的模型中，直接解析求解出人意料地优于工具辅助的视觉求解，即使在绘图是自然策略的问题上也是如此。

英文摘要

Multimodal large language models are increasingly capable of complex reasoning, yet their performance often degrades when they must externalize a problem through a tool and then reason over the tool's output, specifically when they rely on visual aids. This gap is especially important because real engineering and scientific workflows often rely on visualization tools for analysis, validation, and decision-making. To study this discrepancy, we introduce VAMPS (Visual-Assisted Mathematical Problem Solving), a benchmark for graph-assisted mathematics. VAMPS contains 1,168 multimodal, bilingual multiple-choice question-answer pairs drawn from Iranian University Entrance Exam algebra and calculus problems and expanded with human-reviewed LLM-generated synthetic variants, all selected so that plotting provides a natural solution strategy by revealing intersections, extrema, asymptotes, etc. Designed for both benchmarking and diagnosis, VAMPS goes beyond prior multimodal benchmarks that primarily evaluate reasoning over fixed visual inputs by testing whether a model can benefit from constructing a useful graph and grounding its answer in the resulting visualization. Overall, we found that across a diverse set of models, direct analytical solving surprisingly outperforms tool-enabled visual solving, even on problems where plotting is a natural strategy.

URL PDF HTML ☆

赞 0 踩 0

2606.04240 2026-06-04 cs.CV cs.AI cs.CL 版本更新

Overview of the EReL@MIR 2025 Multimodal Document Retrieval Challenge (Track 1)

EReL@MIR 2025 多模态文档检索挑战赛（赛道1）概述

Jingbiao Mei

发表机构 * University of Cambridge（剑桥大学）； Cambridge United Kingdom（剑桥英国）

AI总结本文介绍了EReL@MIR 2025多模态文档检索挑战赛（赛道1）的设计、数据集、评估协议、最终排名及前三名获胜系统的分析，所有系统均基于Qwen2-VL系列解码器多模态大语言模型嵌入器。

Comments MDR Challenge Report at WWW2025

详情

AI中文摘要

对于视觉丰富的文档（即文本与图形、表格和图表交织的页面）的检索，对于多模态检索增强生成至关重要，然而大多数检索器仍然丢弃视觉通道。\emph{多模态文档检索挑战赛}是首届EReL@MIR研讨会（与2025年万维网会议同期举办）中MIR挑战赛的赛道1，要求参与者构建一个\emph{单一}检索系统，处理两种互补的场景：基于文本查询在长文档内进行封闭集文档页面检索（MMDocIR），以及基于图像或图像加文本查询进行开放域维基百科风格段落检索（M2KR）。系统根据两个任务上平均Recall@$\{1,3,5\}$的宏平均值进行排名。该挑战赛吸引了来自22个团队的455名参赛者和586份提交。本报告描述了挑战赛的设计、数据集和评估协议；报告了最终排名；并分析了三个获胜团队的系统。所有三个系统都基于Qwen2-VL系列的解码器多模态大语言模型嵌入器，而非CLIP风格的编码器，主要区别在于它们是通过微调集成、无训练的多路融合与强视觉语言重排序器，还是零样本后期交互达到顶尖水平。无训练系统与微调获胜者的得分差距在0.1分以内。

英文摘要

Retrieval over visually-rich documents, pages that interleave text with figures, tables, and charts, is essential for multimodal retrieval-augmented generation, yet most retrievers still discard the visual channel. The \emph{Multimodal Document Retrieval Challenge}, Track~1 of the MIR Challenge at the first EReL@MIR workshop, co-located with The Web Conference 2025, asks participants to build a \emph{single} retrieval system that handles two complementary regimes: closed-set document page retrieval within long documents from a text query (MMDocIR), and open-domain retrieval of Wikipedia-style passages from an image or image-plus-text query (M2KR). Systems are ranked by the macro-average of mean Recall@$\{1,3,5\}$ over the two tasks. The challenge drew 455 entrants and 586 submissions across 22 teams. This report describes the challenge design, datasets, and evaluation protocol; reports the final standings; and analyses the three winning teams' systems. All three build on decoder-based Multimodal-LLM embedders from the Qwen2-VL family rather than on CLIP-style encoders, and differ chiefly in whether they reach the top through fine-tuned ensembles, training-free multi-route fusion with a strong vision-language re-ranker, or zero-shot late interaction. The training-free system finished within $0.1$ point of the fine-tuned winner.

URL PDF HTML ☆

赞 0 踩 0

2606.04236 2026-06-04 cs.CL cs.AI cs.LG 版本更新

Supportive Token Revealing for Fast Diffusion Language Model Decoding

支持性标记揭示：快速扩散语言模型解码

Giries Abu Ayoub, Mario Barbara, Lluís Pastor-Pérez, Tanja Bien, Aneesh Barthakur, Alaa Maalouf, Loay Mualem

发表机构 * Department of Computer Science, University of Haifa（海法大学计算机科学系）； Institute for AI, University of Stuttgart（斯图加特大学人工智能研究所）； IMPRS-IS

AI总结提出AXON模块，通过选择注意力、不确定性和置信度信号中的锚点标记来改善扩散语言模型并行解码的质量-延迟权衡。

详情

AI中文摘要

离散扩散语言模型可以通过并行更新多个掩码位置来高效生成文本，但这种并行性引入了质量-延迟权衡。激进的解码可能过早提交相互依赖的标记，而保守的解码则需要大量去噪步骤。现有方法通过使用置信度或依赖性标准决定哪些标记可以安全揭示来解决这一矛盾。然而，避免不安全提交并不一定使剩余的掩码序列易于解码，因为不确定的标记可能依赖于掩码标记，从而成为去噪步骤的瓶颈。我们提出AXON，一个无需训练的模块，可添加到现有扩散语言模型的并行解码策略之上。AXON不替换基础解码器，而是监控剩余不确定的掩码标记，并仅当它们当前状态表明需要额外上下文时才进行干预。然后它将标准从揭示哪些标记最安全转变为哪些自信揭示最能支持后续去噪。AXON使用注意力、不确定性和置信度信号选择锚点，即不确定位置关注的自信掩码标记。在多个扩散语言模型的推理和代码生成基准上的实验表明，AXON改善了现有并行解码器的质量-延迟权衡，通常减少函数评估次数，同时保持或提高准确性。

英文摘要

Discrete diffusion language models can generate text efficiently by updating multiple masked positions in parallel, but this parallelism introduces a quality-latency trade-off. Aggressive decoding may commit mutually dependent tokens too early, while conservative decoding requires many denoising steps. Existing methods address this tension by deciding which tokens are safe to reveal using confidence or dependency criteria. However, avoiding unsafe commits does not necessarily make the remaining masked sequence easy to decode, since uncertain tokens may depend on masked tokens, creating a bottleneck for denoising steps. We propose AXON, a training-free module that can be added on top of existing parallel decoding strategies for diffusion language models. Rather than replacing the base decoder, AXON monitors the remaining uncertain masked tokens and intervenes only when their current state suggests that additional context is needed. It then shifts the criterion from which tokens are safest to reveal to which confident reveals would best support later denoising. AXON selects anchors, confident masked tokens that uncertain positions attend to, using attention, uncertainty, and confidence signals. Experiments on reasoning and code-generation benchmarks across multiple diffusion language models show that AXON improves the quality-latency trade-off of existing parallel decoders, often reducing the number of function evaluations while maintaining or improving accuracy.

URL PDF HTML ☆

赞 0 踩 0

2606.04231 2026-06-04 cs.CL cs.AI 版本更新

MM-BizRAG: Rethinking Multimodal Retrieval-Augmented Generation for General Purpose Enterprise Q&A

MM-BizRAG：面向通用企业问答的多模态检索增强生成再思考

Hanoz Bhathena, Parin Rajesh Jhaveri, Rohan Mittal, Prateek Singh, Aymen Kallala, Rachneet Kaur, Yiqiao Jin, Zhen Zeng, Adwait Ratnaparkhi, Denis Kochedykov

发表机构 * JPMorgan Chase & Co.（摩根大通公司）； Georgia Institute of Technology（佐治亚理工学院）

AI总结提出MM-BizRAG框架，通过文档结构感知分割和布局感知解析，结合统一LLM驱动的工件转换与推理时多模态组装，无需微调即可提升企业文档问答性能，在异构企业数据集和两个公开基准上超越基线最多32个百分点。

Comments Accepted at ACL 2026 (Industry Track)

详情

AI中文摘要

近期多模态检索增强生成（MM-RAG）的进展倾向于最小化解析，依赖页面级图像来生成检索器嵌入和答案生成。虽然高效，但这种趋势往往忽略了对复杂企业文档中丰富结构化信息的显式处理，而是依赖预训练嵌入或视觉语言模型隐式捕获这种结构。在本工作中，我们采取更直接的方法：MM-BizRAG通过文档结构感知分割主动提取和表示文档结构，该分割根据文档方向动态路由文档至特定方向的摄取管道，对垂直结构文档（如报告）应用显式布局感知解析，对水平结构文档（如幻灯片）应用整体页面级表示。统一的LLM驱动的工件转换管道通过基于占位符的位置对齐保留自然阅读顺序，而推理时的多模态组装将检索表示与生成上下文解耦，无需任何微调即可生成更丰富、更基于事实的答案。通过在大型异构企业数据集和两个公开基准（SlideVQA和FinRAGBench-V）上的实验，MM-BizRAG一致地超越最先进的以视觉为中心的基线最多32个百分点，在报告式布局上尤其强劲。此外，我们引入了FastRAGEval，一种单次调用的LLM评判指标，用于细粒度生成召回，将RAGChecker的成本减半，同时实现更强的人类对齐。

英文摘要

Recent advances in multimodal retrieval-augmented generation (MM-RAG) have shifted toward minimal parsing, relying on page-level images for producing retriever embeddings and for answer generation. While efficient, this trend often neglects explicit handling of the rich, structured information in complex enterprise documents, instead depending on pre-trained embeddings or vision-language models to implicitly capture such structure. In this work, we take a more direct approach: MM-BizRAG proactively extracts and represents document structure via a document structure-aware split that dynamically routes documents through orientation-specific ingestion pipelines, applying explicit layout-aware parsing for vertically structured documents (e.g., reports) and holistic page-level representations for horizontally structured documents (e.g., slide decks). A unified LLM-driven artifact transformation pipeline with placeholder-based positional alignment preserves natural reading order, while inference-time multimodal assembly decouples retrieval representations from generation context, enabling richer, more grounded answers without any finetuning requirement. Through experiments on a large, heterogeneous enterprise dataset and two public benchmarks (SlideVQA and FinRAGBench-V), MM-BizRAG consistently outperforms state-of-the-art vision-centric baselines by up to 32% points, with especially strong gains on report-style layouts. Furthermore, we introduce FastRAGEval, a single-call LLM Judge metric for fine-grained generative recall that halves RAGChecker's cost while achieving stronger human alignment.

URL PDF HTML ☆

赞 0 踩 0

2606.04205 2026-06-04 cs.MM cs.AI cs.CL cs.CV cs.LG cs.SD 版本更新

DetectZoo: A Unified Toolkit for AI-Generated Content Detection Across Text, Audio, and Image Modalities

DetectZoo：一个用于跨文本、音频和图像模态的AI生成内容检测的统一工具包

Sajad Ebrahimi, Nima Jamali, Bardia Shirsalimian, Kelly McConvey, Wentao Zhang, Jalehsadat Mahdavimoghaddam, Maksym Taranukhin, Maura Grossman, Vered Shwartz, Yuntian Deng, Ebrahim Bagheri

发表机构 * University of Toronto（多伦多大学）； University of Waterloo（滑铁卢大学）； Toronto Metropolitan University（多伦多 Metropolitan 大学）； University of British Columbia（不列颠哥伦比亚大学）； Vector Institute（向量研究所）

AI总结提出DetectZoo，一个首个统一的多模态AI生成内容检测工具包，通过标准化数据预处理、评估流程和集成61个检测器与22个基准数据集，实现公平可重复的基准测试。

详情

AI中文摘要

生成模型的日益普及和能力提升模糊了人类与机器生成内容之间的界限，推动了跨文本、图像和音频检测领域的大量研究。大多数现有的检测器要么是商业软件，要么是开源但带有不兼容的代码库、定制化的预处理、评估协议和评估指标，这使得它们的采用、公平比较和复现变得相当困难。为了解决这一关键差距，我们引入了DetectZoo，这是首个可扩展的工具包，旨在为跨文本、音频和图像模态的AI生成内容检测提供统一接口。DetectZoo标准化了从数据摄取和预处理到模型评估的完整实证流程，为研究人员提供了一个统一的框架来系统地基准测试最先进的检测器。通过将多样的公共数据集和基线检测算法集成到单一的统一API下，我们的工具包促进了严格且可重复的评估。DetectZoo提供了61个检测器的参考实现、22个基准数据集的原生加载器，以及一个标准化的评估流程，通过通用接口报告多个指标。每个检测器都是自包含的，但可通过同一接口访问，自动缓存预训练权重，并复现原始发表的结果。DetectZoo降低了多模态AI取证的入门门槛，使研究人员能够识别跨领域的性能差距，并加速开发鲁棒、可泛化的检测技术。开源仓库和全面文档可在https://github.com/sadjadeb/DetectZoo 获取，且可通过pip install detectzoo安装该包。

英文摘要

The growing popularity and capacity of generative models have eroded the distinction between human and machine-generated content, motivating a growing body of work on detection across text, images, and audio. Most available detectors are either commercial software or, if open-source, come with incompatible codebases with bespoke preprocessing, evaluation protocols, and evaluation metrics, which make their adoption, fair comparison, and reproduction quite difficult. To address this critical gap, we introduce DetectZoo, a first-of-its-kind, extensible toolkit designed to provide a unified interface for AI-generated content detection across text, audio, and image modalities. DetectZoo standardizes the complete empirical pipeline, from data ingestion and preprocessing to model assessment, offering researchers a cohesive framework to benchmark state-of-the-art detectors systematically. By integrating diverse public datasets and baseline detection algorithms under a single, unified API, our toolkit facilitates rigorous and reproducible evaluation. DetectZoo provides reference implementations of 61 detectors, native loaders for 22 benchmark datasets, and a standardized evaluation pipeline that reports multiple metrics through a common interface. Each detector is self-contained yet accessible through the same interface, automatically caches pretrained weights, and reproduces the original published results. DetectZoo lowers the barrier to entry for multi-modal AI forensics, enabling researchers to identify performance gaps across domains and accelerating the development of robust, generalizable detection techniques. The open-source repository and comprehensive documentation are publicly available at https://github.com/sadjadeb/DetectZoo, and the package can be installed via pip install detectzoo.

URL PDF HTML ☆

赞 0 踩 0

2606.04199 2026-06-04 cs.CL cs.LG 版本更新

SocialCoach: 基于强化学习的智能辅导与练习的个性化社交技能学习

Tianfu Wang, Max Xiong, Jianxun Lian, Hongyuan Zhu, Zhengyu Hu, Yuxuan Lei, Linxiao Gong, Xiaofang Li, Peiting Tsai, Nicholas Jing Yuan, Qi Zhang

发表机构 * HKUST (GZ)（香港科技大学（广州））； Duke University（杜克大学）； MSRA Beijing（微软研究院北京）； Microsoft Beijing（微软北京）

AI总结提出SocialCoach系统，利用多智能体管道构建知识语料库、强化学习优化自适应练习调度，并结合沉浸式实践与反思辅导，以解决社交技能学习中专家辅导稀缺和知行差距问题。

详情

AI中文摘要

社交技能如谈判和领导力在当今互联世界中对于个人和职业成功至关重要。然而，由于专家辅导的稀缺，可扩展且有效的培训仍然是一个重大挑战。在本文中，我们介绍了SocialCoach，一个全面的LLM驱动的智能辅导系统，用于大规模个性化社交技能发展。首先，SocialCoach利用多智能体管道，从多样化的专家来源自动构建一个基于教学法的理论到实践知识语料库。其次，为了个性化学习旅程，它采用了一个自适应练习调度模块，遵循处方-检索-适应过程。为了在克服冷启动问题的同时最大化长期学习体验，该策略通过强化学习在学习者模拟环境中进行优化。最后，SocialCoach整合了沉浸式目标驱动练习、因果驱动能力评估和基于知识的反思辅导，以帮助解决知行差距。我们在产品EQoach中部署了该系统，并进行了广泛实验。结果表明，SocialCoach在模拟路径质量和评委评估的辅导质量上优于基线方法，而早期用户反馈表明其具有强烈的感知参与度和有用性。这些发现为个性化、游戏化的软技能学习教学平台提供了一种实用架构。

英文摘要

Social skills such as negotiation and leadership are crucial for personal and professional success in today's interconnected world. However, scalable and effective training remains a significant challenge due to the scarcity of expert coaching. In this paper, we introduce SocialCoach, a holistic LLM-powered agentic tutoring system for personalized social skill development at scale. First, SocialCoach automatically constructs a pedagogically-grounded, theory-to-practice knowledge corpus from diverse expert sources, leveraging a multi-agent pipeline. Second, to personalize the learning journey, it employs an adaptive practice scheduling module that follows a prescription-retrieval-adaptation process. To maximize the long-term learning experience while overcoming the cold-start problem, this policy is optimized within a learner simulation environment through reinforcement learning. Finally, SocialCoach integrates immersive, goal-driven practice, causality-driven proficiency assessment and knowledge-grounded, reflective tutoring to help address the knowing-doing gap. We deploy it in our product, EQoach, and conduct extensive experiments. The results show that SocialCoach improves simulated pathway quality and judge-rated tutoring quality over baseline approaches, while early user feedback indicates strong perceived engagement and usefulness. These findings suggest a practical architecture for personalized and gamified pedagogical platforms on soft skill learning.

URL PDF HTML ☆

赞 0 踩 0

2606.04127 2026-06-04 cs.CL 版本更新

When Retrieval Doesn't Help: A Large-Scale Study of Biomedical RAG

当检索无济于事：生物医学RAG的大规模研究

Erfan Nourbakhsh, Rocky Slavin, Ke Yang, Anthony Rios

发表机构 * The University of Texas at San Antonio（德克萨斯大学阿灵顿分校）

AI总结本研究通过大规模实验发现，检索增强生成（RAG）在生物医学问答中仅带来微小且不一致的提升（1-2%），主要瓶颈在于模型有效利用检索证据的能力不足。

Comments 9 Pages, accepted to BioNLP Workshop at ACL 2026

详情

AI中文摘要

大型语言模型攻击奖励机制与社会

Wei Liu, Xinyi Mou, Hanqi Yan, Zhongyu Wei, Yulan He

发表机构 * King’s College London（伦敦大学国王学院）； Fudan University（复旦大学）； The Alan Turing Institute（艾伦·图灵研究所）

AI总结研究强化学习训练中大型语言模型利用奖励函数漏洞的“社会攻击”现象，通过SocioHack沙盒实验发现模型能发现并利用社会规则漏洞，且现有安全措施效果有限。

Comments 14 pages, 9 figures, 7 tables

详情

AI中文摘要

强化学习已成为一种主导的后训练范式，使大型语言模型能够从奖励中学习。我们观察到社会规则在结构上与奖励函数相似。它们定义了可衡量的结果、阈值和例外情况，同时往往仅部分指定了制度意图。我们假设强化学习训练过程可能利用这些漏洞，因此提出模型在强化学习期间攻击奖励函数的已知倾向是否可能扩展为一种更严重的失败模式，即社会攻击：发现社会运行规则中的漏洞。为了研究这一现象，我们引入了SocioHack，一个包含72个社会环境的沙盒，并发现这些环境中奖励攻击自然出现并导致监管漏洞的发现。模型学会攻击社会规则并生成技术上合规但违背监管意图的策略，而当前的大型语言模型安全措施仅提供有限的缓解。因此，收集真实世界反馈用于模型训练需要更加谨慎，我们需要下一代后训练范式来安全地在真实社会中迭代大型语言模型。

英文摘要

Reinforcement learning (RL) has become a dominant post-training paradigm, enabling large language models (LLMs) to learn from rewards. We observe that societal regulations are structurally similar to reward functions. They define measurable outcomes, thresholds, and exceptions, while often leaving institutional intent only partially specified. We hypothesise that the RL training process may exploit these gaps and therefore ask whether models' well-known tendency to hack reward functions during RL can scale into a more consequential failure mode named societal hacking: discovering loopholes in the rules society runs on. To study this phenomenon, we introduce SocioHack, a sandbox of 72 societal environments, and find that within these environments, reward hacking naturally emerges and leads to regulatory loophole discovery. Models learn to hack the social rules and generate strategies that remain technically compliant while defeating regulatory intent, and current LLM safeguards provide only limited mitigation. Therefore, collecting in-the-wild feedback for model training requires greater caution, and we need a next-generation post-training paradigm for safely iterating LLMs in real society.=

URL PDF HTML ☆

赞 0 踩 0

2606.04071 2026-06-04 cs.CR cs.CL cs.LG 版本更新

Covert Influence Between Language Models

语言模型之间的隐蔽影响

Avidan Shah, Jay Chooi, Jinghua Ou, Shi Feng

发表机构 * MATS ； New York University（纽约大学）； Harvard University（哈佛大学）； George Washington University（乔治华盛顿大学）

AI总结本文研究语言模型间通过微调、蒸馏和上下文学习三种接口实现隐蔽影响的风险，并提出使用逐点归因分数选择载体以放大训练时影响，发现自然语言载体相比数字载体更难被人类检测且跨模型迁移性更差。

详情

AI中文摘要

随着语言模型越来越多地消费彼此的输出，隐蔽影响——即发送者的载荷（其被条件化传播的行为倾向）通过人类无法检测的载体转移到接收者的现象——成为一种日益增长的风险。我们通过三种接口（监督微调、在线策略蒸馏和上下文学习）刻画了这一风险，并发现它们在实现不留下人类可见痕迹的影响规模上有所不同。利用推理时逐样本归因分数，我们研究了所有三种接口下的隐蔽影响，并具备选择能够放大训练时影响的载体的能力，解锁了先前工作无法实现的载荷转移。我们进一步提供证据表明，使用自然语言载体的隐蔽影响与先前使用数字载体的研究是不同的现象，因为前者更难以被人类检测且跨模型家族的迁移性更差。这些结果共同表明，隐蔽影响的风险面比先前认识到的更广，我们研究了逐点归因评分方法作为调查和缓解该风险的工具。

英文摘要

As language models increasingly consume one another's outputs, covert influence -- a phenomenon where a sender's payload (the behavioral disposition it is conditioned to propagate) transfers to a receiver through carriers undetectable by humans -- becomes a growing risk. We characterize this risk across three interfaces: supervised fine-tuning, on-policy distillation, and in-context learning, and find that they vary in the scale of influence achievable without leaving behind human-visible traces. Using inference-time per-sample attribution scores, we study covert influence across all three interfaces with the ability to select carriers that amplify training-time influence, unlocking payload transfers that prior work could not achieve. We further provide evidence that covert influence with natural-language carriers is a distinct phenomenon from prior studies using number carriers, as the latter is more resistant to human detection and less portable across model families. Together, these results suggest that the risk surface for covert influence is broader than previously recognized, and we study pointwise attribution scoring methods as a tool to investigate and mitigate it.

URL PDF HTML ☆

赞 0 踩 0

2606.04046 2026-06-04 cs.CV cs.AI cs.CL cs.LG cs.RO 版本更新

Dive into the Scene: Breaking the Perceptual Bottleneck in Vision-Language Decision Making via Focus Plan Generation

深入场景：通过焦点计划生成打破视觉-语言决策中的感知瓶颈

Boyuan Xiao, Bohong Chen, Yumeng Li, Ji Feng, Yao-Xiang Ding, Kun Zhou

发表机构 * University of Science and Technology of China（中国科学技术大学）； Tsinghua University（清华大学）

AI总结提出SceneDiver方法，通过从粗到细的焦点计划生成，逐步构建场景图并分解任务，减少视觉幻觉，提升视觉-语言模型和视觉-语言-动作模型在具身决策任务中的表现。

Comments Accepted at ICML 2026

详情

AI中文摘要

在具身视觉-语言决策任务（如机器人操作和导航）中，视觉-语言模型和视觉-语言-动作模型（VLMs & VLAs）是具有不同优势的强大工具：VLMs更擅长长期规划，而VLAs更擅长反应控制。然而，它们的性能受到相同感知瓶颈的限制：由于模型无法区分任务相关对象与干扰物，导致视觉幻觉。原则上，准确识别并聚焦关键对象同时过滤无关对象是突破这一限制的关键。一个直接的解决方案是一步聚焦：直接关注重要对象。然而，这种方法被证明无效，因为有效的聚焦本质上需要深度场景理解。为此，我们提出SceneDiver，一种利用VLMs长期规划能力的从粗到细的焦点计划生成方法，首先构建整体场景图以建立初步理解，然后通过识别、理解和分析的迭代循环逐步将任务分解为更简单的子问题。为了实现反应控制，我们还设计了一个轻量级适配器，将深思熟虑的聚焦能力蒸馏到VLAs中。在标准具身AI基准上的评估证实，我们的方法显著减少了VLMs和VLAs的视觉幻觉，同时在需要快速执行的任务中保持了计算效率。我们的代码和数据发布在：https://future-item.github.io/SceneDiver。

英文摘要

In embodied vision-language decision making tasks such as robotic manipulation and navigation, Vision-Language and Vision-Language-Action Models (VLMs & VLAs) are powerful tools with different benefits: VLMs are better at long-term planning, while VLAs are better at reactive control. However, their performance is limited by the same perceptual bottleneck: visual hallucinations arise due to the models' inability to distinguish task-relevant objects from distractors. In principle, accurate identification and focus on critical objects while filtering out irrelevant ones is the key to break this limitation. A straightforward solution is one-step focus: directly attending to essential objects. However, this approach proves ineffective because effective focus inherently requires deep scene understanding. To this end, we propose SceneDiver, a coarse-to-fine focus plan generation method for VLMs leveraging their long-term planning abilities, that first constructs a holistic scene graph to establish initial comprehension, then progressively decomposes the task into simpler sub-problems through an iterative cycle of recognition, understanding, and analysis. To enable reactive control, we also design a lightweight adapter for distilling the deliberate focus ability into VLAs. Evaluations on standard embodied AI benchmarks confirm that our method substantially reduces visual hallucinations for both VLMs and VLAs, while preserving computational efficiency in tasks requiring fast execution. Our code and data are released at: https://future-item.github.io/SceneDiver.

URL PDF HTML ☆

赞 0 踩 0

2606.03892 2026-06-04 cs.CL cs.AI cs.LG 版本更新

Synthesize and Reward -- Reinforcement Learning for Multi-Step Tool Use in Live Environments

合成与奖励——面向实时环境中多步骤工具使用的强化学习

Ibrahim Abdelaziz, Asim Munawar, Kinjal Basu, Maxwell Crouse, Chulaka Gunasekara, Suneet Katrekar, Pavan Kapanipathi

发表机构 * IBM Research（IBM研究院）

AI总结提出PROVE框架，通过20个有状态MCP服务器、自动化数据合成流水线和多组件程序化奖励，解决多步骤工具调用中的环境构建、查询生成和奖励设计问题，在BFCL Multi-Turn、tau2-bench和T-Eval上分别提升最多+10.2、+6.8和+6.5分。

详情

AI中文摘要

训练LLM编排多步骤工具调用受到三个相互耦合的障碍的阻碍：现实的有状态执行环境构建成本高昂，合成训练查询通常与服务器的实际状态脱节（因此生成的工具调用无法执行），以及基于回忆的RL奖励会鼓励冗长的工具调用模式。我们提出PROVE（已验证环境上的程序化奖励），一个包含三项贡献的框架：（1）一个包含20个有状态MCP（模型上下文协议）服务器的库，暴露了343个工具，支持具有会话范围状态隔离的实时执行RL训练；（2）一个自动数据合成流水线，通过基于实时采样服务器状态的依赖图引导的对话模拟，针对这些服务器生成经过验证的多轮工具调用轨迹，使得每个生成的查询都引用实际存在的实体；（3）一个多组件程序化奖励——渐进式有效性评分、依赖感知覆盖率、具有复杂度缩放调用预算的自适应效率惩罚、工具名称信号和参数值匹配奖励——无需外部评判模型。我们使用相同的奖励超参数和约13K训练示例，通过GRPO训练了四个模型（Qwen3-4B、Qwen3-8B、Qwen2.5-7B、Granite-4.1-8B）；仅对每个模型族从三点扫描中调整学习率。在BFCL Multi-Turn、tau2-bench和T-Eval上，PROVE分别带来了最多+10.2、+6.8和+6.5分的改进，表明紧凑的程序化奖励在两个模型族的多步骤工具编排上产生了一致的收益。

英文摘要

Training LLMs to orchestrate multi-step tool calls is held back by three coupled obstacles: realistic stateful execution environments are costly to build, synthetic training queries are often detached from the server's actual state (so the generated tool calls fail to execute), and recall-based RL rewards incentivize verbose tool-calling patterns. We present PROVE (Programmatic Rewards On Verified Environments), a framework with three contributions: (1) a library of 20 stateful MCP (Model Context Protocol) servers exposing 343 tools, enabling live-execution RL training with session-scoped state isolation; (2) a state-machine data synthesis pipeline that generates multi-turn tool-call trajectories grounded in live-sampled server state, so generated queries reference entities that actually exist; and (3) a multi-component programmatic reward with an adaptive efficiency penalty that counters the verbosity incentive of recall-based rewards. We train four models (Qwen3-4B, Qwen3-8B, Qwen2.5-7B, Granite-4.1-8B) with GRPO on the resulting ~13K training examples. On BFCL Multi-Turn, tau2-bench, and T-Eval, PROVE yields improvements of up to +10.2, +6.8, and +6.5 points respectively, demonstrating that this framework yields consistent gains on multi-step tool orchestration across two model families.

URL PDF HTML ☆

赞 0 踩 0

2606.03810 2026-06-04 cs.CL cs.AI 版本更新

Consistency Training Can Entrench Misalignment

一致性训练可能固化不对齐

David Demitri Africa, Arathi Mani

发表机构 * UK AI Security Institute（英国人工智能安全研究所）

AI总结研究通过七种一致性训练方法在108个微调模型上的实验，发现一致性训练通常抑制奖励黑客和新兴不对齐，但会放大谄媚行为，并提出了一个统一的理论框架来解释其对齐效应。

Comments Accepted to ICML 2026

详情

AI中文摘要

一致性训练鼓励模型在相关输入或采样过程中产生相似输出。这类方法简单、可扩展且基本无需标签，但其对模型对齐的影响仍知之甚少。这些方法的自引导特性是否会放大模型中的不良行为？我们在108个“模型生物体”（经过微调以展示各种受控不对齐行为的开源模型，7B-70B）上测试了七种一致性训练方法。我们发现结果差异显著：一致性训练通常抑制奖励黑客和新兴不对齐，但会放大谄媚行为。我们提供的证据表明，由一致性标注过程引起的分布偏移（而非选择算子的变化）可能是系统性对齐效应的主要驱动因素。最后，我们提出了一个统一的理论框架，推导出一致性训练放大或抑制不对齐的条件。总之，我们的研究确立了一致性训练并非对齐中立的，其在关键系统中的使用应受到仔细审计。

英文摘要

Consistency training encourages a model to produce similar outputs across related inputs or sampling procedures. Such methods are simple, scalable, and largely label-free, but their effects on model alignment remain poorly understood. Could the self-bootstrapping nature of these methods amplify undesired behavior in models? We test seven consistency training methods on 108 model organisms: open-source models (7B--70B) fine-tuned to exhibit various forms of controlled misaligned behavior. We find that outcomes vary significantly: consistency training generally suppresses reward hacking and emergent misalignment but amplifies sycophancy. We present evidence that distribution shifts induced by the consistency labeling process, rather than variation in the selection operators, may be the primary driver of systematic alignment effects. Finally, we present a unifying theoretical framework to derive conditions under which consistency training will amplify or suppress misalignment. In total, our study establishes that consistency training is not alignment-neutral, and that its use in critical systems should be carefully audited.

URL PDF HTML ☆

赞 0 踩 0

2606.03376 2026-06-04 cs.CV cs.AI cs.CL cs.LG 版本更新

P$^2$-DPO: Grounding Hallucination in Perceptual Processing via Calibration Direct Preference Optimization

P²-DPO：通过校准直接偏好优化在感知处理中锚定幻觉

Ruipeng Zhang, Zhihao Li, Haozhang Yuan, C. L. Philip Chen, Tong Zhang

发表机构 * Guangdong Provincial Key Laboratory of Computational AI Models and Cognitive Intelligence, School of Computer Science & Engineering, South China University of Technology（广东省计算人工智能模型与认知智能重点实验室，计算机科学与工程学院，华南理工大学）； Pazhou Lab, Guangzhou, China（琶洲实验室，广州，中国）； Engineering Research Center of the Ministry of Education on Health Intelligent Perception and Paralleled Digital-Human, Guangzhou, China（教育部健康智能感知与并行数字人工程研究中心，广州，中国）

AI总结针对大型视觉语言模型中的幻觉问题，提出P²-DPO训练范式，通过模型自生成偏好对和校准损失，直接优化感知瓶颈和视觉鲁棒性，无需昂贵人工反馈。

详情

AI中文摘要

幻觉最近在大型视觉语言模型（LVLMs）中引起了广泛的研究关注。直接偏好优化（DPO）旨在直接从人类提供的纠正偏好中学习，从而解决幻觉问题。尽管取得了成功，但这种范式尚未专门针对关注区域中的感知瓶颈或解决图像退化下的视觉鲁棒性不足问题。此外，现有的偏好对通常是视觉无关的，其固有的离策略性质限制了它们在指导模型学习方面的有效性。为了解决这些挑战，我们提出了感知处理直接偏好优化（P²-DPO），一种新颖的训练范式，其中模型生成并学习自己的偏好对，从而直接解决已识别的视觉瓶颈，同时固有地避免视觉无关和离策略数据的问题。它引入了：（1）一种针对焦点增强感知和视觉鲁棒性的在策略偏好对构建方法，以及（2）一种精心设计的校准损失，以精确地将视觉信号与文本的因果生成对齐。实验结果表明，在相当数量的训练数据和成本下，P²-DPO在基准测试中优于依赖昂贵人工反馈的强基线。此外，对注意力区域保真度（ARF）和图像退化场景的评估验证了P²-DPO在解决关注区域感知瓶颈和提高对退化输入的视觉鲁棒性方面的有效性。

英文摘要

Hallucination has recently garnered significant research attention in Large Vision-Language Models (LVLMs). Direct Preference Optimization (DPO) aims to learn directly from the corrected preferences provided by humans, thereby addressing the hallucination issue. Despite its success, this paradigm has yet to specifically target the perceptual bottleneck in attended regions or address insufficient Visual Robustness against image degradation. Furthermore, existing preference pairs are often vision-agnostic and their inherently off-policy nature limits their effectiveness in guiding model learning. To address these challenges, we propose Perceptual Processing Direct Preference Optimization (P$^2$-DPO), a novel training paradigm in which the model generates and learns from its own preference pairs, thereby directly addressing the identified visual bottlenecks while inherently avoiding the issues of vision-agnostic and off-policy data. It introduces: (1) an on-policy preference pairs construction method targeting Focus-and-Enhance perception and Visual Robustness, and (2) a well-designed Calibration Loss to precisely align visual signals with the causal generation of text. Experimental results demonstrate that with a comparable amount of training data and cost, P$^2$-DPO outperforms strong baselines that rely on costly human feedback on benchmarks. Furthermore, evaluations on Attention Region Fidelity (ARF) and image degradation scenarios validate the effectiveness of P$^2$-DPO in addressing perceptual bottleneck in attended regions and improving Visual Robustness against degraded inputs.

URL PDF HTML ☆

赞 0 踩 0

2606.03318 2026-06-04 cs.CL 版本更新

Beyond Ideal Instruction: A Comprehensive Framework for Evaluating LLMs in Realistic Interactions

超越理想指令：现实交互中评估大语言模型的综合框架

Xuan Yang, Hao Xu, Tingfeng Hui, Hongsheng Xin, Kaike Zhang, Chunxiao Liu, Ning Miao

发表机构 * Department of Data Science, City University of Hong Kong（香港城市大学数据科学系）； Hong Kong Institute of AI for Science, City University of Hong Kong（香港城市大学人工智能科学研究院）； Li Auto Inc.（Li汽车公司）； Beijing University of Posts and Telecommunications（北京邮电大学）； Independent Researcher（独立研究员）

AI总结提出RUT-Bench基准，通过高保真模拟理想与非理想用户行为，评估大语言模型在现实工具调用场景中的表现，发现所有测试模型成功率低于40%且面对复杂非理想输入时性能显著下降。

详情

AI中文摘要

尽管大语言模型（LLMs）在工具使用能力上取得了巨大进步，但现有的评估基准难以完全对齐真实世界场景。这些基准大多依赖于模拟的理想化用户假设，缺乏面向经验的评估。这些局限性未能考虑到真实用户特有的模糊性、不合作行为和意图转变。为填补这一空白，我们提出了RUT-Bench，一个专门用于评估LLMs在多样化真实用户工具调用场景下的基准。RUT-Bench支持高保真模拟，涵盖单轮和多轮对话中的理想理性模式和非理想异质行为。我们使用该基准对19个广泛采用的开源和专有LLMs进行了全面评估。实验结果显示，没有测试的LLM实现超过40%的整体成功率，并且几乎所有模型在面对更复杂的非理想用户输入时都经历了明显的性能下降。我们的代码和数据可在该https URL获取。

英文摘要

Despite great advances in tool-use capabilities of large language models (LLMs), existing evaluation benchmarks struggle to fully align with real-world scenarios. Such benchmarks mostly rely on simulated idealized user assumptions and lacks experience-oriented evaluation. These limitations fail to account for the ambiguity, uncooperative behaviors, and shifting intentions characteristic of real-world users. To fill this gap, we propose RUT-Bench, a dedicated benchmark designed to assess LLMs under diverse Real-world User Tool calling scenarios. RUT-Bench supports high-fidelity simulations covering both ideal rational patterns and heterogeneous non-ideal behaviors across single-turn and multi-turn dialogues. We conduct comprehensive evaluations on 19 widely adopted open-source and proprietary LLMs using our benchmark. Experimental results reveal that no tested LLMs achieve an overall success rate above 40%, and nearly all of them experience noticeable performance drops when facing more complicated non-ideal user inputs. Our code and data is available at https://github.com/Miaow-Lab/RUT-Bench.

URL PDF HTML ☆

赞 0 踩 0

2606.03110 2026-06-04 cs.CL 版本更新

Coherence Maximization Improves Pluralistic Alignment

一致性最大化改进多元对齐

Taslim Mahbub, Yiding Pei, Shi Feng

发表机构 * George Washington University（乔治·华盛顿大学）

AI总结提出内部一致性最大化（ICM）方法，通过最大化标签的互可预测性生成个性化示例，无需人工监督即可将模型与目标群体价值观对齐，并证明示例一致性比单独准确性更重要。

详情

AI中文摘要

将AI系统与多样化的人类价值观对齐需要基于具体示例的价值规范，但在没有广泛人工监督的情况下生成此类示例仍然是一个开放的挑战。我们研究了这些示例的有效性因素，使用内部一致性最大化（ICM）——通过最大化标签的互可预测性来推断标签——生成特定于人的示例，将模型引导至目标群体的价值观，无需人工监督。在涵盖分类、偏好和开放式生成的四个基准测试中，ICM推断的上下文示例与黄金标签的性能相匹配。至关重要的是，一致性比单独的标签准确性更重要：在准确性保持不变的情况下，更一致的示例比不一致的示例具有更好的泛化能力。对于预训练数据中代表性不足的人物，在模型对人物价值观最不确定的问题上进行有针对性的反馈，比在任意问题上使用相同数量的标签产生更好的泛化效果。这些结果将一致性确定为可扩展价值规范的关键设计原则，利用了预训练语言模型中已经编码的多样化人类视角。

英文摘要

Aligning AI systems with diverse human values requires value specifications grounded in concrete examples, but generating such examples without extensive human supervision remains an open challenge. We investigate what makes these examples effective, using Internal Coherence Maximization (ICM) -- which infers labels by maximizing their mutual predictability -- to generate persona-specific examples that steer a model toward a target group's values, without human supervision. Across four benchmarks spanning classification, preference, and open-ended generation, ICM-inferred in-context examples match the performance of gold labels. Crucially, coherence matters beyond individual label accuracy: with accuracy held constant, more coherent examples generalize substantially better than incoherent ones. For personas underrepresented in pretraining data, targeted human feedback on the questions where the model is least certain about a persona's values yields better generalization than the same number of labels on arbitrary questions. These results identify coherence as a key design principle for scalable value specification, leveraging the diverse human perspectives already encoded in pretrained language models.

URL PDF HTML ☆

赞 0 踩 0

2606.02914 2026-06-04 cs.AI cs.CL 版本更新

DiscourseFlip: 面向黑盒检索增强生成的非直述式语篇级观点操纵攻击

Yuyang Gong, Miaokun Chen, Jiawei Liu, Zhuo Chen, Guoxiu He, Wei Lu, XiaoFeng Wang, Xiaozhong Liu

发表机构 * Wuhan University（武汉大学）； East China Normal University（华东师范大学）； Nanyang Technological University（南洋理工大学）； Worcester Polytechnic Institute（沃思堡理工学院）

AI总结提出一种基于图引导的代理攻击方法DiscourseFlip，通过语义查询网络中的协同影响在有限预算下最大化语篇级观点偏差，实验证明其有效性和隐蔽性，并揭示现有防御的不足。

详情

AI中文摘要

检索增强生成（RAG）系统被广泛部署且影响力日益增强，但其对外部语料库的依赖暴露了来自中毒检索内容的新安全风险。现有的RAG攻击主要关注单个查询或狭窄主题局部查询集，这限制了其实际影响范围，并在现实场景中提供有限的伪装。在本文中，我们引入了语篇级观点操纵，这是一种新的威胁模型，其中跨语义查询网络的协同影响会在整体、多主题查询空间上诱导观点转变。我们在黑盒设置中形式化了这种威胁，并提出了DiscourseFlip，一种基于代理的、图引导的攻击，动态分配有限的中毒预算以最大化语篇级观点偏差。大量实验表明，DiscourseFlip在上下文化查询网络上持续诱导目标观点转变，并在覆盖范围和有效性方面显著优于现有基线。用户研究进一步证实，DiscourseFlip有效且能很好地伪装以躲避用户检测。此外，系统分析表明，现有的缓解策略对语篇级操纵无效，这凸显了迫切需要更鲁棒和自适应的防御措施来应对语篇级漏洞。

英文摘要

Retrieval-Augmented Generation (RAG) systems are widely deployed and increasingly influential, but their reliance on external corpora exposes new security risks from poisoned retrieval content. Existing RAG attacks are largely focusing on individual queries or narrow topic-local query sets, which limits their practical reach and offers limited camouflage in real-world settings. In this paper, we introduce discourse-level opinion manipulation, a new threat model in which coordinated influence across a semantic query network induces opinion shifts over a holistic, multi-topic query space. We formalize this threat in a black-box setting and propose DiscourseFlip, an agentic, graph-guided attack that dynamically allocates a limited poisoning budget to maximize discourse-level opinion deviation. Extensive experiments demonstrate that DiscourseFlip consistently induces targeted opinion shifts across the contextualized query network and significantly outperforms existing baselines in terms of coverage and effectiveness. User studies further confirm that DiscourseFlip is effective while remaining well camouflaged from user detection. Moreover, systematic analyses show that existing mitigation strategies are ineffective against discourse-level manipulation, underscoring the urgent need for more robust and adaptive defenses to address discourse-level vulnerabilities.

URL PDF HTML ☆

赞 0 踩 0

2606.00356 2026-06-04 cs.CL 版本更新

How Far Do Auto-Interpretation Labels Generalize: A Controlled Study Across Languages, Scripts, and Rewordings

自动解释标签的泛化程度：跨语言、文字和改写的受控研究

Sripad Karne

发表机构 * Columbia University（哥伦比亚大学）

AI总结通过塞尔维亚双文字系统控制实验，研究稀疏自编码器特征的自然语言标签是否真正泛化到不同语言和文字，发现标签在语义内容匹配上存在显著偏差，且随网络深度增加而加剧。

详情

AI中文摘要

稀疏自编码器（SAE）特征越来越多地用于解释语言模型，自动生成的自然语言标签是理解每个特征含义的主要接口。我们询问这些标签是否泛化：标记为某个概念的特征是否真的跨语言和文字追踪该概念？使用塞尔维亚双文字系统作为受控测试平台——通过确定性音译将同一语言以拉丁字母和西里尔字母书写——我们首先发现，由不同语言、文字和措辞中的相同内容激活的SAE特征集具有显著重叠（峰值Jaccard相似度0.57，随机基线0.13），表明存在真正的跨语言语义特征。然后我们测试自动解释标签是否跟上步伐。它们通常没有：标签描述语义内容的特征在塞尔维亚语中错过相同含义的频率比英语中高出多达4倍，并且错过塞尔维亚西里尔字母比塞尔维亚拉丁字母更多——这两种文字是彼此的确定性音译——表明失败追踪了每种形式在训练中的表现程度。差距随着网络深度增加而扩大，但标签没有给出任何失败指示。这些结果表明，自动解释标签可能反映特征在良好表示输入上的行为，而不是概念本身。

英文摘要

Sparse autoencoder (SAE) features are increasingly used to interpret language models, with auto-generated natural-language labels serving as the primary interface for understanding what each feature represents. We ask whether these labels generalize: does a feature labeled for a concept actually track that concept across languages and scripts? Using Serbian digraphia as a controlled testbed--the same language written in both Latin and Cyrillic via deterministic transliteration--we first find that SAE feature sets activated by the same content in different languages, scripts, and wordings share substantial overlap (mean Jaccard 0.39 vs. 0.13 random baseline, peaking at 0.57), suggesting genuine cross-lingual semantic features. We then test whether auto-interpretation labels keep pace. They often do not: features whose labels describe semantic content miss the same meaning in Serbian up to 4x more often thanwithin English, and miss Serbian Cyrillic more than Serbian Latin--two scripts that are deterministic transliterations of each other--suggesting the failures align with how well each form is represented in training. The gap grows with network depth, yet the labels give no indication that they fail. These results suggest that auto-interpretation labels may reflect a feature's behavior on well-represented inputs rather than the concept itself.

URL PDF HTML ☆

赞 0 踩 0

2606.00012 2026-06-04 cs.CL cs.AI 版本更新

DraDDP: A Multimodal Multi-Party Dialogue Discourse Parsing Dataset

DraDDP：多模态多方对话话语解析数据集

Shannan Liu, Peifeng Li, Yaxin Fan, Qiaoming Zhu

发表机构 * School of Computer Science and Technology, Soochow University（苏州大学计算机科学与技术学院）

AI总结针对现有研究局限于文本或双方对话的问题，构建了基于美剧的首个公开英文多模态多方对话话语解析数据集DraDDP，并验证了多模态信息在捕捉对话结构和关系类型中的价值。

详情

Journal ref: Findings of the Association for Computational Linguistics (ACL 2026)

AI中文摘要

多方对话话语解析旨在识别对话中话语之间的依赖结构和关系类型。以往的研究大多局限于文本模态或双方对话，无法满足多模态和多方对话场景。本文基于美国电视剧，构建了首个公开的英文多模态多方对话话语解析数据集DraDDP。该数据集包含495个对话片段，共6,374条话语和9.1小时的并行视频内容，涵盖了丰富的多方交互场景。此外，我们在DraDDP上评估了该任务，并深入分析了不同模态的影响，建立了全面的基准。实验结果表明，多模态信息在捕捉对话结构和关系类型方面具有重要价值。我们将公开发布数据集、标注指南和代码，以促进多模态对话理解的未来研究。

英文摘要

Multi-party dialogue discourse parsing aims to identify dependency structures and relation types between utterances in conversations. Previous studies are mostly limited to textual modality or two-party dialogue, failing to meet the multimodal and multi-party settings. In this paper, we construct the first publicly available English multimodal dataset DraDDP for multi-party dialogue discourse parsing, based on American TV dramas. DraDDP contains 495 dialogue segments with 6,374 utterances and 9.1 hours of parallel video content, covering rich multi-party interaction scenarios. Moreover, we establish comprehensive benchmarks by evaluating this task on DraDDP and conducting in-depth analysis on the impact of different modalities. Experimental results demonstrate the value of multimodal information in capturing dialogue structures and relation types. We will publicly release the dataset, annotation guidelines, and code to promote future research in multimodal dialogue understanding.

URL PDF HTML ☆

赞 0 踩 0

2605.31483 2026-06-04 cs.CL cs.AI 版本更新

BenHalluEval: A Multi-Task Hallucination Evaluation Framework for Large Language Models on Bengali

BenHalluEval：孟加拉语大语言模型的多任务幻觉评估框架

Shefayat E Shams Adib, Ahmed Alfey Sani, Ekramul Alam Esham, Ajwad Abrar, Ishmam Tashdeed, Md Taukir Azam Chowdhury

发表机构 * Department of Computer Science and Engineering, Islamic University of Technology（伊斯兰科技大学计算机科学与工程系）； Department of Computer Science and Engineering, University of California（加州大学计算机科学与工程系）

AI总结针对孟加拉语大语言模型幻觉评估的空白，提出BenHalluEval框架，涵盖四项任务，构建12000个幻觉候选，并提出双轨校准指标BenHalluScore，揭示模型间幻觉校准的显著差异。

Comments Preprint. Under review

详情

AI中文摘要

尽管孟加拉语是世界上使用人数第六多的语言，但此前尚无工作系统评估大语言模型（LLMs）在孟加拉语上的幻觉。我们提出了BenHalluEval，一个针对孟加拉语的细粒度幻觉评估框架，涵盖四项任务：生成式问答（GQA）、孟加拉语-英语混合问答、摘要和推理。我们利用GPT-5.4从三个现有孟加拉语数据集中构建了12,000个幻觉候选，涵盖十二种任务特定的幻觉类型，并在双轨协议下评估了七个LLM，涵盖推理导向、多语言和孟加拉语中心类别，该协议独立测量真实实例上的假阳性率（轨道A）和幻觉候选上的幻觉检测率（轨道B）。为了同时惩罚两种失败模式并防止均匀响应偏差导致的分数膨胀，我们提出了BenHalluScore，一种双轨校准指标，在模型和任务上范围从7.72%到55.42%，揭示了幻觉校准的显著差异。链式思维提示作为一种缓解策略应用，会改变响应分布，但未能一致改善幻觉判别。BenHalluEval建立了首个针对孟加拉语的专用幻觉基准，并突显了单轨和仅提示评估方法在低资源语言环境中的不足。数据集和代码可在https://anonymous.4open.science/r/BanglaHalluEval-EB77获取。

英文摘要

Despite Bengali being the sixth most spoken language in the world, no prior work has systematically evaluated hallucination in large language models (LLMs) for Bengali. We introduce BenHalluEval, a fine-grained hallucination evaluation framework for Bengali covering four tasks: Generative Question Answering (GQA), Bangla-English Code-Mixed QA, Summarization, and Reasoning. We construct 12,000 hallucinated candidates using GPT-5.4 across twelve task-specific hallucination types, drawn from three existing Bengali datasets, and evaluate seven LLMs spanning reasoning-oriented, multilingual, and Bengali-centric categories under a dual-track protocol that independently measures false-positive rate on ground-truth instances (Track A) and hallucination detection rate on hallucinated candidates (Track B). To jointly penalise both failure modes and prevent inflated scores from uniform response bias, we propose BenHalluScore, a dual-track calibration metric that ranges from 7.72% to 55.42% across models and tasks, revealing substantial variation in hallucination calibration. Chain-of-thought prompting, applied as a mitigation strategy, shifts response distributions without consistently improving hallucination discrimination. BenHalluEval establishes the first dedicated hallucination benchmark for Bengali and highlights the inadequacy of single-track and prompting-only evaluation approaches for low-resource language settings. The dataset and code are available at https://anonymous.4open.science/r/BanglaHalluEval-EB77.

URL PDF HTML ☆

赞 0 踩 0

2605.30995 2026-06-04 cs.CY cs.CL 版本更新

Traceable by Design: An LLM Pipeline and Dashboard for EU Regulatory Consultation Analysis

可追溯性设计：用于欧盟监管咨询分析的LLM流程与仪表板

Thales Bertaglia, Haoyang Gui, Catalina Goanta, Gerasimos Spanakis

发表机构 * Utrecht University（乌特勒支大学）； Maastricht University（马斯特里赫特大学）

AI总结提出基于LLM的端到端流程与交互式仪表板，从监管咨询提交中提取结构化主题，确保逐字引用、完全可追溯和透明性，并以欧盟数字公平法案为例验证。

Comments This research has been supported by funding from the ERC Starting Grant HUMANads (ERC-2021-StG No 101041824)

详情

AI中文摘要

公众咨询产生大量利益相关者提交的数据，手动分析几乎不可行。我们提出了一个基于LLM的端到端流程和交互式仪表板，用于从监管咨询提交中提取结构化主题，并以欧盟委员会数字公平法案（DFA）公开征集证据作为案例研究。该系统处理原始PDF附件和网络表单响应，提取主题注释，并将每个提取结果基于源文本的逐字引用。应用于4,322份DFA提交，该流程生成了15,368个主题注释，并附有20,951条逐字证据引用。三个原则指导了所提出的设计：逐字引用、完全可追溯性和透明性设计。仪表板通过五个分析视图展示完整的提取数据集，从数据集级别的主题概览到单个段落的深入分析，每个结果都可追溯到其来源。除了预定义的DFA主题类别外，该流程还生成了某些利益相关者关注的问题，如年龄验证、支付处理器审查和数字所有权，这些是固定分类法方法会遗漏的。该流程是领域通用的；将其适应新的咨询只需要更新提示词和新的数据集。实时演示可在https://dfa-dashboard.thalesbertaglia.com/获取。代码和处理后的数据可在https://github.com/thalesbertaglia/dfa-dashboard公开获取。

英文摘要

Public consultations generate large volumes of data in the form of stakeholder submissions that are practically unfeasible to analyse manually. We present an end-to-end LLM-based pipeline and interactive dashboard for structured topic extraction from regulatory consultation submissions, demonstrated on the European Commission's Digital Fairness Act (DFA) public call for evidence as a case study. The system processes raw PDF attachments and web-form responses, extracts topic annotations, and grounds every extraction in a verbatim quote from the source text. Applied to 4,322 DFA submissions, the pipeline produced 15,368 topic annotations supported by 20,951 verbatim evidence quotes. Three principles govern the proposed design: verbatim grounding, full traceability, and transparency by design. The dashboard exposes the full extraction dataset through five analytical views, from dataset-level topic overviews to individual paragraph drill-downs, with every result traceable to its source. Beyond the predefined DFA topic categories, the pipeline generated certain stakeholder concerns, such as Age Verification, Payment Processor Censorship, and Digital Ownership, that a fixed-taxonomy approach would have missed. The pipeline is domain-generic; adapting it to a new consultation requires only a prompt update and a new dataset. A live demo is available at https://dfa-dashboard.thalesbertaglia.com/. The code and processed data are publicly available at https://github.com/thalesbertaglia/dfa-dashboard.

URL PDF HTML ☆

赞 0 踩 0

2605.30947 2026-06-04 cs.CL 版本更新

Extending AI for Research to the Humanities: A Multi-Agent Framework for Evidence-Grounded Scholarship

将人工智能研究扩展到人文学科：一个用于证据基础学术的多智能体框架

Yating Pan, Jiajun Zhang, Jun Wang, Qi Su

发表机构 * Department of Information Management（信息管理系）； Research Center for Digital Humanities（数字人文研究中心）； School of Foreign Languages（外国语言学院）； Institute for Artificial Intelligence（人工智能研究院）

AI总结提出SPIRE多智能体框架，通过将人文学科操作建模为协作智能体角色，结合多尺度细读检索，实现基于证据的论证，在古典文献基准上优于现有方法。

Comments 28 pages, 3 figures. Code, data catalogues, and reproduction scripts: https://github.com/YatingPan/SPIRE. Lead corresponding author: Jun Wang; corresponding author: Qi Su

详情

AI中文摘要

GAPD：面向知识库问答中智能体强化学习的金动作策略蒸馏

Xin Sun, Jianan Xie, Zhongqi Chen, Qiang Liu, Shu Wu, Bowen Song, Weiqiang Wang, Zilei Wang, Liang Wang

发表机构 * University of Science and Technology of China（中国科学技术大学）； NLPR, MAIS, Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； ShanghaiTech University（上海科技大学）； Ant Group（蚂蚁集团）

AI总结提出GAPD框架，通过中间锚点匹配将金动作序列与在线策略对齐，为基于强化学习的知识库问答提供密集的令牌级指导，在多个基准上取得最优结果。

详情

AI中文摘要

强化学习（RL）天然适用于智能体知识库问答（KBQA），其中模型必须发出可执行动作、观察知识库反馈并最终返回答案。然而，当前基于RL的KBQA系统主要优化来自最终答案的稀疏奖励，导致中间动作错误监督不足。这对于逻辑形式标注的KBQA基准尤其受限：金逻辑形式可转换为可执行动作序列，但现有流水线主要将其用于热启动数据构建，而非用于在线策略RL更新。我们提出GAPD，一种训练时的金动作策略蒸馏框架，为基于结果的RL添加密集的令牌级指导。为了将金动作与在线学生策略对齐，GAPD使用中间锚点匹配：它将学生探索和金执行期间达到的中间实体视为状态锚点，并通过这些探索的实体集将学生状态与金状态匹配。基于对齐后的金动作的当前策略作为停止梯度的教师，其令牌分布被蒸馏回普通学生策略的生成动作令牌跨度上。GAPD在WebQSP、GrailQA和GraphQ上持续超越当前最先进水平。

英文摘要

Reinforcement learning (RL) is a natural fit for agentic knowledge base question answering (KBQA), where a model must issue executable actions, observe knowledge-base feedback, and eventually return an answer. However, current RL-based KBQA systems mainly optimize sparse rewards from the final answer, leaving intermediate action errors weakly supervised. This is especially limiting for logical-form annotated KBQA benchmarks: gold logical forms can be converted into executable action sequences, but existing pipelines use them mainly for warm-start data construction rather than for on-policy RL updates. We propose GAPD, a training-time Gold-Action Policy Distillation framework that adds dense token-level guidance to outcome-based RL. To align gold actions with on-policy student rollouts, GAPD uses MID-ANCHOR MATCHING: it treats the intermediate entities reached during student exploration and gold execution as state anchors, and matches student states to gold states through these explored entity sets. The current policy conditioned on this aligned gold action serves as a stop-gradient teacher, whose token distribution is distilled back to the ordinary student policy over generated action-token spans. GAPD consistently surpasses the current state of the art on WebQSP, GrailQA, and GraphQ.

URL PDF HTML ☆

赞 0 踩 0

2509.23694 2026-06-04 cs.AI cs.CL cs.CR 版本更新

SafeSearch: Automated Red-Teaming of LLM-Based Search Agents

SafeSearch: 基于LLM的搜索代理的自动化红队测试

Jianshuo Dong, Sheng Guo, Hao Wang, Xun Chen, Zhuotao Liu, Tianwei Zhang, Ke Xu, Minlie Huang, Han Qiu

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出SafeSearch自动化红队框架，系统评估基于LLM的搜索代理在五个风险类别中的安全性，发现GPT-4.1-mini在搜索工作流中攻击成功率高达90.5%，且常见防御措施效果有限。

Comments Accepted by ICML 2026

详情

AI中文摘要

搜索代理将LLM连接到互联网，使其能够访问更广泛和更新的信息。然而，这也引入了一个新的威胁面：不可靠的搜索结果可能误导代理产生不安全的输出。现实世界的事件和我们的两个野外观察表明，此类失败在实践中可能发生。为了系统地研究这一威胁，我们提出了SafeSearch，一个可扩展、成本效益高且轻量级的自动化红队框架，支持搜索代理的沙盒安全评估。利用该框架，我们生成了涵盖五个风险类别（例如，错误信息和提示注入）的300个测试用例，并评估了三个搜索代理框架在17个代表性LLM上的表现。我们的结果揭示了基于LLM的搜索代理存在重大漏洞，在搜索工作流设置中，GPT-4.1-mini的最高攻击成功率（ASR）达到90.5%。此外，我们发现常见的防御措施（如提醒提示）提供的保护有限。总体而言，SafeSearch提供了一种实用的方法来衡量和提高基于LLM的搜索代理的安全性。

GroupTravelBench: 多人群组旅行规划中LLM智能体的基准测试

Xiang Cheng, Yulan Hu, Lulu Zheng, Zheng Pan, Xin Li, Yong Liu

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China（中国人民大学首都人工智能学院）； AMAP, Alibaba Group（阿里集团AMAP）

AI总结提出GroupTravelBench基准，通过多用户多轮对话任务评估LLM智能体在偏好获取、冲突协调和公平规划三方面的能力。

Comments work in process

详情

AI中文摘要

旅行规划是评估LLM智能体规划与工具使用能力的现实任务。然而，现有基准通常只假设单一用户，从而回避了现实场景中最具挑战性的方面之一：智能体识别和解决多用户冲突的能力。为填补这一空白，我们引入了 extbf{GroupTravelBench}，这是首个针对 extbf{多用户、多轮}旅行规划的基准。基于真实用户画像、POI数据和票价数据，我们综合生成了650个任务，并将其分为三个难度等级。除了单用户行程规划所需的标准能力（如多步推理和工具使用）外，我们的基准进一步评估了旅行智能体所需的三项关键能力：\emph{(i) 获取}——主动进行多轮对话以收集每位用户的偏好；\emph{(ii) 协调}——通过妥协或分组策略解决用户间的冲突；以及\emph{(iii) 规划}——搜索能最大化整体群体效用同时保持公平性和可行性的旅行方案。为模拟现实中的对话式行程规划，同时确保可靠的工具使用和离线评估，我们构建了一个带有缓存真实工具数据的交互式沙箱环境。我们评估了多种LLM，发现即使是前沿模型在偏好覆盖率和群体公平性方面仍存在显著弱点。 extit{GroupTravelBench}为推进LLM智能体在现实旅行规划中的研究提供了一个实用且可复现的基准。

英文摘要

Travel planning in the real world is overwhelmingly a \textit{group} activity, yet existing LLM travel-planning benchmarks reduce it to a single user, where the field is approaching saturation. This single-user assumption sidesteps what makes group planning hard for an agent: discovering private preferences across multiple users, surfacing conflicts, and balancing utility against fairness. To bring the task back to its multi-user reality, we introduce \textbf{\textit{GroupTravelBench}}, the first benchmark for \textbf{multi-user, multi-turn} travel planning. Built from real user profiles, POI data, and ticket prices, it comprises 650 tasks across three difficulty levels, each running in a synchronous group-chat sandbox with cached tool data for reproducible offline evaluation. Beyond the multi-step reasoning and tool use that single-user benchmarks already test, GroupTravelBench probes three group-specific capabilities: \textit{(i) elicitation} of private preferences through multi-turn dialogue; \textit{(ii) coordination} of inter-user conflicts via compromise or subgrouping; and \textit{(iii) planning} that balances group utility against fairness. We pair this with a complementary evaluation framework combining rule-based outcome metrics and LLM-judge process metrics. Across a wide range of frontier models, even the strongest agents fall short on all four rule-based outcome metrics, with plan validity below 12\%, suggesting that group-level outcome quality is a key open challenge for LLM travel-planning agents.

URL PDF HTML ☆

赞 0 踩 0

2605.18879 2026-06-04 cs.LG cs.AI cs.CL 版本更新

谈话（不）廉价：LLM攻击的分类法与基准覆盖审计

Karthik Raghu Iyer, Yazdan Jamshidi, Nicholas Bray, Alexey A. Shvets

发表机构 * Palo Alto Networks（帕洛阿尔托网络）

AI总结提出一个基于STRIDE的4×6目标×技术矩阵框架，用于审计LLM攻击基准的集体覆盖，发现现有基准仅覆盖最多25%的威胁面，且存在命名碎片化和评估空白。

详情

AI中文摘要

我们引入了一个可重用的框架，用于审计LLM攻击基准是否共同覆盖威胁面：一个基于STRIDE的4×6目标×技术矩阵，该矩阵由从932篇arXiv安全研究（2023-2026）中提取的507叶分类法（401个数据填充叶和106个威胁模型衍生叶）构建而成。该矩阵支持基准外部验证——审计集体覆盖而非单个基准的一致性。将其应用于六个公开基准，发现三个主要框架（HarmBench、InjecAgent、AgentDojo）占据非重叠的单元格，最多覆盖矩阵的25%，而整个STRIDE威胁类别（服务中断、模型内部）缺乏标准化评估，尽管这些类别中已发表的攻击通过没有基准测试的机制实现了46倍令牌放大和96%的攻击成功率。包含2521个独特攻击组的语料库进一步揭示了普遍的命名碎片化（单个攻击最多有29种表面形式）以及集中在安全与对齐绕过上的严重问题，这些结构属性在较小规模下不可见。分类法、攻击记录和覆盖映射作为可扩展工件发布；随着新基准的出现，它们可以映射到同一矩阵上，使社区能够跟踪评估差距是否正在缩小。

英文摘要

We introduce a reusable framework for auditing whether LLM attack benchmarks collectively cover the threat surface: a 4$\times$6 Target $\times$ Technique matrix grounded in STRIDE, constructed from a 507-leaf taxonomy -- 401 data-populated and 106 threat-model-derived leaves -- of inference-time attacks extracted from 932 arXiv security studies (2023--2026). The matrix enables benchmark-external validation -- auditing collective coverage rather than individual benchmark consistency. Applying it to six public benchmarks reveals that the three primary frameworks (HarmBench, InjecAgent, AgentDojo) occupy non-overlapping cells covering at most 25\% of the matrix, while entire STRIDE threat categories (Service Disruption, Model Internals) lack any standardized evaluation, despite published attacks in these categories achieving 46$\times$ token amplification and 96\% attack success rates through mechanisms which no benchmark tests. The corpus of 2,521 unique attack groups further reveals pervasive naming fragmentation (up to 29 surface forms for a single attack) and heavy concentration in Safety \& Alignment Bypass, structural properties invisible at smaller scale. The taxonomy, attack records, and coverage mappings are released as extensible artifacts; as new benchmarks emerge, they can be mapped onto the same matrix, enabling the community to track whether evaluation gaps are closing.

URL PDF HTML ☆

赞 0 踩 0

2605.08665 2026-06-04 cs.CL 版本更新

MedRedFlag：探究LLMs如何在真实健康沟通中纠正误解

Sraavya Sambara, Yuan Pu, Ayman Ali, Vishala Mishra, Lionel Wong, Monica Agrawal

发表机构 * Independent Researcher（独立研究者）； Duke University（杜克大学）； Stanford University（斯坦福大学）

AI总结本研究通过构建MedRedFlag数据集（1100+个来自Reddit的需纠正问题），系统比较了先进LLMs与临床医生的回应，发现LLMs常未能纠正问题中的错误前提，可能导致次优医疗决策，揭示了患者面向医疗AI系统的关键安全漏洞。

详情

AI中文摘要

来自患者的真实健康问题往往无意中嵌入了错误的假设或前提。在这种情况下，安全的医疗沟通通常涉及纠正：先指出隐含的误解，然后回应用户的潜在背景，而非原始问题。尽管大型语言模型（LLMs）越来越多地被普通用户用于医疗建议，但它们尚未针对这一关键能力进行测试。因此，在本工作中，我们研究了LLMs如何应对真实健康问题中嵌入的错误前提。我们开发了一个半自动化流程来整理MedRedFlag，这是一个包含1100多个来自Reddit的、需要纠正的问题的数据集。然后，我们系统地比较了最先进的LLMs与临床医生的回应。我们的分析显示，LLMs往往未能纠正有问题的提问，即使检测到了有问题的前提，并且提供的答案可能导致次优的医疗决策。我们的基准测试和结果揭示了LLMs在真实健康沟通条件下表现的新且重大的差距，突显了面向患者的医疗AI系统的关键安全问题。代码和数据集可在https://github.com/srsambara-1/MedRedFlag获取。

英文摘要

Real-world health questions from patients often unintentionally embed false assumptions or premises. In such cases, safe medical communication typically involves redirection: addressing the implicit misconception and then responding to the underlying patient context, rather than the original question. While large language models (LLMs) are increasingly being used by lay users for medical advice, they have not yet been tested for this crucial competency. Therefore, in this work, we investigate how LLMs react to false premises embedded within real-world health questions. We develop a semi-automated pipeline to curate MedRedFlag, a dataset of 1100+ questions sourced from Reddit that require redirection. We then systematically compare responses from state-of-the-art LLMs to those from clinicians. Our analysis reveals that LLMs often fail to redirect problematic questions, even when the problematic premise is detected, and provide answers that could lead to suboptimal medical decision making. Our benchmark and results reveal a novel and substantial gap in how LLMs perform under the conditions of real-world health communication, highlighting critical safety concerns for patient-facing medical AI systems. Code and dataset are available at https://github.com/srsambara-1/MedRedFlag.

URL PDF HTML ☆

赞 0 踩 0

2511.20233 2026-06-04 cs.CL 版本更新

REFLEX: Self-Refining Explainable Fact-Checking via Verdict-Anchored Style Control

REFLEX: 通过裁决锚定风格控制实现自我精炼的可解释事实核查

Chuyi Kong, Wei Gao, Jing Ma, Hongzhan Lin, Yuxi Sun

发表机构 * Hong Kong Baptist University（香港 Baptist 大学）； Singapore Management University（新加坡 Management 大学）

AI总结提出REFLEX方法，利用自我分歧的真实性信号构建引导向量，以裁决锚定风格控制实现自我精炼的事实核查，仅需465个样本即达最优性能。

详情

Journal ref: ACL 2026 Main Conference

AI中文摘要

社交媒体上假新闻的盛行要求自动化事实核查系统提供准确的裁决和忠实的解释。然而，现有基于大语言模型（LLM）的方法忽略了LLM生成解释中的欺骗性误导风格，导致不忠实的理由可能误导人类判断。它们严重依赖外部知识源，引入幻觉甚至高延迟，削弱了实时使用中至关重要的可靠性和响应性。为解决这些挑战，我们提出REason-guided Fact-checking with Latent EXplanations (REFLEX)，一种自我精炼范式，显式控制以裁决锚定的推理风格。REFLEX利用骨干模型及其微调变体之间的自我分歧真实性信号构建引导向量，自然地将事实与风格分离。在真实世界数据集上的实验表明，REFLEX在LLaMA系列模型下仅用465个自我精炼样本即达到最先进性能。此外，由于其可迁移性，REFLEX在野外数据上获得了高达7.54%的提升。我们的结果进一步证明，该方法有效缓解了忠实幻觉，从而引导模型在可解释事实核查中比先前工作获得更准确的裁决。

英文摘要

The prevalence of fake news on social media demands automated fact-checking systems to provide accurate verdicts with faithful explanations. However, existing large language model (LLM)-based approaches ignore deceptive misinformation styles in LLM-generated explanations, resulting in unfaithful rationales that can mislead human judgments. They rely heavily on external knowledge sources, introducing hallucinations and even high latency that undermine reliability and responsiveness, which is crucial for real-time use. To address these challenges, we propose REason-guided Fact-checking with Latent EXplanations (REFLEX), a self-refining paradigm that explicitly controls reasoning style anchored on verdict. REFLEX utilizes self-disagreement veracity signals between the backbone model and its fine-tuned variant to construct steering vectors, naturally disentangling fact from style. Experiments on the real-world dataset show REFLEX achieves state-of-the-art performance under LLaMA-series models with only 465 self-refined samples. Moreover, owing to its transferability, REFLEX yields up to a 7.54% gain on in-the-wild data. Our results further demonstrate that our method effectively mitigates faithful hallucination, thereby guiding the model toward more accurate verdicts than previous works in explainable fact-checking.

URL PDF HTML ☆

赞 0 踩 0

2604.17709 2026-06-04 cs.CL cs.DC 版本更新

DeInfer: Efficient Parallel Inferencing for Decomposed Large Language Models

DeInfer：分解式大语言模型的高效并行推理

You-Liang Huang, Xinhao Huang, Chengxi Liao, Zeyi Wen

发表机构 * Boston University（波士顿大学）

AI总结针对分解式大语言模型并行推理性能差的问题，提出DeInfer系统，通过多项优化实现高性能并行推理，实验证明其优越性。

Comments accepted by DAC'26, latest version fixs a minor mistake

2604.11510 2026-06-04 cs.CL cs.AI cs.LG 版本更新

Policy Split: Incentivizing Dual-Mode Exploration in LLM Reinforcement with Dual-Mode Entropy Regularization

策略分裂：通过双模式熵正则化激励大语言模型强化学习中的双模式探索

Jiashu Yao, Heyan Huang, Daiqing Wu, Zeming Liu, Yuhang Guo

发表机构 * Beijing Institute of Technology（北京理工大学）； Tsinghua University（清华大学）； Beihang University（北航）

AI总结提出Policy Split方法，将策略分裂为正常和高熵两种模式，通过协作双模式熵正则化在保持准确性的同时促进多样化探索，实验表明在通用和创造性任务上优于现有基线。

Comments preprint

2604.08564 2026-06-04 cs.CL cs.LG 版本更新

Attention-Based Sampler for Diffusion Language Models

基于注意力的扩散语言模型采样器

Yuyan Zhou, Kai Syun Hou, Weiyu Chen, James Kwok

发表机构 * Department of Computer Science and Engineering, The Hong Kong University of Science and Technology（计算机科学与工程系，香港科学与技术大学）

AI总结针对扩散语言模型采样中忽略全局序列结构的问题，提出基于注意力矩阵列和的采样顺序优化方法，实现无训练的高质量并行采样。

详情

AI中文摘要

自回归模型（ARMs）已在语言建模中建立了主导范式。然而，其严格的顺序采样范式对推理效率和建模灵活性施加了根本性限制。为解决这些限制，提出了基于扩散的大语言模型（dLLMs），提供了并行采样和灵活语言建模的潜力。尽管有这些优势，当前dLLMs的采样策略主要依赖于token级别的信息，未能考虑全局序列结构，往往产生次优结果。在本文中，我们从对数似然最大化的角度研究采样顺序选择问题。我们证明该问题是NP难的，并提出一种基于最优采样秩的近似方法，使目标在计算上可行。我们进一步证明，通过按注意力矩阵列和降序采样token可以优化该可行目标。这一发现为注意力引导采样提供了原则性依据，并提供了贪婪搜索的理论基础替代方案。我们将这一理论见解实例化为一种新的无训练采样算法，称为Attn-Sampler，并进一步提出动态注意力阈值以实现实际加速。在多个基准上的大量实验验证了我们方法的有效性，表明它在增强采样并行性的同时实现了更优的生成质量。

英文摘要

Auto-regressive models (ARMs) have established a dominant paradigm in language modeling. However, their strictly sequential sampling paradigm imposes fundamental constraints on both inference efficiency and modeling flexibility. To address these limitations, diffusion-based large language models (dLLMs) have been proposed, offering the potential for parallel sampling and flexible language modeling. Despite these advantages, current dLLMs sampling strategies rely primarily on token level information, which fails to account for global sequence structure and often yields suboptimal results. In this paper, we study the sampling order selection problem from the perspective of log-likelihood maximization. We show that this problem is NP-hard and propose an optimal sampling-rank-based approximation that makes the objective computationally tractable. We further prove that the tractable objective is optimized by sampling tokens in descending order of their attention-matrix column sums. This finding provides a principled justification for attention-guided sampling and offers a theoretically grounded alternative to greedy search. We instantiate this theoretical insight in a new training-free sampling algorithm, termed Attn-Sampler, and further propose dynamic attention thresholding for practical acceleration. Extensive experiments across multiple benchmarks validate the effectiveness of our proposed method, demonstrating that it achieves superior generation quality while enhancing the sampling parallelism.

URL PDF HTML ☆

赞 0 踩 0

2604.04944 2026-06-04 cs.CL cs.AI 版本更新

Inclusion-of-Thoughts: Mitigating Preference Instability via Purifying the Decision Space

包含思维：通过净化决策空间缓解偏好不稳定性

Mohammad Reza Ghasemi Madani, Soyeon Caren Han, Shuo Yang, Jey Han Lau

发表机构 * School of Computing and Information Systems, The University of Melbourne（计算与信息系统学院，墨尔本大学）

AI总结提出包含思维（IoT）策略，通过逐步自过滤干扰选项来重构多选题，从而稳定模型偏好并提升推理性能。

详情

AI中文摘要

多项选择题（MCQ）被广泛用于评估大型语言模型（LLM）。然而，LLM 仍然容易受到似是而非的干扰项的影响。这常常将注意力转移到无关选项上，导致在正确和错误答案之间不稳定地摇摆。在本文中，我们提出包含思维（IoT），一种渐进式自过滤策略，旨在减轻这种认知负荷（即干扰项存在下模型偏好的不稳定性），并使模型更有效地关注合理答案。我们的方法仅使用合理的选项选择来重构 MCQ，为检查比较判断以及模型在扰动下内部推理的稳定性提供了一个受控环境。通过明确记录这一过滤过程，IoT 还增强了模型决策的透明度和可解释性。广泛的实证评估表明，IoT 在算术、常识推理和教育基准测试中显著提升了思维链性能，且计算开销极小。

英文摘要

Multiple-choice questions (MCQs) are widely used to evaluate large language models (LLMs). However, LLMs remain vulnerable to the presence of plausible distractors. This often diverts attention toward irrelevant choices, resulting in unstable oscillation between correct and incorrect answers. In this paper, we propose Inclusion-of-Thoughts (IoT), a progressive self-filtering strategy that is designed to mitigate this cognitive load (i.e., instability of model preferences under the presence of distractors) and enable the model to focus more effectively on plausible answers. Our method operates to reconstruct the MCQ using only plausible option choices, providing a controlled setting for examining comparative judgements and therefore the stability of the model's internal reasoning under perturbation. By explicitly documenting this filtering process, IoT also enhances the transparency and interpretability of the model's decision-making. Extensive empirical evaluation demonstrates that IoT substantially boosts chain-of-thought performance across a range of arithmetic, commonsense reasoning, and educational benchmarks with minimal computational overhead.

URL PDF HTML ☆

赞 0 踩 0

2604.00819 2026-06-04 cs.CL cs.AI 版本更新

Emotion Entanglement and Bayesian Inference for Multi-Dimensional Emotion Understanding

情感纠缠与贝叶斯推理用于多维情感理解

Hemanth Kotaprolu, Kishan Maharaj, Raey Zhao, Abhijit Mishra, Pushpak Bhattacharyya

发表机构 * Indian Institute of Technology Bombay（印度理工学院班加罗尔）； University of Texas at Austin（德克萨斯大学奥斯汀分校）； IBM Research（IBM研究院）

AI总结提出基于Plutchik基本情绪理论的情感场景基准EmoScene，并利用情感共现统计的贝叶斯推理框架进行联合后验推理，提升多维情感理解的结构一致性。

Comments 19 pages in total, 10 Figures, 7 Tables

详情

AI中文摘要

理解自然语言中的情感本质上是一个多维推理问题，其中多个情感信号通过上下文、人际关系和情境线索相互作用。然而，大多数现有的情感理解基准依赖于短文本和预定义的情感标签，将这一过程简化为独立的标签预测，忽略了情感之间的结构化依赖关系。为了解决这一局限性，我们引入了情感场景（EmoScene），一个基于理论的基准，包含4,731个上下文丰富的场景，并用源自Plutchik基本情绪的8维情感向量进行标注。基于情感很少独立出现的观察，我们进一步提出了一个纠缠感知的贝叶斯推理框架，该框架结合情感共现统计，对情感向量进行联合后验推理。这种轻量级的后处理不需要任何参数更新，提高了预测的结构一致性，并在不增加额外成本的情况下，整体词汇准确率提升了2.24%。因此，EmoScene为研究多维情感理解和当前语言模型的局限性提供了一个具有挑战性的基准。

英文摘要

Understanding emotions in natural language is inherently a multi-dimensional reasoning problem, where multiple affective signals interact through context, interpersonal relations, and situational cues. However, most existing emotion understanding benchmarks rely on short texts and predefined emotion labels, reducing this process to independent label prediction and ignoring the structured dependencies among emotions. To address this limitation, we introduce Emotional Scenarios (EmoScene), a theory-grounded benchmark of 4,731 contextrich scenarios annotated with an 8-dimensional emotion vector derived from Plutchik's basic emotions. Motivated by the observation that emotions rarely occur independently, we further propose an entanglement-aware Bayesian inference framework that incorporates emotion co-occurrence statistics to perform joint posterior inference over the emotion vector. This lightweight post-processing does not require any parameter updates and improves the structural consistency of predictions, and yields overall gains of 2.24% Lexical Accuracy without any additional cost. EmoScene therefore provides a challenging benchmark for studying multi-dimensional emotion understanding and the limitations of current language models.

URL PDF HTML ☆

赞 0 踩 0

2601.11214 2026-06-04 cs.CL 版本更新

T$^\star$: Progressive Block Scaling for Masked Diffusion Language Models Through Trajectory Aware Reinforcement Learning

T$^\star$：通过轨迹感知强化学习实现掩码扩散语言模型的渐进块缩放

Hanchen Xia, Baoyou Chen, Yutang Ge, Guojiang Zhao, Siyu Zhu

发表机构 * Shanghai Academy of AI for Science（上海人工智能科学研究院）； Shanghai Innovation Institute（上海创新研究院）； Fudan University（复旦大学）； School of Mathematical Sciences（数学科学学院）； Shanghai Jiao Tong University（上海交通大学）； Carnegie Mellon University（卡内基梅隆大学）

AI总结提出T$^\star$方法，利用基于TraceRL的训练课程，在掩码扩散语言模型中渐进增大块大小，实现高并行解码且性能损失极小。

2510.21459 2026-06-04 cs.CR cs.CL cs.LG 版本更新

SBASH: a Framework for Designing and Evaluating RAG vs. Prompt-Tuned LLM Honeypots

SBASH：用于设计和评估RAG与提示调优的LLM蜜罐框架

Adetayo Adebimpe, Helmut Neukirchen, Thomas Welsh

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出SBASH框架，利用轻量级本地LLM和RAG技术构建蜜罐，通过多种指标评估RAG与提示调优对LLM蜜罐真实性和响应延迟的影响。

Comments to be published in: The 3rd International Conference on Foundation and Large Language Models (FLLM2025), IEEE, 2025

详情

DOI: 10.1109/FLLM67465.2025.11391242
Journal ref: 2025 3rd International Conference on Foundation and Large Language Models (FLLM), IEEE, 2025

AI中文摘要

蜜罐是用于收集有价值威胁情报或将攻击者从生产系统引开的诱饵系统。最大化攻击者参与度对其效用至关重要。然而，研究表明，上下文感知能力（例如响应新攻击类型、系统和攻击者代理的能力）对于提高参与度是必要的。大型语言模型（LLM）已被证明是提高上下文感知能力的一种方法，但面临若干挑战，包括响应时间的准确性和及时性、高运营成本以及由于云部署带来的数据保护问题。我们提出了基于系统的注意力外壳蜜罐（SBASH）框架，通过使用轻量级本地LLM来管理数据保护问题。我们研究了使用检索增强生成（RAG）支持的LLM和非RAG LLM处理Linux shell命令的情况，并使用多种不同指标（如响应时间差异、人类测试者的真实感、以及通过Levenshtein距离、SBert和BertScore计算的与真实系统的相似度）对其进行评估。我们表明，RAG提高了未调优模型的准确性，而通过系统提示（指示LLM像Linux系统一样响应）调优的模型在无RAG情况下达到了与未调优模型有RAG时相似的准确性，同时延迟略低。

英文摘要

Honeypots are decoy systems used for gathering valuable threat intelligence or diverting attackers away from production systems. Maximising attacker engagement is essential to their utility. However research has highlighted that context-awareness, such as the ability to respond to new attack types, systems and attacker agents, is necessary to increase engagement. Large Language Models (LLMs) have been shown as one approach to increase context awareness but suffer from several challenges including accuracy and timeliness of response time, high operational costs and data-protection issues due to cloud deployment. We propose the System-Based Attention Shell Honeypot (SBASH) framework which manages data-protection issues through the use of lightweight local LLMs. We investigate the use of Retrieval Augmented Generation (RAG) supported LLMs and non-RAG LLMs for Linux shell commands and evaluate them using several different metrics such as response time differences, realism from human testers, and similarity to a real system calculated with Levenshtein distance, SBert, and BertScore. We show that RAG improves accuracy for untuned models while models that have been tuned via a system prompt that tells the LLM to respond like a Linux system achieve without RAG a similar accuracy as untuned with RAG, while having a slightly lower latency.

URL PDF HTML ☆

赞 0 踩 0

2603.23841 2026-06-04 cs.CL cs.AI 版本更新

PoliticsBench: Benchmarking Political Values in Large Language Models with Multi-Turn Roleplay

PoliticsBench: 通过多轮角色扮演基准测试大型语言模型中的政治价值观

Rohan Khetan, Ashna Khetan

发表机构 * Northville High School, Northville, USA（北维尅高中）； Department of Computer Science, Stanford University, Stanford, USA（斯坦福大学计算机科学系）

AI总结提出PoliticsBench，一个多阶段角色扮演基准，通过20个演化场景评估LLM的细粒度价值表达，发现场景提示比直接提问能引发更广泛和强烈的价值表达。

Comments 7 pages, 5 tables, 5 figures, 4 appendix pages. Accepted to the ICML 2026 Trustworthy AI for Good Workshop

详情

AI中文摘要

虽然大型语言模型（LLMs）越来越多地被用作主要信息来源，但其潜在的政治偏见可能影响其客观性。现有的LLM社会偏见基准主要评估人口统计刻板印象，而当衡量政治偏见时，是在粗略的层面上进行的，忽视了塑造社会政治推理的价值观。我们引入了PoliticsBench，一个用于评估LLM中细粒度价值表达的多阶段角色扮演基准。在20个演化场景中，模型在竞争压力下阐述权衡、表明立场并做出决策。在八个主流LLM上，我们表明，与直接的政治问题相比，基于场景的提示引发了更广泛和更强烈的价值表达，峰值交互阶段使强烈激活的价值维度数量增加了约0.75（共10个维度），相对于基线提示具有统计显著性（p < 0.05）。此外，在交互过程中，立场的承诺度增加，从初始阶段到决策阶段，在[0,5]量表上上升了约1.4分。虽然在后期交互阶段，响应对于场景释义的鲁棒性降低，但评判者间的一致性保持相对稳定。我们的结果表明，评估LLM的政治行为需要超越静态提示，转向更长的交互设置，以捕捉价值观如何在上下文中应用。

英文摘要

While Large Language Models (LLMs) are increasingly used as primary sources of information, their potential for political bias may impact their objectivity. Existing benchmarks of LLM social bias primarily evaluate demographic stereotypes, and when political bias is measured, it is done so at a coarse level, overlooking the values that shape sociopolitical reasoning. We introduce PoliticsBench, a multi-stage roleplay benchmark for evaluating fine-grained value expression in LLMs. Across twenty evolving scenarios, models articulate tradeoffs, take positions, and make decisions under competing pressures. Across eight prominent LLMs, we show that scenario-based prompting elicits broader and more strongly expressed value profiles than direct political questions, with peak interaction stages increasing the number of strongly activated value dimensions by approximately $0.75$ (out of 10 total dimensions), a statistically significant increase relative to baseline prompting ($p < 0.05$). In addition, commitment to a stance increases over the course of interaction, rising by approximately $1.4$ points on a $[0,5]$ scale from initial to decision stages. While responses become less robust to scenario paraphrasing in later interaction stages, inter-judge agreement remains relatively stable. Our results suggest that evaluating LLM political behavior requires moving beyond static prompts toward longer interactive settings that capture how values are applied in context.

URL PDF HTML ☆

赞 0 踩 0

2603.20884 2026-06-04 cs.CL 版本更新

回答前先置信：高效LLM不确定性估计的范式转变

Changcheng Li, Jiancan Wu, Hengheng Zhang, Zhengsu Chen, Guo An, Junxiang Qiu, Xiang Wang, Qi Tian

发表机构 * University of Science and Technology of China（中国科学技术大学）； Huawei Inc.（华为公司）

AI总结提出CoCA框架，通过GRPO强化学习联合优化置信度校准与答案准确性，实现回答前输出置信度，提升不确定性估计效率。

详情

AI中文摘要

大型语言模型（LLM）的可靠部署需要准确的不确定性估计。现有方法主要是先回答后置信，即在生成答案后才产生置信度，这衡量的是特定响应的正确性，限制了实际可用性。我们研究了一种置信优先范式，其中模型在回答之前输出其置信度，将该分数解释为模型在当前策略下正确回答问题的概率。我们提出了CoCA（联合优化的置信度和答案），这是一种GRPO强化学习框架，通过分段信用分配联合优化置信度校准和答案准确性。通过为置信度和答案段分配单独的奖励和组相对优势，CoCA实现了稳定的联合优化并避免了奖励黑客攻击。在数学、代码和事实问答基准上的实验表明，在保持答案质量的同时，校准和不确定性区分能力得到改善，从而支持更广泛的下游应用。

英文摘要

Reliable deployment of large language models (LLMs) requires accurate uncertainty estimation. Existing methods are predominantly answer-first, producing confidence only after generating an answer, which measure the correctness of a specific response and limits practical usability. We study a confidence-first paradigm, where the model outputs its confidence before answering, interpreting this score as the model's probability of answering the question correctly under its current policy. We propose CoCA(Co-optimized Confidence and Answers), a GRPO reinforcement learning framework that jointly optimizes confidence calibration and answer accuracy via segmented credit assignment. By assigning separate rewards and group-relative advantages to confidence and answer segments, CoCA enables stable joint optimization and avoids reward hacking. Experiments across math, code, and factual QA benchmarks show improved calibration and uncertainty discrimination while preserving answer quality, thereby enabling a broader range of downstream applications.

URL PDF HTML ☆

赞 0 踩 0

2603.03205 2026-06-04 cs.CL 版本更新

Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use

学习何时行动或拒绝：为安全的多步骤工具使用守护代理推理模型

Aradhye Agarwal, Gurdit Siyan, Yash Pandya, Joykirat Singh, Akshay Nambi, Ahmed Awadallah

发表机构 * Microsoft Research（微软研究院）

AI总结提出MOSAIC框架，通过显式安全推理和基于偏好的强化学习，使代理模型在工具使用中安全决策，减少有害行为并保持良性性能。

Comments Accepted to the 43rd International Conference on Machine Learning (ICML 2026)

详情

AI中文摘要

代理语言模型在安全机制上与聊天模型根本不同：它们必须规划、调用工具并执行长期行动，其中单个失误（如访问文件或输入凭据）可能导致不可逆的伤害。现有的对齐方法主要针对静态生成和任务完成进行优化，由于顺序决策、对抗性工具反馈和过度自信的中间推理，在这些设置中失效。我们引入了MOSAIC，一个后训练框架，通过使安全决策显式且可学习，对齐代理以实现安全的多步骤工具使用。MOSAIC将推理构建为规划、检查、然后行动或拒绝的循环，将显式安全推理和拒绝作为第一类行动。为了在没有轨迹级标签的情况下进行训练，我们使用基于偏好的强化学习与成对轨迹比较，这捕获了标量奖励常常忽略的安全区别。我们在三个模型家族（Qwen2.5-7B、Qwen3-4B-Thinking和Phi-4）以及跨分布基准（涵盖有害任务、提示注入、良性工具使用和跨域隐私泄露）上评估了MOSAIC的零样本性能。MOSAIC将有害行为减少高达50%，在注入攻击上将有害任务拒绝率提高超过20%，减少隐私泄露，并保持或改善良性任务性能，展示了跨模型、领域和代理设置的鲁棒泛化能力。

英文摘要

Agentic language models operate in a fundamentally different safety regime than chat models: they must plan, call tools, and execute long-horizon actions where a single misstep, such as accessing files or entering credentials, can cause irreversible harm. Existing alignment methods, largely optimized for static generation and task completion, break down in these settings due to sequential decision-making, adversarial tool feedback, and overconfident intermediate reasoning. We introduce MOSAIC, a post-training framework that aligns agents for safe multi-step tool use by making safety decisions explicit and learnable. MOSAIC structures inference as a plan, check, then act or refuse loop, with explicit safety reasoning and refusal as first-class actions. To train without trajectory-level labels, we use preference-based reinforcement learning with pairwise trajectory comparisons, which captures safety distinctions often missed by scalar rewards. We evaluate MOSAIC zero-shot across three model families, Qwen2.5-7B, Qwen3-4B-Thinking, and Phi-4, and across out-of-distribution benchmarks spanning harmful tasks, prompt injection, benign tool use, and cross-domain privacy leakage. MOSAIC reduces harmful behavior by up to 50%, increases harmful-task refusal by over 20% on injection attacks, cuts privacy leakage, and preserves or improves benign task performance, demonstrating robust generalization across models, domains, and agentic settings.

URL PDF HTML ☆

赞 0 踩 0

2502.03799 2026-06-04 cs.CL cs.SY eess.SY 版本更新

Enhancing Hallucination Detection through Noise Injection

通过噪声注入增强幻觉检测

Litian Liu, Reza Pourreza, Sunny Panchal, Apratim Bhattacharyya, Yubing Jian, Yao Qin, Roland Memisevic

发表机构 * Qualcomm AI Research（高通人工智能研究）

AI总结提出一种基于贝叶斯模型不确定性的无训练噪声注入方法，通过扰动模型参数或隐藏单元激活来改进采样过程，显著提升大语言模型幻觉检测性能。

Comments ICLR 2026 main conference paper

详情

AI中文摘要

大型语言模型（LLMs）容易生成看似合理但错误的响应，即幻觉。因此，有效检测幻觉对于LLMs的安全部署至关重要。最近的研究将幻觉与模型不确定性联系起来，表明可以通过测量从模型中抽取的多个样本的答案分布离散度来检测幻觉。虽然从模型定义的词元分布中抽取样本是一种自然的方式，但在这项工作中，我们认为这对于检测幻觉而言并非最优。我们表明，通过以贝叶斯方式考虑模型不确定性，可以显著改进检测。为此，我们提出了一种非常简单、无需训练的方法，该方法基于在采样过程中扰动适当子集的模型参数，或等效地扰动隐藏单元激活。我们证明，我们的方法在跨不同数据集、模型架构和不确定性度量上，显著优于标准采样的推理时幻觉检测。

英文摘要

Large Language Models (LLMs) are prone to generating plausible yet incorrect responses, known as hallucinations. Effectively detecting hallucinations is therefore crucial for the safe deployment of LLMs. Recent research has linked hallucinations to model uncertainty, suggesting that hallucinations can be detected by measuring dispersion over answer distributions obtained from multiple samples drawn from a model. While drawing from the distribution over tokens defined by the model is a natural way to obtain samples, in this work, we argue that it is suboptimal for the purpose of detecting hallucinations. We show that detection can be improved significantly by taking into account model uncertainty in the Bayesian sense. To this end, we propose a very simple, training-free approach based on perturbing an appropriate subset of model parameters, or equivalently hidden unit activations, during sampling. We demonstrate that our approach significantly improves inference-time hallucination detection over standard sampling across diverse datasets, model architectures, and uncertainty metrics.

URL PDF HTML ☆

赞 0 踩 0

2602.21103 2026-06-04 cs.CL cs.IR 版本更新

Prompt-Level Distillation: A Non-Parametric Alternative to Model Fine-Tuning for Efficient Reasoning

提示级蒸馏：一种非参数化的模型微调替代方案，用于高效推理

Sanket Badhe, Deep Shah

发表机构 * Google Mountain View, California, USA（谷歌山景城，加利福尼亚州，美国）

AI总结提出提示级蒸馏（PLD），通过从教师模型中提取推理模式并组织为结构化指令注入学生模型的系统提示，无需微调即可提升小模型推理性能，在多个基准上接近前沿水平且延迟极低。

Comments Accepted at ACL 2026 Industry Track

详情

AI中文摘要

高级推理通常需要思维链提示，虽然准确但会导致高昂的延迟和测试时推理成本。标准的替代方案——微调较小的模型——往往牺牲可解释性，同时引入显著的计算和操作开销。为了解决这些限制，我们引入了提示级蒸馏（PLD）。我们从教师模型中提取显式推理模式，并将其组织成结构化的表达性指令列表，用于学生模型的系统提示。使用Gemma-3 4B进行评估，PLD在StereoSet上将Macro F1分数从57%提升至90.0%，在Contract-NLI上从67%提升至83%，同时将LogiQA准确率提升至70%。在Mistral Small 3.1上的类似结果证明了跨架构的泛化能力，使得这些紧凑模型能够以可忽略的延迟开销达到前沿性能。这些表达性指令使决策过程透明化，允许对逻辑进行完全人工验证，使得该方法非常适合法律、金融和内容审核等受监管行业，以及高容量用例和边缘设备。

英文摘要

Advanced reasoning typically requires Chain-of-Thought prompting, which is accurate but incurs prohibitive latency and substantial test-time inference costs. The standard alternative, fine-tuning smaller models, often sacrifices interpretability while introducing significant resource and operational overhead. To address these limitations, we introduce Prompt-Level Distillation (PLD). We extract explicit reasoning patterns from a Teacher model and organize them into a structured list of expressive instructions for the Student model's System Prompt. Evaluated using Gemma-3 4B, PLD improved Macro F1 scores on StereoSet (57\% to 90.0\%) and Contract-NLI (67\% to 83\%), while increasing LogiQA accuracy to 70\%. Similar results on Mistral Small 3.1 demonstrate cross-architecture generalizability, enabling these compact models to match frontier performance with negligible latency overhead. These expressive instructions render the decision-making process transparent, allowing for full human verification of logic, making this approach ideal for regulated industries such as law, finance, and content moderation, as well as high-volume use cases and edge devices.

URL PDF HTML ☆

赞 0 踩 0

2511.05722 2026-06-04 cs.CL cs.AI 版本更新

OckBench: Measuring the Efficiency of LLM Reasoning

OckBench：衡量LLM推理效率

Zheng Du, Hao Kang, Song Han, Tushar Krishna, Ligeng Zhu

发表机构 * Georgia Institute of Technology（佐治亚理工学院）； Massachusetts Institute of Technology（麻省理工学院）； NVIDIA（英伟达）

AI总结提出OckBench基准，联合评估推理和编码任务中的准确性与token效率，揭示当前模型token利用率低下问题。

详情

AI中文摘要

大型语言模型（LLM）如GPT-5和Gemini 3已推动自动推理和代码生成的前沿。然而，当前基准强调准确性和输出质量，忽略了关键维度：token使用的效率。在实际应用中，token效率变化很大。解决相同问题且准确率相近的模型，其token长度差异可达 extbf{5.0$ imes$}，导致模型推理能力存在巨大差距。这种差异暴露了显著的冗余，凸显了对标准化基准来量化token效率差距的迫切需求。因此，我们引入OckBench，这是首个联合衡量推理和编码任务中准确性与token效率的基准。我们的评估表明，当前模型的token效率在很大程度上未得到优化，显著增加了服务成本和延迟。这些发现为社区优化潜在推理能力（即token效率）提供了具体路线图。最终，我们主张评估范式转变：token不应被无谓地倍增。我们的基准可在https://ockbench.github.io/获取。

英文摘要

Large language models (LLMs) such as GPT-5 and Gemini 3 have pushed the frontier of automated reasoning and code generation. Yet current benchmarks emphasize accuracy and output quality, neglecting a critical dimension: efficiency of token usage. The token efficiency is highly variable in practical. Models solving the same problem with similar accuracy can exhibit up to a \textbf{5.0$\times$} difference in token length, leading to massive gap of model reasoning ability. Such variance exposes significant redundancy, highlighting the critical need for a standardized benchmark to quantify the gap of token efficiency. Thus, we introduce OckBench, the first benchmark that jointly measures accuracy and token efficiency across reasoning and coding tasks. Our evaluation reveals that token efficiency remains largely unoptimized across current models, significantly inflating serving costs and latency. These findings provide a concrete roadmap for the community to optimize the latent reasoning ability, token efficiency. Ultimately, we argue for an evaluation paradigm shift: tokens must not be multiplied beyond necessity. Our benchmarks are available at https://ockbench.github.io/.

URL PDF HTML ☆

赞 0 踩 0

2602.19101 2026-06-04 cs.CL cs.AI 版本更新

Value Entanglement: Conflation Between Different Kinds of Good In (Some) Large Language Models

价值纠缠：大型语言模型中不同种类好的混淆

Seong Hah Cho, Junyi Li, Anna Leshinskaya

发表机构 * Independent Department of Cognitive Sciences, UC Irvine（独立认知科学系，加州大学 Irvine 分校）

AI总结通过探测模型行为、嵌入和残差流激活，发现大型语言模型普遍存在价值纠缠，即道德、语法和经济三种价值被混淆，其中语法和经济价值过度受道德价值影响，通过选择性消融与道德相关的激活向量可修复此问题。

2506.06006 2026-06-04 cs.CV cs.AI cs.CL 版本更新

面向搜索增强型大语言模型推理的自适应信息控制

Siheng Xiong, Oguzhan Gungordu, James C. Kerce, Faramarz Fekri

发表机构 * Georgia Institute of Technology（佐治亚理工学院）

AI总结提出基于信息效用的自适应控制框架DeepControl，通过控制检索的广度与分辨率，提升搜索增强推理的性能与训练稳定性。

详情

AI中文摘要

搜索增强型推理代理将多步推理与外部检索交错进行，但不受控制的检索可能引入冗余证据、使上下文饱和，并破坏强化学习（RL）的稳定性。现有的基于结果的RL方法仅提供稀疏的终端奖励，对中间信息获取决策的指导有限。我们提出DeepControl，一种基于信息效用的自适应信息控制框架，其中信息效用是检索证据边际价值的状态依赖估计。该框架沿两个维度调节信息获取：广度（即是否应继续检索）和分辨率（即应暴露多少检索细节）。它通过检索继续引导、层次化粒度控制以及退火控制强制方案来实现这些控制。这使得策略能够在训练期间内化有效的获取行为，并在测试时无需外部控制即可运行。在七个基准测试中，DeepControl在没有显式信息控制的情况下，始终优于强RL和检索基线；与Search-R1相比，在Qwen2.5-7B和Qwen2.5-3B上分别平均提高了9.4和8.6个点。额外分析显示搜索效率、训练稳定性和证据利用率均有所提升。

英文摘要

Search-augmented reasoning agents interleave multi-step reasoning with external retrieval, but uncontrolled retrieval can introduce redundant evidence, saturate the context, and destabilize reinforcement learning (RL). Existing outcome-based RL methods provide only sparse terminal rewards, offering limited guidance for intermediate information-acquisition decisions. We propose DeepControl, an adaptive information-control framework based on information utility, a state-dependent estimate of the marginal value of retrieved evidence. The framework regulates information acquisition along two axes: extent, i.e., whether retrieval should continue, and resolution, i.e., how much retrieved detail should be exposed. It implements these controls through retrieval-continuation guidance, hierarchical granularity control, and an annealed control-forcing scheme. This enables the policy to internalize effective acquisition behavior during training and operate without external control at test time. Across seven benchmarks, DeepControl consistently outperforms strong RL and retrieval baselines without explicit information control; compared with Search-R1, it improves average performance by +9.4 and +8.6 points on Qwen2.5-7B and Qwen2.5-3B, respectively. Additional analyses show improved search effectiveness, training stability, and evidence utilization.

URL PDF HTML ☆

赞 0 踩 0

2601.22396 2026-06-04 cs.CL cs.AI cs.CY cs.HC physics.soc-ph 版本更新

Culturally Grounded Personas in Large Language Models: Characterization and Alignment with Socio-Psychological Value Frameworks

大型语言模型中的文化基础人物角色：与社会心理价值框架的表征与对齐

Candida M. Greco, Lucio La Cava, Andrea Tagarelli

发表机构 * DIMES, University of Calabria, Italy（意大利卡拉布里亚大学DIMES研究所）

AI总结本研究通过世界价值观调查、英格尔哈特-韦尔策尔文化地图和道德基础理论，评估大型语言模型生成的文化基础人物角色是否准确反映不同文化条件下的世界和道德价值体系，并分析其跨文化结构和道德变异。

Comments Under Review

详情

AI中文摘要

尽管大型语言模型（LLMs）在模拟人类行为方面的实用性日益增强，但这些合成人物角色在不同文化条件下是否准确反映世界和道德价值体系仍不确定。本文研究了合成、文化基础人物角色与既定框架（特别是世界价值观调查（WVS）、英格尔哈特-韦尔策尔文化地图和道德基础理论）的对齐情况。我们基于一组可解释的WVS衍生变量概念化并生成LLM人物角色，并通过三个互补视角检查生成的人物角色：在英格尔哈特-韦尔策尔地图上的定位，揭示其反映跨文化条件稳定差异的解释；与世界价值观调查在人口统计层面的一致性，其中响应分布大致追踪人类群体模式；以及源自道德基础问卷的道德轮廓，我们通过文化-道德映射分析道德响应如何在不同文化配置中变化。我们的文化基础人物角色生成和分析方法能够评估跨文化结构和道德变异。

英文摘要

Despite the growing utility of Large Language Models (LLMs) for simulating human behavior, the extent to which these synthetic personas accurately reflect world and moral value systems across different cultural conditionings remains uncertain. This paper investigates the alignment of synthetic, culturally-grounded personas with established frameworks, specifically the World Values Survey (WVS), the Inglehart-Welzel Cultural Map, and Moral Foundations Theory. We conceptualize and produce LLM-generated personas based on a set of interpretable WVS-derived variables, and we examine the generated personas through three complementary lenses: positioning on the Inglehart-Welzel map, which unveils their interpretation reflecting stable differences across cultural conditionings; demographic-level consistency with the World Values Survey, where response distributions broadly track human group patterns; and moral profiles derived from a Moral Foundations questionnaire, which we analyze through a culture-to-morality mapping to characterize how moral responses vary across different cultural configurations. Our approach of culturally-grounded persona generation and analysis enables evaluation of cross-cultural structure and moral variation.

URL PDF HTML ☆

赞 0 踩 0

2601.19921 2026-06-04 cs.CL cs.AI 版本更新

Demystifying Multi-Agent Debate: The Role of Confidence and Diversity

揭秘多智能体辩论：置信度与多样性的作用

Xiaochen Zhu, Caiqi Zhang, Yizhou Chi, Tom Stafford, Nigel Collier, Andreas Vlachos

发表机构 * University of Cambridge（剑桥大学）； University of Sheffield（谢菲尔德大学）

AI总结针对多智能体辩论（MAD）在提升大语言模型性能时效果不佳的问题，提出多样性感知初始化和置信度调节辩论协议两种轻量级干预方法，显著提升辩论有效性。

详情

AI中文摘要

多智能体辩论（MAD）被广泛用于通过测试时缩放提升大语言模型（LLM）性能，然而近期研究表明，尽管计算成本更高，普通MAD往往不如简单的多数投票。研究表明，在同质化智能体和统一信念更新下，辩论保持了预期的正确性，因此无法可靠地改善结果。借鉴人类审议和集体决策的研究发现，我们识别出普通MAD缺失的两个关键机制：（i）初始观点的多样性，以及（ii）明确且校准的置信度沟通。我们提出两种轻量级干预方法。首先，一种多样性感知初始化，选择更多样化的候选答案池，增加辩论开始时存在正确假设的可能性。其次，一种置信度调节的辩论协议，其中智能体表达校准后的置信度，并根据他人的置信度调节其更新。我们从理论上证明，多样性感知初始化在不改变底层更新动态的情况下提高了MAD成功的先验概率，而置信度调节更新使辩论能够系统地漂移到正确假设。在实验上，在六个面向推理的QA基准测试中，我们的方法始终优于普通MAD和多数投票。我们的结果将人类审议与基于LLM的辩论联系起来，并表明简单、有原则的修改可以显著增强辩论效果。

英文摘要

Multi-agent debate (MAD) is widely used to improve large language model (LLM) performance through test-time scaling, yet recent work shows that vanilla MAD often underperforms simple majority vote despite higher computational cost. Studies show that, under homogeneous agents and uniform belief updates, debate preserves expected correctness and therefore cannot reliably improve outcomes. Drawing on findings from human deliberation and collective decision-making, we identify two key mechanisms missing from vanilla MAD: (i) diversity of initial viewpoints and (ii) explicit, calibrated confidence communication. We propose two lightweight interventions. First, a diversity-aware initialisation that selects a more diverse pool of candidate answers, increasing the likelihood that a correct hypothesis is present at the start of debate. Second, a confidence-modulated debate protocol in which agents express calibrated confidence and condition their updates on others' confidence. We show theoretically that diversity-aware initialisation improves the prior probability of MAD success without changing the underlying update dynamics, while confidence-modulated updates enable debate to systematically drift to the correct hypothesis. Empirically, across six reasoning-oriented QA benchmarks, our methods consistently outperform vanilla MAD and majority vote. Our results connect human deliberation with LLM-based debate and demonstrate that simple, principled modifications can substantially enhance debate effectiveness.

URL PDF HTML ☆

赞 0 踩 0

2506.10912 2026-06-04 cs.AI cs.CL 版本更新

Breaking Bad Molecules: Are MLLMs Ready for Structure-Level Molecular Detoxification?

Breaking Bad Molecules: MLLMs 是否准备好进行结构级分子解毒？

Fei Lin, Ziyang Gong, Cong Wang, Tengchao Zhang, Yonglin Tian, Yining Jiang, Ji Dai, Chao Guo, Xiaotong Yu, Xue Yang, Gen Luo, Fei-Yue Wang

发表机构 * Department of Engineering Science, Macau University of Science and Technology, Macau, China（澳门科学技术大学工程科学系）； School of Computer Science, Shanghai Jiao Tong University, Shanghai, China（上海交通大学计算机科学学院）； Institute of Automation, Chinese Academy of Sciences, Beijing, China（中国科学院自动化研究所）； School of Pharmacy, Macau University of Science and Technology, Macau, China（澳门科学技术大学药学院）； Faculty of Electrical Engineering and Computer Science, Ningbo University, Ningbo, China（宁波大学电气与计算机科学学院）； State Key Laboratory of Biopharmaceutical Preparation and Delivery, Institute of Process Engineering, Chinese Academy of Sciences, Beijing, China（中国科学院生物制药制备与递送国家重点实验室）； School of Automation and Intelligent Sensing, Shanghai Jiao Tong University, Shanghai, China（上海交通大学自动化与智能感知学院）； Shanghai Artificial Intelligence Laboratory, Shanghai, China（上海人工智能实验室）

AI总结本文提出 ToxiMol 基准任务，利用多模态大语言模型进行分子毒性修复，并构建数据集、提示流程和自动评估框架 ToxiEval，实验表明当前模型虽面临挑战但展现出毒性理解与结构编辑的潜力。

详情

AI中文摘要

毒性仍然是早期药物开发失败的主要原因。尽管分子设计和性质预测取得了进展，但分子毒性修复任务——生成结构有效且毒性降低的分子替代物——尚未被系统定义或基准化。为填补这一空白，我们引入了 ToxiMol，这是首个针对通用多模态大语言模型（MLLMs）的分子毒性修复基准任务。我们构建了一个标准化数据集，涵盖 11 个主要任务和 660 个代表性有毒分子，覆盖多种机制和粒度。我们设计了一个具有机制感知和任务自适应能力的提示注释流程，并基于专家毒理学知识。同时，我们提出了一个自动评估框架 ToxiEval，将毒性终点预测、合成可及性、类药性和结构相似性集成到高通量评估链中，用于修复成功评估。我们系统评估了 43 个主流通用 MLLMs，并进行了多项消融研究，以分析关键问题，包括评估指标、候选多样性和失败归因。实验结果表明，尽管当前 MLLMs 在此任务上仍面临重大挑战，但它们开始展现出在毒性理解、语义约束遵循和结构感知编辑方面的有前景的能力。

英文摘要

Toxicity remains a leading cause of early-stage drug development failure. Despite advances in molecular design and property prediction, the task of molecular toxicity repair, generating structurally valid molecular alternatives with reduced toxicity, has not yet been systematically defined or benchmarked. To fill this gap, we introduce ToxiMol, the first benchmark task for general-purpose Multimodal Large Language Models (MLLMs) focused on molecular toxicity repair. We construct a standardized dataset covering 11 primary tasks and 660 representative toxic molecules spanning diverse mechanisms and granularities. We design a prompt annotation pipeline with mechanism-aware and task-adaptive capabilities, informed by expert toxicological knowledge. In parallel, we propose an automated evaluation framework, ToxiEval, which integrates toxicity endpoint prediction, synthetic accessibility, drug-likeness, and structural similarity into a high-throughput evaluation chain for repair success. We systematically assess 43 mainstream general-purpose MLLMs and conduct multiple ablation studies to analyze key issues, including evaluation metrics, candidate diversity, and failure attribution. Experimental results show that although current MLLMs still face significant challenges on this task, they begin to demonstrate promising capabilities in toxicity understanding, semantic constraint adherence, and structure-aware editing.

URL PDF HTML ☆

赞 0 踩 0

2601.18777 2026-06-04 cs.LG cs.AI cs.CL cs.IR stat.AP 版本更新

PRECISE: Reducing the Bias of LLM Evaluations Using Prediction-Powered Ranking Estimation

PRECISE: 使用预测驱动的排名估计减少LLM评估的偏差

Abhishek Divekar, Anirban Majumder

发表机构 * Primary contributor and corresponding author（主要贡献者及通讯作者）

AI总结提出PRECISE框架，通过结合少量人工标注与LLM判断，利用预测驱动推断（PPI）方法，在低资源下可靠估计搜索、排序和RAG系统的指标，并校正LLM偏差。

Comments Accepted at AAAI 2026 - Innovative Applications of AI (IAAI-26)

详情

AI中文摘要

评估搜索、排序和RAG系统的质量传统上需要大量人工相关性标注。近年来，一些已部署的系统探索使用大型语言模型（LLM）作为自动评判者，但其固有偏差阻碍了直接用于指标估计。我们提出了一个扩展预测驱动推断（PPI）的统计框架，将最少的人工标注与LLM判断相结合，以生成需要子实例标注的指标的可靠估计。我们的方法仅需少至100个人工标注查询和10,000个未标注示例，相比传统方法显著减少了标注需求。我们为基于LLM的查询改写应用中的相关性提升推断制定了所提出的框架（PRECISE），将PPI扩展到查询-文档级别的子实例标注。通过重新制定指标集成空间，我们将计算复杂度从O(2^|C|)降低到O(2^K)，其中|C|表示语料库大小（百万量级）。在多个著名检索数据集上的详细实验表明，我们的方法降低了业务关键指标Precision@K的估计方差，同时在低资源设置下有效校正了LLM偏差。

拓扑结构至关重要：多智能体大语言模型中的内存泄漏测量

Jinbo Liu, Defu Cao, Yifei Wei, Tianyao Su, Yuan Liang, Yushun Dong, Yan Liu, Yue Zhao, Xiyang Hu

发表机构 * Arizona State University（亚利桑那州立大学）； University of Southern California（南加州大学）； Florida State University（佛罗里达州立大学）

AI总结提出MAMA框架，通过控制图拓扑结构评估多智能体LLM系统中的内存泄漏，发现密集连接、短攻击距离和高中心性增加泄漏，并给出稀疏或层次化拓扑的设计建议。

Comments Accepted to Findings of the Association for Computational Linguistics: ACL 2026. Camera-ready version

详情

AI中文摘要

图拓扑结构是多智能体LLM系统中内存泄漏的基本决定因素，但其影响尚未得到充分量化。我们提出了MAMA（多智能体内存攻击），一个用于比较多智能体LLM系统中拓扑条件内存泄漏的受控评估框架。MAMA操作于包含标记的个人身份信息（PII）实体的合成文档，从中生成经过清理的任务指令。我们执行两阶段协议：Engram（将私人信息植入目标智能体的内存）和Resonance（多轮交互，攻击者尝试提取）。在10轮中，我们使用两阶段恢复标准测量泄漏，该标准结合了精确匹配提取和基于LLM对攻击者最终输出的推理。我们评估了六种典型拓扑（完全图、环、链、树、星、星环），涉及n∈{4,5,6}、攻击者-目标放置和基础模型。结果一致：更密集的连通性、更短的攻击者-目标距离和更高的目标中心性增加泄漏；大多数泄漏发生在早期轮次，然后趋于平稳；模型选择改变绝对比率但保留广泛的结构趋势；时空/位置属性比身份凭证或受监管标识符更容易泄漏。我们提炼出系统设计的实用指导：倾向于稀疏或层次化连通性，最大化攻击者-目标分离，并通过拓扑感知访问控制限制枢纽/捷径路径。我们的代码可在https://github.com/llll121/mama-eval获取。

英文摘要

Graph topology is a fundamental determinant of memory leakage in multi-agent LLM systems, yet its effects remain poorly quantified. We introduce MAMA (Multi-Agent Memory Attack), a controlled evaluation framework for comparing topology-conditioned memory leakage in multi-agent LLM systems. MAMA operates on synthetic documents containing labeled Personally Identifiable Information (PII) entities, from which we generate sanitized task instructions. We execute a two-phase protocol: Engram (seeding private information into a target agent's memory) and Resonance (multi-round interaction where an attacker attempts extraction). Over 10 rounds, we measure leakage using a two-stage recovery criterion that combines exact-match extraction with LLM-based inference over the attacker's final output. We evaluate six canonical topologies (complete, circle, chain, tree, star, star-ring) across $n\in\{4,5,6\}$, attacker-target placements, and base models. Results are consistent: denser connectivity, shorter attacker-target distance, and higher target centrality increase leakage; most leakage occurs in early rounds and then plateaus; model choice shifts absolute rates but preserves broad structural trends; spatiotemporal/location attributes leak more readily than identity credentials or regulated identifiers. We distill practical guidance for system design: favor sparse or hierarchical connectivity, maximize attacker-target separation, and restrict hub/shortcut pathways via topology-aware access control. Our code is available at https://github.com/llll121/mama-eval.

URL PDF HTML ☆

赞 0 踩 0

2601.05633 2026-06-04 cs.CL 版本更新

GIFT: Games as Informal Training for Generalizable LLMs

GIFT：游戏作为通用型LLM的非正式训练

Nuoyan Lyu, Bingbing Xu, Xueyun Tian, Weihao Meng, Yige Yuan, Yang Zhang, Zhiyong Huang, Tat-Seng Chua, Huawei Shen

发表机构 * State Key Laboratory of AI Safety, Institute of Computing Technology, CAS（人工智能安全国家重点实验室，计算技术研究所，中国科学院）； University of Chinese Academy of Sciences（中国科学院大学）； University of Washington（华盛顿大学）； National University of Singapore（新加坡国立大学）

AI总结提出将游戏作为非正式训练环境，结合协调子任务训练（CST）方法，提升LLM在抽象推理、规划、创造力等通用能力上的泛化性能。

详情

AI中文摘要

最近的LLM在数学推理和代码生成等正式任务上表现出色，但在规划、创造力和社交智能等更广泛的能力上仍然存在困难。受人类学习的启发，其中正式指导和非正式经验共同塑造智力，我们将非正式学习引入LLM训练，并使用游戏作为无注释、反馈驱动的环境。为了涵盖抽象推理、规划、创造力和社交互动等多种能力，我们将正式数学任务与三种代表性游戏任务（矩阵游戏、井字棋和谁是卧底）相结合。然而，在统一的RL目标下直接混合这些任务可能会模糊特定任务的学习信号，并且没有为协调任务梯度方向提供明确的指导。为了解决这些问题，我们提出了协调子任务训练（CST），它用顺序的子任务特定更新替换单一的混合更新，分离异质RL信号，同时隐式促进子任务间的协调。在能力导向基准上的实验表明，基于游戏的非正式学习提高了超越正式训练的泛化能力，而CST通过保持领域内子任务性能并提高更广泛的通用能力，进一步增强了多任务RL。代码和数据已公开。

英文摘要

Recent LLMs excel at formal tasks such as mathematical reasoning and code generation, but still struggle with broader abilities such as planning, creativity, and social intelligence. Inspired by human learning, where formal instruction and informal experience jointly shape intelligence, we introduce informal learning into LLM training and use games as annotation-free, feedback-driven environments. To cover diverse abilities including abstract reasoning, planning, creativity, and social interaction, we combine formal math tasks with three representative game tasks, including Matrix Games, TicTacToe, and Who's the Spy. However, directly mixing these tasks under a unified RL objective can blur task-specific learning signals and provides no explicit guidance for coordinating task-gradient directions. To combat these, we propose Coordinated Subtask Training (CST), which replaces a single mixed update with sequential subtask-specific updates, separating heterogeneous RL signals while implicitly promoting coordination among subtasks. Experiments on ability-oriented benchmarks show that game-based informal learning improves generalization beyond formal training alone, while CST further enhances multi-task RL by preserving in-domain subtask performance and improving broader general abilities. Code and data are publicly available.

URL PDF HTML ☆

赞 0 踩 0

2511.07107 2026-06-04 cs.AI cs.CL 版本更新

MENTOR: A Metacognition-Driven Self-Evolution Framework for Uncovering and Mitigating Implicit Domain Risks in LLMs

MENTOR: 一种元认知驱动的自我进化框架，用于发现和缓解大语言模型中的隐式领域风险

Liang Shan, Kaicheng Shen, Wen Wu, Zhenyu Ying, Chaochao Lu, Yan Teng, Jingqi Huang, Qingshan Liu, Guangze Ye, Guoqing Wang, Jie Zhou, Liang He

发表机构 * School of Computer Science and Technology, East China Normal University（东华大学计算机科学与技术学院）； Shanghai AI Lab, Shanghai Innovation Institute（上海人工智能实验室，上海创新研究院）

AI总结针对大语言模型在特定领域（如教育、金融、管理）中存在的隐式安全风险，提出基于元认知自我评估和动态规则知识图谱的MENTOR框架，通过激活级引导信号有效降低攻击成功率。

详情

AI中文摘要

确保大语言模型（LLMs）的安全性对于实际部署至关重要。然而，当前的安全措施往往无法解决隐式的、特定领域的风险。为了研究这一差距，我们引入了一个包含3000个标注查询的数据集，涵盖教育、金融和管理领域。对14个主流LLMs的评估揭示了一个令人担忧的漏洞：平均越狱成功率为57.8%。为此，我们提出了MENTOR，一种元认知驱动的自我进化框架。MENTOR执行元认知自我评估，采用视角转换和后果推理等策略来揭示潜在的模型错位。由此产生的反思被提炼为动态的基于规则的知识图谱，从中检索到的规则被转换为激活级引导信号，以在推理过程中指导内部表示。实验表明，MENTOR在所有测试领域显著降低了攻击成功率，并优于现有的安全对齐方法。MENTOR的代码和数据集可在 https://anonymous.4open.science/r/MENTOR-Evo 获取。

英文摘要

Ensuring the safety of Large Language Models (LLMs) is critical for real-world deployment. However, current safety measures often fail to address implicit, domain-specific risks. To investigate this gap, we introduce a dataset of 3,000 annotated queries spanning education, finance, and management. Evaluations across 14 leading LLMs reveal a concerning vulnerability: an average jailbreak success rate of 57.8\%. In response, we propose MENTOR, a metacognition-driven self-evolution framework. MENTOR performs metacognitive self-assessment, using strategies such as perspective-taking and consequential reasoning to uncover latent model misalignments. The resulting reflections are distilled into dynamic rule-based knowledge graphs, from which retrieved rules are converted into activation-level steering signals to guide internal representations during inference. Experiments demonstrate that MENTOR substantially reduces attack success rates across all tested domains and outperforms existing safety alignment methods. The code and dataset for MENTOR are available at: https://anonymous.4open.science/r/MENTOR-Evo.

URL PDF HTML ☆

赞 0 踩 0

2411.05894 2026-06-04 cs.CL cs.AI cs.LG 版本更新

SSSD: Simply-Scalable Speculative Decoding

SSSD: 简单可扩展的推测解码

Michele Marzollo, Jiawei Zhuang, Niklas Roemer, Niklas Zwingenberger, Lorenz K. Müller, Lukas Cavigelli

发表机构 * Huawei（华为）； ETH Zurich（苏黎世联邦理工学院）

AI总结提出一种无需训练的推测解码方法SSSD，结合轻量级n-gram匹配和硬件感知推测，在多种基准测试中达到与领先训练方法相当的性能，延迟降低高达2.9倍，且对语言和领域变化具有鲁棒性。

Comments Accepted to the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026, Main Conference)

详情

AI中文摘要

推测解码已成为加速大型语言模型推理的流行技术。然而，大多数现有方法在生产服务系统中仅带来适度的改进。实现显著加速的方法通常依赖于额外的训练草案模型或辅助模型组件，增加了部署和维护的复杂性。这种增加的复杂性降低了灵活性，特别是当服务负载转移到草案模型训练数据中未充分表示的任务、领域或语言时。我们引入了简单可扩展的推测解码（SSSD），一种无需训练的方法，结合了轻量级n-gram匹配和硬件感知推测。相对于标准自回归解码，SSSD将延迟降低高达2.9倍。它在广泛的基准测试中达到了与领先的基于训练的方法相当的性能，同时需要显著更低的采用成本——无需数据准备、训练或调优——并且在语言和领域变化以及长上下文设置中表现出优越的鲁棒性。

英文摘要

Speculative Decoding has emerged as a popular technique for accelerating inference in Large Language Models. However, most existing approaches yield only modest improvements in production serving systems. Methods that achieve substantial speedups typically rely on an additional trained draft model or auxiliary model components, increasing deployment and maintenance complexity. This added complexity reduces flexibility, particularly when serving workloads shift to tasks, domains, or languages that are not well represented in the draft model's training data. We introduce Simply-Scalable Speculative Decoding (SSSD), a training-free method that combines lightweight n-gram matching with hardware-aware speculation. Relative to standard autoregressive decoding, SSSD reduces latency by up to 2.9x. It achieves performance on par with leading training-based approaches across a broad range of benchmarks, while requiring substantially lower adoption effort--no data preparation, training or tuning are needed--and exhibiting superior robustness under language and domain shift, as well as in long-context settings.

URL PDF HTML ☆

赞 0 踩 0

2507.16199 2026-06-04 cs.CL 版本更新

LLM Abstention Can Be a Prompt Artifact, in Addition to Genuine Uncertainty

LLM 的拒绝回答可能既是真实不确定性的体现，也是提示的产物

Zipeng Ling, Shuliang Liu, Yuehao Tang, Junqi Yang, Shenghong Fu, Chen Huang, Kejia Huang, Yao Wan, Zhichao Hou, Xuming Hu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； University of Pennsylvania（宾夕法尼亚大学）； Huazhong University of Science and Technology（华中科技大学）； Nanjing University of Posts and Telecommunications（南京邮电大学）； The Hong Kong Polytechnic University（香港理工大学）

AI总结本文发现大语言模型（LLM）的拒绝回答行为不仅源于真实不确定性，还受提示结构影响，称为“拒绝膨胀”，并通过实验证明该现象由额外选项的结构性存在触发，而非真实不确定性。

详情

AI中文摘要

大型语言模型（LLM）越来越多地被训练来拒绝回答它们不确定的问题。然而，这种能力经常被误用：在实际应用中，输入提示有时包含不确定性元素，受此驱动，LLM 倾向于拒绝回答它们本可以解决的问题。我们认为 LLM 的拒绝回答不仅是真实不确定性的表达；它也是一种很大程度上受提示影响的产物。我们将这种现象命名为 *拒绝膨胀*。我们为 LLM 添加“未知”作为额外选项供其选择；实验表明，在真/假问题（TFQ）上准确率严重下降。将“未知”替换为不相关的随机词会产生相同的效果。我们认为 LLM 被训练成模仿 *拒绝回答* 的表面模式，而不是表达真实的不确定性。基于十个实验，我们支持四个主张，它们构成了一个递进的论证：（C1）*拒绝膨胀* 是由额外选项的结构性存在触发的，而不是由真实不确定性触发的；（C2）进一步，它使模型在能够回答时也否认自己能回答；（C3）在表示层面，这表现为后层输出覆盖；（C4）最后，这种偏差是稳定的，并通过指令调优出现，而非随机噪声。

英文摘要

Large Language Models (LLMs) are increasingly trained to abstain from answering questions they are unsure about. However, this ability is often misused: in real-world applications, input prompts sometimes contain uncertainty elements, and driven by this, LLMs are inclined to abstain even on problems they are capable of solving. We argue that LLM abstention is not only an expression of genuine uncertainty; it is also an artifact that can be largely influenced by prompts. We name this phenomenon *Abstention Inflation*. We add "Unknown" as an extra option for LLMs to choose from; experiments show serious accuracy drops on True/False Questions (TFQs). Replacing "Unknown" with an unrelated random word produces an identical effect. We argue that LLMs are trained to imitate the surface pattern of *abstention*, rather than to express genuine uncertainty. Based on ten experiments, we support four claims that form a progressive argument: **(C1)** *Abstention Inflation* is triggered by the structural presence of an extra option, not by genuine uncertainty; **(C2)** further, it makes the model deny it can answer even when it can; **(C3)** at the representation level, this manifests as a later-layer output override; **(C4)** finally, this bias is stable and emerges through instruction tuning, rather than stochastic noise.

URL PDF HTML ☆

赞 0 踩 0

2512.15552 2026-06-04 cs.CL 版本更新

Automated Lexical Coverage for Language Learning: From General to Specialized Word Lists

语言学习的自动化词汇覆盖：从通用到专业词表

Dakota Ellis, Samy Babikerali, Wanshan Chen, Bao Dinh, Uyen Le

发表机构 * University of North Carolina at Charlotte（北卡罗来纳大学夏洛特分校）； School of Data Science（数据科学学院）

AI总结本文提出一种基于目标文本自动生成专业词表的方法，相比通用词表能以更少词汇达到95%的文本覆盖率，并实现自动化、可扩展的词汇学习资源构建。

详情

AI中文摘要

通用服务词表（GSL）是语言学习者识别重要英语单词的常用资源。传统的GSL创建依赖语言专业知识和主观输入，资源消耗大。我们创建了自己的GSL，并评估其与新通用服务词表（NGSL）的性能。我们发现，针对特定文本定制的专业词表（SWL）是语言学习者的实用方法。由于SWL源自目标文本本身，它通过构造达到语言理解所需的95%覆盖率，并且与应用于同一文本的通用词表相比，使用的词汇量显著更少：在涵盖小说、学术论文和脚本的九个文本中，NGSL覆盖了每个文本的64-85%，而文本特定词表以更小的词汇量达到95%。通过仅依赖客观标准，SWL过程可以自动化、可扩展，并针对全球语言学习者的需求进行定制。

英文摘要

A General Service List (GSL) is a commonly used resource for language learners to identify important English words. Traditional GSL creation is resource-intensive, relying on linguistic expertise and subjective input. We created our own GSL and evaluated its performance against the New General Service List (NGSL). We found that creating a Specialized Word List (SWL), tailored to a specific text, is a practical method for language learners. Because an SWL is derived from the target text itself, it reaches the 95% coverage required for language comprehension by construction, and it does so with substantially fewer words than a general list applied to the same text: across nine texts spanning fiction, academic papers, and scripts, the NGSL covered 64-85% of each text, whereas a text-specific list reached 95% with far smaller vocabularies. By restricting the SWL process to objective criteria only, it can be automated, scaled, and tailored to the needs of language-learners across the globe.

URL PDF HTML ☆

赞 0 踩 0

2512.08094 2026-06-04 cs.CL 版本更新

Segment, Embed, and Align: A Universal Recipe for Aligning Subtitles to Signing

分割、嵌入和对齐：将字幕与手语对齐的通用方法

Zifan Jiang, Youngjoon Jang, Liliane Momeni, Gül Varol, Sarah Ebling, Andrew Zisserman

发表机构 * VGG, Dept. of Engineering Science, University of Oxford（视觉感知与计算实验室，工程科学系，牛津大学）； University of Zurich（苏黎世大学）； KAIST（韩国科学技术院）； LIGM, CNRS, Univ Gustave Eiffel, ENPC, IP Paris（LIGM，国家科学研究中心，古斯塔夫·埃菲尔大学，巴黎理工大学，IP巴黎）

AI总结提出一种通用框架SEA，利用预训练模型分割视频帧序列为单个手势、嵌入手势片段到与文本共享的潜在空间，并通过轻量动态规划实现高效对齐，在多个手语数据集上达到最先进性能。

Comments Camera-ready version of ACL 2026 (Main)

详情

AI中文摘要

本文的目标是开发一种通用方法，用于将字幕（即带有对应时间戳的口语文本）与连续手语视频对齐。先前的方法通常依赖于针对特定语言或数据集的端到端训练，这限制了它们的通用性。相比之下，我们的方法Segment, Embed, and Align (SEA)提供了一个适用于多种语言和领域的单一框架。SEA利用两个预训练模型：第一个模型将视频帧序列分割为单个手势，第二个模型将每个手势的视频片段嵌入到与文本共享的潜在空间中。随后，通过轻量级动态规划程序进行对齐，该程序即使在长达一小时的视频中也能在CPU上高效运行，耗时不到一分钟。SEA灵活且能适应各种场景，利用从小型词汇表到大型连续语料库的资源。在四个手语数据集上的实验展示了最先进的对齐性能，突显了SEA在生成高质量并行数据以推动手语处理方面的潜力。SEA的代码和模型已公开提供。

英文摘要

The goal of this work is to develop a universal approach for aligning subtitles (i.e., spoken language text with corresponding timestamps) to continuous sign language videos. Prior approaches typically rely on end-to-end training tied to a specific language or dataset, which limits their generality. In contrast, our method Segment, Embed, and Align (SEA) provides a single framework that works across multiple languages and domains. SEA leverages two pretrained models: the first to segment a video frame sequence into individual signs and the second to embed the video clip of each sign into a shared latent space with text. Alignment is subsequently performed with a lightweight dynamic programming procedure that runs efficiently on CPUs within a minute, even for hour-long episodes. SEA is flexible and can adapt to a wide range of scenarios, utilizing resources from small lexicons to large continuous corpora. Experiments on four sign language datasets demonstrate state-of-the-art alignment performance, highlighting the potential of SEA to generate high-quality parallel data for advancing sign language processing. SEA's code and models are openly available.

URL PDF HTML ☆

赞 0 踩 0

2511.12784 2026-06-04 cs.CL cs.LO 版本更新

Evaluating Autoformalization Robustness via Semantically Similar Paraphrasing

通过语义相似改写评估自动形式化鲁棒性

Hayden Moore, Asfahan Shah

发表机构 * Department of Computer Science and Engineering, The Pennsylvania State University（宾夕法尼亚州立大学计算机科学与工程系）

AI总结本文通过语义相似改写生成自然语言变体，评估大语言模型在自动形式化中生成形式证明的鲁棒性，发现自然语言表述的微小变化会显著影响模型输出。

详情

AI中文摘要

大语言模型（LLMs）最近成为自动形式化的强大工具。尽管性能令人印象深刻，这些模型仍可能难以产生扎实且可验证的形式化。文本到SQL领域的最新工作表明，即使保留高度的语义保真度，LLMs对改写的自然语言（NL）输入也可能敏感。在本文中，我们在自动形式化领域调查了这一说法。具体而言，我们通过测量语义和编译有效性，评估LLMs在生成具有语义相似改写NL语句的形式证明时的鲁棒性。使用形式基准MiniF2F和Lean 4版本的ProofNet，以及两个现代LLMs，我们生成改写的自然语言语句，并在两个模型上交叉评估这些语句。本文的结果揭示了改写输入之间的性能变异性，表明NL语句的微小变化会显著影响模型输出。

英文摘要

Large Language Models (LLMs) have recently emerged as powerful tools for autoformalization. Despite their impressive performance, these models can still struggle to produce grounded and verifiable formalizations. Recent work in text-to-SQL, has revealed that LLMs can be sensitive to paraphrased natural language (NL) inputs, even when high degrees of semantic fidelity are preserved. In this paper, we investigate this claim in the autoformalization domain. Specifically, we evaluate the robustness of LLMs generating formal proofs with semantically similar paraphrased NL statements by measuring semantic and compilation validity. Using the formal benchmarks MiniF2F and Lean 4 version of ProofNet, and two modern LLMs, we generate paraphrased natural language statements and cross-evaluate these statements across both models. The results of this paper reveal performance variability across paraphrased inputs, demonstrating that minor shifts in NL statements can significantly impact model outputs.

URL PDF HTML ☆

赞 0 踩 0

2511.01192 2026-06-04 cs.CL 版本更新

DEER: Disentangled Mixture of Experts with Instance-Adaptive Routing for Generalizable Machine-Generated Text Detection

DEER: 面向可泛化机器生成文本检测的实例自适应路由解耦专家混合模型

Guoxin Ma, Xiaoming Liu, Hongyang Chen, Chengzhengxu Li, Zhaohan Zhang, Shengchao Liu, Yu Lan, Cong Wang, Chao Shen

发表机构 * Faculty of Electronic and Information Engineering, Xi’an Jiaotong University（电子与信息工程学院，西安交通大学）； China Mobile Group（中国移动集团）； Queen Mary University of London（伦敦大学玛丽女王学院）； City University of Hong Kong（香港城市大学）

AI总结提出DEER框架，通过解耦领域局部与领域不变知识为专门专家模块，并利用强化学习驱动的路由器基于实例级检测奖励选择专家路径，解决机器生成文本检测中的领域偏移问题，在域内和域外数据集上均取得优于现有方法的性能。

Comments ARR Under Review

详情

AI中文摘要

随着LLM的快速发展，检测机器生成文本已成为一项关键挑战，但现有检测器在领域偏移下性能严重下降。通过系统性的初步研究，我们将这一脆弱性归因于当前泛化策略中的两个根本缺陷：即多领域训练中领域特定知识的不完全保留，以及推理时知识检索与检测目标之间的错位。为解决这些问题，我们提出了DEER，一种解耦的专家混合框架，将领域局部和领域不变知识明确解耦到专门的专家模块中。与静态领域匹配不同，DEER采用强化学习驱动的路由器，基于实例级检测奖励选择专家路径。这种任务对齐、领域无关的机制通过优先考虑检测效用而非风格相似性，确保了对未见分布的鲁棒适应。大量实验表明，DEER始终优于最先进的检测器，在域内和域外数据集上平均F1提升1.28%和2.92%，准确率提升1.35%和2.26%，为开放世界部署提供了可靠的泛化能力。

英文摘要

Detecting machine-generated text has become a critical challenge amid the rapid advancement of LLMs, yet existing detectors degrade severely under domain shift. Through systematic pilot studies, we trace this vulnerability to two fundamental flaws in current generalization strategies, namely the incomplete preservation of domain-specific knowledge during multi-domain training and the misalignment between knowledge retrieval and the detection objective at inference. To address these gaps, we propose DEER, a Disentangled mixturE-of-ExpeRts framework that explicitly decouples domain-local and domain-invariant knowledge into specialized expert modules. Instead of static domain matching, DEER employs a reinforcement learning-driven router that selects expert pathways based on instance-level detection rewards. This task-aligned, domain-agnostic mechanism ensures robust adaptation to unseen distributions by prioritizing detection utility over stylistic resemblance. Extensive experiments demonstrate that DEER consistently outperforms state-of-the-art detectors, achieving average F1 improvements of 1.28% and 2.92%, and accuracy gains of 1.35% and 2.26% on in-domain and out-of-domain datasets, offering reliable generalization for open-world deployment.

URL PDF HTML ☆

赞 0 踩 0

2510.13796 2026-06-04 cs.CL cs.CV 版本更新

约束自适应拒绝采样

Paweł Parys, Sairam Vaidya, Taylor Berg-Kirkpatrick, Loris D'Antoni

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出约束自适应拒绝采样（CARS），通过自适应剪枝无效前缀来提高拒绝采样的样本效率，同时保持无分布扭曲，在程序模糊测试和分子生成等任务中优于现有方法。

详情

AI中文摘要

语言模型（LMs）越来越多地应用于生成的输出必须满足严格语义或语法约束的场景。现有的约束生成方法处于一个谱系中：贪婪约束解码方法在解码过程中强制执行有效性，但扭曲了LM的分布；而拒绝采样（RS）保留了保真度，但通过丢弃无效输出浪费计算资源。在程序模糊测试等领域，样本的有效性和多样性都至关重要，这两种极端方法都有问题。我们提出约束自适应拒绝采样（CARS），一种严格提高RS样本效率且不产生分布扭曲的方法。CARS从无约束LM采样开始，通过将违反约束的续写记录在trie中并从后续抽取中减去其概率质量，自适应地排除它们。这种自适应剪枝确保已证明无效的前缀不会被重新访问，接受率单调提高，并且生成的样本精确遵循约束分布。在多个领域的实验（例如程序模糊测试和分子生成）中，CARS始终实现更高的效率（以每个有效样本的LM前向传递次数衡量），同时产生比GCD和近似LM分布的方法更强的样本多样性。

英文摘要

Language Models (LMs) are increasingly used in applications where generated outputs must satisfy strict semantic or syntactic constraints. Existing approaches to constrained generation fall along a spectrum: greedy constrained decoding methods enforce validity during decoding but distort the LM's distribution, while rejection sampling (RS) preserves fidelity but wastes computation by discarding invalid outputs. Both extremes are problematic in domains such as program fuzzing, where both validity and diversity of samples are essential. We present Constrained Adaptive Rejection Sampling (CARS), an approach that strictly improves the sample-efficiency of RS without distributional distortion. CARS begins with unconstrained LM sampling and adaptively rules out constraint-violating continuations by recording them in a trie and subtracting their probability mass from future draws. This adaptive pruning ensures that prefixes proven invalid are never revisited, acceptance rates improve monotonically, and the resulting samples exactly follow the constrained distribution. In experiments on a variety of domains -- e.g., program fuzzing and molecular generation -- CARS consistently achieves higher efficiency -- measured in the number of LM forward passes per valid sample -- while also producing stronger sample diversity than both GCD and methods that approximate the LM's distribution.

URL PDF HTML ☆

赞 0 踩 0

2509.21597 2026-06-04 eess.AS cs.CL cs.SD 版本更新

AUDDT: A Unified Benchmark Toolkit for Audio and Speech Deepfake Detectors

AUDDT：音频与语音深度伪造检测器的统一基准工具包

Yi Zhu, Heitor R. Guimarães, Arthur Pimentel, Tiago Falk

发表机构 * MuSAELab（MuSAELab实验室）

AI总结本文提出AUDDT开源基准工具包，通过整合31个数据集并自动化评估预训练检测器，系统分析了深度伪造检测在不同操作类型和录音条件下的泛化能力与性能差异。

详情

AI中文摘要

随着人工智能生成内容（如音频深度伪造）的普及，近期大量工作聚焦于开发深度伪造检测技术。然而，现有基准仅使用少量数据集，使得检测器在真实世界条件下的泛化能力不确定。本文系统回顾了31个现有音频深度伪造数据集，并提出了一个名为AUDDT（https://github.com/MuSAELab/AUDDT）的开源基准测试工具包。该工具包旨在自动化评估预训练检测器在广泛语音和非语音音频数据集上的性能，为用户提供其深度伪造检测器在不同操作类型和录音条件下的优缺点直接反馈。我们首先展示了所开发工具包的使用方法、基准的组成以及不同深度伪造子组的细分。接着，我们强调了AUDDT与现有基准工作的不同之处，即通过大规模、多样化的现代欺骗方法评估以及通过全面的元数据注释进行更丰富的属性级分析。使用一个广泛采用的预训练深度伪造检测器，我们展示了域内和域外检测结果，揭示了在不同条件和音频操作类型下显著的性能差异。最后，我们还分析了这些现有数据集的局限性及其与实际部署场景之间的差距。

英文摘要

With the prevalence of artificial intelligence (AI)-generated content, such as audio deepfakes, a large body of recent work has focused on developing deepfake detection techniques. However, existing benchmarks employ a narrow set of datasets, leaving detector generalization to real-world conditions uncertain. In this paper, we systematically review 31 existing audio deepfake datasets and present an open-source benchmarking toolkit called AUDDT (https://github.com/MuSAELab/AUDDT). The goal of this toolkit is to automate the evaluation of pretrained detectors across a wide range of speech and non-speech audio datasets, giving users direct feedback on the advantages and shortcomings of their deepfake detectors under diverse manipulation types and recording conditions. We start by showcasing the usage of the developed toolkit, the composition of our benchmark, and the breakdown of different deepfake subgroups. Next, we highlight how AUDDT differs from existing benchmarking efforts by enabling large-scale, diverse evaluation across modern spoofing methods and richer attribute-level analysis through comprehensive metadata annotation. Using a widely adopted pretrained deepfake detector, we present in- and out-of-domain detection results, revealing notable performance variability across different conditions and audio manipulation types. Lastly, we also analyze the limitations of these existing datasets and their gaps relative to practical deployment scenarios.

URL PDF HTML ☆

赞 0 踩 0

2509.15676 2026-06-04 cs.LG cs.AI cs.CL 版本更新

KITE: Kernelized and Information Theoretic Exemplars for In-Context Learning

KITE: 基于核方法和信息论的上下文学习示例选择

Vaibhav Singh, Soumya Suvra Ghosal, Kapu Nirmal Joshua, Soumyabrata Pal, Sayak Ray Chowdhury

发表机构 * IIT Bombay（印度比哈尔理工学院）； UMD College Park（马里兰大学 College Park 分校）； IIT Kanpur（印度坎普尔理工学院）； Adobe Research（Adobe 研究）

AI总结针对上下文学习中的示例选择问题，提出一种基于信息论和核方法的贪心算法，通过最小化查询特定预测误差并引入多样性正则化，显著提升分类性能。

详情

AI中文摘要

上下文学习（ICL）已成为一种强大的范式，通过仅使用提示中精心选择的少量任务特定示例，使大型语言模型（LLM）适应新的、数据稀缺的任务。然而，鉴于LLM有限的上下文大小，一个基本问题出现了：应选择哪些示例以最大化给定用户查询的性能？虽然基于最近邻的方法（如KATE）已被广泛用于此目的，但它们在高维嵌入空间中存在众所周知的缺点，包括泛化能力差和缺乏多样性。在这项工作中，我们从原则性的、信息论驱动的角度研究ICL中的示例选择问题。我们首先将LLM建模为输入嵌入上的线性函数，并将示例选择任务框架化为一个查询特定的优化问题：从较大的示例库中选择一个子集，以最小化特定查询上的预测误差。这种表述通过针对特定查询实例的准确预测，偏离了传统的以泛化为中心的学习理论方法。我们推导出一个原则性的代理目标，该目标是近似子模的，从而能够使用具有近似保证的贪心算法。我们通过（i）引入核技巧以在高维特征空间中操作而无需显式映射，以及（ii）引入基于最优设计的正则化项以鼓励所选示例的多样性，进一步增强了我们的方法。实验上，我们在多个分类任务上展示了相对于标准检索方法的显著改进，突出了在真实世界、标签稀缺场景中，结构感知、多样化的示例选择对ICL的益处。

英文摘要

In-context learning (ICL) has emerged as a powerful paradigm for adapting large language models (LLMs) to new and data-scarce tasks using only a few carefully selected task-specific examples presented in the prompt. However, given the limited context size of LLMs, a fundamental question arises: Which examples should be selected to maximize performance on a given user query? While nearest-neighbor-based methods like KATE have been widely adopted for this purpose, they suffer from well-known drawbacks in high-dimensional embedding spaces, including poor generalization and a lack of diversity. In this work, we study this problem of example selection in ICL from a principled, information theory-driven perspective. We first model an LLM as a linear function over input embeddings and frame the example selection task as a query-specific optimization problem: selecting a subset of exemplars from a larger example bank that minimizes the prediction error on a specific query. This formulation departs from traditional generalization-focused learning theoretic approaches by targeting accurate prediction for a specific query instance. We derive a principled surrogate objective that is approximately submodular, enabling the use of a greedy algorithm with an approximation guarantee. We further enhance our method by (i) incorporating the kernel trick to operate in high-dimensional feature spaces without explicit mappings, and (ii) introducing an optimal design-based regularizer to encourage diversity in the selected examples. Empirically, we demonstrate significant improvements over standard retrieval methods across a suite of classification tasks, highlighting the benefits of structure-aware, diverse example selection for ICL in real-world, label-scarce scenarios.

URL PDF HTML ☆

赞 0 踩 0

2508.01815 2026-06-04 cs.CL cs.AI 版本更新

From Graph Retrieval to Schema Realization: Counterfactual Validation for Text-to-SPARQL over Heterogeneous Knowledge Graphs

从图检索到模式实现：面向异构知识图谱的文本到SPARQL的反事实验证

Chengxiao Dai, Yue Xiu, Dusit Niyato

发表机构 * University of Bristol（布里斯托大学）

AI总结提出SchemaForge框架，通过问题条件化的模式切片对齐和反事实验证，在异构知识图谱上提升文本到SPARQL查询生成的执行准确率。

详情

AI中文摘要

文本到SPARQL将自然语言问题映射为RDF知识图谱上的可执行SPARQL查询。标准评估通常预先固定目标图，但实际知识图谱问答（KGQA）可能涉及具有不同模式、部分对齐和不完整元数据的异构图集合。在此设置下，查询生成不仅依赖于SPARQL语法：系统必须识别能够支持问题所需的谓词、实体类型、连接、过滤器和约束的图模式。我们提出SchemaForge，一个面向异构KG集合的文本到SPARQL的基于模式的智能体框架。其核心机制是问题条件化的模式切片对齐：弱图证据首先识别可能的图，而更强的模式证据确定局部模式切片能否实现预期查询。选定的模式切片随后在执行前约束查询生成和验证。当仅有一个图可用时，该公式简化为带有模式基础的标准单KG文本到SPARQL。我们在LC-QuAD 2.0、QALD-9 Plus、QALD-10和Spider4SPARQL上评估SchemaForge。在四个公开基准上，SchemaForge相比最强匹配的智能体基线平均提高执行准确率11.50个百分点。在Spider4SPARQL上，SchemaForge将执行准确率从54.86%提升至64.18%，并达到73.0%的Top-1和97.0%的Top-3图分配准确率。这些结果表明，从弱图证据转向模式特定的查询承诺，结合反事实答案集检查，改进了异构知识图谱上的可执行查询生成。

英文摘要

Text-to-SPARQL maps natural-language questions to executable SPARQL queries over RDF knowledge graphs. While standard evaluations often fix the target graph in advance, practical knowledge graph question answering (KGQA) may involve heterogeneous graph collections with different schemas, partial alignments, and incomplete metadata. In this setting, query generation depends on more than SPARQL syntax: the system must identify a graph schema that can support the predicates, entity types, joins, filters, and constraints required by the question. We present SchemaForge, a schema-grounded agentic framework for text-to-SPARQL over heterogeneous KG collections. Its central mechanism is question-conditioned schema-slice alignment: weak graph evidence first identifies plausible graphs, while stronger schema evidence determines whether a local schema slice can realize the intended query. The selected schema slice then constrains query generation and verification before execution. When only one graph is available, the same formulation reduces to standard single-KG text-to-SPARQL with schema grounding. We evaluate SchemaForge on LC-QuAD 2.0, QALD-9 Plus, QALD-10, and Spider4SPARQL. Across the four public benchmarks, SchemaForge improves execution accuracy over the strongest matched agent baseline by 11.50 percentage points on average. On Spider4SPARQL, SchemaForge improves execution accuracy from 54.86% to 64.18% and achieves 73.0% Top-1 and 97.0% Top-3 graph allocation accuracy. These results show that moving from weak graph evidence to schema-specific query commitments, together with counterfactual answer-set checks, improves executable query generation over heterogeneous knowledge graphs.

URL PDF HTML ☆

赞 0 踩 0

2507.21892 2026-06-04 cs.CL 版本更新

Graph-R1: Towards Agentic GraphRAG Framework via End-to-end Reinforcement Learning

Graph-R1：通过端到端强化学习实现智能图RAG框架

Haoran Luo, Haihong E, Guanting Chen, Qika Lin, Yikai Guo, Fangzhi Xu, Zemin Kuang, Meina Song, Xiaobao Wu, Yifan Zhu, Luu Anh Tuan

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出Graph-R1，首个通过端到端强化学习的智能图RAG框架，采用轻量知识超图构建、多轮智能体-环境交互检索和端到端奖励机制，在推理准确性、检索效率和生成质量上优于传统图RAG和强化学习增强RAG方法。

Comments Accepted by ICML 2026 main conference

详情

Journal ref: ICML 2026

AI中文摘要

检索增强生成（RAG）通过引入外部知识减轻大语言模型中的幻觉，但依赖于缺乏结构语义的基于块的检索。图RAG方法通过将知识建模为实体-关系图来改进RAG，但仍面临构建成本高、固定一次性检索以及依赖长上下文推理和提示设计等挑战。为解决这些问题，我们提出Graph-R1，首个通过端到端强化学习（RL）的智能图RAG框架。它引入了轻量知识超图构建，将检索建模为多轮智能体-环境交互，并通过端到端奖励机制优化智能体过程。在标准RAG数据集上的实验表明，Graph-R1在推理准确性、检索效率和生成质量上优于传统图RAG和强化学习增强的RAG方法。我们的软件和数据公开在https://github.com/LHRLAB/Graph-R1。

英文摘要

Retrieval-Augmented Generation (RAG) mitigates hallucination in LLMs by incorporating external knowledge, but relies on chunk-based retrieval that lacks structural semantics. GraphRAG methods improve RAG by modeling knowledge as entity-relation graphs, but still face challenges in high construction cost, fixed one-time retrieval, and reliance on long-context reasoning and prompt design. To address these challenges, we propose Graph-R1, the first agentic GraphRAG framework via end-to-end reinforcement learning (RL). It introduces lightweight knowledge hypergraph construction, models retrieval as a multi-turn agent-environment interaction, and optimizes the agent process via an end-to-end reward mechanism. Experiments on standard RAG datasets show that Graph-R1 outperforms traditional GraphRAG and RL-enhanced RAG methods in reasoning accuracy, retrieval efficiency, and generation quality. Our software and data are publicly available at https://github.com/LHRLAB/Graph-R1.

URL PDF HTML ☆

赞 0 踩 0

2507.03373 2026-06-04 cs.CL 版本更新

WETBench: A Benchmark for Detecting Task-Specific Machine-Generated Text on Wikipedia

WETBench：用于检测维基百科上特定任务机器生成文本的基准

Gerrit Quaremba, Elizabeth Black, Denny Vrandečić, Elena Simperl

发表机构 * King’s College London（伦敦国王学院）； Wikimedia Foundation（维基媒体基金会）

AI总结提出WETBench，一个多语言、多生成器、任务特定的基准，用于检测维基百科编辑场景下的机器生成文本，实验表明训练型检测器平均准确率78%，零样本检测器平均58%。

详情

AI中文摘要

鉴于维基百科作为高质量、可靠内容的可信来源，对其平台上由大型语言模型（LLM）产生的低质量机器生成文本（MGT）的扩散担忧日益增加。因此，可靠的MGT检测至关重要。然而，现有工作主要在通用生成任务上评估MGT检测器，而非维基百科编辑者更常执行的任务。这种错位可能导致在真实维基百科场景中应用时泛化能力差。我们引入了WETBench，一个多语言、多生成器、任务特定的MGT检测基准。我们定义了三个编辑任务，这些任务基于维基百科编辑者对LLM辅助编辑的感知用例进行实证：段落写作、摘要和文本风格迁移，我们使用两个新数据集在三种语言中实现这些任务。对于每个写作任务，我们评估三个提示，使用表现最佳的提示跨多个生成器生成MGT，并对多种检测器进行基准测试。我们发现，在各种设置下，基于训练的检测器平均准确率达到78%，而零样本检测器平均为58%。这些结果表明，检测器在现实生成场景中难以应对MGT，并强调了在多样化、任务特定数据上评估此类模型以评估其在编辑驱动场景中可靠性的重要性。

英文摘要

Given Wikipedia's role as a trusted source of high-quality, reliable content, concerns are growing about the proliferation of low-quality machine-generated text (MGT) produced by large language models (LLMs) on its platform. Reliable detection of MGT is therefore essential. However, existing work primarily evaluates MGT detectors on generic generation tasks rather than on tasks more commonly performed by Wikipedia editors. This misalignment can lead to poor generalisability when applied in real-world Wikipedia contexts. We introduce WETBench, a multilingual, multi-generator, and task-specific benchmark for MGT detection. We define three editing tasks, empirically grounded in Wikipedia editors' perceived use cases for LLM-assisted editing: Paragraph Writing, Summarisation, and Text Style Transfer, which we implement using two new datasets across three languages. For each writing task, we evaluate three prompts, generate MGT across multiple generators using the best-performing prompt, and benchmark diverse detectors. We find that, across settings, training-based detectors achieve an average accuracy of 78%, while zero-shot detectors average 58%. These results show that detectors struggle with MGT in realistic generation scenarios and underscore the importance of evaluating such models on diverse, task-specific data to assess their reliability in editor-driven contexts.

URL PDF HTML ☆

赞 0 踩 0

2506.05233 2026-06-04 cs.LG cs.AI cs.CL 版本更新

MesaNet: Sequence Modeling by Locally Optimal Test-Time Training

MesaNet: 通过局部最优测试时训练进行序列建模

Johannes von Oswald, Nino Scherrer, Seijin Kobayashi, Luca Versari, Songlin Yang, Sarthak Mittal, Maximilian Schlegel, Kaitlin Maile, Yanick Schimpf, Oliver Sieberling, Alexander Meulemans, Rif A. Saurous, Guillaume Lajoie, Charlotte Frenkel, Razvan Pascanu, Blaise Agüera y Arcas, João Sacramento

发表机构 * Google（谷歌）； Paradigms of Intelligence Team（智能范式团队）； Google DeepMind（谷歌深Mind）； MIT CSAIL（麻省理工学院CSAIL）

AI总结提出一种基于共轭梯度求解器实现局部最优测试时训练的Mesa层，在保持常数推理成本的同时，在语言建模困惑度和下游基准性能上超越现有RNN模型。

Comments Published at ICLR 2026

详情

AI中文摘要

序列建模目前主要由使用softmax自注意力的因果Transformer架构主导。尽管被广泛采用，Transformer在推理时需要线性扩展内存和计算。最近一系列工作将softmax操作线性化，产生了具有恒定内存和计算成本的强大循环神经网络模型，如DeltaNet、Mamba或xLSTM。这些模型可以通过注意到其循环层动态都源于上下文回归目标（通过在线学习规则近似优化）来统一。在此，我们加入这一系列工作，引入最近提出的Mesa层（von Oswald等人，2024）的一个数值稳定、可分块并行化的版本，该层原本只能顺序运行，因此不可扩展。该层同样源于上下文损失，但现在使用快速共轭梯度求解器在每个时间点将其最小化至最优。通过一系列扩展到十亿参数规模的实验，我们表明最优测试时训练使得语言建模困惑度更低，下游基准性能优于之前的RNN，尤其是在需要长上下文理解的任务上。这一性能提升以推理时额外浮点运算为代价。因此，我们的结果与最近增加测试时计算以提高性能的趋势有趣地相关——这里通过花费计算在神经网络内部解决序列优化问题来实现。

英文摘要

Sequence modeling is currently dominated by causal transformer architectures that use softmax self-attention. Although widely adopted, transformers require scaling memory and compute linearly during inference. A recent stream of work linearized the softmax operation, resulting in powerful recurrent neural network (RNN) models with constant memory and compute costs such as DeltaNet, Mamba or xLSTM. These models can be unified by noting that their recurrent layer dynamics can all be derived from an in-context regression objective, approximately optimized through an online learning rule. Here, we join this line of work and introduce a numerically stable, chunkwise parallelizable version of the recently proposed Mesa layer (von Oswald et al., 2024), which could only run sequentially in time and was therefore not scalable. This layer again stems from an in-context loss, but which is now minimized to optimality at every time point using a fast conjugate gradient solver. Through an extensive suite of experiments study up to the billion-parameter scale, we show that optimal test-time training enables reaching lower language modeling perplexity and higher downstream benchmark performance than previous RNNs, especially on tasks requiring long context understanding. This performance gain comes at the cost of additional flops spent during inference time. Our results are therefore intriguingly related to recent trends of increasing test-time compute to improve performance -- here by spending compute to solve sequential optimization problems within the neural network itself.

URL PDF HTML ☆

赞 0 踩 0

2505.19293 2026-06-04 cs.CL cs.AI cs.LG 版本更新

100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?

100-LongBench：事实上的长上下文基准是否真的在评估长上下文能力？

Wang Yang, Hongye Jin, Shaochen Zhong, Song Jiang, Qifan Wang, Vipin Chaudhary, Xiaotian Han

发表机构 * Case Western Reserve University（凯斯西储大学）； Texas A&M University（德克萨斯A&M大学）； Rice University（里德大学）； University of California, Los Angeles（加州大学洛杉矶分校）； Meta（Meta公司）

AI总结针对现有长上下文基准无法分离基线能力与真实长上下文能力、且输入长度固定等问题，提出长度可控的长上下文基准和新指标，以有效评估大语言模型的长上下文能力。

详情

AI中文摘要

长上下文能力被认为是LLM最重要的能力之一，因为真正具备长上下文能力的LLM使用户能够轻松处理许多原本繁琐的任务——例如，阅读长文档寻找答案与直接询问LLM。然而，现有的基于真实任务的长上下文评估基准有两个主要缺陷。首先，像LongBench这样的基准通常没有提供适当的指标来将长上下文性能与模型的基线能力分开，使得跨模型比较不清晰。其次，此类基准通常以固定输入长度构建，这限制了它们在不同模型上的适用性，并且无法揭示模型何时开始崩溃。为了解决这些问题，我们引入了一个长度可控的长上下文基准和一个新颖的指标，该指标将基线知识与真实的长上下文能力解耦。实验证明了我们的方法在有效评估LLM方面的优越性。

英文摘要

Long-context capability is considered one of the most important abilities of LLMs, as a truly long context-capable LLM enables users to effortlessly process many originally exhausting tasks -- e.g., digesting a long-form document to find answers vs. directly asking an LLM about it. However, existing real-task-based long-context evaluation benchmarks have two major shortcomings. First, benchmarks like LongBench often do not provide proper metrics to separate long-context performance from the model's baseline ability, making cross-model comparison unclear. Second, such benchmarks are usually constructed with fixed input lengths, which limits their applicability across different models and fails to reveal when a model begins to break down. To address these issues, we introduce a length-controllable long-context benchmark and a novel metric that disentangles baseline knowledge from true long-context capabilities. Experiments demonstrate the superiority of our approach in effectively evaluating LLMs.

URL PDF HTML ☆

赞 0 踩 0

2505.17315 2026-06-04 cs.AI cs.CL cs.LG 版本更新

Longer Context, Deeper Thinking: Uncovering the Role of Long-Context Ability in Reasoning

更长上下文，更深思考：揭示长上下文能力在推理中的作用

Wang Yang, Zirui Liu, Hongye Jin, Qingyu Yin, Vipin Chaudhary, Xiaotian Han

发表机构 * Case Western Reserve University（凯斯西储大学）； University of Minnesota - Twin Cities（明尼苏达大学双城分校）； Texas A&M University（德克萨斯阿姆大学）

AI总结本研究通过实验发现，增强模型的长上下文能力（在监督微调前）能显著提升推理性能，即使对于短输入任务也有泛化收益，表明长上下文建模是推理能力的关键基础。

详情

AI中文摘要

近期语言模型展现出强大的推理能力，但长上下文能力对推理的影响仍未充分探索。在本工作中，我们假设当前推理能力的局限性部分源于长上下文能力不足，这一假设基于经验观察：（1）更高的上下文窗口长度通常带来更强的推理性能，（2）失败的推理案例与失败的长上下文案例相似。为验证这一假设，我们检验了在监督微调（SFT）前增强模型的长上下文能力是否能提升推理性能。具体而言，我们比较了架构和微调数据相同但长上下文能力不同的模型。结果揭示了一致趋势：长上下文能力更强的模型在SFT后，在推理基准上取得了显著更高的准确率。值得注意的是，即使在输入长度较短的任务上，这些增益也持续存在，表明长上下文训练为推理性能提供了可泛化的益处。这些发现表明，长上下文建模不仅对处理长输入至关重要，而且也是推理的关键基础。我们主张将长上下文能力作为未来语言模型设计的首要目标。

英文摘要

Recent language models exhibit strong reasoning capabilities, yet the influence of long-context capacity on reasoning remains underexplored. In this work, we hypothesize that current limitations in reasoning stem, in part, from insufficient long-context capacity, motivated by empirical observations such as (1) higher context window length often leads to stronger reasoning performance, and (2) failed reasoning cases resemble failed long-context cases. To test this hypothesis, we examine whether enhancing a model's long-context ability before Supervised Fine-Tuning (SFT) leads to improved reasoning performance. Specifically, we compared models with identical architectures and fine-tuning data but varying levels of long-context capacity. Our results reveal a consistent trend: models with stronger long-context capacity achieve significantly higher accuracy on reasoning benchmarks after SFT. Notably, these gains persist even on tasks with short input lengths, indicating that long-context training offers generalizable benefits for reasoning performance. These findings suggest that long-context modeling is not just essential for processing lengthy inputs, but also serves as a critical foundation for reasoning. We advocate for treating long-context capacity as a first-class objective in the design of future language models.

URL PDF HTML ☆

赞 0 踩 0

2502.17956 2026-06-04 cs.CL 版本更新

Towards Better Understanding of Program-of-Thought Reasoning in Cross-Lingual and Multilingual Environments

在跨语言和多语言环境中更好地理解程序思维推理

Patomporn Payoungkhamdee, Pume Tuchinda, Jinheon Baek, Samuel Cahyawijaya, Can Udomcharoenchaikit, Potsawee Manakul, Peerat Limkonchotiwat, Ekapol Chuangsuwanich, Sarana Nutanong

发表机构 * School of Information Science and Technology, VISTEC（信息科学与技术学院，VISTEC）； KAIST（韩国科学技术院）； Cohere ； SCB 10X ； AI Singapore（AI新加坡）； Department of Computer Engineering, Chulalongkorn University（朱拉隆梭大学计算机工程系）

AI总结通过分离推理与代码执行，提出评估程序思维提示的框架，发现微调显著提升多语言推理能力，且推理质量与答案准确性强相关。

详情

DOI: 10.18653/v1/2025.findings-acl.817
Journal ref: Findings of the Association for Computational Linguistics: ACL 2025

AI中文摘要

多步推理对于大型语言模型至关重要，但多语言性能仍然具有挑战性。虽然思维链提示改进了推理，但由于推理与执行的纠缠，它在非英语语言中表现不佳。程序思维提示将推理与执行分离，提供了一种有前景的替代方案，但将挑战转移到从非英语问题生成程序上。我们提出了一个框架，通过分离多语言推理与代码执行来评估程序思维，以检验（i）微调对问题-推理对齐的影响，以及（ii）推理质量如何影响答案正确性。我们的发现表明，程序思维微调显著增强了多语言推理，优于思维链微调模型。我们进一步证明了推理质量（通过代码质量衡量）与答案准确性之间的强相关性，突出了其作为测试时性能改进启发式方法的潜力。

英文摘要

Multi-step reasoning is essential for large language models (LLMs), yet multilingual performance remains challenging. While Chain-of-Thought (CoT) prompting improves reasoning, it struggles with non-English languages due to the entanglement of reasoning and execution. Program-of-Thought (PoT) prompting separates reasoning from execution, offering a promising alternative but shifting the challenge to generating programs from non-English questions. We propose a framework to evaluate PoT by separating multilingual reasoning from code execution to examine (i) the impact of fine-tuning on question-reasoning alignment and (ii) how reasoning quality affects answer correctness. Our findings demonstrate that PoT fine-tuning substantially enhances multilingual reasoning, outperforming CoT fine-tuned models. We further demonstrate a strong correlation between reasoning quality (measured through code quality) and answer accuracy, highlighting its potential as a test-time performance improvement heuristic.

URL PDF HTML ☆

赞 0 踩 0

2504.12329 2026-06-04 cs.CL cs.AI 版本更新

Speculative Thinking: Enhancing Small-Model Reasoning with Large Model Guidance at Inference Time

推测性思考：在推理时利用大模型指导增强小模型推理能力

Wang Yang, Xiang Yue, Vipin Chaudhary, Xiaotian Han

发表机构 * Case Western Reserve University（凯斯西储大学）； Carnegie Mellon University（卡内基梅隆大学）

AI总结提出一种无需训练的推测性思考框架，通过让大推理模型在推理层面引导小模型，在提升小模型推理准确率的同时缩短输出长度。

详情

AI中文摘要

近期进展利用后训练来增强模型推理性能，这通常需要昂贵的训练流程，并且仍然存在低效、输出过长的问题。我们提出推测性思考，一种无需训练的框架，使大推理模型在推理层面引导小模型进行推理，区别于在词元层面操作的推测解码。我们的方法基于两个观察：（1）支持推理的词元（如“wait”）经常出现在结构分隔符（如“\n\n”）之后，作为反思或继续的信号；（2）大模型对反思行为有更强的控制，减少不必要的回溯同时提高推理质量。通过策略性地将反思步骤委托给能力更强的模型，我们的方法显著提升了推理模型的推理准确率，同时缩短了输出。在32B推理模型的辅助下，1.5B模型在MATH500上的准确率从83.2%提升至89.4%，实现了6.2%的大幅提升。同时，平均输出长度从5439个词元减少到4583个词元，下降了15.7%。此外，当应用于非推理模型（Qwen-2.5-7B-Instruct）时，我们的框架在相同基准上将准确率从74.0%提升至81.8%，实现了7.8%的相对提升。

英文摘要

Recent advances leverage post-training to enhance model reasoning performance, which typically requires costly training pipelines and still suffers from inefficient, overly lengthy outputs. We introduce Speculative Thinking, a training-free framework that enables large reasoning models to guide smaller ones during inference at the reasoning level, distinct from speculative decoding, which operates at the token level. Our approach is based on two observations: (1) reasoning-supportive tokens such as "wait" frequently appear after structural delimiters like "\n\n", serving as signals for reflection or continuation; and (2) larger models exhibit stronger control over reflective behavior, reducing unnecessary backtracking while improving reasoning quality. By strategically delegating reflective steps to a more capable model, our method significantly boosts the reasoning accuracy of reasoning models while shortening their output. With the assistance of the 32B reasoning model, the 1.5B model's accuracy on MATH500 increases from 83.2% to 89.4%, marking a substantial improvement of 6.2%. Simultaneously, the average output length is reduced from 5439 tokens to 4583 tokens, representing a 15.7% decrease. Moreover, when applied to a non-reasoning model (Qwen-2.5-7B-Instruct), our framework boosts its accuracy from 74.0% to 81.8% on the same benchmark, achieving a relative improvement of 7.8%.

URL PDF HTML ☆

赞 0 踩 0

2307.00862 2026-06-04 cs.CV cs.CL 版本更新

UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding

UniFine: 一种统一且细粒度的零样本视觉-语言理解方法

Rui Sun, Zhecan Wang, Haoxuan You, Noel Codella, Kai-Wei Chang, Shih-Fu Chang

发表机构 * Columbia University（哥伦比亚大学）； Microsoft Research（微软研究院）； University of California, Los Angeles（加州大学洛杉矶分校）

AI总结提出UniFine框架，通过利用句子关键词和图像对象等细粒度信息进行图像-文本匹配，在零样本设置下统一处理VQA、SNLI-VE和VCR等视觉-语言任务，并在多个数据集上取得显著改进。

Comments 14 pages, 4 figures, ACL 2023 Findings

详情

AI中文摘要

视觉-语言任务，如VQA、SNLI-VE和VCR，具有挑战性，因为它们需要模型的推理能力来理解视觉世界和自然语言的语义。针对视觉-语言任务的监督方法已被充分研究。然而，在零样本设置下解决这些任务的研究较少。由于对比语言-图像预训练（CLIP）在图像-文本匹配上展现了显著的零样本性能，先前的工作通过将视觉-语言任务转换为图像-文本匹配问题来利用其强大的零样本能力，并且它们主要考虑全局级别的匹配（例如，整个图像或句子）。然而，我们发现视觉和文本的细粒度信息，例如句子中的关键词和图像中的对象，对于语义理解可能相当有信息量。受此启发，我们提出了一个统一框架，利用细粒度信息进行零样本视觉-语言学习，涵盖多个任务，如VQA、SNLI-VE和VCR。我们的实验表明，我们的框架在VQA上优于先前的零样本方法，并在SNLI-VE和VCR上取得了显著改进。此外，我们的消融研究证实了我们提出的方法的有效性和泛化性。

英文摘要

Vision-language tasks, such as VQA, SNLI-VE, and VCR are challenging because they require the model's reasoning ability to understand the semantics of the visual world and natural language. Supervised methods working for vision-language tasks have been well-studied. However, solving these tasks in a zero-shot setting is less explored. Since Contrastive Language-Image Pre-training (CLIP) has shown remarkable zero-shot performance on image-text matching, previous works utilized its strong zero-shot ability by converting vision-language tasks into an image-text matching problem, and they mainly consider global-level matching (e.g., the whole image or sentence). However, we find visual and textual fine-grained information, e.g., keywords in the sentence and objects in the image, can be fairly informative for semantics understanding. Inspired by this, we propose a unified framework to take advantage of the fine-grained information for zero-shot vision-language learning, covering multiple tasks such as VQA, SNLI-VE, and VCR. Our experiments show that our framework outperforms former zero-shot methods on VQA and achieves substantial improvement on SNLI-VE and VCR. Furthermore, our ablation studies confirm the effectiveness and generalizability of our proposed method.

URL PDF HTML ☆

赞 0 踩 0

2407.03884 2026-06-04 cs.CL cs.AI 版本更新

ChatSOP: An SOP-Guided MCTS Planning Framework for Controllable LLM Dialogue Agents

ChatSOP: 一种SOP引导的MCTS规划框架，用于可控的LLM对话代理

Zhigen Li, Jianxiang Peng, Yanmeng Wang, Yong Cao, Tianhao Shen, Minghui Zhang, Linxi Su, Shang Wu, Yihang Wu, Yuqian Wang, Ye Wang, Wei Hu, Jianfeng Li, Shaojun Wang, Jing Xiao, Deyi Xiong

发表机构 * TJUNLP Lab, College of Intelligence and Computing, Tianjin University（天津大学智能计算学院TJUNLP实验室）； Ping An Technology（平安科技）； Tübingen AI Center, University of Tübingen（图宾根大学图宾根人工智能中心）； Kunming University of Science and Technology（昆明理工大学）

AI总结提出ChatSOP框架，通过SOP引导的蒙特卡洛树搜索增强LLM对话代理的可控性，在动作准确率上相比GPT-3.5基线提升27.95%。

Comments Accepted to ACL 2025 main

详情

DOI: 10.18653/v1/2025.acl-long.863
Journal ref: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 17637-17659, 2025

AI中文摘要

由大型语言模型驱动的对话代理在各种任务中表现出优越的性能。尽管它们能更好地理解用户并生成类人回复，但**缺乏可控性**仍然是一个关键挑战，常常导致对话偏离主题或任务失败。为了解决这个问题，我们引入标准操作程序来规范对话流程。具体来说，我们提出了**ChatSOP**，一种新颖的SOP引导的蒙特卡洛树搜索规划框架，旨在增强LLM驱动的对话代理的可控性。为此，我们整理了一个数据集，包含使用GPT-4o的半自动角色扮演系统生成的、经过严格人工质量控制验证的SOP标注的多场景对话。此外，我们提出了一种新方法，将思维链推理与监督微调相结合用于SOP预测，并利用SOP引导的蒙特卡洛树搜索在对话中进行最优动作规划。实验结果表明了我们方法的有效性，例如，与基于GPT-3.5的基线模型相比，动作准确率提高了27.95%，并且在开源模型上也显示出显著的提升。数据集和代码已公开。

英文摘要

Dialogue agents powered by Large Language Models (LLMs) show superior performance in various tasks. Despite the better user understanding and human-like responses, their **lack of controllability** remains a key challenge, often leading to unfocused conversations or task failure. To address this, we introduce Standard Operating Procedure (SOP) to regulate dialogue flow. Specifically, we propose **ChatSOP**, a novel SOP-guided Monte Carlo Tree Search (MCTS) planning framework designed to enhance the controllability of LLM-driven dialogue agents. To enable this, we curate a dataset comprising SOP-annotated multi-scenario dialogues, generated using a semi-automated role-playing system with GPT-4o and validated through strict manual quality control. Additionally, we propose a novel method that integrates Chain of Thought reasoning with supervised fine-tuning for SOP prediction and utilizes SOP-guided Monte Carlo Tree Search for optimal action planning during dialogues. Experimental results demonstrate the effectiveness of our method, such as achieving a 27.95% improvement in action accuracy compared to baseline models based on GPT-3.5 and also showing notable gains for open-source models. Dataset and codes are publicly available.

URL PDF HTML ☆

赞 0 踩 0

2408.11121 2026-06-04 cs.LG cs.AI cs.CL cs.CR 版本更新

DOMBA: Double Model Balancing for Access-Controlled Language Models via Minimum-Bounded Aggregation

DOMBA: 通过最小有界聚合实现访问控制语言模型的双模型平衡

Tom Segal, Asaf Shabtai, Yuval Elovici

发表机构 * Ben-Gurion University（本·古里安大学）

AI总结提出DOMBA方法，通过最小有界平均函数聚合两个不同访问级别文档训练的语言模型的概率分布，在保证安全性的同时实现高效用。

Comments Code: https://github.com/ppo1/DOMBA 11 pages, 3 figures

详情

DOI: 10.1609/aaai.v39i23.34695
Journal ref: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 39, pp. 25101-25109, 2025

AI中文摘要

大型语言模型（LLMs）的实用性在很大程度上取决于其训练数据的质量和数量。许多组织拥有大量数据语料库，可用于训练或微调针对其特定需求的LLMs。然而，这些数据集通常带有基于用户权限并由访问控制机制强制执行的访问限制。在此类数据集上训练LLMs可能导致敏感信息暴露给未经授权的用户。防止此类暴露的一种直接方法是为每个访问级别训练一个单独的模型。然而，由于每个模型的训练数据量相对于整个组织语料库的总量有限，这可能导致模型效用低下。另一种方法是在所有数据上训练单个LLM，同时限制未经授权信息的暴露。然而，当前针对LLMs的暴露限制方法对于访问控制数据无效，因为敏感信息在多个训练样本中频繁出现。我们提出DOMBA——双模型平衡——一种训练和部署LLMs的简单方法，可在提供高效用和访问控制功能的同时保证安全性。DOMBA使用“最小有界”平均函数（一个受较小值约束的函数，例如调和平均）聚合两个模型的概率分布，每个模型在具有（可能多个）不同访问级别的文档上训练。详细的数学分析和广泛评估表明，DOMBA在保护受限信息的同时，提供了与非安全模型相当的效用。

英文摘要

The utility of large language models (LLMs) depends heavily on the quality and quantity of their training data. Many organizations possess large data corpora that could be leveraged to train or fine-tune LLMs tailored to their specific needs. However, these datasets often come with access restrictions that are based on user privileges and enforced by access control mechanisms. Training LLMs on such datasets could result in exposure of sensitive information to unauthorized users. A straightforward approach for preventing such exposure is to train a separate model for each access level. This, however, may result in low utility models due to the limited amount of training data per model compared to the amount in the entire organizational corpus. Another approach is to train a single LLM on all the data while limiting the exposure of unauthorized information. However, current exposure-limiting methods for LLMs are ineffective for access-controlled data, where sensitive information appears frequently across many training examples. We propose DOMBA - double model balancing - a simple approach for training and deploying LLMs that provides high utility and access-control functionality with security guarantees. DOMBA aggregates the probability distributions of two models, each trained on documents with (potentially many) different access levels, using a "min-bounded" average function (a function that is bounded by the smaller value, e.g., harmonic mean). A detailed mathematical analysis and extensive evaluation show that DOMBA safeguards restricted information while offering utility comparable to non-secure models.

URL PDF HTML ☆

赞 0 踩 0

2412.06095 2026-06-04 cs.CL cs.FL cs.IT math.IT 版本更新

Measuring Grammatical Diversity from Small Corpora: Derivational Entropy Rates, Mean Length of Utterances, and Annotation Invariance

从小语料库测量语法多样性：派生熵率、平均话语长度和注释不变性

Fermin Moscoso del Prado Martin

发表机构 * Department of Computer Science and Technology & Jesus College University of Cambridge（计算机科学与技术系及耶稣学院，剑桥大学）

AI总结本文从理论和实证上证明语法的派生熵与其生成的话语平均长度（MLU）之间存在根本联系，提出派生熵率作为衡量语法复杂性的新指标，并引入平滑诱导树库熵（SITE）从小树库中准确估计这些度量。

详情

DOI: 10.1162/COLI.a.15
Journal ref: Computational Linguistics (2025) 51 (4): 1191-1233

AI中文摘要

在许多领域，如语言习得、语言神经心理学、衰老研究和历史语言学，语料库被用于估计个体、社区或说话者类型在一段时间内产生的语法结构的多样性。在这些情况下，树库被视为可能遇到的句法结构的代表性样本。从小型语料库中记录的结构推广潜在的句法多样性需要谨慎的外推，其准确性受到代表性子语料库规模有限的制约。在本文中，我从理论和实证上证明，语法的派生熵与其生成的话语平均长度（MLU）之间存在根本联系，从而产生了一个新的度量——派生熵率。话语平均长度成为句法复杂性最实用的指标；我证明MLU不仅仅是一个代理，而是语法多样性的基本度量。结合新的派生熵率度量，它提供了一种无理论的语法复杂性评估。派生熵率索引了不同语法注释框架确定树库语法复杂性的速率。我引入了平滑诱导树库熵（SITE）作为准确估计这些度量的工具，即使从非常小的树库中也能做到。最后，我讨论了这些结果对自然语言处理和人类语言处理的重要启示。

英文摘要

In many fields, such as language acquisition, neuropsychology of language, the study of aging, and historical linguistics, corpora are used for estimating the diversity of grammatical structures that are produced during a period by an individual, community, or type of speakers. In these cases, treebanks are taken as representative samples of the syntactic structures that might be encountered. Generalizing the potential syntactic diversity from the structures documented in a small corpus requires careful extrapolation whose accuracy is constrained by the limited size of representative sub-corpora. In this article, I demonstrate -- theoretically, and empirically -- that a grammar's derivational entropy and the mean length of the utterances (MLU) it generates are fundamentally linked, giving rise to a new measure, the derivational entropy rate. The mean length of utterances becomes the most practical index of syntactic complexity; I demonstrate that MLU is not a mere proxy, but a fundamental measure of syntactic diversity. In combination with the new derivational entropy rate measure, it provides a theory-free assessment of grammatical complexity. The derivational entropy rate indexes the rate at which different grammatical annotation frameworks determine the grammatical complexity of treebanks. I introduce the Smoothed Induced Treebank Entropy (SITE) as a tool for estimating these measures accurately, even from very small treebanks. I conclude by discussing important implications of these results for both NLP and human language processing.

URL PDF HTML ☆

赞 0 踩 0

2409.11901 2026-06-04 cs.CL 版本更新

LLMs + Persona-Plug = Personalized LLMs

LLMs + Persona-Plug = 个性化大语言模型

Jiongnan Liu, Yutao Zhu, Shuting Wang, Xiaochi Wei, Erxue Min, Yu Lu, Shuaiqiang Wang, Dawei Yin, Zhicheng Dou

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China（中国人民大学人工智能学院 Gallagher 学院）； Engineering Research Center of Next-Generation Intelligent Search and Recommendation, MOE（下一代智能搜索与推荐工程研究中心，教育部）； Baidu Inc.（百度公司）

AI总结提出一种轻量级插件式用户嵌入模块PPersona-Plug，通过建模用户历史上下文生成个性化嵌入，无需微调即可提升大语言模型输出个性化程度。

详情

AI中文摘要

个性化在众多语言任务和应用中扮演着关键角色，因为具有相同需求的用户可能根据个人兴趣偏好不同的输出。这促使了各种个性化方法的发展，旨在使大语言模型（LLMs）适应生成符合用户偏好的定制化输出。其中一些方法涉及为每个用户微调一个独特的个性化LLM，这过于昂贵而难以广泛应用。替代方法以即插即用的方式引入个性化信息，通过检索用户相关历史文本作为示例。然而，这种基于检索的策略可能会破坏用户历史的连续性，无法捕捉用户的整体风格和模式，从而导致次优性能。为了解决这些挑战，我们提出了一种新颖的个性化LLM模型PPersona-Plug。它通过一个轻量级的插件式用户嵌入模块，对每个用户的所有历史上下文进行建模，构建用户特定的嵌入。通过将该嵌入附加到任务输入中，LLMs可以更好地理解和捕捉用户的习惯与偏好，从而在不调整自身参数的情况下生成更个性化的输出。在语言模型个性化（LaMP）基准上的各种任务上的大量实验表明，所提出的模型显著优于现有的个性化LLM方法。

英文摘要

Personalization plays a critical role in numerous language tasks and applications, since users with the same requirements may prefer diverse outputs based on their individual interests. This has led to the development of various personalized approaches aimed at adapting large language models (LLMs) to generate customized outputs aligned with user preferences. Some of them involve fine-tuning a unique personalized LLM for each user, which is too expensive for widespread application. Alternative approaches introduce personalization information in a plug-and-play manner by retrieving the user's relevant historical texts as demonstrations. However, this retrieval-based strategy may break the continuity of the user history and fail to capture the user's overall styles and patterns, hence leading to sub-optimal performance. To address these challenges, we propose a novel personalized LLM model, PPlug. It constructs a user-specific embedding for each individual by modeling all her historical contexts through a lightweight plug-in user embedder module. By attaching this embedding to the task input, LLMs can better understand and capture user habits and preferences, thereby producing more personalized outputs without tuning their own parameters. Extensive experiments on various tasks in the language model personalization (LaMP) benchmark demonstrate that the proposed model significantly outperforms existing personalized LLM approaches.

URL PDF HTML ☆

赞 0 踩 0

2407.03956 2026-06-04 cs.MA cs.CL 版本更新

Solving Zebra Puzzles Using Constraint-Guided Multi-Agent Systems

使用约束引导的多智能体系统解决斑马谜题

Shmuel Berman, Kathleen McKeown, Baishakhi Ray

发表机构 * Princeton University（普林斯顿大学）； Columbia University（哥伦比亚大学）

AI总结提出一种多智能体系统ZPS，结合大语言模型与定理证明器，通过分解问题、生成SMT代码和智能体间反馈，显著提升复杂逻辑谜题的解决能力。

详情

AI中文摘要

先前的研究通过链式思维提示或引入符号表示等技术，增强了大语言模型（LLMs）解决逻辑谜题的能力。然而，由于将自然语言线索翻译为逻辑语句的固有复杂性，这些框架通常仍不足以解决复杂的逻辑问题，例如斑马谜题。我们引入了一个多智能体系统ZPS，它将LLMs与现成的定理证明器集成在一起。该系统通过将问题分解为更小、更易管理的部分，生成SMT（可满足性模理论）代码以使用定理证明器求解，并利用智能体之间的反馈来反复改进答案，从而处理复杂的谜题求解任务。我们还引入了一个自动网格谜题评分器来评估我们谜题解决方案的正确性，并通过用户研究评估了该自动评分器的可靠性。我们的方法在我们测试的所有三个LLM中均显示出改进，其中GPT-4的完全正确解决方案数量提高了166%。

英文摘要

Prior research has enhanced the ability of Large Language Models (LLMs) to solve logic puzzles using techniques such as chain-of-thought prompting or introducing a symbolic representation. These frameworks are still usually insufficient to solve complicated logical problems, such as Zebra puzzles, due to the inherent complexity of translating natural language clues into logical statements. We introduce a multi-agent system, ZPS, that integrates LLMs with an off the shelf theorem prover. This system tackles the complex puzzle-solving task by breaking down the problem into smaller, manageable parts, generating SMT (Satisfiability Modulo Theories) code to solve them with a theorem prover, and using feedback between the agents to repeatedly improve their answers. We also introduce an automated grid puzzle grader to assess the correctness of our puzzle solutions and show that the automated grader is reliable by evaluating it in a user-study. Our approach shows improvement in all three LLMs we tested, with GPT-4 showing 166% improvement in the number of fully correct solutions.

URL PDF HTML ☆

赞 0 踩 0

2402.02555 2026-06-04 cs.CV cs.CL 版本更新

High-Quality Entity Segmentation and Grounding

高质量实体分割与定位

Lu Qi, Yi-Wen Chen, Tao Zhang, Xiangtai Li, Xu Yang, Bo Du, Ming-Hsuan Yang

发表机构 * Wuhan University（武汉大学）； Insta360 Research（Insta360研究院）； Department of EECS, University of California, Merced（加州大学默塞德分校电子工程与计算机科学系）； Nanyang Technological University（南洋理工大学）； Institute of Automation of the Chinese Academy of Sciences（中国科学院自动化研究所）

AI总结提出ESG流水线，通过新数据集EntitySeg和两阶段解耦设计（CropFormer高质量分割+GELLA精确名词提取与语义匹配），实现高质量实体分割与定位，在五项任务上有效。

详情

AI中文摘要

在这项工作中，我们提出了ESG，一个由新数据集EntitySeg支持的高质量实体分割与定位流水线。首先，所提出的数据集命名为EntitySeg，包含跨越各种图像域和实体的图像，以及用于训练和测试的大量高分辨率图像和高质量掩码标注。然后，ESG主要由两个模块组成：用于高质量实体分割的CropFormer，以及用于从句子中精确提取名词并在语言和视觉区域之间进行语义匹配的GELLA。与现有联合训练分割和大语言模型的定位方法不同，ESG采用两阶段解耦设计，保留了高质量掩码和定位鲁棒性，避免了联合训练通常带来的权衡。CropFormer确保高质量实体分割结果，然后可以编码到GELLA模型中进行有效定位。大量实验结果表明，我们提出的流水线在五项任务上有效，包括实体分割、全景分割、开放词汇分割、指代分割和全景定位叙述。此外，ESG流水线的GELLA模块高度灵活，能够处理来自任何分割框架的掩码输入，这得益于其轻量级的颜色图/视觉编码器、语言/掩码解码器和关联模块。实体分割数据集和定位代码将在https://github.com/qqlu/Entity发布。

英文摘要

In this work, we propose ESG, a pipeline for high-quality entity segmentation and grounding supported by a new dataset EntitySeg. At first, the proposed dataset naming EntitySeg contains images spanning various image domains and entities, along with plentiful high-resolution images and high-quality mask annotations for training and testing. Then, the ESG mainly consists of two modules: CropFormer for high-quality entity segmentation whereas GELLA for accurate noun extraction from sentences and semantic matching between language and visual regions. Unlike existing grounding methods that jointly train a segmentation and a large language model, ESG adopts a two-stage decoupled design, preserving high-quality masks and grounding robustness without the trade-offs often introduced by joint training. CropFormer ensures high-quality entity segmentation results, which can then be encoded into the GELLA model for effective grounding. Extensive experimental results demonstrate the effectiveness of our proposed pipeline across five tasks, including entity segmentation, panoptic segmentation, open-vocabulary segmentation, referring segmentation, and panoptic localized narratives. Furthermore, GELLA module of ESG pipeline is highly flexible and capable of processing mask inputs from any segmentation framework, thanks to its lightweight colormap/vision encoder, language/mask decoder, and association module. The entity segmentation dataset and grounding code will be released at https://github.com/qqlu/Entity.

URL PDF HTML ☆

赞 0 踩 0

1904.03152 2026-06-04 eess.SY cs.CL cs.NE cs.SY 版本更新

Data-driven Modelling of Dynamical Systems Using Tree Adjoining Grammar and Genetic Programming

基于树附加语法学和遗传编程的数据驱动动态系统建模

Dhruv Khandelwal, Maarten Schoukens, Roland Tóth

发表机构 * Department of Electrical Engineering（电气工程系）； Eindhoven University of Technology（埃因霍温理工大学）

AI总结本文提出了一种利用树附加语法学和遗传编程进行非线性动态系统数据驱动建模的方法，通过自动化建模过程并分析不同挑战下的性能。

Comments Paper accepted at IEEE CEC 2019

详情

DOI: 10.1109/CEC.2019.8790250

AI中文摘要

最先进的数据驱动非线性动态系统建模方法通常需要与专家用户交互。为了部分自动化从数据中建模物理系统的过程，许多基于进化算法的方法被提出用于模型结构选择，特别是针对非线性系统。最近，一种利用遗传编程（GP）进行非线性动态系统数据驱动建模的方法被提出。该方法的创新点在于对噪声的建模以及使用树附加语法来塑造GP探索的搜索空间。在本文中，我们报告了该方法在三个案例研究中的结果。每个案例研究均基于真实的物理系统。这些案例研究提出了各种挑战。特别是，这些挑战涵盖了对真实系统先验知识的不同程度、可用数据量、系统动态的复杂性以及系统中非线性特性。基于案例研究中取得的结果，我们对所提出的方法的性能进行了批判性分析。

英文摘要

State-of-the-art methods for data-driven modelling of non-linear dynamical systems typically involve interactions with an expert user. In order to partially automate the process of modelling physical systems from data, many EA-based approaches have been proposed for model-structure selection, with special focus on non-linear systems. Recently, an approach for data-driven modelling of non-linear dynamical systems using Genetic Programming (GP) was proposed. The novelty of the method was the modelling of noise and the use of Tree Adjoining Grammar to shape the search-space explored by GP. In this paper, we report results achieved by the proposed method on three case studies. Each of the case studies considered here is based on real physical systems. The case studies pose a variety of challenges. In particular, these challenges range over varying amounts of prior knowledge of the true system, amount of data available, the complexity of the dynamics of the system, and the nature of non-linearities in the system. Based on the results achieved for the case studies, we critically analyse the performance of the proposed method.

URL PDF HTML ☆

赞 0 踩 0

1701.08711 2026-06-04 cs.CL cs.LG econ.GN q-fin.EC stat.ML 版本更新

Predicting Auction Price of Vehicle License Plate with Deep Recurrent Neural Network

利用深度循环神经网络预测车辆车牌拍卖价格

Vinci Chow

发表机构 * Department of Economics, The Chinese University of Hong Kong, Shatin, Hong Kong（香港中文大学经济系，沙田，香港）

AI总结本文提出将车辆车牌价格预测视为自然语言处理任务，通过构建深度循环神经网络来预测香港车牌拍卖价格，并展示了模型在解释价格变化和扩展为车牌搜索引擎方面的贡献。

详情

DOI: 10.1016/j.eswa.2019.113008

AI中文摘要

在中国社会，迷信因素极为重要，具有吉祥数字的车辆车牌在拍卖中可以高价成交。与其他珍贵物品不同，车牌在拍卖前并不预估价格。本文提出将车牌价格预测视为自然语言处理（NLP）任务，因为价值取决于车牌上每个字符的含义和语义。本文构建了一个深度循环神经网络（RNN）来预测香港车牌的价格，基于车牌上的字符。在13年的历史拍卖价格上评估，深度RNN的预测可以解释超过80%的价格变化，显著优于以前的模型。此外，本文还展示了该模型如何扩展为车牌搜索引擎，并提供价格分布的估计。

英文摘要

In Chinese societies, superstition is of paramount importance, and vehicle license plates with desirable numbers can fetch very high prices in auctions. Unlike other valuable items, license plates are not allocated an estimated price before auction. I propose that the task of predicting plate prices can be viewed as a natural language processing (NLP) task, as the value depends on the meaning of each individual character on the plate and its semantics. I construct a deep recurrent neural network (RNN) to predict the prices of vehicle license plates in Hong Kong, based on the characters on a plate. I demonstrate the importance of having a deep network and of retraining. Evaluated on 13 years of historical auction prices, the deep RNN's predictions can explain over 80 percent of price variations, outperforming previous models by a significant margin. I also demonstrate how the model can be extended to become a search engine for plates and to provide estimates of the expected price distribution.

URL PDF HTML ☆

赞 0 踩 0

1902.01119 2026-06-04 cs.AI cs.CL cs.LG cs.SY eess.SY 版本更新

The Natural Language of Actions

动作的自然语言

Guy Tennenholtz, Shie Mannor

发表机构 * Faculty of Electrical Engineering, Technion Institute of Technology, Israel（电气工程学院，技术学院，以色列）

AI总结本文提出Act2Vec框架，用于学习基于上下文的动作表示以提升强化学习性能，通过将相似动作分组并利用动作间的关系来改进Q值近似和状态表示。

Comments Published in the proceedings of the 36th International Conference on Machine Learning (ICML 2019)

1811.00641 2026-06-04 cs.LG cs.CL cs.NA math.NA stat.ML 版本更新

Online Embedding Compression for Text Classification using Low Rank Matrix Factorization

在线文本分类中的低秩矩阵分解用于词嵌入压缩

Anish Acharya, Rahul Goel, Angeliki Metallinou, Inderjit Dhillon

发表机构 * Amazon Alexa AI（亚马逊Alexa人工智能）； Amazon Search Technologies（亚马逊搜索技术）； University of Texas at Austin（德克萨斯大学奥斯汀分校）

AI总结本文提出了一种在线词嵌入压缩方法，利用低秩矩阵分解在训练过程中压缩词嵌入层，从而减少NLP模型的内存瓶颈，同时在下游任务中通过重新训练恢复精度，实验证明该方法在句子分类任务中实现了90%的压缩率，并优于固定点量化等其他方法。

Comments Accepted in Thirty-Third AAAI Conference on Artificial Intelligence (AAAI 2019)

详情

AI中文摘要

深度学习模型已成为自然语言处理（NLP）任务的最新技术，但将其部署到生产系统中却面临显著的内存限制。现有的压缩方法要么有损，要么引入显著的延迟。我们提出了一种压缩方法，利用低秩矩阵分解在训练过程中压缩词嵌入层，该层是大多数NLP模型的主要内存瓶颈。我们的模型在训练、压缩后，再在下游任务上重新训练以恢复精度，同时保持减小的尺寸。实验证明，所提出的方法在句子分类任务中可实现90%的压缩，对精度影响极小，并优于固定点量化或其他方法如离线词嵌入压缩。我们还通过FLOP计算分析了我们方法的推理时间和存储空间，显示我们可以通过可配置的比率压缩DNN模型，并在不引入额外延迟的情况下恢复精度损失。最后，我们引入了一种新的学习率调度方法，即周期性退火学习率（CALR），并通过句子分类基准实验证明其优于其他流行的自适应学习率算法。

英文摘要

Deep learning models have become state of the art for natural language processing (NLP) tasks, however deploying these models in production system poses significant memory constraints. Existing compression methods are either lossy or introduce significant latency. We propose a compression method that leverages low rank matrix factorization during training,to compress the word embedding layer which represents the size bottleneck for most NLP models. Our models are trained, compressed and then further re-trained on the downstream task to recover accuracy while maintaining the reduced size. Empirically, we show that the proposed method can achieve 90% compression with minimal impact in accuracy for sentence classification tasks, and outperforms alternative methods like fixed-point quantization or offline word embedding compression. We also analyze the inference time and storage space for our method through FLOP calculations, showing that we can compress DNN models by a configurable ratio and regain accuracy loss without introducing additional latency compared to fixed point quantization. Finally, we introduce a novel learning rate schedule, the Cyclically Annealed Learning Rate (CALR), which we empirically demonstrate to outperform other popular adaptive learning rate algorithms on a sentence classification benchmark.

URL PDF HTML ☆

赞 0 踩 0

1803.02238 2026-06-04 cs.RO cs.CL cs.SY eess.SY 版本更新

Precise but Natural Specification for Robot Tasks

机器人任务的精确但自然的规范

Ivan Gavran, Brendon Boldt, Eva Darulova, Rupak Majumdar

发表机构 * Max Planck Institute for Software Systems, Germany（德国马克斯·普朗克软件研究所）

AI总结 Flipper通过自然语言接口实现机器人高阶任务规范，结合形式化核心语言与语义解析器，提供可视化反馈并支持自然语言扩展，提升任务描述效率。

详情

AI中文摘要

我们提出了Flipper，一种自然语言接口，用于描述机器人高阶任务规范并编译为机器人动作。Flipper始于形式化核心语言，允许表达丰富的时序规范，并通过语义解析器提供自然语言接口。Flipper通过在图形用户界面中执行自动构建的计划提供即时视觉反馈，允许用户解决潜在的歧义解释。Flipper通过自然化扩展自身：用户可以添加 utterances 的定义，Flipper 由此诱导新规则并将其添加到核心语言中，逐渐形成更加自然的任务规范语言。Flipper通过泛化用户提供的定义来改进自然化。与其他任务规范系统不同，Flipper在保持编程语言的表达力和形式精确性的同时，实现了自然语言交互。我们通过初始用户研究证明，自然语言交互和泛化可以显著简化任务描述。此外，随着时间推移，用户会使用更多超出初始核心语言的概念。这些扩展可供Flipper社区使用，用户可以使用其他人定义的概念。

英文摘要

We present Flipper, a natural language interface for describing high-level task specifications for robots that are compiled into robot actions. Flipper starts with a formal core language for task planning that allows expressing rich temporal specifications and uses a semantic parser to provide a natural language interface. Flipper provides immediate visual feedback by executing an automatically constructed plan of the task in a graphical user interface. This allows the user to resolve potentially ambiguous interpretations. Flipper extends itself via naturalization: its users can add definitions for utterances, from which Flipper induces new rules and adds them to the core language, gradually growing a more and more natural task specification language. Flipper improves the naturalization by generalizing the definition provided by users. Unlike other task-specification systems, Flipper enables natural language interactions while maintaining the expressive power and formal precision of a programming language. We show through an initial user study that natural language interactions and generalization can considerably ease the description of tasks. Moreover, over time, users employ more and more concepts outside of the initial core language. Such extensions are available to the Flipper community, and users can use concepts that others have defined.

URL PDF HTML ☆

赞 0 踩 0

1804.01189 2026-06-04 eess.SY cs.CL cs.SY math.OC stat.ML 版本更新

Real-Time Prediction of the Duration of Distribution System Outages

配电系统停电持续时间的实时预测

Aaron Jaech, Baosen Zhang, Mari Ostendorf, Daniel S. Kirschen

AI总结本文利用历史停电记录训练神经网络预测停电持续时间，通过环境因素初始预测并结合现场报告文本分析提升预测性能，案例研究显示自然语言处理能识别停电原因和修复步骤。

Comments Appears in IEEE Transactions on Power Systems