arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.31524 2026-06-01 cs.LG cs.LO

Value Functions as Supermartingale Certificates

值函数作为超鞅证书

Alessandro Abate, Daniel Contro, Mirco Giacobbe, Agustín Martínez-Suñé, Diptarko Roy

AI总结本文通过建立值函数与Streett超鞅证书之间的理论联系，将随机系统的形式化验证方法与强化学习相结合，为ω-正则性质提供了一种基于RL的证书合成方法。

Comments To appear in SAIV'26

详情

AI中文摘要

随机系统的认证方法提供了基于实值超鞅证书的充分证明规则，用于确定在一般状态空间（包括可数无限和连续状态空间）上几乎必然满足ω-正则性质（因此也适用于线性时序逻辑）。相反，针对ω-正则任务的强化学习（RL）方法已受到广泛关注，但它们通常缺乏对所学策略满足规范的形式化保证，除非可能限于有限状态和动作空间。我们通过建立一个新的理论联系来弥合这两条研究路线：在适当的奖励下，与几乎必然满足ω-正则性质的策略相关联的值函数编码了该规范的Streett超鞅证书。我们的结果在有限马尔可夫决策过程上通过实验验证，适用于有限、可数无限和连续状态空间，为通过RL进行证书合成提供了一条有原则的途径。

英文摘要

Certification methods for stochastic systems provide sufficient proof rules, based on real-valued supermartingale certificates, to determine the almost-sure satisfaction of $ω$-regular properties (and therefore of linear temporal logic) over general state spaces, encompassing both countably infinite and continuous state spaces. Conversely, reinforcement learning (RL) methods for $ω$-regular tasks have received considerable attention, but they typically lack formal guarantees that the learned policy satisfies the specification, except possibly for finite state and action spaces. We bridge these two lines of research by establishing a novel theoretical connection: under an appropriate reward, the value function associated to a policy that almost surely satisfies an $ω$-regular property encodes a Streett supermartingale certificate for that specification. Our results, validated experimentally on finite Markov decision processes, hold for finite, countably infinite, and continuous state spaces, suggesting a principled route to certificate synthesis via RL.

URL PDF HTML ☆

赞 0 踩 0

2605.31522 2026-06-01 cs.LG q-bio.GN q-bio.QM

Chem-PerturBridge: a harmonized compendium of small molecule perturbation transcriptomic effects

Chem-PerturBridge：小分子扰动转录组效应的协调汇编

Artur Szałata, Olga Novitskaia, Maiia Shulman, Matthew Mella, Altynbek Zhubanchaliyev, Fabian J. Theis

AI总结为解决小分子扰动转录组数据碎片化问题，构建了涵盖37k化合物、136种细胞背景和125万样本的协调资源Chem-PerturBridge，并验证了其在跨数据集签名一致性评估和化合物表示学习预训练中的有效性。

Comments 33 pages, 6 figures, 16 tables

详情

AI中文摘要

大型扰动模型需要涵盖化学、细胞和检测多样性的训练数据。然而，当前用于小分子建模的转录组资源在技术、元数据惯例、对照、剂量和预处理流程方面是碎片化的。我们引入了Chem-PerturBridge，这是一个协调的多数据集资源，包含超过37k种化合物、136种细胞背景和125万个转录组样本，涵盖八种检测类型，具有标准化的标识符、元数据和考虑重复的条件级效应。我们利用该资源评估了跨数据集的匹配条件一致性和数据集内的重复一致性。匹配的相同化合物条件在大多数数据集对上的细粒度logFC排名和幅度上通常表现出弱一致性，通常低于相同背景不同化合物的基线。相比之下，logFC方向的一致性要稳定得多，并且通常超过这些基线。我们进一步评估了Chem-PerturBridge作为化合物表示学习预训练资源的效果。在化合物留出的OP3评估分割下，基于Chem-PerturBridge预训练的嵌入在各项指标上优于仅使用L1000的嵌入、Morgan指纹和无描述符的OP3基线。在11个数据集上的广泛分子留出评估进一步表明，基于Chem-PerturBridge训练的模型优于或匹配未使用该资源的模型。因此，Chem-PerturBridge支持跨数据集签名一致性的诊断评估以及异质扰动转录组数据的模型导向复用。

英文摘要

Large perturbation models require training data encompassing chemical, cellular, and assay diversity. Current transcriptomic resources for small-molecule modeling, however, are fragmented across technologies, metadata conventions, controls, doses, and preprocessing pipelines. We introduce Chem-PerturBridge, a harmonized multi-dataset resource comprising over 37k compounds, 136 cellular contexts, and 1.25M transcriptomic samples across eight assay types, with standardized identifiers, metadata, and replicate-aware condition-level effects. We use the resource to evaluate matched-condition agreement across datasets and replicate agreement within datasets. Matched same-compound conditions generally show weak agreement in fine-grained logFC rankings and magnitudes across most dataset pairs, often falling below same-context different-compound baselines. In contrast, logFC direction agreement is substantially more stable and usually exceeds these baselines. We further evaluate Chem-PerturBridge as a pretraining resource for compound representation learning. Under a compound-held-out OP3 evaluation split, embeddings pretrained on Chem-PerturBridge improve over L1000-only embeddings, Morgan fingerprints, and the descriptor-free OP3 baseline across metrics. An extensive molecule-holdout evaluation across 11 datasets further shows that models trained on Chem-PerturBridge outperform or match those that are not. Chem-PerturBridge therefore supports both diagnostic evaluation of cross-dataset signature agreement and model-oriented reuse of heterogeneous perturbation transcriptomic data.

URL PDF HTML ☆

赞 0 踩 0

2605.31521 2026-06-01 cs.CL cs.SD

UniAudio-Token: Empowering Semantic Speech Tokenizers with General Audio Perception

UniAudio-Token: 赋予语义语音分词器通用音频感知能力

Yuhan Song, Linhao Zhang, Aiwei Liu, Chuhan Wu, Sijun Zhang, Wei Jia, Yuan Liu, Houfeng Wang, Xiao Zhou

AI总结提出UniAudio-Token框架，通过语义-声学基元（SAP）和语义-声学均衡（SAE）机制，在不牺牲语音能力的前提下为语义分词器注入通用音频感知，实现统一音频接口。

Comments 19 pages, 10 figures

详情

AI中文摘要

语义语音分词器因其紧凑的单码本设计和强语言对齐能力，已成为音频-大语言模型广泛使用的接口。然而，它们对语言抽象的关注导致了声学盲点，限制了其在以语音为中心的任务之外的适用性。我们提出UniAudio-Token，一个在不损害语音能力的前提下赋予语义分词器通用音频感知能力的框架。UniAudio-Token并非改变语义范式，而是通过两个关键创新来减轻其信息损失：(1) 语义-声学基元（SAP）通过将音频分解为语言内容、声音属性和听觉场景基元来提供结构化监督；(2) 语义-声学均衡（SAE）引入了一种内容感知门控机制，自适应地从浅层恢复细粒度声学细节。广泛评估表明，UniAudio-Token在学习全面的通用表示的同时，保持了高保真语音生成。当与下游大语言模型集成时，它在理解和生成任务上均优于所有单码本基线分词器，有效地作为统一音频接口。我们在https://github.com/Tencent/Universal_Audio_Tokenizer上公开发布了所有代码，包括训练和推理脚本以及模型检查点。

英文摘要

Semantic speech tokenizers have become a widely used interface for Audio-LLMs, owing to their compact single-codebook design and strong linguistic alignment. However, their focus on linguistic abstraction induces acoustic blindness, limiting their applicability beyond speech-centric tasks. We propose UniAudio-Token, a framework that empowers semantic tokenizers with general audio perception without compromising speech ability. Instead of altering the semantic paradigm, UniAudio-Token mitigates its information loss through two key innovations: (1) Semantic-Acoustic Primitives (SAP) provide structured supervision by decomposing audio into linguistic content, vocal attributes, and auditory-scene primitives; and (2) Semantic-Acoustic Equilibrium (SAE) introduces a content-aware gating mechanism that adaptively restores fine-grained acoustic details from shallow layers. Extensive evaluations show that UniAudio-Token learns comprehensive universal representations while preserving high-fidelity speech generation. When integrated with downstream LLMs, it outperforms all single-codebook baseline tokenizers on both understanding and generation tasks, effectively serving as a unified audio interface. We publicly release all our code, including training and inference scripts, together with the model checkpoints at https://github.com/Tencent/Universal_Audio_Tokenizer.

URL PDF HTML ☆

赞 0 踩 0

2605.31518 2026-06-01 cs.LG

On the Relationship Between Activation Outliers and Feature Death in Sparse Autoencoders

关于稀疏自编码器中激活异常值与特征死亡之间关系的研究

Elana Simon, Etowah Adams, James Zou

AI总结本文通过理论分析和实验验证，揭示了稀疏自编码器中维度级激活异常值导致特征死亡的机制，并提出均值中心化预处理方法有效消除该问题。

Comments Accepted to ICML 2026 main conference

详情

AI中文摘要

稀疏自编码器（SAEs）将神经网络激活分解为可解释的特征，但许多学习到的特征从未激活，这一称为特征死亡的问题浪费了字典容量并可能重新引入叠加。不同模型之间的死亡率差异巨大：在GPT-2上接近零，而在相同配置的AlphaFold3上超过70%。我们发现维度级激活异常值（其平均幅度相对于每个token的变化较大的维度）通过根据每个特征与激活均值的对齐方式在初始化时改变预激活来导致此问题。与均值反对齐的特征获得永久负预激活且从不触发。我们将异常值严重程度形式化为$γ= \|μ\|/\|σ\|$；它在涵盖语言、视觉、蛋白质和基因组模型的454个模型-层组合上预测初始死亡率（对于TopK死亡的Spearman $ρ= 0.89$，对于ReLU死亡的$0.82$）。死亡特征可以在训练期间复活，但恢复需要SAE偏置学习激活均值，这一过程在高$γ$时过于缓慢。均值中心化（减去激活均值）绕过了这一点，并在所有测试模型中消除了异常值诱导的死亡，确认了该机制，并为何时以及为何需要这一预处理步骤提供了原则性基础。

英文摘要

Sparse autoencoders (SAEs) decompose neural network activations into interpretable features, but many learned features never activate, a problem called feature death that wastes dictionary capacity and can reintroduce superposition. Death rates vary dramatically between models: near-zero on GPT-2, over 70% on AlphaFold3 with identical configurations. We find that dimension-level activation outliers (dimensions whose mean magnitude is large relative to per-token variation) cause this by shifting pre-activations at initialization based on each feature's alignment with the activation mean. Features anti-aligned with the mean receive permanently negative pre-activations and never fire. We formalize outlier severity as $γ= \|μ\|/\|σ\|$; it predicts initial death rates (Spearman $ρ= 0.89$ for dead-by-TopK, $0.82$ for dead-by-ReLU) across 454 model-layer combinations spanning language, vision, protein, and genomic models. Dead features can revive during training, but recovery requires the SAE bias to learn the activation mean, a process that is prohibitively slow at high $γ$. Mean-centering (subtracting the activation mean) sidesteps this and eliminates outlier-induced death across all tested models, confirming the mechanism and providing a principled basis for when and why this preprocessing step is necessary.

URL PDF HTML ☆

赞 0 踩 0

2605.31513 2026-06-01 cs.CV

Personalize Your Large Vision-language Models With In-context Prompt Tuning

用上下文提示调优个性化你的大型视觉语言模型

Yanshu Li, Jiaqian Li, Kuai Yu, Xi Xiao, Dongfang Liu, Tianyang Wang, Ruixiang Tang

AI总结提出上下文提示调优（ICPT）方法，通过轻量投影模块从多参考图像中提取细粒度视觉语义并转化为连续提示，结合几何正则化解决环境偏差和跨概念干扰，实现高效个性化。

Comments 27 pages, 10 figures, 5 tables

详情

AI中文摘要

大型视觉语言模型（LVLMs）展示了强大的通用多模态能力，并越来越多地部署在下游系统中。这一趋势推动了对LVLM个性化的日益增长的兴趣，其目标是使模型能够快速有效地学习分布外的多模态概念，以满足用户特定需求。然而，许多现有方法依赖于推理时训练，降低了效率。它们也难以在复杂的多图像、多概念设置中保持准确性。这些限制制约了基于LVLM的系统的更广泛部署。因此，本文提出了上下文提示调优（ICPT）。具体来说，ICPT采用了一个轻量级投影模块，能够在复杂场景中操作，从多个参考图像中提取细粒度视觉语义，并将这些特征与身份标签映射无缝地转化为连续提示。为了最大化计算效率，该模块根据每个概念的内在视觉复杂性自适应地确定提示长度。关键的是，为了克服实际应用中普遍存在的环境偏差和跨概念干扰，我们引入了两种新颖的几何正则化。这些约束通过将关键身份与瞬态环境状态解耦，并分离概念以避免语义混淆，来优化提示表示。大量实验表明，ICPT在多种任务和LVLM骨干网络上实现了最先进的个性化准确性。

英文摘要

Large vision-language models (LVLMs) have demonstrated strong general multimodal capability and are increasingly deployed in downstream systems. This trend has driven growing interest in LVLM personalization, which aims to enable models to quickly and effectively learn out-of-distribution multimodal concepts to meet user-specific needs. However, many existing methods rely on inference-time training, which reduces efficiency. They also struggle to maintain accuracy in complex multi-image, multi-concept settings. These limitations restrict the broader deployment of LVLM-based systems. Therefore, this paper proposes in-context prompt tuning (ICPT). Specifically, ICPT employs a lightweight projection module capable of operating in complex scenarios to extract fine-grained visual semantics from multiple reference images, seamlessly transforming these features alongside identity-label mappings into continuous prompts. To maximize computational efficiency, this module adaptively determines the prompt length based on the intrinsic visual complexity of each concept. Crucially, to overcome the environmental biases and cross-concept interference prevalent in real-world applications, we introduce two novel geometric regularizations. These constraints refine prompt representations by decoupling key identities from transient environmental states and separating concepts to avoid semantic confusion. Extensive experiments show that ICPT achieves state-of-the-art personalization accuracy across diverse tasks and LVLM backbones.

URL PDF HTML ☆

赞 0 踩 0

2605.31512 2026-06-01 cs.CL

Reliable Multilingual Orthopedic Decision Support from Clinical Narratives: Language-Aware Adaptation and Verification-Guided Deferral

来自临床叙述的可靠多语言骨科决策支持：语言感知适应与验证引导的延迟

Danish Ali, Li Xiaojian, Sundas Iqbal, Farrukh Zaidi

AI总结针对低资源医疗环境中的多语言骨科决策支持，提出结合语言感知适配编码器IndicBERT-HPA和确定性选择性验证层的可靠性框架，在英语、印地语和旁遮普语临床文本分类中取得最优性能。

详情

AI中文摘要

多语言骨科决策支持在低资源医疗环境中仍然具有挑战性，其中临床叙述包含专业术语、混合文字、不完整证据、标签不平衡和语言依赖的文档模式。本文提出了一个面向可靠性的框架，用于对英语、印地语和旁遮普语的自由文本骨科笔记进行分类。我们比较了任务对齐的多语言Transformer编码器、任务微调的DistilBERT基线、零样本指令微调的大语言模型（LLMs）和领域自适应编码器IndicBERT-HPA。IndicBERT-HPA通过语言感知的骨科适配器头增强IndicBERT，以支持临床相关的多语言表示学习。评估从整体准确率扩展到每类性能、ROC-AUC、AUPRC、期望校准误差、跨语言稳定性以及在受控平衡和自然患病率分布下的鲁棒性。评估的零样本LLMs在封闭集分类中远不如任务自适应编码器有效，且存在语言依赖的不稳定性。在自然临床患病率下，IndicBERT-HPA实现了最强的整体性能，平均Macro-F1达到0.8792，Macro-AUROC为0.894，AUPRC为0.902。我们进一步实现了一个确定性的选择性验证层，结合了置信门控、证据一致性检查和语言风险筛查。在随机选择的5000条保留子集上，它在72.3%的覆盖率下实现了84.4%的选择性准确率和0.76的选择性Macro-F1，而全接受预测的准确率为71.5%，Macro-F1为0.65。这些结果支持了面向可靠性的多语言临床决策支持，并带有明确的延迟机制。

英文摘要

Multilingual orthopedic decision support remains challenging in low-resource healthcare settings, where clinical narratives contain specialized terminology, mixed scripts, incomplete evidence, label imbalance and language-dependent documentation patterns. This article presents a reliability-oriented framework for classifying free-text orthopedic notes in English, Hindi and Punjabi. We compare task-aligned multilingual transformer encoders, a task-fine-tuned DistilBERT baseline, zero-shot instruction-tuned large language models (LLMs) and a domain-adaptive encoder, IndicBERT-HPA. IndicBERT-HPA augments IndicBERT with language-aware orthopedic adapter heads to support clinically relevant multilingual representation learning. Evaluation extends beyond aggregate accuracy to per-class performance, ROC-AUC, AUPRC, expected calibration error, cross-language stability and robustness under controlled balanced and natural-prevalence distributions. The evaluated zero-shot LLMs remain substantially less effective than task-adapted encoders for closed-set classification, with language-dependent instability. Under natural clinical prevalence, IndicBERT-HPA achieves the strongest overall performance, reaching an averaged Macro-F1 of 0.8792, Macro-AUROC of 0.894 and AUPRC of 0.902. We further implement a deterministic selective-verification layer combining confidence gating, evidence-consistency checking and language-risk screening. On a randomly selected held-out 5,000-record subset, it achieves 84.4% selective accuracy and 0.76 selective Macro-F1 at 72.3% coverage, compared with 71.5% accuracy and 0.65 Macro-F1 for accept-all prediction. These results support reliability-oriented multilingual clinical decision support with explicit deferral.

URL PDF HTML ☆

赞 0 踩 0

2605.31509 2026-06-01 cs.LG cs.AI

Skill Reuse as Compression in Agentic RL

智能体强化学习中的技能重用作为压缩

Zhikun Xu, Yu Feng, Jacob Dineen, Taiwei Shi, Jieyu Zhao, Ben Zhou

AI总结提出ReuseRL方法，基于最小描述长度原则将成功轨迹压缩为可重用技能字典，并通过分割代价惩罚低效编码行为，在多个环境中提升分布内和分布外成功率。

Comments Work in progress

2605.31508 2026-06-01 cs.CV

Internalizing Temporal Consistency in Video Object-Centric Learning without Explicit Regularization

在没有显式正则化的情况下内化视频目标中心学习中的时间一致性

Rongzhen Zhao, Zhiyuan Li, Juho Kannala, Joni Pajarinen

AI总结提出一种无需显式时间一致性损失（SSC）的视频目标中心学习方法，通过时序通道分解（CCD）和跨时间重建（CTR）机制隐式学习时间一致性，提升训练效率和性能。

Comments 14 pages

详情

AI中文摘要

视频目标中心学习（OCL）旨在将目标表示为 extit{slot}向量并保持其在帧间的一致性。Slot-Slot对比（SSC）损失已成为最先进（SOTA）视频OCL方法的基石。虽然非常有效，但SSC依赖于帧间的一对一目标对应并引入额外损失。遵循奥卡姆剃刀原则，我们提出范式转变：时间一致性应作为隐式模型设计而非显式损失来加强。为了优雅地排除SSC（ extbf{xSSC}），我们引入了两种准零开销的协同机制：（ extit{i}）时序通道分解（CCD）在结构上将slot表示沿通道维度分解为 extit{静态}和 extit{动态}子空间，作为经验统一的信息瓶颈；（ extit{ii}）跨时间重建（CTR）通过融合当前slot的静态通道和目标slot的动态通道，随机重建当前或前一时间步的目标特征，使用单个标准OCL解码器并进行少量训练调整。因此，slot集合通过仅最小化标准重建误差而内在地学习时间一致性。大量实验表明，将xSSC集成到领先基线中不仅提高了训练效率，还在视频目标发现和识别任务上建立了新的SOTA。此外，我们的PCA和梯度分析证实了目标的时间不变语义和时间变化运动学被编码到所提出的子空间中。我们的源代码、模型检查点和训练日志可在https://github.com/Genera1Z/xSSC上获取。

英文摘要

Video Object-Centric Learning (OCL) aims to represent objects as \textit{slot} vectors and maintain their consistency across frames. Slot-Slot Contrastive (SSC) loss has become the cornerstone for state-of-the-art (SOTA) video OCL methods. While highly effective, SSC relies on one-to-one object correspondence across frames and introduces an extra loss. Following Occam's Razor, we propose a paradigm shift: temporal consistency is better enforced as an implicit model design rather than an explicit loss. To elegantly exclude SSC (\textbf{xSSC}), we introduce two quasi-zero-overhead synergistic mechanisms: (\textit{i}) Chrono-Channel Decomposition (CCD) structurally disentangles slot representations along the channel dimension into \textit{static} and \textit{dynamic} sub-spaces, serving as an empirically unified information bottleneck; (\textit{ii}) Cross-Temporal Reconstruction (CTR) stochastically reconstructs target features of either the current or previous time step by fusing current slots' static channels and target slots' dynamic channels, using a single standard OCL decoder with minor training adaptation. Thereby, the slot sets inherently learn temporal consistency by minimizing the standard reconstruction error alone. Extensive experiments show that integrating xSSC into leading baselines not only improves training efficiency but also establishes new SOTAs on video object discovery and recognition tasks. Furthermore, our PCA and gradient analyses confirm that objects' time-invariant semantics and time-variant kinematics are encoded into the proposed sub-spaces. Our source code, model checkpoints and training logs are provided on https://github.com/Genera1Z/xSSC.

URL PDF HTML ☆

赞 0 踩 0

2605.31504 2026-06-01 cs.LG stat.ML

When Are Multimodal Predictions Biologically Supported? A Diagnostic Evaluation Framework

何时多模态预测具有生物学支持？一个诊断性评估框架

Dylan Steiner, Gustavo Arango-Argoty, Gerald Sun, Etai Jacob

AI总结提出DECAT框架，通过五个零参考指标和规则决策，将多模态表示分类为四种诊断场景，以检测模型是否学到共享生物学、单模态生物学或虚假相关性。

详情

AI中文摘要

肿瘤学中的多模态模型可以产生准确的预测，但准确预测并不能揭示模型是否学到了跨模态共享的生物学、局限于单一模态的生物学，还是反映了混杂因素而非真正生物学的虚假相关性。我们引入了DECAT，一个模型无关的事后评估框架，该框架针对给定任务和模态，使用五个零参考指标和基于规则的决策程序，将多模态表示分类为四种诊断场景。该框架作用于学习到的表示，不需要知道存在哪个特定混杂因素，并在证据不足时返回不确定。我们在四种多模态模型类别（超过2500个训练表示）的合成数据上以及来自8979名TCGA患者的真实数据上验证了DECAT，评估了多模态嵌入和五个预训练的病理基础模型。纠缠模型（如CLIP）实现了近乎完美的共享生物学检测，但在真实基础模型嵌入中，大多数情况下错误地声称存在共享生物学。这种错误声称率随着混杂强度增加而增加，因此更大的队列和更强的表示会产生更自信但仍然错误的诊断。应用于多模态TCGA嵌入和五个没有配对RNA的病理基础模型时，DECAT检测到了AUROC无法看到的混杂，而无需混杂标签，这一点通过事后分层得到了证实。

英文摘要

Multimodal models in oncology can produce accurate predictions, but accurate prediction does not reveal whether the model has learned biology that is shared across modalities, biology confined to one modality, or spurious correlations that reflect confounders rather than genuine biology. We introduce DECAT, a model-agnostic post-hoc evaluation framework that classifies multimodal representations into four diagnostic scenarios for a given task and modality, using five null-referenced metrics and a rule-based decision procedure. The framework operates on learned representations, requires no knowledge of which specific confounder is present, and returns indeterminate when the evidence is insufficient. We validate DECAT on synthetic data across four multimodal model classes (over 2,500 trained representations) and on real data from 8,979 TCGA patients, evaluating both multimodal embeddings and five pretrained pathology foundation models. Entangled models (e.g., CLIP) achieve near-perfect shared biology detection but falsely claim shared biology in the majority of cases where it is absent on real foundation model embeddings. This false claim rate increases with confound strength so that larger cohorts and stronger representations produce more confident but still incorrect diagnoses. Applied to both multimodal TCGA embeddings and five pathology foundation models without paired RNA, DECAT detects confounding invisible to AUROC without requiring the confounder labels, as confirmed by post-hoc stratification.

URL PDF HTML ☆

赞 0 踩 0

2605.31503 2026-06-01 cs.CV cs.LG

How can embedding models bind concepts?

嵌入模型如何绑定概念？

Arnas Uselis, Darina Koishigarina, Seong Joon Oh

AI总结本文研究视觉-语言嵌入模型（如CLIP）在概念绑定上的局限性，发现场景嵌入可加性分解为对象表示，但CLIP的高复杂度绑定函数阻碍了泛化，而通过充分数据训练的Transformer模型能学习低复杂度乘法交互绑定函数实现系统泛化。

Comments ICML 2026

详情

AI中文摘要

人类在多物体场景中能轻松判断哪种颜色属于哪种形状，这种能力称为概念绑定。视觉-语言嵌入模型（如CLIP）在绑定时存在困难：它们能识别单个概念，但无法表示哪些概念构成哪些对象。尽管CLIP在跨模态检索中表现为词袋模型，但对象信息可以从其图像和文本嵌入中分别恢复。我们通过绑定函数（将概念映射到场景嵌入）研究这种张力。我们发现场景嵌入可加性分解为对象表示，这解释了为何单模态探针能恢复对象信息。然而，CLIP的绑定函数具有高复杂度，这可能阻止图像和文本编码器学习共享的绑定机制，从而无法泛化到未见过的概念组合。然后我们探究这种局限性是否是根本性的。我们证明并非如此。在从零开始训练的受控Transformer模型中，随着数据覆盖率的增加，绑定泛化出现。这些模型学习到低复杂度的绑定函数，其特点是概念之间的乘法交互，从而实现系统泛化。代码公开于https://github.com/oshapio/binding-concepts-complexity。

英文摘要

Humans easily determine which color belongs to which shape in multi-object scenes, an ability known as concept binding. Vision-language embedding models such as CLIP struggle with binding: they recognize individual concepts but fail to represent which concepts form which objects. Although CLIP behaves like a bag-of-concepts model in cross-modal retrieval, object information is recoverable from its image and text embeddings separately. We study this tension through the binding function, which maps concepts to scene embeddings. We find that scene embeddings decompose additively into object representations, explaining why uni-modal probes can recover object information. However, CLIP's binding function is high-complexity, which likely prevents the image and text encoders from learning a shared binding mechanism that generalizes to unseen concept combinations. We then ask whether this limitation is fundamental. We show that it is not. In controlled transformer models trained from scratch, binding generalization emerges with sufficient data coverage. These models learn low-complexity binding functions characterized by multiplicative interactions between concepts, enabling systematic generalization. Code is publicly available at https://github.com/oshapio/binding-concepts-complexity.

URL PDF HTML ☆

赞 0 踩 0

2605.31500 2026-06-01 cs.LG cs.AI

On Efficient Scaling of GNNs via IO-Aware Layers Implementations

通过IO感知层实现实现GNN的高效扩展

Daria Fomina, Daniil Krasylnikov, Alexey Boykov, Andrey Dolgovyazov, Vyacheslav Zhdanovskiy, Fedor Velikonivtsev

AI总结针对GNN中稀疏不规则内存访问瓶颈，提出三种GPU内核族（SpMM卷积、归约聚合、注意力层）以减少数据移动并提升局部性，在真实图上实现高达8.5倍加速和76倍内存降低。

Comments International Conference on Machine Learning (ICML) 2026, Spotlight Paper

详情

AI中文摘要

图神经网络（GNN）受限于稀疏、不规则的内存访问。流行的框架如DGL和PyTorch Geometric支持通用消息传递，但复杂层通常具体化边中间结果，增加内存流量并限制在大图上的可扩展性。我们以I/O和算术强度为中心的观点表明，广泛使用的层分为三种内核族：基于SpMM的卷积、基于归约的聚合和基于注意力的层（GATv2/Graph Transformer）。对于每个族，我们开发了减少数据移动、改善局部性并在真实图上保持鲁棒性的GPU内核。我们还研究了图重排序，发现其影响取决于内核映射：它对邻居并行（以gather为主）内核的益处比特征并行设计更一致。实验表明，我们的融合注意力内核在Graph Transformer上达到高达$ extbf{3.9} imes$的加速（中位数$ extbf{1.6} imes$），在局部密集图上使用Tensor Core（块稀疏）变体达到高达$ extbf{7.3} imes$；对于GATv2，我们达到高达$ extbf{8.5} imes$的加速（中位数$ extbf{2.0} imes$），同时峰值内存降低高达$ extbf{76} imes$（中位数$ extbf{6} imes$）。我们的度感知归约内核达到高达$ extbf{10} imes$的加速（中位数$ extbf{2.6} imes$）。对于基于SpMM的层，适当缓存的cuSPARSE比DGL达到高达$ extbf{8} imes$的加速，并在大多数评估中优于评估的自定义基线。我们发布我们的实现作为即插即用的替代品，以支持可重现的、硬件感知的GNN加速。

英文摘要

Graph Neural Networks (GNNs) are bottlenecked by sparse, irregular memory access. Popular frameworks such as DGL and PyTorch Geometric support general message passing, but complex layers often materialize edge-wise intermediates, increasing memory traffic and limiting scalability on large graphs. We take an I/O- and arithmetic-intensity--centric view and show that widely used layers fall into three kernel families: SpMM-based convolutions, reduction-based aggregations, and attention-based layers (GATv2/Graph Transformer). For each family, we develop GPU kernels that reduce data movement, improve locality, and remain robust across realistic graphs. We also study graph reordering and find that its impact depends on the kernel mapping: it benefits neighbor-parallel (gather-dominated) kernels more consistently than feature-parallel designs. Empirically, our fused attention kernels reach up to $\textbf{3.9}\times$ speedup for Graph Transformer (median $\textbf{1.6}\times$), with Tensor Core (block-sparse) variants up to $\textbf{7.3}\times$ on locally dense graphs; for GATv2 we reach up to $\textbf{8.5}\times$ speedup (median $\textbf{2.0}\times$) while reducing peak memory by up to $\textbf{76}\times$ (median $\textbf{6}\times$). Our degree-aware reduction kernels achieve up to $\textbf{10}\times$ speedup (median $\textbf{2.6}\times$). For SpMM-based layers, properly cached cuSPARSE achieves up to $\textbf{8}\times$ speedup over DGL and outperforms evaluated custom baselines in the majority of evaluations. We release our implementations as drop-in replacements to support reproducible, hardware-aware GNN acceleration.

URL PDF HTML ☆

赞 0 踩 0

2605.31497 2026-06-01 cs.LG stat.ML

Assign and Add: A Mechanistic Study of Compositional Arithmetic

Assign and Add: 组合算术的机制研究

Brady Exoo, Alberto Bietti, John Sous

AI总结通过变量赋值和模加法任务，研究Transformer中组合泛化的机制，发现模型利用同一模加法模块处理直接和间接输入，并揭示了三阶段学习动态。

详情

AI中文摘要

大型语言模型能够组合技能以执行复杂任务，其中许多任务可能在训练期间未曾见过。这种组合发生的具体细节仍然难以捉摸。在本文中，我们通过考虑一个涉及变量赋值和模加法的简单受控设置，研究Transformer中组合泛化的机制。通过将训练数据划分为不相交的集合，我们观察到小型Transformer能够泛化到先前未见过的变量和数字组合。我们的机制分析表明，无论输入是直接给出还是通过单独的变量赋值机制间接给出，都使用相同的“模加法”MLP模块。我们还从经验角度分析了训练动态，揭示了三个学习阶段：首先学习模加法，然后学习变量赋值所需的结构，最后是精炼阶段，模型泛化到训练中未见的一些困难序列。最后，我们提供了一个理论框架来解释组合性如何从训练动态中涌现。这些结果表明，组合泛化可以是Transformer内部机制组合性的自然结果。

英文摘要

Large language models are able to compose skills in order to perform complex tasks, many of which might not have been seen during training. The details of how exactly this composition occurs remain elusive. In this paper, we study a mechanism for compositional generalization in transformers by considering a simple controlled setting involving variable assignment and modular addition. By partitioning our training data into disjoint sets, we observe that small transformers are able to generalize to previously unseen combinations of variables and numbers. Our mechanistic analysis shows that the same ``modular addition'' MLP module is used whether the inputs are given directly or indirectly through a separate variable assignment mechanism. We also analyze the training dynamics from an empirical lens, which reveals three phases of learning: first, modular addition is learned, then the structure required for variable assignment, and finally a refinement phase where the model generalizes to some hard sequences not seen in training. Finally, we provide a theoretical framework to explain how compositionality emerges from training dynamics. These results suggest that compositional generalization can be a natural consequence of the compositionality of internal mechanisms in~transformers.

URL PDF HTML ☆

赞 0 踩 0

2605.31494 2026-06-01 cs.CL cs.LG

Consolidating Rewarded Perturbations for LLM Post-Training

整合奖励扰动用于大语言模型后训练

Zheyu Zhang, Shuo Yang, Gjergji Kasneci

AI总结提出CoRP方法，通过奖励加权聚合、兼容性重加权和验证门控，将奖励扰动整合为单一模型，无需梯度，在单次推理下平均提升8.1分。

详情

AI中文摘要

语言模型的后训练通常被框架为通过梯度下降实现的样本-分数-更新循环。最近的一系列工作，以RandOpt为例，将此循环转移到权重空间，在预训练模型周围采样高斯扰动，并在推理时集成前K个奖励专家。虽然在与PPO和GRPO匹配训练计算量下具有竞争力，但这种预测级集成每个测试样本需要K次前向传播，并且不能干净地扩展到自由生成。我们询问是否可以将奖励种群折叠成一个单一的可部署模型，用一次整合更新替代推理时集成。对25个模型-任务对的拆分半分析揭示了每种情况下可复现的低秩结构。我们将这种几何结构转化为CoRP（整合奖励扰动），这是一种无梯度算子，结合了奖励加权聚合、兼容性感知重加权和保留验证门控，且没有梯度通过语言模型。在从0.5B到8B的五个语言模型和涵盖数学、代码和创意写作的五个任务上，CoRP平均将基础模型提升了8.1分。使用RandOpt扰动预算的十分之一，CoRP超过了单次推理的RandOpt 6.5分，并恢复了50次多数投票集成增益的一半以上，而每个测试样本只需一次前向传播。

英文摘要

Post-training of language models is commonly framed as a sample-score-update loop implemented by gradient descent. A recent line of work, exemplified by RandOpt, relocates this loop to weight space, sampling Gaussian perturbations around a pretrained model and ensembling the top-K rewarded specialists at inference. While competitive with PPO and GRPO under matched training compute, this prediction-level ensemble incurs K forward passes per test example and does not extend cleanly to free-form generation. We ask whether the rewarded population can instead be folded into a single deployable model, replacing the inference-time ensemble with one consolidated update. A split-half analysis over 25 model-task pairs reveals reproducible low-rank structure in every case. We turn this geometry into CoRP (Consolidating Rewarded Perturbations), a gradient-free operator that combines reward-weighted aggregation, compatibility-aware reweighting, and a held-out validation gate, with no gradient flowing through the language model. Across five language models from 0.5B to 8B and five tasks covering math, code, and creative writing, CoRP improves the base model by 8.1 points on average. Using one tenth of RandOpt's perturbation budget, CoRP exceeds single-inference RandOpt by 6.5 points and recovers more than half of the gain of the 50-pass majority-vote ensemble, at one forward pass per test example.

URL PDF HTML ☆

赞 0 踩 0

2605.31492 2026-06-01 cs.AI

LinTree: Improving LLM Reasoning with Explicitly Structured Search Histories

LinTree: 通过显式结构化搜索历史提升LLM推理能力

Liwei Kang, Yee Whye Teh, Wee Sun Lee

AI总结针对LLM推理中隐式搜索树导致性能不佳的问题，提出LinTree方法，通过添加父指针显式表示线性化树结构，在Blocks World、网格导航和Sokoban任务中提升了任务性能和搜索效率。

Comments 16 pages, 3 figures

详情

AI中文摘要

大型语言模型（LLM）通常通过生成中间轨迹来解决推理问题，这些轨迹探索并修正部分解决方案。从搜索的角度来看，这些轨迹可以视为线性化的搜索树，其中模型扩展部分解决方案，失败时放弃并回溯尝试替代方案。与传统启发式搜索相比，这种策略有一个潜在优势：它基于整个搜索轨迹而非仅当前局部状态进行条件化。我们首先测试LLM是否利用这一优势，通过比较轨迹条件推理策略与配备仅观察当前局部状态的LLM启发式的最佳优先搜索。在三个受控推理环境（Blocks World、网格导航和Sokoban）中，我们发现仅原始访问搜索历史不足以可靠地超越启发式搜索。然后我们研究了一个可能的原因：在LLM推理轨迹中，底层搜索树仅隐式表示，当模型回溯或切换分支时，轨迹并未明确标识正在重新访问哪个早期搜索状态。我们表明，添加简单的父指针以显式表示线性化树（LinTree）结构，相对于隐式推理模型和LLM启发式引导搜索，提高了任务性能和搜索效率。这些结果表明，当树结构被显式化时，搜索历史变得最为有用，从而激励LLM推理中更具结构意识的表示。

英文摘要

Large language models (LLMs) often solve reasoning problems by generating intermediate traces that explore and revise partial solutions. From a search perspective, these traces can be viewed as linearized search trees, where the model extends a partial solution, abandons it when it fails, and backtracks to try alternatives. Compared with traditional heuristic-guided search, such a policy has a potential advantage: it conditions on the whole search trace rather than only on the current local state. We first test whether LLMs utilize this advantage by comparing trace-conditioned reasoning policies against best-first search equipped with an LLM heuristic that only observes the current local state. Across three controlled reasoning environments, Blocks World, grid Navigation, and Sokoban, we find that raw access to search history alone is not enough to reliably outperform heuristic search. We then study one possible reason: in LLM reasoning traces, the underlying search tree is only implicitly represented, and when the model backtracks or switches branches, the trace does not explicitly identify which earlier search state is being revisited. We show that adding simple parent pointers to explicitly represent the linearized tree (LinTree) structure improves both task performance and search efficiency relative to implicit reasoning models and LLM-heuristic-guided search. These results suggest that search history becomes most useful when its tree structure is made explicit, motivating more structure-aware representations for LLM reasoning.

URL PDF HTML ☆

赞 0 踩 0

2605.31486 2026-06-01 cs.RO

Learning Controlled Separation of Small Objects Between Two Fingers with a Tactile Skin

利用触觉皮肤学习两个手指间小物体的受控分离

Ulf Kasolowsky, Berthold Bäuml

AI总结本文提出并解决了多用途机器人手两个手指间小物体的受控分离任务，通过强化学习训练纯触觉策略，并分析了空间分辨触觉反馈的优势。

详情

AI中文摘要

我们提出并解决了多用途机器人手两个手指间小物体的受控分离这一新任务：在抓取一盒小物体后，任务是丢弃尽可能多的物体，直到手指间保留所需数量。这些物体相对于手指宽度很小，而且绝对尺寸也很小。在我们的案例中，处理的是直径仅为6毫米的小颗粒。我们证明，该任务可以纯粹通过触觉（无视觉）完成，使用指尖上的空间分辨触觉皮肤。分离策略通过强化学习在模拟中训练，使用简单的稀疏奖励，基本上检查是否达到所需物体数量。在模拟实验中，我们详尽分析了使用空间分辨触觉反馈的好处：虽然理想（高分辨率）触觉传感器几乎可以完美完成任务，但空间分辨率较低的传感器（此处为4x4触觉单元）与仅使用手指关节传感器相比，仍能带来高达20%的改进。为了进行此分析，我们还在策略旁边训练了一个估计器，用于预测真实接触位置。最后，我们展示了配备触觉皮肤的DLR-Hand II的成功仿真到现实迁移。

英文摘要

We introduce and solve the novel task of controlled separation of small objects with two fingers of a multi-purpose robotic hand: after grasping into a box of small objects, the task is to drop as many of them until a desired number remains between the fingers. The objects are small compared to the width of the fingers but also in absolute terms. In our case little pellets with a diameter of only 6mm are handled. We show that the task can be performed purely tactile (no vision) using a spatially-resolved tactile skin on a fingertip. The separation policy is trained in simulation via reinforcement learning using a straightforward sparse reward, which basically checks if the desired number of objects is reached. In simulation experiments, we provide an exhaustive analysis of the benefits of using spatially-resolved tactile feedback: while an ideal (high-resolution) tactile sensor allows solving the task almost perfectly, a sensor with lower spatial resolution (here 4x4 taxels) still leads to an improvement of up to 20% compared to using only the fingers' joint sensors. For this analysis, we further train an estimator alongside the policy that predicts the ground truth contact positions. Finally, we demonstrate the successful sim-to-real transfer for the DLR-Hand II equipped with a tactile skin.

URL PDF HTML ☆

赞 0 踩 0

2605.31485 2026-06-01 cs.LG math.CT

Graphical einops: bridging tensor networks and computation graphs

Graphical einops: 桥接张量网络与计算图

Vincent Wang-Maścianica, Nikhil Khatri

AI总结本文提出一种形式化的图形演算，用于einops的张量编程结构片段，通过等级自然性重写实现张量等变性的图解证明，并应用于注意力掩码转换以优化稀疏注意力实现。

2605.31484 2026-06-01 cs.LG

Balanced LoRA: Removing Parameter Invariance to Accelerate Convergence

平衡LoRA：消除参数不变性以加速收敛

Valérie Castin, Kimia Nadjahi, Pierre Ablin, Gabriel Peyré

AI总结针对LoRA过参数化导致不同低秩因子对条件数差异大而影响收敛速度的问题，提出BaLoRA，通过投影到平衡流形改善损失景观条件，实现更快收敛和更优性能。

Comments Accepted at ICML 2026

2605.31481 2026-06-01 cs.RO

Batched Differentiable Rigid Body Dynamics in PyTorch for GPU-Accelerated Robot Learning

Yue Wang, Yanran Xu, Wenbo Wu, Chuanhang Qiu, Zhaoxing Li

AI总结提出BARD，一种基于PyTorch的批处理可微刚体动力学库，通过三级缓存、无矩阵乘法的关节变换和层级并行传播，在GPU上实现高达64倍的前向运动学加速，并支持梯度计算。

详情

AI中文摘要

随着机器人控制转向大规模强化学习与循环动力学计算，社区对Pinocchio等CPU绑定库的依赖在基于GPU的训练流程中造成了吞吐瓶颈。我们提出了BARD（批处理铰接刚体动力学），这是一个自包含的PyTorch实现，基于Featherstone的刚体动力学算法，针对批处理GPU评估和自动微分进行了优化。三个设计选择使其高效：分层惰性求值缓存避免冗余树遍历，通过预计算的Rodrigues常数实现无矩阵乘法的关节变换，以及将顺序操作减少为树深度批处理步骤的层级并行传播。在五个机器人模型（7-23自由度）上，BARD在数值上匹配Pinocchio，同时在NVIDIA H200上以批大小4096实现前向运动学高达64倍、雅可比矩阵高达63倍的吞吐量提升。我们通过基于梯度的系统辨识验证了可微性，在7自由度机械臂上，在5%扭矩噪声下将连杆质量恢复至1.24%的平均误差，并将BARD集成到Isaac Lab AMP训练流程中，用于具有4096个并行环境的11自由度脊柱四足机器人，其在循环动力学中比Pinocchio快8.5倍，比ADAM快2.0倍。BARD已开源：https://github.com/YueWang996/bard-pytorch-dynamics。

英文摘要

As robot control shifts toward large-scale reinforcement learning with in-loop dynamics computation, the community's reliance on CPU-bound libraries such as Pinocchio creates a throughput bottleneck in GPU-based training pipelines. We present BARD (Batched Articulated Rigid-body Dynamics), a self-contained PyTorch implementation of Featherstone's rigid-body dynamics algorithms, optimized for batched GPU evaluation and automatic differentiation. Three design choices make this efficient: a tiered lazy-evaluation cache that avoids redundant tree traversals, matmul-free joint transforms via pre-computed Rodrigues constants, and level-parallel propagation that reduces sequential operations to tree-depth batched steps. On five robot models (7-23 DOFs), BARD matches Pinocchio numerically while reaching up to 64x higher throughput for Forward Kinematics and 63x for Jacobians at batch size 4096 on an NVIDIA H200. We validate differentiability through gradient-based system identification on a 7-DOF manipulator, recovering link masses to 1.24% mean error under 5% torque noise, and integrate BARD into an Isaac Lab AMP training pipeline for an 11-DOF spined quadruped with 4096 parallel environments, where it is 8.5x faster than Pinocchio and 2.0x faster than ADAM for in-loop dynamics. BARD is open-sourced at: https://github.com/YueWang996/bard-pytorch-dynamics.

URL PDF HTML ☆

赞 0 踩 0

2605.31480 2026-06-01 cs.CL

Language Models Can Resolve Reference Compositionally, But It's Not Their Native Strength: The Case of the Personal Relation Task

语言模型可以组合性地解析指代，但这并非其天然优势：以个人关系任务为例

Bart Evelo, Meaghan Fowlie, Denis Paperno

AI总结通过个人关系任务，比较人类与大型语言模型在外延任务（确定指称对象）和内涵任务（结构化表示意义）上的表现，发现人类更擅长外延任务而LLM更擅长内涵任务，表明缺乏指称基础是LLM模拟人类语言理解的关键缺失。

Comments A pre-MIT Press publication version. Paper accepted to Transactions of the Association for Computational Linguistics

详情

AI中文摘要

神经模型（如大型语言模型）是否真正获得了组合性能力来理解自然语言？当我们谈论语义解释时，可以区分两个互补的方面：确定一个表达式在世界中的指称（我们称之为外延任务）以及以结构化方式表示其意义（我们称之为内涵任务）。我们在个人关系任务（Paperno 2022）的设置中评估了LLM和人类在这两项任务上的表现，该任务给定一个人际关系宇宙，要求解释诸如“Amber的父母的朋友”这样的名词短语。这里，对于内涵任务，答案是公式“friend(parent(amber))”；对于外延任务，答案是具体的人。我们发现人类和LLM表现出相反的强项：人类在外延任务上表现优于内涵任务，而LLM则相反。我们的方法为理解现代机器学习模型中的组合性能力带来了更细致的视角。我们的结果支持这样一种观点：LLM训练中缺乏指称基础是模仿人类语言理解的关键缺失成分。

英文摘要

Do neural models, such as Large Language Models, genuinely acquire compositional abilities for interpretation of natural language? When we talk about semantic interpretation, we can distinguish two complementary aspects: establishing what an expression refers to in the world (which we call the Extensional task) and representing its sense in a structured way (which we call the Intensional task). We evaluate LLMs and humans on both tasks in the setting of the Personal Relation Task (Paperno 2022) in which, given a universe of people and their relationships with each other, one is asked to interpret a noun phrase such as "Amber's parent's friend". Here, for the Intensional task, the answer is the formula "friend(parent(amber))", and for the Extensional task, the person. We find that humans and LLMs show opposite strengths: humans perform better on Extensional than Intensional tasks, and LLMs vice versa. Our methodology brings greater nuance to the understanding of compositional abilities in modern machine learning models. Our results support the notion that the lack of referential grounding in LLM training is a crucial missing component in mimicking human-like language understanding.

URL PDF HTML ☆

赞 0 踩 0

2605.31476 2026-06-01 cs.RO

IDOL: Inverse-Dynamics-Guided Future Prediction for End-to-End Autonomous Driving

IDOL: 逆动力学引导的未来预测用于端到端自动驾驶

Chenghao Zhang, Timin Li, Dongmei Li

AI总结提出IDOL框架，通过逆动力学模型将BEV世界模型预测的未来潜在场景状态转化为规划相关的轨迹增量，实现未来预测与轨迹优化的紧密耦合，在NAVSIM基准上达到最优性能。

Comments 20 pages, 5 figures

详情

AI中文摘要

端到端自动驾驶已成为直接从传感器观测学习规划的有力范式，而近期基于世界模型的方法通过显式推理场景未来演化进一步丰富了这一范式。然而，仅靠未来预测并不能保证更好的规划，除非预测的演化能够转化为规划相关的轨迹更新。当前许多方法仍预测未来场景状态，而未明确解码状态转换中隐藏的运动含义。因此，未来推理通常仅具有描述性价值，而与可执行运动生成的耦合较弱。为解决此限制，我们提出IDOL，一种基于逆动力学的未来预测框架，用于潜在BEV空间中基于世界模型的端到端规划，其中逆动力学作为未来预测与轨迹优化之间的关键桥梁。IDOL首先使用BEV世界模型预测多个未来潜在场景状态，然后对相邻潜在未来应用逆动力学模型，以解码过渡感知的轨迹特征并恢复规划相关的运动增量，解释潜在世界随时间如何演化。这些逆动力学导出的信号用于优化规划轨迹，将未来预测从被动场景预测转变为可操作的规划指导。轻量级闭环细化模块通过重用优化轨迹进行另一轮未来感知推理，进一步改善长时一致性。通过将逆动力学引入潜在未来推理，IDOL加强了世界建模与规划之间的耦合。在NAVSIM v1和NAVSIM v2基准上的大量实验表明，IDOL在可比方法中达到了最先进的性能。

英文摘要

End-to-end autonomous driving has emerged as a compelling paradigm for learning planning directly from sensor observations, while recent world-model-based approaches further enrich this paradigm by enabling explicit reasoning about how the scene may evolve in the future. Yet future prediction alone does not guarantee better planning unless the predicted evolution can be converted into planning-relevant trajectory updates. Many current methods still forecast future scene states without explicitly decoding the motion implications hidden in state transitions. As a result, future reasoning often remains descriptively useful but only weakly coupled to executable motion generation. To address this limitation, we propose \mathbf{IDOL}, an inverse-dynamics-guided future prediction framework for world-model-based end-to-end planning in latent BEV space, where inverse dynamics serves as the key bridge between future prediction and trajectory optimization. IDOL first predicts multiple future latent scene states with a BEV world model, then applies an inverse dynamics model to adjacent latent futures to decode transition-aware trajectory features and recover planning-relevant motion deltas that explain how the latent world evolves over time. These inverse-dynamics-derived signals are used to optimize the planned trajectory, turning future forecasting from passive scene anticipation into actionable planning guidance. A lightweight closed-loop refinement module further improves long-horizon consistency by reusing the optimized trajectory for another round of future-aware reasoning. By introducing inverse dynamics into latent future reasoning, IDOL tightens the coupling between world modeling and planning. Extensive experiments on the NAVSIM v1 and NAVSIM v2 benchmarks show that IDOL achieves state-of-the-art performance among comparable methods.

URL PDF HTML ☆

赞 0 踩 0

2605.31469 2026-06-01 cs.CL cs.AI cs.SD eess.AS

Scaling Conversational Hungarian ASR: The BEA-Dialogue+ Corpus

扩展匈牙利语对话ASR：BEA-Dialogue+语料库

Máté Gedeon, Piroska Zsófia Barta, Péter Mihajlik, Katalin Mády

AI总结针对匈牙利语对话语音识别训练数据不足的问题，本文通过放宽分割标准扩展BEA-Dialogue语料库至200小时，并评估基于Whisper和FastConformer的模型，证明基于序列化输出训练的微调能持续改善识别性能。

详情

AI中文摘要

匈牙利语对话自动语音识别受到公开对话式训练数据有限的制约。BEA-Dialogue语料库解决了这一需求，但其严格的说话人分离的训练/开发/测试分割将可用材料减少到仅85小时。在本文中，我们介绍了BEA-Dialogue+，这是该语料库的扩展版本，它放宽了实验者和对话伙伴的分割标准，同时保持主要说话人的完全分离。这产生了200小时转录的自然对话，并允许对额外训练数据与分割间说话人重叠之间的权衡进行受控研究。我们在两个语料库版本上评估了多个基于Whisper和FastConformer的模型，包括基于序列化输出训练（SOT）的对话转录微调。我们的结果表明，对于未经微调的模型，较大的语料库更具挑战性，而基于SOT的适应在WER、CER、cpWER和cpCER上产生了一致的改进。总体而言，BEA-Dialogue+为匈牙利语对话ASR提供了一个更大但仍具挑战性的基准，以及用于训练和评估对话转录系统的实用资源。

英文摘要

Conversational automatic speech recognition in Hungarian is constrained by the limited amount of publicly available dialogue-style training data. The BEA-Dialogue corpus addresses this need, but its strictly speaker-disjoint train/dev/eval split reduces the usable material to only 85 hours. In this paper, we introduce BEA-Dialogue+, an expanded version of the corpus that relaxes the split criterion for experimenters and dialogue partners while preserving complete separation of the primary speakers. This results in 200 hours of transcribed natural conversations and enables a controlled study of the trade-off between additional training data and speaker overlap across the splits. We evaluate several Whisper- and FastConformer-based models on both corpus versions, including Serialized Output Training (SOT)-based fine-tuning for dialogue transcription. Our results show that the larger corpus is more challenging for models without fine-tuning, whereas SOT-based adaptation yields consistent improvements in WER, CER, cpWER, and cpCER. Overall, BEA-Dialogue+ provides a substantially larger yet still demanding benchmark for Hungarian dialogue ASR, and a practical resource for training and evaluating dialogue transcription systems.

URL PDF HTML ☆

赞 0 踩 0

2605.31468 2026-06-01 cs.AI

AutoSci: A Memory-Centric Agentic System for the Full Scientific Research Lifecycle

AutoSci: 面向完整科学生命周期的以记忆为中心的智能体系统

Weitong Qian, Beicheng Xu, Zhongao Xie, Bowen Fan, Guozheng Tang, Jiale Chen, Xinzhe Wu, Mingtian Yang, Chenyang Di, Jiajun Li, Lingching Tung, Peichao Lai, Yifei Xia, Ziyi Guo, Yanwei Xu, Yanzhao Qin, Shaoduo Gan, Xupeng Miao, Bin Cui

AI总结提出AutoSci，一个以记忆为中心、支持完整科学生命周期的智能体系统，通过结构化记忆、多阶段流程、有向无环图增强和演化机制实现自动化科研。

详情

AI中文摘要

科学研究传统上是人力密集型的，要求研究人员在漫长的项目周期中协调文献、想法、实验、手稿和审稿回复。基于LLM的科学智能体的兴起为自动化这一过程创造了机会。这样的系统必须支持完整的研究生命周期，跨项目维护结构化的持久记忆，并随时间改进自身的研究流程。然而，现有系统要么部分满足，要么未能满足这些要求，留下了统一自动化科学研究系统的空白。因此，我们提出了AutoSci，一个面向完整科学生命周期的以记忆为中心的智能体系统。AutoSci围绕四个模块组织。SciMem提供受模式约束的研究记忆，将可重复使用的科学知识分离为长期知识记忆，将项目级工件（如想法、实验、手稿和审稿）分离为活跃研究记忆。SciMem通过一个控制状态、上下文、验证、反馈和编排的框架执行从文献理解到反驳的五阶段生命周期。SciDAG通过有向无环图形式的多智能体操作符和可重用的阶段特定模板增强困难技能。SciEvolve将来自用户、实验、审稿和外部环境的反馈信号转化为对SciMem组织、SciFlow技能和SciDAG模板的版本化更新。这些模块共同使AutoSci成为一个持久的研究环境，能够在研究项目间执行、记忆和演化。代码仓库位于https://github.com/skyllwt/AutoSci。

英文摘要

Scientific research has traditionally been human-intensive, requiring researchers to coordinate literature, ideas, experiments, manuscripts, and review responses across long project cycles. The rise of LLM-based scientific agents creates an opportunity to automate this process. Such a system must support the full research lifecycle, maintain structured persistent memory across projects, and improve its own research procedures over time. However, existing systems either partially satisfy or fail to satisfy these requirements, leaving a gap for a unified automated scientific research system. As a result, we present AutoSci, a memory-centric agentic system for the full scientific research lifecycle. AutoSci is organized around four modules. SciMem provides schema-governed research memory, separating Long-Term Knowledge Memory for reusable scientific knowledge from Active Research Memory for project-level artifacts such as ideas, experiments, manuscripts, and reviews. SciFlow executes a five-stage lifecycle from literature understanding to rebuttal through a harness that controls state, context, verification, feedback, and orchestration. SciDAG augments difficult skills with DAG-shaped multi-agent operators and reusable stage-specific templates. SciEvolve converts feedback signals from users, experiments, reviews, and external environments into versioned updates to SciMem organization, SciFlow skills, and SciDAG templates. Together, these modules make AutoSci a persistent research environment that can execute, remember, and evolve across research projects. The code repository is available at https://github.com/skyllwt/AutoSci.

URL PDF HTML ☆

赞 0 踩 0

2605.31466 2026-06-01 cs.CV

VolFill: Single-View Amodal 3D Scene Reconstruction with Volumetric Flow Matching

VolFill: 基于体素流匹配的单视图非模态3D场景重建

Tuan Duc Ngo, Chuang Gan, Evangelos Kalogerakis

AI总结提出VolFill框架，利用混合3D VAE和潜在扩散Transformer从单张RGB图像生成完整3D场景结构，在SCRREAM和NRGB-D数据集上显著优于现有方法。

详情

AI中文摘要

从单张RGB图像重建场景的完整几何形状仍然具有挑战性——尤其是在推断视觉证据不完整的隐藏结构时。我们提出了VolFill，一个生成框架，它预测完整场景的3D结构，而不是依赖传统的像素对齐回归。我们的方法利用混合3D VAE将稀疏截断无符号距离函数网格压缩为紧凑的潜在空间，并结合潜在扩散Transformer对该表示进行去噪以恢复完整场景。我们以几何基础模型为条件生成，利用丰富的空间先验进行稳健推理。与受限于逐射线约束或非结构化点云查询的现有方法不同，VolFill提供了一种结构化表示，支持直接表面提取和大规模占用查询。在SCRREAM和NRGB-D数据集上的大量实验表明，我们的方法显著优于当前基线，为整体空间理解提供了稳健的基础。

英文摘要

Reconstructing the complete geometry of a scene from a single RGB image remains challenging - especially when inferring hidden structures where visual evidence is incomplete. We introduce VolFill, a generative framework that predicts the 3D structure of the complete scene rather than relying on traditional pixel-aligned regression. Our method utilizes a hybrid 3D VAE to compress sparse truncated unsigned distance function grids into a compact latent space, paired with a latent Diffusion Transformer that denoises this representation to recover the complete scene. We condition the generation on geometry foundation models, leveraging rich spatial priors for robust reasoning. Unlike existing methods limited by per-ray constraints or unstructured point-cloud queries, VolFill provides a structured representation that supports direct surface extraction and occupancy queries at scale. Extensive experiments on the SCRREAM and NRGB-D datasets demonstrate that our approach significantly outperforms current baselines, providing a robust foundation for holistic spatial understanding.

URL PDF HTML ☆

赞 0 踩 0

2605.31464 2026-06-01 cs.LG cs.AI

GPU Forecasters: Language Models as Selective Surrogates for Kernel Runtime Optimization

GPU预测器：语言模型作为内核运行时优化的选择性替代

Zaid Khan, Justin Chih-Yao Chen, Jaemin Cho, Elias Stengel-Eskin, Mohit Bansal

AI总结研究利用语言模型作为GPU内核性能的选择性替代，通过强化学习提高预测准确性和校准度，在有限GPU评估预算下加速内核搜索。

Comments Code: https://github.com/codezakh/gpu-forecasters

详情

AI中文摘要

GPU内核是现代深度学习的主力，优化它们（通过进化搜索或编码代理）通常需要在目标硬件上重复测量。虽然这些测量提供了内核搜索所需的地面真实信号，但成本高昂，因为每次评估内核都需要编译并在GPU上重复执行。随着LLM推理的改进降低了编写新内核的成本，并且LLM驱动的搜索扩展到大的搜索预算，设备上的评估成为瓶颈。为了解决这个问题，我们研究LLM如何通过预测所提议内核的性能，作为选择性GPU替代用于内核评估。一个有用的替代应该是准确的，并且应该是选择性的，知道何时可能出错，并推迟到GPU。为了评估替代，我们测量其预测是否准确、校准良好，并且在有限的GPU测量预算下对恢复快速内核实际有用。接下来，我们研究强化学习是否能提高预测准确性和置信度校准。我们的实验表明，LLM可以准确预测相对内核性能，并且通过强化学习可以提高其实用性。在内核搜索中使用替代，使得搜索在相同的GPU评估预算下可以考虑多倍的候选，从而比同等预算的基线找到更快的内核。这些结果表明，LLM可以在内核优化中发挥更广泛的作用，作为GPU的虚拟模型，而不仅仅是搜索的内核生成器。

英文摘要

GPU kernels are the workhorse of modern deep learning, and optimizing them (via evolutionary search or coding agents) usually requires repeated measurement on target hardware. While these measurements provide the ground-truth signal necessary for kernel search, they are costly, because each evaluation of a kernel requires compilation and repeated execution on a GPU. As improvements in LLM inference reduce the cost of writing novel kernels and LLM-driven searches scale to large search budgets, on-device evaluation becomes a bottleneck. To address this, we study how LLMs can serve as selective GPU surrogates for kernel evaluation, by forecasting the performance of proposed kernels. A useful surrogate should be accurate, and it should be selective, by knowing when it could be wrong, and deferring to the GPU. To evaluate surrogates, we measure whether their forecasts are accurate, calibrated, and practically useful for recovering fast kernels under limited GPU-measurement budgets. Next, we study whether reinforcement learning can improve forecast accuracy and confidence calibration. Our experiments demonstrate that LLMs can accurately forecast relative kernel performance, that their utility can be improved through reinforcement learning. Used inside a kernel search, the surrogate lets the search consider several times as many candidates under the same GPU evaluation budget, and that leads to finding faster kernels than an equal-budget baseline. These results suggest that LLMs can play a broader role in kernel optimization, by acting as virtual models of a GPU rather than solely as kernel generators for search.

URL PDF HTML ☆

赞 0 踩 0

2605.31463 2026-06-01 cs.LG cs.AI cs.CL cs.DC

PithTrain: A Compact and Agent-Native MoE Training System

PithTrain: 一个紧凑且面向智能体的MoE训练系统

Ruihang Lai, Hao Kang, Haozhan Tang, Akaash R. Parthasarathy, Zichun Yu, Junru Shao, Todd C. Mowry, Chenyan Xiong, Tianqi Chen

AI总结提出PithTrain，一个基于智能体原生设计原则的紧凑型MoE训练框架，通过引入ATE-Bench评估智能体任务效率，在保持生产框架吞吐量的同时，将智能体任务轮次和活跃GPU时间分别降低62%和64%。

详情

AI中文摘要

混合专家模型（MoE）已成为前沿语言模型的主导架构。为满足这一需求，生产框架经过多年的工程努力构建了优化的MoE训练栈。然而，为新的架构和系统优化而演进这些栈仍然代价高昂。随着AI编码智能体的兴起，它们可以自动化训练框架开发的部分工作并加速这一演进。但将这些智能体应用于现有框架会带来隐藏成本，这些成本在当今仅关注吞吐量的评估中不可见。我们将这一缺失维度命名为智能体任务效率（ATE）：即使用编码智能体理解、操作和扩展框架的成本。基于四个智能体原生设计原则，我们构建了PithTrain，一个紧凑、智能体原生的MoE训练框架。我们进一步引入了ATE-Bench，涵盖现实世界的训练框架任务。我们的评估表明，PithTrain在吞吐量上与生产框架相当，并且在ATE-Bench上，PithTrain实现了更高的智能体任务效率，智能体轮次减少高达62%，活跃GPU时间减少64%。

英文摘要

Mixture-of-Experts (MoE) has become the dominant architecture for frontier language models. To meet this demand, production frameworks have built optimized MoE training stacks over years of engineering effort. Yet evolving these stacks for new architectures and system optimizations remains expensive. With the rise of AI coding agents, they could automate parts of training-framework development and accelerate this evolution. But applying them to these existing frameworks carries hidden costs, invisible to today's throughput-only evaluations. We name this missing dimension agent-task efficiency (ATE): the cost of using coding agents to understand, operate, and extend a framework. Grounded in four agent-native design principles, we build PithTrain, a compact, agent-native MoE training framework. We further introduce ATE-Bench, covering real-world training-framework tasks. Our evaluation shows PithTrain matches the throughput of production frameworks, and on ATE-Bench, PithTrain enables higher agent-task efficiency, with up to 62% fewer Agent Turns and 64% less Active GPU Time.

URL PDF HTML ☆

赞 0 踩 0

2605.31460 2026-06-01 cs.RO cs.SY eess.SY

On-Device Robotic Planning: Eliminating Inference Redundancy for Efficient Decision-Making

设备端机器人规划：消除推理冗余以实现高效决策

Joonhee Lee, Hyunseung Shin, Hyunmi Kim, Pei Zhang, Jeonggil Ko

AI总结提出REIS框架，通过场景门控、KV引导的affordance路由和审慎推理减少推理冗余，在保持语义适应性的同时加速机器人控制。

Comments 19 pages

2605.31457 2026-06-01 cs.CV

VisionPulse: Dynamic Visual Sparsity for Efficient Multimodal Reasoning

VisionPulse: 用于高效多模态推理的动态视觉稀疏性

Hengbo Xu, Shengjie Jin, Yanbiao Ma, Zhiwu Lu

AI总结提出VisionPulse框架，通过步骤级视觉令牌剪枝，利用视觉注意力质量估计保留预算，仅保留关键令牌，在几乎不损失准确率的情况下减少推理开销和推理轨迹长度。

Comments Accepted at ICML 2026

详情

AI中文摘要

随着大型多模态模型（LMMs）的快速发展，推理时间开销已成为实际部署的关键瓶颈。现有方法通常在预填充阶段剪枝视觉令牌，假设推理过程中所需的视觉证据保持静态。然而，我们经验性地表明，视觉证据具有强烈的步骤依赖性：每个解码步骤只有稀疏的视觉令牌子集是关键，且关键集在推理过程中演变。此外，我们识别出一个耦合瓶颈，其中冗余的视觉上下文可能将模型引向与查询无关的区域，从而延长推理轨迹。受这些洞察的指导，我们提出VisionPulse，一种推理过程中的步骤级视觉令牌剪枝框架。VisionPulse计算轻量级的视觉注意力质量，通过利用其与LMMs有效视觉令牌使用的强正相关性来估计步骤级保留预算，并在此预算下仅保留最关键的令牌。通过在推理过程中强制视觉稀疏性，VisionPulse过滤冗余的视觉上下文，同时保留相关的视觉证据，自然缩短推理轨迹。大量实验表明，VisionPulse每步仅保留5%的视觉令牌，推理轨迹缩短11.2%，同时保持准确率几乎不变。

英文摘要

With the rapid advancement of large multimodal models (LMMs), inference-time overhead has become a key bottleneck for real-world deployment. Existing methods typically prune visual tokens at prefill, assuming the required visual evidence remains static during reasoning. However, we empirically show that visual evidence is strongly step-dependent: only a sparse subset of visual tokens is critical at each decoding step, and the critical set evolves across reasoning. Furthermore, we identify a coupled bottleneck where redundant visual context can steer the model toward query-irrelevant regions, lengthening the reasoning trace. Guided by these insights, we propose VisionPulse, a step-wise visual token pruning framework during reasoning. VisionPulse computes a lightweight visual attention mass to estimate the step-wise retention budget by exploiting its strong positive correlation with LMMs' effective visual token usage and retain only the most critical tokens under this budget. By enforcing visual sparsity during reasoning, VisionPulse filters redundant visual context while preserving relevant visual evidence, shortening reasoning traces naturally. Extensive experiments show that VisionPulse only retains 5% of visual tokens per step with reasoning traces shortened by 11.2%, while keeping accuracy almost unchanged.

URL PDF HTML ☆

赞 0 踩 0

2605.31455 2026-06-01 cs.LG cs.CL

DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization

DRIFT: 解耦的轨迹采样与重要性加权微调以实现高效的多轮优化

Jian Mu, Tianyi Lin, Chengwei Qin, Zhongxiang Dai, Yao Shu

AI总结针对多轮交互中在线强化学习成本高而离线监督微调存在分布偏移的问题，提出DRIFT框架，通过将KL正则化强化学习目标等价转化为重要性加权监督学习，实现高效且稳定的多轮优化。

详情

AI中文摘要

大型语言模型越来越多地部署在多轮交互环境中，用户或环境可以迭代地提供轻量级反馈。不幸的是，优化这种行为在实践中面临一个尖锐的困境：在线强化学习能够有效处理多轮动态，但由于每次更新时生成完整修正轨迹的成本过高而变得昂贵，而离线监督微调（SFT）虽然高效，但存在分布偏移和行为崩溃的问题。为此，我们创新性地提出了DRIFT（解耦的轨迹采样与重要性加权微调）框架，该框架实现了KL正则化强化学习目标等价于重要性加权监督学习的理论洞察。DRIFT通过从固定参考策略中采样离线交互轨迹，推导基于回报的重要性权重，并通过在所得数据集上进行加权SFT来优化策略，从而将轨迹采样与优化解耦。实验表明，DRIFT在多轮强化学习基线中达到或超越其性能，同时保持了标准监督微调的训练效率和简单性。代码可在 https://github.com/2020-qqtcg/DRIFT 获取。

英文摘要

Large language models are increasingly deployed in multi-turn interactive settings where users or environments can iteratively provide lightweight feedback. Unfortunately, optimizing such behavior presents a sharp dilemma in practice: online reinforcement learning is able to effectively address multi-turn dynamics but is prohibitively expensive due to the cost of generating full correction trajectories at every update, whereas offline supervised fine-tuning (SFT) is efficient but suffers from distribution shift and behavioral collapse. To this end, we novelly propose DRIFT (Decoupled Rollouts and Importance-Weighted Fine-Tuning), a framework that operationalizes the theoretical insight that the KL-regularized RL objective is equivalent to importance-weighted supervised learning. DRIFT decouples rollout from optimization by sampling offline interaction trajectories from a fixed reference policy, deriving return-based importance weights, and optimizing the policy via weighted SFT on the resulting dataset. Empirically, we demonstrate that DRIFT matches or exceeds the performance of multi-turn reinforcement learning baselines while maintaining the training efficiency and simplicity of standard supervised fine-tuning. Code is available at https://github.com/2020-qqtcg/DRIFT.

URL PDF HTML ☆

赞 0 踩 0

2605.31452 2026-06-01 cs.CL cs.HC

Translation Analytics for Freelancers II: Benchmarking Local LLMs for Confidential Translation Workflows

自由译者的翻译分析 II：用于保密翻译工作流的本地大语言模型基准测试

Yuri Balashov, Rex VanHorn, Mingxi Xu, Austin Downes

AI总结针对自由译者和小型语言服务提供商，开发了实用低门槛的方法，通过基准测试本地可运行的大语言模型在保密敏感领域的离线翻译性能，发现精心选择的本地大语言模型可匹配或超越本地神经机器翻译系统和前沿大语言模型，但落后于顶级商业神经机器翻译系统。

Comments 20 pages. Accepted at EAMT-2026 (Tilburg, Netherlands, June 2026)

详情

AI中文摘要

基于我们之前的工作，本文为自由译者和小型语言服务提供商开发了实用、低门槛的方法，使用严格且易于访问的分析方法来评估翻译技术。这里我们解决一个高风险、专业化的需求：保密敏感领域的离线翻译，其中隐私约束排除了基于云的引擎和商业大语言模型的使用。我们将之前工作中使用的Reeve基金会三语语料库（RFTC）扩展为多语语料库（RFMC），添加了句子对齐的德语和简体中文参考翻译。然后，我们通过Ollama对几个本地可运行的语言模型进行基准测试，涵盖四种语言方向，从该语料库中选取了1000多个句子。我们使用一致的单一提示调用，无需微调或领域适应，将本地大语言模型的输出与商业神经机器翻译（DeepL、百度）、前沿大语言模型（GPT-5.2）以及专业级本地神经机器翻译系统（OPUS-CAT、NeuralDesktop、Promt）进行比较。使用MATEO进行自动评估。结果显示，本地大语言模型在不同语言方向和模型规模上的性能存在显著差异。最好的本地大语言模型匹配或超越了本地神经机器翻译系统和前沿大语言模型，但仍落后于顶级商业神经机器翻译。这些发现强调了精心选择的本地大语言模型翻译对于隐私受限的专业人士的可行性，并为未来关于模型缩放和多语言能力的研究提供了信息。

英文摘要

Building on our previous work, this paper develops practical, low-barrier methods for freelance translators and smaller language service providers to evaluate translation technologies using rigorous yet accessible analytic methods. Here we address a high-stakes, specialized need: offline translation for confidentiality-sensitive domains in which privacy constraints preclude the use of cloud-based engines and commercial LLMs. We expand the Reeve Foundation Trilingual Corpus (RFTC) used in our previous work into a multilingual corpus (RFMC) by adding sentence-aligned German and Simplified Chinese reference translations. We then benchmark several locally runnable language models (via Ollama) across four language directions on 1000+ sentences selected from this corpus. We use consistent single-prompt calls without fine-tuning or domain adaptation, comparing local LLM outputs against commercial NMTs (DeepL, Baidu), a frontier LLM (GPT-5.2), and professional-grade local NMT systems (OPUS-CAT, NeuralDesktop, Promt). Automatic evaluation is conducted with MATEO. Results reveal substantial variation in local LLM performance across language directions and model sizes. The best local LLMs match or surpass local NMT systems and a frontier LLM, though they remain behind top commercial NMTs. These findings underscore the viability of carefully selected local LLM translation for privacy-constrained professionals and inform future research on model scaling and multilingual capability.

URL PDF HTML ☆

赞 0 踩 0

2605.31446 2026-06-01 cs.CL cs.AI

Fine-grained Verification via Diagnostic Reasoning Supervision for Aspect Sentiment Triplet Extraction

面向方面情感三元组抽取的诊断推理监督细粒度验证

Wenna Lai, Haoran Xie, Guandong Xu, Qing Li, S. Joe Qin

AI总结提出FiVeD框架，通过诊断推理监督进行细粒度验证，利用质量评分和错误分类等辅助任务提升ASTE三元组抽取的可靠性。

Comments 25 pages, 13 figures, and 6 tables

详情

AI中文摘要

方面情感三元组抽取（ASTE）旨在识别方面词、观点词和情感极性作为结构化三元组，为下游信息系统应用（如意见挖掘、可解释推荐和评论摘要）提供必要输入。先前工作主要关注端到端抽取，而对抽取三元组的事后验证仍相对未被充分探索。这一差距限制了ASTE系统的可靠性，因为预测的三元组可能在局部合理但全局无效。此外，候选无效性是多方面的，候选可用性本质上是分级的，这促使了一种细粒度验证机制，可以过滤或重新排序来自不同抽取器的输出。在本文中，我们提出了FiVeD，一个具有诊断推理监督的细粒度验证框架。具体来说，验证器通过多个互补目标进行训练，包括作为主要任务的有效性分类和质量评分估计，以及作为辅助任务的错误类型分类和理由生成。我们定义了层次化错误类别，并在语义和句法约束下构建合理的错误三元组，利用现成的LLM和特定任务评分标准生成质量评分和诊断理由。在推理过程中，生成的质量评分用于过滤候选输出，支持可调节的精确率-召回率权衡。在多个ASTE基线模型上的实验表明，FiVeD作为即插即用的验证模块，持续将抽取性能提升最多3.53个F1点。

英文摘要

Aspect Sentiment Triplet Extraction (ASTE) aims to identify aspect terms, opinion terms, and sentiment polarities as structured triplets, providing essential inputs for downstream information system applications such as opinion mining, explainable recommendations, and review summarization. Prior work mainly focuses on end-to-end extraction, while post hoc verification of extracted triplets remains comparatively underexplored. This gap limits the reliability of ASTE systems, since predicted triplets may be locally plausible while being globally invalid. Moreover, candidate invalidity is multi-faceted and candidate usability is inherently graded, motivating a fine-grained verification mechanism that can filter or re-rank outputs from diverse extractors. In this paper, we propose FiVeD, a framework for Fine-grained Verification with Diagnostic reasoning supervision. Specifically, the verifier is trained with multiple complementary objectives, including validity classification and quality score estimation as primary tasks, with error type classification and rationale generation as auxiliary tasks. We define hierarchical error categories and construct plausible incorrect triplets under semantic and syntactic constraints, and leverage an off-the-shelf LLM with task-specific rubrics to produce quality scores and diagnostic rationales. During inference, the resulting quality scores are used to filter candidate outputs, supporting adjustable precision-recall tradeoffs. Experiments across multiple ASTE baselines demonstrate that FiVeD consistently improves extraction performance by up to 3.53 F1 points as a plug-and-play verification module.

URL PDF HTML ☆

赞 0 踩 0