arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.03495 2026-06-03 cs.LG

HiSE: A Lightweight Hierarchical Semantic Explainer for Heterogeneous Graph Neural Networks

HiSE：一种用于异构图神经网络的轻量级层次语义解释器

Zongrui Li, Yuhang Zhao, Ying Zhao, Yuanzhao Guo, Qiang Huang, Yuan Tian

发表机构 * School of Artificial Intelligence, Jilin University（吉林大学人工智能学院）； Mohamed bin Zayed University of Artificial Intelligence（穆罕默德·本·扎耶德人工智能大学）

AI总结提出HiSE，一种轻量级特征导向的可解释模型，通过层次语义建模（语义级LASSO稀疏特征学习和跨语义级KL散度自适应融合）实现高保真、低计算开销的异构图神经网络解释。

详情

AI中文摘要

异构图神经网络（HGNNs）在建模复杂关系数据方面表现出色，然而在高风险应用中的可解释性仍然是一个关键挑战。现有的解释方法存在两个主要局限性：一方面，生成的解释未能反映HGNNs固有的语义层次，导致对模型内部决策机制的保真度不足；另一方面，特征解释通常依赖于复杂的搜索或扰动机制，导致计算复杂度过高且效率低下。为了解决这些问题，我们提出了HiSE，一种轻量级特征导向的HGNNs可解释模型。HiSE通过层次语义建模实现语义感知的特征解释：在语义层面，采用基于最小绝对收缩和选择算子（LASSO）的局部代理模型学习每个语义视图下的稀疏特征表示；在跨语义层面，通过KL散度自适应地表征不同语义视图的贡献，生成统一的解释。大量实验表明，HiSE在保真度、鲁棒性和跨语义解释能力方面优于现有方法，同时其轻量级框架具有较低的计算开销，能够高效应用于大规模、复杂的真实世界异构图。

英文摘要

Heterogeneous graph neural networks (HGNNs) have demonstrated remarkable performance in modeling complex relational data, however their interpretability in high-stakes applications remains a critical challenge. Existing explanation methods suffer from two major limitations: on the one hand, the generated explanations fail to reflect the inherent semantic hierarchy of HGNNs, resulting in a lack of fidelity to the model's internal decision-making mechanism; on the other hand, feature explanations often rely on complex search or perturbation mechanisms, leading to excessive computational complexity and poor efficiency. To address these issues, we propose HiSE, a lightweight feature-oriented interpretable model for HGNNs. HiSE achieves semantically aware feature explanations through hierarchical semantic modeling: at the semantic level, local surrogate models based on the Least Absolute Shrinkage and Selection Operator (LASSO) are employed to learn sparse feature representations under each semantic view; at the cross-semantic level, the contributions of different semantic views are adaptively characterized via KL divergence to produce a unified explanation. Extensive experiments demonstrate that HiSE outperforms existing methods in terms of fidelity, robustness, and cross-semantic explanation capability, while its lightweight framework incurs low computational overhead, enabling efficient application to large-scale, complex real-world heterogeneous graphs.

URL PDF HTML ☆

赞 0 踩 0

2606.03493 2026-06-03 cs.CV cs.LG

Low-Frequency Shortcuts in Texture-Driven Visual Learning

纹理驱动视觉学习中的低频捷径

Utku Şirin, Cathy Hou, David Alvarez-Melis, Stratos Idreos

发表机构 * Harvard University（哈佛大学）； Kempner Institute（凯姆纳研究所）

AI总结本文分析了纹理驱动领域中神经网络依赖低频成分作为捷径的现象，提出通过裁剪低频成分来消除捷径，从而提升分布内准确率和鲁棒性。

详情

AI中文摘要

神经网络存在捷径学习问题，即学习到的特征在训练集上泛化良好，但在分布内（ID）或分布外（OOD）测试集上表现不佳。现有研究均基于少数几个标准基准，这些基准是形状驱动的。然而，许多应用领域是纹理驱动的。在这项工作中，我们针对纹理驱动领域进行了捷径学习分析，并将其与标准基准进行了比较。我们表明，纹理驱动领域存在低频捷径。它们主要基于少数具有偏斜频谱行为的低频成分（LFC）做出决策，尽管其分类信息存在于更高频率的细粒度细节中。从训练集和测试集中裁剪LFC可以消除捷径，并提供更平衡的频谱行为，将ID准确率提升高达8%。我们表明，低频捷径使模型极易受到OOD干扰的影响，导致与ID准确率相比下降高达70%。裁剪LFC显著提高了对低频干扰的鲁棒性，提升高达40%，并引入了对高频干扰的权衡；平衡的频谱行为提供了更好的泛化性能，而对高频特征的依赖增加则降低了泛化性能。OOD准确率取决于这两个因素之间的相互作用。

英文摘要

Neural networks suffer from shortcut learning, where learned features generalize well to the training set but not to in-distribution (ID) or out-of-distribution (OOD) test sets. Existing studies are all based on a few standard benchmarks, which are shape-driven. Numerous application domains, however, are texture-driven. In this work, we present shortcut learning analysis for texture-driven domains, and compare it with that of a standard benchmark. We show that texture-driven domains suffer from low-frequency shortcuts. They make the majority of their decisions based on a few low-frequency components (LFCs) with a skewed spectral behavior, despite that their classification information is in higher-frequency, fine-grained details. Pruning LFCs from training and test sets eliminates the shortcut and provides a more balanced spectral behavior, improving the ID accuracy by up to 8%. We show that low-frequency shortcuts make the models highly vulnerable to OOD corruptions, leading up to 70% accuracy drop compared to the ID accuracy. Pruning LFCs significantly improves robustness to low-frequency corruptions, by up to 40%, and introduces a trade-off for high-frequency corruptions; the balanced spectral behavior provides a better generalization performance, whereas the increased dependence on high-frequency features reduces it. OOD accuracy depends on the interaction between these two factors.

URL PDF HTML ☆

赞 0 踩 0

2606.03490 2026-06-03 cs.CV

TrAction: Action Recognition with Sparse Trajectories

TrAction: 基于稀疏轨迹的动作识别

Jan F. Meier, Felix B. Mueller, Alexander Ecker, Timo Lüddecke

发表机构 * Institute of Computer Science and Campus Institute Data Science, University Göttingen（计算机科学研究所和校园数据科学学院，哥廷根大学）； Max Planck Institute for Dynamics and Self-Organization（动态与自组织Max Planck研究所）

AI总结提出使用稀疏点轨迹作为输入模态，结合掩码轨迹预训练的Transformer架构，在降低计算成本的同时实现高效动作识别，并证明轨迹特征与外观特征互补。

详情

AI中文摘要

现代动作识别模型运行在内存和计算密集的密集RGB视频体积上，并且经常利用外观和背景捷径，例如从物体或场景而不是特征运动来预测动作。我们研究了一种高效的替代输入模态，它通过构造在很大程度上避免了这种偏差：稀疏点轨迹。为此，我们开发了一个简单的Transformer架构用于基于2.5D轨迹的识别，并配合掩码轨迹预训练，我们证明这能显著提高下游动作识别准确率。尽管仅使用密集RGB输入的一小部分，我们的方法在Something-Something V2上达到45%的top-1准确率，在EPIC-Kitchens-100上达到54%，并在时间反转敏感性上超过了V-JEPA。更重要的是，我们发现轨迹特征与最先进的基于外观的特征互补。将我们的预训练模型与DINOv2和V-JEPA 2融合，在Something-Something V2上top-1准确率分别提高了8.7和1.6个百分点。代码：此 https URL

英文摘要

Modern action recognition models operate on memory- and compute-intensive dense RGB video volumes and frequently exploit appearance and background shortcuts, for example, predicting actions from objects or scenes instead of characteristic motion. We investigate an efficient alternative input modality that is largely free of such biases by construction: sparse point trajectories. To this end, we develop a simple transformer architecture for 2.5D trajectory-based recognition together with a masked-trajectory pretraining, which we show to substantially improve downstream action recognition accuracy. Despite using only a fraction of the dense RGB input, our method reaches 45% top-1 on Something-Something V2 and 54% on EPIC-Kitchens-100, and surpasses V-JEPA on time-reversal sensitivity. More importantly, we find trajectory features to be complementary to state-of-the-art appearance-based features. Fusing our pretrained model with DINOv2 and V-JEPA 2 improves top-1 accuracy on Something-Something V2 by 8.7 and 1.6 points, respectively. Code: https://github.com/ecker-lab/TrAction

URL PDF HTML ☆

赞 0 踩 0

2606.03483 2026-06-03 cs.LG cs.AI

Analyzing Stream Collapse in Hyper-Connections: From Diagnosis to Mitigation

分析超连接中的流坍缩：从诊断到缓解

Ekaterina Alimaskina, Gleb Molodtsov, Aleksandr Beznosikov

发表机构 * MIRAI ； BRAIn Lab ； Yandex Research ； Innopolis University

AI总结本文通过细粒度诊断发现超连接中的多流残差连接存在流坍缩现象，即信号集中于主导流，并通过打破初始化对称性缓解该问题以提升性能。

详情

AI中文摘要

超连接（HC）用多个流替换单个Transformer残差流，引入了流索引上的置换对称性。我们研究这种对称性在实践中如何被打破：流是平衡地专门化还是表现出主导流使用。通过对基于HC的语言模型进行细粒度诊断，我们追踪多流表示的实际使用方式。我们发现，在早期种子阶段之后，残差混合通常保持接近恒等映射，限制了HC在流之间交换信息的核心机制。此外，信号和可解释特征都集中在一个主导流中，名义上的多流残差连接可能未充分利用其容量，行为更接近单流残差路径。最后，我们表明在流初始化时打破对称性可以减少主导行为并提高各种 extit{m}HC变体的性能。我们的代码已公开。

英文摘要

Hyper-Connections (HC) replace the single Transformer residual stream with multiple streams, introducing a permutation symmetry over stream indices. We study how this symmetry is resolved in practice: whether streams specialize in a balanced way or exhibit dominant-stream usage. Using fine-grained diagnostics for HC-based language models, we trace how multi-stream representations are actually used. We find that after an early seeding stage, residual mixing often remains close to identity, limiting a core HC mechanism for exchanging information between streams. Moreover, both signal and interpretable features concentrate in a dominant stream, and the nominally multi-stream residual connection can underutilize its capacity, behaving closer to a single-stream residual pathway. Finally, we show that breaking symmetry at stream initialization reduces dominant behavior and improves performance across \textit{m}HC variants. Our code is publicly available.

URL PDF HTML ☆

赞 0 踩 0

2606.03479 2026-06-03 cs.CV cs.GR

PersistGS: Differentiable Physics for Object Permanence in 4D Gaussian Splatting

PersistGS: 4D高斯溅射中物体持久性的可微物理

Adrian Ramlal, John S. Zelek

发表机构 * University of Waterloo（滑铁卢大学）

AI总结提出PersistGS方法，通过将可微刚体模拟与3D高斯溅射耦合，在物体被遮挡期间利用物理规律预测其SE(3)轨迹，从而恢复物体持久性，并引入质心轮廓损失降低轨迹误差。

Comments Accepted in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026 Workshop on Generative 3D Reconstruction

详情

Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026, pp. 4687-4696

AI中文摘要

动态3D高斯溅射（3DGS）方法通过光度监督从同步多相机视频重建时变场景。当一个运动物体被所有训练相机完全遮挡时，光度监督消失：表示该物体的高斯体无法接收梯度信号而退化。现有处理神经重建中不完整观测的方法依赖于学习到的生成先验，这些先验优先考虑视觉合理性而非物理正确性。我们提出$ extbf{PersistGS}$，一种通过将可微刚体模拟与3D高斯溅射耦合来在遮挡期间恢复物体持久性的方法。我们的方法将场景分解为每个物体的高斯体和碰撞网格，通过可微模拟从观测到的遮挡前轨迹估计摩擦和速度，并利用得到的SE(3)轨迹在整个遮挡期间定位物体高斯体。由于预测轨迹满足刚体动力学的控制方程，它能够忠实捕捉接触事件（弹跳、基于摩擦的减速、方向变化），而运动学外推无法建模这些事件。我们引入质心轮廓损失，将位置梯度与外观噪声分离，使轨迹误差比光度监督降低40%。我们使用在训练中保留的相机进行评估，这些相机在遮挡期间观察物体。在合成场景上的实验表明，PersistGS在PSNR上比恒定速度外推高出2.46dB，并且与真实轨迹上限仅差0.19dB。

英文摘要

Dynamic 3D Gaussian Splatting (3DGS) methods reconstruct time-varying scenes from synchronized multi-camera video using photometric supervision. When a moving object becomes fully occluded from all training cameras, this supervision vanishes: the Gaussians representing it receive no gradient signal and degrade. Existing approaches to incomplete observations in neural reconstruction rely on learned generative priors that prioritize visual plausibility over physical correctness. We propose $\textbf{PersistGS}$, a method that restores object permanence during occlusion by coupling differentiable rigid body simulation with 3D Gaussian Splatting. Our approach decomposes the scene into per-object Gaussians and collision meshes, estimates friction and velocity from the observed pre-occlusion trajectory via differentiable simulation, and uses the resulting SE(3) trajectory to position object Gaussians throughout the occlusion period. Because the predicted trajectory satisfies the governing equations of rigid body dynamics, it faithfully captures contact events (bounces, friction-based deceleration, direction changes) that kinematic extrapolation cannot model. We introduce a centroid silhouette loss that isolates positional gradients from appearance noise, yielding 40% lower trajectory error than photometric supervision. We evaluate using cameras withheld from training that observe the object during its occlusion. Experiments on synthetic scenes show that PersistGS outperforms constant velocity extrapolation by +2.46dB PSNR and comes within 0.19dB of a ground-truth trajectory upper bound.

URL PDF HTML ☆

赞 0 踩 0

2606.03476 2026-06-03 cs.RO

Human2Humanoid: Physics-Aware Cross-Morphology Motion Retargeting for Humanoid Robots

Human2Humanoid: 面向人形机器人的物理感知跨形态运动重定向

Tianchen Huang, Feiyang Yuan, Junchi Gu, Shurui Fang, Xiaohu Zhang, Yu Wang, Wei Gao, Shiwu Zhang

发表机构 * Institute of Humanoid Robots, Department of Precision Machinery and Precision Instrumentation, University of Science and Technology of China（人形机器人研究院，精密机械与精密仪器系，中国科学技术大学）

AI总结提出Human2Humanoid无监督运动重定向框架，利用CycleGAN和骨架感知图卷积网络处理未配对数据，通过形态不变末端执行器一致性损失和物理感知可行性约束，实现从人体运动到人形机器人的高保真重定向。

Comments Project page: https://huangtc233.github.io/human2humanoid_website/

详情

AI中文摘要

将人体运动重定向到人形机器人对于远程操作、模仿学习和人机交互至关重要。然而，由于人类与机器人在骨骼拓扑、肢体比例和自由度等方面的显著形态差异，以及配对运动数据的稀缺性，这仍然具有挑战性。本文提出了Human2Humanoid，一种无监督运动重定向框架，能够将人体运动高保真地迁移到人形机器人行为。为了在未配对数据下弥合领域差距，我们采用基于CycleGAN的架构，配备骨架感知图卷积网络来捕获拓扑相关的运动特征。为了解决跨域尺度不匹配问题，我们引入了一种形态不变的末端执行器一致性损失，该损失对齐归一化的末端执行器轨迹，以保留跨实体的运动语义。为了提高物理合理性并减少接触伪影，我们施加了显式的物理感知可行性约束，以鼓励再现源运动中的接触模式。实验结果表明，所提出的方法成功地将人体运动重定向到Unitree G1人形机器人，无需配对数据，并且在下游可控性和物理可行性方面均优于现有方法。

英文摘要

Retargeting human motion to humanoid robots is critical for teleoperation, imitation learning and human-robot interaction. However, it remains challenging because of substantial morphological discrepancies between humans and robots, including differences in skeletal topology, limb proportions and degrees of freedom, as well as the scarcity of paired motion data. This paper presents Human2Humanoid, an unsupervised motion retargeting framework that transfers human motions to humanoid robot behaviors with high fidelity. To bridge the domain gap under unpaired data, we adopt a CycleGAN-based architecture equipped with a skeleton-aware graph convolutional network to capture topology-dependent motion features. To address cross-domain scale mismatches, we introduce a morphology-invariant end-effector consistency loss that aligns normalized end-effector trajectories to preserve motion semantics across embodiments. To improve physical plausibility and reduce contact artifacts, we impose explicit physics-aware feasibility constraints to encourage reproduction of the contact patterns in the source motion. Experimental results show that the proposed method successfully retargets human motion to the Unitree G1 humanoid robot without paired data, and outperforms existing methods in both downstream controllability and physical feasibility.

URL PDF HTML ☆

赞 0 踩 0

2606.03470 2026-06-03 cs.CV

Mixed-Modality Dual Face-Hair Retrieval

混合模态双人脸-发型检索

Quoc-Anh Bui-Huynh, Mai-Tuyen Lam, Dai-Anh-Tuan Nguyen, Thanh Duc Ngo

发表机构 * Vietnam National University, Ho Chi Minh City, Vietnam（越南国家大学，胡志明市，越南）； University of Information Technology, VNU-HCM, Ho Chi Minh City, Vietnam（信息技术大学，VNU-HCM，胡志明市，越南）

AI总结提出混合模态双参考检索任务DFHR，通过解耦身份与发型特征并融合多模态嵌入，实现跨模态的身份感知与属性可控检索。

详情

AI中文摘要

我们提出了双人脸-发型检索（DFHR），这是一种图像检索中新的混合模态双参考任务，其中查询由指定身份的人脸图像和以图像或文本形式表达的发型参考组成。与先前的检索设置不同，DFHR需要对来自异质模态的两个语义独立属性——身份和发型——进行跨组件推理。这种表述要求在统一的嵌入空间内实现局部特征解耦、跨模态语义对齐和混合模态组合。我们构建了DFHR-Bench，这是首个用于混合模态人脸-发型检索的基准，包含超过18万个标注三元组，涵盖双图像和图像-文本设置，通过多阶段标注协议构建，确保语义和身份完整性。我们进一步提出了MFHC（多模态人脸-发型组合器），一个统一的框架，通过令牌注入和多视角监督融合解耦的身份和发型嵌入。DFHR和DFHR-Bench共同为跨模态的身份感知、属性可控视觉检索建立了新的范式。

英文摘要

We introduce Dual Face-Hair Retrieval (DFHR), a new mixed-modality dual-reference task in image retrieval where a query consists of a face image specifying identity and a hairstyle reference expressed as either an image or text. Unlike prior retrieval settings, DFHR requires cross-component reasoning between two semantically independent attributes -- identity and hairstyle -- originating from heterogeneous modalities. This formulation demands localized feature disentanglement, cross-modal semantic alignment, and mixed-modality composition within a unified embedding space. We construct DFHR-Bench, the first benchmark for mixed-modality face-hair retrieval, comprising over 180K annotated triplets across dual-image and image-text settings, built via a multi-stage annotation protocol ensuring semantic and identity integrity. We further propose MFHC (Multimodal Face-Hair Combiner), a unified framework that fuses disentangled identity and hairstyle embeddings through token injection and multi-view supervision. DFHR and DFHR-Bench together establish a new paradigm for identity-aware, attribute-controllable visual retrieval across modalities.

URL PDF HTML ☆

赞 0 踩 0

2606.03467 2026-06-03 cs.AI

StepFinder: A Temporal Semantic Framework for Failure Attribution in Multi-Agent Systems

StepFinder：多智能体系统中故障归因的时间语义框架

Taiyu Zhu, Yifan Wu, Weilin Jin, Ying Li, Gang Huang

发表机构 * Peking University（北京大学）

AI总结提出StepFinder框架，通过将执行日志编码为时间语义序列并利用时序建模与注意力模块，高效准确地定位多智能体系统中的故障根因步骤。

Comments 12 pages, 5 figures. Accepted by KDD 2026

详情

AI中文摘要

基于LLM的多智能体系统在复杂多步骤任务中展现出显著的协作能力。然而，这些系统对单步执行错误高度敏感，错误会通过智能体交互传播并导致级联故障。为理解故障原因并提高系统可靠性，故障归因被引入作为一项任务，旨在自动识别导致故障的根因步骤。现有故障归因方法主要依赖LLM对原始执行轨迹进行推理，这不仅导致高推理成本和延迟，还受到冗余和噪声执行日志的干扰，使LLM难以准确识别真正的根因步骤。为此，我们提出StepFinder，一个轻量级故障归因框架。我们仅在特征构建阶段使用LLM将执行日志编码为时间语义序列。随后，应用参数高效的时序建模与注意力模块组合来捕捉轨迹的序列演化与跨步骤依赖。最后，通过多尺度差异和位置偏差细化步骤级错误分数，实现精确的根因识别。在Who&When基准上的实验结果表明，StepFinder在步骤级故障归因上优于基于LLM的方法，同时实现了显著更高的推理效率，与最快的基于LLM的方法相比，推理时间减少79%，且无文本生成开销。我们的代码可从此https URL获取。

英文摘要

LLM-based multi-agent systems exhibit remarkable collaborative capabilities in complex multi-step tasks. However, these systems are highly sensitive to single-step execution errors that can propagate through agent interactions and lead to cascading failures. To understand the causes of failure and improve system reliability, failure attribution has been introduced as a task that aims to automatically identify the root cause step responsible for a failure. Existing failure attribution methods mainly rely on LLMs to reason over original execution trajectories, which not only incur high inference costs and latency, but also suffer from interference caused by redundant and noisy execution logs, causing LLMs to struggle in accurately identifying the true root cause step. To address this, we propose StepFinder, a lightweight failure attribution framework. We use LLMs solely during the feature construction phase to encode execution logs into temporal semantic sequences. Subsequently, a parameter-efficient combination of temporal modeling and attention modules is applied to capture the sequential evolution and cross-step dependencies of the trajectories. Finally, the step-level error score is refined through multi-scale differences and position bias, enabling precise root cause identification. Experimental results on the Who&When benchmark demonstrate that StepFinder outperforms LLM-based methods in step-level failure attribution while achieving substantially higher inference efficiency, reducing inference time by 79% compared with the fastest LLM-based method, with no text generation overhead. Our code is available at https://github.com/taiyu-zhu/StepFinder.

URL PDF HTML ☆

赞 0 踩 0

2606.03465 2026-06-03 cs.LG cs.AI

Rethinking the Role of Tensor Decompositions in Post-Training LLM Compression

重新思考张量分解在训练后大语言模型压缩中的作用

Artur Zagitov, Alexander Miasnikov, Maxim Krutikov, Vladimir Aletov, Gleb Molodtsov, Nail Bashirov, Artem Tsedenov, Aleksandr Beznosikov

发表机构 * University of Florida（佛罗里达大学）； National Research University Higher School of Economics（俄罗斯国家研究大学——莫斯科经济学院）

AI总结本文系统评估了张量分解在稠密和MoE架构上的训练后压缩效果，通过实证与理论分析揭示了其与LLM异构表示之间的根本性不匹配，从而界定了其实际限制和在规模化部署中的可行角色。

2606.03463 2026-06-03 cs.AI cs.CL

DMF: A Deterministic Memory Framework for Conversational AI Agents

DMF：对话式AI代理的确定性记忆框架

Matteo Stabile, Enrico Zimuel

发表机构 * Roma Tre University（罗马三大学）

AI总结提出一种CPU优先的确定性记忆框架DMF，通过经典NLP分析、向量几何和数学评分替代生成式记忆压缩，实现零令牌成本且与Mem0相当的准确性。

Comments 21 pages, 3 figures

详情

AI中文摘要

对话式AI代理需要在大时间跨度的交互中既具可扩展性又语义连贯的记忆系统。现有方法主要依赖基于大语言模型（LLM）的写入时摘要，这引入了非确定性、令牌成本上升以及剪枝决策不透明等问题。我们提出确定性记忆框架（DMF），一种CPU优先的方法，用完全确定性的流水线替代生成式记忆压缩，该流水线基于经典NLP分析、向量几何和数学评分。DMF为每次对话交互分配一个生存分数$\Omega$，该分数由确定性内容信号、对话线索和结构化来源通过逻辑投影组合计算得出。一个交互计数衰减定律，记为$\Omega_{\mathrm{eff}}(\Delta n)$，控制着相关性随新轮次到达的演变，其中$\Delta n$是较新交互的数量而非实际时间，从而保持完全确定性。我们给出了DMF的数学公式、结构化召回流水线、剪枝决策过程和评估协议。实验在基于LoCoMo和LongMemEval数据集构建的专用基准上进行。我们将DMF与AI代理的流行记忆层Mem0进行比较。DMF在准备记忆上下文时使用零令牌，在整个对话中使用的令牌数少5到242倍，同时达到相当的准确性。这些结果表明，可以从记忆管理循环中消除LLM调用，将令牌成本降至几乎为零，并为对话式AI代理实现确定性记忆系统。

英文摘要

Conversational AI agents require memory systems that are both scalable and semantically coherent across long interaction horizons. Existing approaches rely predominantly on large language model (LLM)-based summarisation at write time, which introduces non-determinism, escalating token costs, and opacity in pruning decisions. We present the Deterministic Memory Framework (DMF), a CPU-first approach that replaces generative memory compression with a fully deterministic pipeline grounded in classical NLP analysis, vector geometry, and mathematical scoring. DMF assigns each conversational interaction a Survival Score $Ω$ computed from deterministic content signals, conversational cues, and structured provenance, combined through a logistic projection. An interaction-count decay law, denoted as $Ω_{\mathrm{eff}}(Δn)$, governs how relevance evolves as new turns arrive, where $Δn$ is the number of newer interactions rather than wall-clock time, preserving full determinism. We present the mathematical formulation of DMF, its structured recall pipeline, the pruning decision procedure, and the evaluation protocol. Experiments are conducted on a purpose-built benchmark using the LoCoMo and LongMemEval datasets. We compare DMF against Mem0, a popular memory layer for AI agents. DMF achieves comparable accuracy while using zero tokens to prepare the memory context and 5x to 242x fewer tokens over the entire conversation. These results show that it is possible to eliminate LLM calls from the memory-management loop, reducing token costs to nearly zero and enabling deterministic memory systems for conversational AI agents.

URL PDF HTML ☆

赞 0 踩 0

2606.03462 2026-06-03 cs.LG cs.SI

Topology-Aware Gaussian Graph Repair for Robust Graph Neural Networks

拓扑感知的高斯图修复用于鲁棒图神经网络

Anubha Goel, Juho Kanniainen

发表机构 * Computing Science/Financial Computing and Data Analytics Group, Tampere University（计算科学/金融计算与数据分析组，塔尔皮奥大学）

AI总结提出拓扑感知高斯修复（TAGR）框架，通过自适应高斯核构建稀疏特征邻域图并结合拓扑感知残差校正，在不改变网络架构的情况下提升图神经网络在噪声边和缺失边场景下的鲁棒性。

详情

AI中文摘要

图神经网络在图结构数据上取得了强劲性能，但其有效性高度依赖于观测图的质量。在实际应用中，图拓扑往往不完美：噪声边可能连接无关节点，而缺失边可能阻碍有用信息的传播。现有的鲁棒图学习方法主要通过移除可疑边或在训练过程中学习新图结构来解决这一问题。然而，仅移除边无法恢复缺失连接，而图结构学习可能引入额外的优化复杂度。在本文中，我们提出拓扑感知高斯修复（TAGR），一种用于图神经网络中鲁棒消息传递的简单图修复框架。TAGR 不学习稠密邻接矩阵，而是使用自适应高斯核构建稀疏特征邻域图，并将其与观测图的拓扑感知残差校正相结合。高斯修复组件在特征相似节点之间引入辅助边，而残差校正根据局部特征和结构一致性保留并重新加权原始拓扑。修复后的图可直接用于标准图神经网络，无需改变其架构。在基准引文网络上的大量实验表明，TAGR 在噪声边和缺失边设置下均能提升 GNN 的鲁棒性。进一步分析表明，高斯特征邻域修复提供了主要的鲁棒性增益，而拓扑感知残差校正在观测图不完整时提高了稳定性。这些结果表明，通过轻量级稀疏图修复而非稠密图结构学习即可实现有效的图鲁棒性。

英文摘要

Graph neural networks have achieved strong performance on graph-structured data, but their effectiveness depends heavily on the quality of the observed graph. In real applications, graph topology is often imperfect: noisy edges may connect unrelated nodes, while missing edges may prevent useful information from being propagated. Existing robust graph learning methods mainly address this problem by removing suspicious edges or by learning a new graph structure during training. However, edge removal alone cannot recover missing connections, and graph structure learning may introduce additional optimization complexity. In this paper, we propose Topology-Aware Gaussian Repair (TAGR), a simple graph repair framework for robust message passing in graph neural networks. Instead of learning a dense adjacency matrix, TAGR constructs a sparse feature-neighborhood graph using an adaptive Gaussian kernel and combines it with a topology-aware residual correction of the observed graph. The Gaussian repair component introduces auxiliary edges between feature-similar nodes, while the residual correction preserves and reweights the original topology according to local feature and structural consistency. The repaired graph can be used directly with standard graph neural networks without changing their architectures. Extensive experiments on benchmark citation networks show that TAGR improves the robustness of GNNs under both noisy-edge and missing-edge settings. The analysis further show that Gaussian feature-neighborhood repair provides the main robustness gain, while topology-aware residual correction improves stability when the observed graph is incomplete. These results suggest that effective graph robustness can be achieved through lightweight sparse graph repair rather than dense graph structure learning.

URL PDF HTML ☆

赞 0 踩 0

2606.03461 2026-06-03 cs.AI

What Makes Interaction Trajectories Effective for Training Terminal Agents?

什么使得交互轨迹对训练终端代理有效？

Sidi Yang, Chaofan Tao, Jierun Chen, Tiezheng Yu, Ruoyu Wang, Yuxin Jiang, Yiming Du, Wendong Xu, Jing Xiong, Taiqiang Wu, Lifeng Shang, Xiaohui Li, Ngai Wong, Haoli Bai

发表机构 * The University of Hong Kong（香港大学）； Huawei Technologies（华为技术有限公司）； Nanyang Technological University（南洋理工大学）

AI总结本文通过Terminal-Lego流水线研究交互轨迹的教学效能，发现低分代理（DeepSeek-V3.2）的轨迹比高分代理（Claude Opus 4.6）更能提升学生泛化能力，归因于环境接地监督（EGS），并展示了极佳的数据效率。

详情

AI中文摘要

更强的代码代理通常被认为是训练后阶段的更优教师，然而这一假设尚未与任务难度、框架设计和学生能力充分解耦。我们使用Terminal-Lego（一个可扩展的流水线，将多领域现实问题转化为环境验证的代理任务）来研究这种教学联系。令人惊讶的是，独立表现并不能决定教学效能：尽管Claude Opus 4.6在Terminal-Bench 2.0上获得更高分数，但使用来自较低分代理DeepSeek-V3.2的轨迹进行微调的学生表现出显著更强的泛化能力。我们将这种“教学悖论”归因于环境接地监督（EGS）：通过框架可见交互明确暴露“检查-行动-验证”行为的轨迹，使学生能够内化稳健的问题解决程序，而非脆弱的动作序列。扩展分析揭示了卓越的数据效率：例如，仅使用15.3k条Terminal-Lego轨迹，Qwen3-32B在Terminal-Bench 2.0上获得了24.3%的分数，与之前使用超过30倍数据量达到的最优性能相当。我们的结果表明，代理训练后的前沿不仅限于结果匹配，而是将焦点转向“框架工程”，其中环境接地交互结构的系统设计成为可复现和可泛化的代理智能的主要催化剂。

英文摘要

Stronger code agents are commonly assumed to be superior teachers for post-training, yet this assumption remains poorly disentangled from task difficulty, harness design, and student capacity. We investigate this pedagogical link using Terminal-Lego, a scalable pipeline that transforms multi-domain real-world issues into environment-verified agentic tasks. Surprisingly, standalone performance does not dictate teaching efficacy: while Claude Opus 4.6 achieves higher scores on Terminal-Bench 2.0, students fine-tuned on trajectories from DeepSeek-V3.2, a lower-scoring agent, exhibit significantly stronger generalization. We attribute this "pedagogical paradox" to Environment-Grounded Supervision (EGS): trajectories that explicitly expose inspect-act-verify behaviors through harness-visible interactions allow students to internalize robust problem-solving routines rather than fragile action sequences. Scaling analysis reveals exceptional data efficiency: with only 15.3k Terminal-Lego trajectories, for example, Qwen3-32B achieves a 24.3% score on Terminal-Bench 2.0, rivaling previous SOTA performance established with over 30x the data volume. Our results suggest that the frontier of agent post-training lies beyond mere outcome-matching, shifting the focus toward "Harness Engineering", where the systematic design of environment-grounded interaction structures serves as the primary catalyst for reproducible and generalizable agentic intelligence.

URL PDF HTML ☆

赞 0 踩 0

2606.03459 2026-06-03 cs.SD cs.AI

Tonal parsimony in chord-sequence analysis: combining modulation cost and tonal vocabulary

和弦序列分析中的调性简约性：结合调制代价与调性词汇

François Pachet

发表机构 * LIP6, Sorbonne Université, Paris, France（LIP6，索邦大学，巴黎，法国）； Ynosound, Paris, France（Ynosound，巴黎，法国）

AI总结提出调性简约性方法，通过字典序最小化调制次数和不同调性数量，结合动态规划与固定24调性空间，在和弦序列分析中减少调性词汇并保持调制最优。

Comments 20 pages, 1 figure

详情

AI中文摘要

我们研究将局部调性分配给和弦序列，这一任务对和声分析、作曲和爵士即兴演奏很有用。标准的动态规划方法最小化调制，但可能引入不必要多的调性中心。我们将这种仅转移目标与纯最小词汇分析以及调性简约性进行比较，后者按字典序最小化调制次数，然后最小化不同调性的数量。尽管这个联合目标通常组合困难，但我们利用固定的24调性大调/小调宇宙给出了精确算法。在31,032个LMD和弦序列上，调性简约性在55.8%的情况下保持了转移最优，同时减少了调性词汇。在加权爵士替换闭包下，它将平均调性数从3.802降至3.206，调制次数从16.728降至12.141。在1,555个带注释的爵士标准曲上，它将兼容和弦-音阶一致性提高到95.6%，支持可处理的专业级和声分析。

英文摘要

We study the assignment of local tonalities to chord sequences, a task useful for harmonic analysis, composition, and jazz-oriented improvisation. Standard dynamic-programming approaches minimize modulations but can introduce unnecessarily many tonal centers. We compare this transition-only objective with pure minimum-vocabulary analysis and with tonal parsimony, which minimizes lexicographically the number of modulations and then the number of distinct tonalities. Although this joint objective is combinatorially hard in general, we give exact algorithms exploiting the fixed 24-tonality major/minor universe. On 31,032 LMD Chords sequences, tonal parsimony preserves the transition optimum while reducing tonal vocabulary in 55.8% of cases. With weighted jazz-substitution closure, it lowers mean tonalities from 3.802 to 3.206 and modulations from 16.728 to 12.141. On 1,555 annotated jazz standards, it improves compatible chord-scale agreement to 95.6%, supporting tractable professional-scale harmonic analysis.

URL PDF HTML ☆

赞 0 踩 0

2606.03458 2026-06-03 cs.LG

KVarN: Variance-Normalized KV-Cache Quantization Mitigates Error Accumulation in Reasoning Tasks

KVarN: 方差归一化的KV缓存量化减轻推理任务中的误差累积

Lorenz K. Muller, Philippe Bich, Chiara Boretti, Hyun-Min Chang, Jiawei Zhuang, Lukas Cavigelli

发表机构 * Huawei（华为）

AI总结提出KVarN，一种无校准的KV缓存量化方法，通过Hadamard旋转和双尺度方差归一化减少自回归解码中的量化误差累积，在2位精度下达到生成基准测试的最新水平。

详情

AI中文摘要

测试时扩展是一种在大语言模型中获取更好推理能力的强大方法，但在长时域解码过程中，由于KV缓存增长，它会成为内存瓶颈。KV缓存量化有助于改善这一问题，但当前方法在预填充设置下进行评估，而误差在自回归解码下表现不同。我们表明，在后一种情况下，量化误差随时间步累积，主要由不正确的token尺度驱动。我们引入KVarN，一种无校准的KV缓存量化器，它应用Hadamard旋转，随后对K和V矩阵的两个轴进行双尺度方差归一化。我们发现，这种组合修复了异常的token尺度误差，并显著减少了现有基线的误差累积。KVarN在生成基准测试（包括MATH500、AIME24和HumanEval）上以2位精度建立了KV缓存量化的最新技术水平。KVarN方法的vLLM实现可在此https URL获取。

英文摘要

Test-time scaling is a powerful approach to obtain better reasoning in large language models, but it becomes memory-bottlenecked during long-horizon decoding, as the KV-cache grows. KV-cache quantization can help improve this, but current methods are evaluated under prefill-like settings and errors behave differently under autoregressive decoding. We show that in the latter regime, quantization errors accumulate across timesteps, driven primarily by incorrect token scales. We introduce KVarN, a calibration-free KV-cache quantizer that applies a Hadamard rotation followed by a dual-scaling variance normalization across both axes of the K and V matrices. We find that this combination fixes outlying token-scale errors and substantially reduces error accumulation over existing baselines. KVarN establishes a new state-of-theart for KV-cache quantization on generative benchmarks, including MATH500, AIME24 and HumanEval, at 2-bit precision. A vLLM implementation of the KVarN method is available at https://github.com/huawei-csl/KVarN

URL PDF HTML ☆

赞 0 踩 0

2606.03444 2026-06-03 cs.CV cs.AI

PRISM: Synergizing Vision Foundation Models via Self-organized Expert Specialization

PRISM: 通过自组织专家专业化协同视觉基础模型

Ying Tang, Dong Li, Youjia Zhang, Zikai Song, Junqing Yu, Wei Yang

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出PRISM框架，采用双流混合专家（MoE）架构，通过两阶段范式（先解构专家知识使其专业化，再动态重组为任务特定路径）解决视觉基础模型集成中的负迁移问题，在PASCAL-Context和NYUD-v2上达到新最优。

Comments Accepted to ICML 2026

2606.03437 2026-06-03 cs.CL

Large Language Models Are Overconfident in Their Own Responses

大型语言模型对自己的回答过度自信

Mario Sanz-Guerrero, Manuel Mager, Katharina von der Wense

发表机构 * Johannes Gutenberg University Mainz（莱茵河畔明斯特约翰·古腾堡大学）； University of Colorado Boulder（科罗拉多大学波德分校）

AI总结研究指令微调与聊天模板导致的大型语言模型校准偏差，发现“所有权偏见”使模型对自己的回答自信度高出26%，并提出通过将模型回答伪装为用户输入来降低过度自信。

Comments Accepted to ACL 2026 Findings

详情

AI中文摘要

先前工作表明，指令微调的大型语言模型（LLMs）比其基础预训练模型校准更差。然而，关于常用聊天模板对对话型LLM校准的影响知之甚少。在本工作中，我们通过解耦后训练算法和聊天格式的影响，研究了导致这种校准偏差的机制。我们发现，虽然指令微调从根本上损害了校准，但聊天模板通过“所有权偏见”加剧了问题——模型对自己回答的自信度显著高于对用户提供的相同回答的自信度。在六个近期开源权重LLM、三个基准和三种置信度获取方法上的大量实验表明，模型对自己的回答分配的置信度高出26%。利用这一见解，我们提出一种简单的推理时策略：在置信度获取时将模型的回答框定为用户输入。该方法显著降低了过度自信，并将校准提高了26%，无需重新训练，缩小了基础模型与指令微调模型之间的差距。

英文摘要

Prior work has shown that instruction-tuned large language models (LLMs) are less well calibrated than their base pre-trained counterparts. However, little is known about the frequently used chat template's effect on the calibration of conversational LLMs. In this work, we investigate the mechanisms driving this miscalibration by decoupling the effects of the post-training algorithm and the chat format. We find that, while instruction tuning fundamentally harms calibration, the chat template aggravates the issue through an "ownership bias" -- models are significantly more confident in their own answers than in identical answers provided by a user. Extensive experiments across six recent open-weight LLMs, three benchmarks, and three confidence elicitation methods show that models assign up to 26% higher confidence to their own responses. Leveraging this insight, we propose a simple inference-time strategy: framing the model's answer as user input during confidence elicitation. This approach significantly reduces overconfidence and improves calibration by up to 26% without the need for retraining, narrowing the gap between base and instruction-tuned models.

URL PDF HTML ☆

赞 0 踩 0

2606.03435 2026-06-03 cs.AI

CP-Agent: Context-Aware Multimodal Reasoning for Cellular Morphological Profiling under Chemical Perturbations

CP-Agent: 化学扰动下细胞形态学轮廓的上下文感知多模态推理

Yuxin Zhang, Yiyao Li, Ping Shu Ho, Simon See, Zhenqin Wu, Kevin Tsia

发表机构 * Department of Electrical and Computer Engineering, The University of Hong Kong（香港大学电子与计算机工程系）； School of Computing and Data Science, The University of Hong Kong（香港大学计算与数据科学学院）； School of Biomedical Engineering, The University of Hong Kong（香港大学生物医学工程学院）； Nvidia AI Technology Center（NVIDIA人工智能技术中心）； Advanced Biomedical Instrumentation Centre（先进生物医学仪器中心）

AI总结提出CP-Agent，一种基于上下文感知对齐模块CP-CLIP的多模态大语言模型，用于生成药物扰动下细胞形态变化的可解释机制性解释，实现高精度处理与机制区分（最大F1分数0.896），并整合工具使用与推理生成结构化报告以加速药物发现。

Comments ICLR 2026

详情

AI中文摘要

Cell Painting结合多重荧光染色、高内涵成像和定量分析，生成高维表型读数，以支持多种下游任务，如作用机制（MoA）推断、毒性预测和药物-疾病图谱构建。然而，现有工作流程缓慢、昂贵且难以解释。药物筛选建模方法主要侧重于分子表示学习，而忽略了实际实验上下文（例如细胞系、给药方案等），限制了泛化性和MoA分辨率。我们引入了CP-Agent，一种智能多模态大语言模型（MLLM），能够为药物扰动下的细胞形态变化生成与机制相关、人类可解释的理由。其核心是CP-Agent利用上下文感知对齐模块CP-CLIP，该模块联合嵌入高内涵图像和实验元数据，以实现稳健的处理和MoA区分（达到最大F1分数0.896）。通过将CP-CLIP输出与智能工具使用和推理相结合，CP-Agent将理由编译成结构化报告，以指导实验设计和假设优化。这些能力凸显了CP-Agent通过实现更可解释、可扩展和上下文感知的表型筛选来加速药物发现的潜力——简化药物发现中假设生成的迭代循环。

英文摘要

Cell Painting combines multiplexed fluorescent staining, high-content imaging, and quantitative analysis to generate high-dimensional phenotypic readouts to support diverse downstream tasks such as mechanism-of-action (MoA) inference, toxicity prediction, and construction of drug-disease atlases. However, existing workflows are slow, costly and difficult to interpret. Approaches for drug screening modeling predominantly focus on molecular representation learning, while neglecting actual experimental context (e.g., cell line, dosing schedule, etc.), limiting generalization and MoA resolution. We introduce CP-Agent, an agentic multimodal large language model (MLLM) capable of generating mechanism-relevant, human-interpretable rationales for cell morphological changes under drug perturbations. At its core, CP-Agent leverages a context-aware alignment module, CP-CLIP, that jointly embeds high-content images and experimental metadata to enable robust treatment and MoA discrimination (achieving a maximum F1-score of 0.896). By integrating CP-CLIP outputs with agentic tool usage and reasoning, CP-Agent compiles rationales into a structured report to guide experimental design and hypothesis refinement. These capabilities highlight CP-Agent's potential to accelerate drug discovery by enabling more interpretable, scalable, and context-aware phenotypic screening -- streamlining iterative cycles of hypothesis generation in drug discovery.

URL PDF HTML ☆

赞 0 踩 0

2606.03420 2026-06-03 cs.CV

PHAF-Personalized Hand Avatars in a Flash

PHAF-瞬间个性化手部化身

Meghana Shankar, Akanxit Upadhyay, Anmol Namdev, Green Rosh KS, Pawan Prasad BH

发表机构 * Samsung R&D Institue（三星研发机构）

AI总结提出PHAF方法，从两张图像（手背和手掌）快速生成个性化逼真手部化身，通过语义引导网格对齐和密集纹理提取，结合视图修复网络，实现高质量多视角渲染，纹理生成速度比现有方法快30倍。

2606.03417 2026-06-03 cs.CV

A unified multi-task framework enables interpretable chest radiograph analysis

统一多任务框架实现可解释的胸部X光片分析

Lijian Xu, Ziyu Ni, Xinglong Liu, Xiaosong Wang, Hongsheng Li, Shaoting Zhang

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出IMT-CXR框架，通过统一Transformer架构模拟放射科医生诊断流程，实现疾病识别、属性表征和可追溯报告生成，在十个基准上表现优异，且临床评估中66%的AI报告达到或超越原始报告。

详情

AI中文摘要

虽然多模态深度学习推动了医学影像分析，但现有的黑箱系统可能局限于孤立任务，常常忽视临床诊断作为多任务过程对信任敏感的本质。我们提出IMT-CXR（可解释多任务Transformer用于胸部X光分析），该框架通过三个基于证据的阶段模拟放射科医生的诊断工作流：1）疾病识别；2）属性表征（如大小、位置、严重程度量化）；3）具有可追溯决策路径的证据整合报告生成。该框架采用统一Transformer架构，通过医学领域指令调优优化，顺序执行四个临床任务：多标签疾病分类、病灶定位、解剖分割和放射学报告生成。实验验证表明，在直接推理和微调设置下，该框架在十个CXR基准上表现出竞争性性能。在一项对来自四个医疗中心的160份历史报告的盲评中，三位放射科医生认为66%的AI生成报告在诊断清晰度上达到或超越原始临床报告，凸显了该框架的转化潜力。通过建立从解剖发现到结论的可追溯诊断路径，这项工作弥合了AI技术指标与临床实用性之间的差距，推动了医学影像中可信赖AI系统的发展。

英文摘要

While multimodal deep learning has advanced medical imaging analysis, existing black-box systems \textcolor{black}{may remain confined to isolated tasks, often overlooking} the trust-sensitive nature of clinical diagnosis as a multi-task process. We propose IMT-CXR (Interpretable Multi-task Transformer for Chest X-ray Analysis), a framework that emulates radiologists' diagnostic workflow through three evidence-driven stages: 1) Disease recognition; 2) Attribute characterization (e.g., size, location, severity quantification); 3) Evidence-integrated report generation with traceable decision pathways. The framework employs a unified transformer architecture optimized via medical-domain instruction tuning, sequentially executing four clinical tasks: multi-label disease classification, lesion localization, anatomical segmentation, and radiology report generation. Experimental validation demonstrates competitive performance on ten CXR benchmarks under direct inference and fine-tuning settings. In a blinded evaluation of 160 historical reports from four medical centers, three radiologists rated 66\% of AI-generated reports as comparable to or surpassing original clinical reports in diagnostic clarity, highlighting the framework's translational potential. By establishing traceable diagnostic pathways from anatomical findings to conclusions, this work bridges the gap between AI technical metrics and clinical utility, advancing trustworthy AI systems in medical imaging.

URL PDF HTML ☆

赞 0 踩 0

2606.03410 2026-06-03 cs.CV

Enginuity: A Dataset and Benchmark for Vision-Language Understanding of Engineering Diagrams

Enginuity：工程图纸视觉语言理解的数据集与基准

Abhishek Kumar, Isha Motiyani, Tilak Kasturi, Ethan Seefried, Prahitha Movva, Tirthankar Ghosal

发表机构 * Predii ； Oak Ridge National Laboratory（橡树岭国家实验室）； Independent Researcher（独立研究员）

AI总结针对工程图纸领域缺乏公开基准的问题，提出首个开放数据集Enginuity，通过结构化零件表提取和自由形式视觉问答两项任务评估前沿VLM，揭示零件识别与描述保真度之间的系统性差距。

详情

AI中文摘要

工程图纸对视觉语言模型提出了独特的挑战：与自然图像或通用文档不同，它们通过密集的空间布局、领域特定符号以及视觉标注与结构化零件表之间的交叉引用来编码信息。尽管工程图纸在服务、维修和设计工作流中至关重要，但目前尚无公开基准来衡量该领域VLM的能力；现有数据集主要关注流程图、科学图表或商业文档。为填补这一空白，我们引入了Enginuity，这是首个用于评估复杂工程图纸上VLM的开放数据集和基准。我们在美国军用服务和维修手册语料库上定义了两项任务：结构化零件表提取（任务1）和自由形式视觉图问答（VQA）（任务2）用于基准测试。我们在零样本和思维链提示下评估了四种前沿VLM（GPT-5.2 Chat、Claude Opus 4.7、Gemma 4、Qwen3-VL-32B-Instruct）。在任务1上，模型达到了0.61-0.87的Recall@all，但Token F1pen仅为0.03-0.18，暴露了零件识别与描述保真度之间的系统性差距。任务2揭示了所有模型在事实推理上的一致差距。一项支持性分析表明，相对于语义相似性，token重叠指标将技术描述上的模型能力低估了2-6倍，这促使在领域特定评估中进行LLM作为评判者的校准。我们发布了数据集、注释、评估框架以及每个样本的模型输出，以支持对工程内容上VLM能力的可重复研究。

英文摘要

Engineering diagrams pose a distinct challenge for vision-language models: unlike natural images or general documents, they encode information through dense spatial layouts, domain-specific symbols, and cross-references between visual callouts and structured parts tables. Despite their centrality to service, repair, and design workflows, there is no public benchmark for measuring VLM capabilities in this domain; existing datasets primarily focus on flowcharts, scientific figures, or business documents. To address this gap, we introduce Enginuity, the first open dataset and benchmark for evaluating VLMs on complex engineering diagrams. We define two tasks over a corpus of U.S. military service and repair manuals: structured parts-table extraction (Task 1) and free-form visual diagram question answering (VQA)(Task 2) for benchmarking. We evaluate four frontier VLMs (GPT-5.2 Chat, Claude Opus 4.7, Gemma 4, Qwen3-VL-32B-Instruct) under zero-shot and chain-of-thought prompting. On Task 1, models reach Recall@all of 0.61-0.87 but Token F1pen of only 0.03-0.18, exposing a systematic gap between part identification and description fidelity. Task 2 reveals a consistent factual-reasoning gap across all models. A supporting analysis shows that token-overlap metrics under-report model capability on technical descriptions by 2-6x relative to semantic similarity, motivating LLM-as-judge calibration for domain-specific evaluation. We release the dataset, annotations, evaluation harness, and per-sample model outputs to support a reproducible study of VLM capability on engineering content.

URL PDF HTML ☆

赞 0 踩 0

2606.03401 2026-06-03 cs.CV

Towards Characterizing Scientific Image Utility and Upgradability

面向科学图像效用与可升级性的表征

WenZhe Li, Qihang Yan, Liang Chen, Junying Wang, Farong Wen, Yijin Guo, Chunyi Li, Zicheng Zhang, Guangtao Zhai

发表机构 * TongJi University（同济大学）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； Shanghai Jiao Tong University（上海交通大学）

AI总结针对AI生成内容对科学图像完整性的威胁，提出SIU²A框架，通过效用（错误检测与修正可行性）和可升级性（修正质量）两个维度评估科学图像，并构建基准数据集揭示当前多模态系统在科学错误评估与忠实修正方面的显著局限。

详情

AI中文摘要

科学图像在研究交流中作为关键证据，但其完整性面临来自AI生成内容的前所未有的威胁，这些内容引入了微妙但严重的错误。现有的评估范式被证明是不充分的：感知质量指标与科学有效性相关性差，而语言模型缺乏特定领域的验证能力。为了解决这一差距，我们提出了 extbf{科学图像效用与可升级性评估（SIU$^2$A）}框架，该框架引入了两个互补的科学图像评估维度。 extbf{效用}包括 extit{错误检测}（识别科学不准确性）和 extit{修正可行性}（评估错误是否可以被可靠修复）。 extbf{可升级性}衡量修正的质量。我们将科学图像损坏分为四种基本类型：细节失真、不完整性、虚假内容和实体混淆。基于这一分类，我们构建了SIU$^2$A-Benchmark，这是一个包含专家标注用于错误识别和修复的数据集。该框架实现了一个两阶段评估协议： extit{效用}阶段评估错误检测能力和修复指令生成，而 extit{可升级性}阶段评估修正是否在不损害现有准确信息的情况下忠实恢复科学有效性。实验表明，当前的多模态系统在科学错误评估和忠实修正方面表现出显著局限性，揭示了视觉感知与科学可用性之间的根本差距。

英文摘要

Scientific images function as critical evidence in research communication, yet their integrity faces unprecedented threats from AI-generated content that introduces subtle but consequential errors. Existing evaluation paradigms prove inadequate: perceptual quality metrics poorly correlate with scientific validity, while language models lack domain-specific verification capabilities. To address this gap, we propose the \textbf{S}cientific \textbf{I}mage \textbf{U}tility and \textbf{U}pgradability \textbf{A}ssessment (\textbf{SIU$^2$A}) framework, which introduces two complementary dimensions for scientific image evaluation. \textbf{Utility} encompasses \textit{error detection} (identifying scientific inaccuracies) and \textit{correction feasibility} (assessing whether errors can be reliably repaired). \textbf{Upgradability} measures the quality of correction. We categorize scientific image corruption into four fundamental types: Detail Distortion, Incompleteness, False Content, and Entity Confusion. Based on this taxonomy, we construct SIU$^2$A-Benchmark, a dataset with expert annotations for error identification and repair. The framework implements a two-stage evaluation protocol: the \textit{Utility} stage evaluates error detection capability and repair instruction generation, while the \textit{Upgradability} stage assesses whether corrections faithfully restore scientific validity without compromising existing accurate information. Experiments reveal that current multimodal systems exhibit significant limitations in both scientific error assessment and faithful correction, exposing a fundamental gap between visual perception and scientific usability.

URL PDF HTML ☆

赞 0 踩 0

2606.03399 2026-06-03 cs.CL cs.CR

Selective Token-Level Cryptographic Redaction for Privacy-Preserving Clinical Deployment of Large Language Models

选择性令牌级密码学编辑用于大型语言模型的隐私保护临床部署

Farhan Sheth, Ziyuan Yang, Yongying Lan, Si Yong Yeo

发表机构 * MedVisAI Lab, Singapore（新加坡MedVisAI实验室）； Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, China（中国上海交通大学医学院瑞金医院）； Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore（新加坡南洋理工大学Lee Kong Chian医学院）

AI总结提出HERALD框架，通过令牌级密码学编辑仅加密敏感令牌，在保护隐私的同时保持下游模型效用，在分类和医疗问答任务上接近明文性能。

Comments 33 pages, 8 figures, 26 tables

详情

AI中文摘要

尽管大型语言模型（LLMs）越来越多地用于临床应用，但许多现有流程需要将原始敏感健康信息发送到远程服务器进行处理，这增加了隐私泄露的风险。缓解这种风险的一种自然方法是在传输前对数据进行加密。然而，加密整个数据集等直接解决方案会带来巨大的计算、对齐和通信开销，使得大规模实际部署不可行。为了在保护隐私的同时保持可用性，我们提出了通过自适应语言分解的医疗加密与编辑（HERALD），这是一个令牌级密码学编辑框架，通过仅加密敏感令牌同时保留上下文以供下游模型使用，实现了这种平衡。HERALD结合医学命名实体识别器（NER）与基于词性（POS）的策略来选择候选令牌，执行目标词形还原以稳定表面形式，并将每个受保护令牌替换为包裹在显式分隔符中的确定性密文。值得注意的是，HERALD是模型无关的，完全在客户端运行，确保敏感内容在存储、传输和处理过程中保持加密，无需更改下游模型。我们在公开数据集上对分类和医疗问答（MQA）任务评估了HERALD。在不同任务中，实验表明完全安全的基线遭受显著的效用损失，而HERALD始终恢复接近明文的性能。总体而言，HERALD提供了一种新颖的利用流程。

英文摘要

While large language models (LLMs) are increasingly used for clinical applications, many existing pipelines require sending raw sensitive health information to remote servers for processing, which heightens the risk of privacy leakage. A natural approach to mitigate this risk is to encrypt the data before transmission. However, straightforward solutions such as encrypting the entire dataset introduce prohibitive computational, alignment, and communication overheads, rendering large-scale practical deployment infeasible. To preserve privacy while maintaining usability, we present Healthcare Encryption & Redaction via Adaptive Linguistic Decomposition (HERALD), a token-level cryptographic redaction framework designed to achieve this balance by encrypting only sensitive tokens while preserving the surrounding context for downstream model utility. HERALD combines medical named-entity recognizer (NER) with part-of-speech (POS) driven policies to select candidate tokens, performs targeted lemmatization to stabilize surface forms, and substitutes each protected token with a deterministic ciphertext wrapped in explicit delimiters. Notably, HERALD is model-agnostic and operates entirely on the client side, ensuring that sensitive content remains encrypted throughout storage, transmission, and processing without requiring changes to downstream models. We evaluated HERALD on both classification and medical question answering (MQA) tasks on public datasets. Across different tasks, experiments illustrate that fully secured baselines suffer significant utility loss, whereas HERALD consistently recovers performance close to plaintext. Overall, HERALD provides a novel utilization pipeline.

URL PDF HTML ☆

赞 0 踩 0

2606.03398 2026-06-03 cs.CL cs.AI

Causal Evidence of Stack Representations in Modeling Counter Languages Using Transformers

Transformer建模计数器语言中栈表示的因果证据

Nishit Singh

发表机构 * Birla Institute of Technology and Science, Pilani（比拉理工学院和科学学院，皮兰）

AI总结通过线性探针和消融实验，证明Transformer在计数器语言任务中学习的栈表示对其性能具有因果必要性。

Comments 8 pages, 8 figures

2606.03392 2026-06-03 cs.RO

OpenEAI-Platform: An Open-source Embodied Artificial Intelligence Hardware-Software Unified Platform

OpenEAI-Platform: 一个开源具身人工智能硬件-软件统一平台

Jinyuan Zhang, Luoyi Fan, Leiyu Wang, Yeqiang Wang, Yicheng Zhu, Cewu Lu, Nanyang Ye

发表机构 * Shanghai Innovation Institute（上海创新研究院）； Huazhong University of Science and Technology（华中科技大学）； Shanghai Jiao Tong University（上海交通大学）

AI总结提出OpenEAI-Platform，集成低成本6+1自由度机械臂和可复现VLA模型，通过开源设计和两阶段训练在真实操作任务中超越商业臂，性能媲美大规模预训练基线。

详情

AI中文摘要

现实世界中的具身AI需要精确的硬件和稳健的视觉-语言-动作（VLA）策略。我们提出OpenEAI-Platform，一个完全开源平台，集成了低成本6+1自由度机械臂（OpenEAI-Arm）和可复现的VLA模型（OpenEAI-VLA）。OpenEAI-Arm提供开源机械设计以实现低制造成本，并采用柔顺控制方法以提高精度。OpenEAI-VLA基于Qwen3-VL-4B，使用扩散Transformer动作头，并仅使用开源机器人和多模态数据集进行两阶段训练。在四个真实操作任务中，OpenEAI-Arm在相同策略下优于两款商用6+1自由度机械臂，而OpenEAI-VLA在仅有限预训练数据下达到了与大规模预训练pi0基线相当的成功率。我们将发布完整的硬件设计、驱动程序、模型以及训练/数据流水线，以支持可复现研究和可扩展数据收集。我们的代码、布局和模型将在论文被接收后发布。

英文摘要

Embodied AI in the real world requires both accurate hardware and robust vision-language-action (VLA) policies. We present OpenEAI-Platform, a fully open-source platform that integrates a low-cost 6+1 degree-of-freedom (dof) robotic arm (OpenEAI-Arm) and a reproducible VLA model (OpenEAI-VLA). OpenEAI-Arm provides open-source mechanical designs for low manufacturing cost and compliant control methods for higher accuracy. OpenEAI-VLA builds on Qwen3-VL-4B and uses a Diffusion Transformer action head, and is trained in two stages with only open-source robot and multimodal datasets. Across four real-world manipulation tasks, OpenEAI-Arm outperforms two commercial 6+1-dof arms under the same policy, and OpenEAI-VLA achieves success rates comparable to the large-scale pretrained pi0 baseline with only limited pretraining data. We will release the full hardware designs, drivers, models, and training/data pipelines to support reproducible research and scalable data collection. Our codes, layouts, and models will be released after the paper is accepted.

URL PDF HTML ☆

赞 0 踩 0

2606.03391 2026-06-03 cs.LG cs.AI cs.CL

When Model Merging Breaks Routing: Training-Free Calibration for MoE

当模型合并破坏路由：MoE的无训练校准

Canbin Huang, Tianyuan Shi, Xiaojun Quan, Jingang Wang, Jianfei Zhang, Qifan Wang

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结针对MoE架构中模型合并导致的路由崩溃问题，提出基于二阶曲率的无训练校准方法HARC，通过闭式解和共轭梯度法高效重对齐路由器，显著提升数学推理和代码生成性能。

详情

AI中文摘要

模型合并已成为一种无需重新训练即可整合多个LLM能力的成本效益方法。然而，现有的合并技术主要基于线性参数算术或优化，在应用于混合专家（MoE）架构时面临困难。我们识别出MoE合并中的一个关键失效模式，称为路由崩溃，其中合并后的路由器无法将令牌分派给合适的专家。路由崩溃源于非线性softmax和离散Top-k路由机制对合并引起的参数扰动的敏感性，这种敏感性进一步被MoE预训练期间施加的负载平衡约束放大。由于微调后的专家表现出不同的专长，即使是适度的错误路由也可能导致严重的性能下降。为解决此问题，我们提出Hessian感知路由器校准（HARC），一种无训练框架，利用二阶曲率信息重新对齐合并后的路由器。该方法采用闭式解，可通过无矩阵共轭梯度法高效求解。在数学推理和代码生成任务上的实验表明，HARC有效缓解了多种MoE合并基线中的路由崩溃，并带来了显著的性能提升。我们的代码可在该https URL获取。

英文摘要

Model merging has emerged as a cost-effective approach for consolidating the capabilities of multiple LLMs without retraining. However, existing merging techniques, largely based on linear parameter arithmetic or optimization, struggle when applied to Mixture-of-Experts (MoE) architectures. We identify a critical failure mode in MoE merging, termed routing breakdown, in which the merged router fails to dispatch tokens to suitable experts. Routing breakdown stems from the sensitivity of the non-linear softmax and discrete Top-k routing mechanisms to parameter perturbations from merging, a sensitivity further amplified by load-balancing constraints imposed during MoE pretraining. Because fine-tuned experts exhibit distinct specializations, even modest misrouting can cause severe performance degradation. To address this issue, we propose Hessian-Aware Router Calibration (HARC), a training-free framework that leverages second-order curvature information to realign the merged router. This approach admits a closed-form solution that can be efficiently solved using a matrix-free conjugate gradient method. Experiments on mathematical reasoning and code generation tasks show that HARC effectively mitigates routing breakdown across diverse MoE merging baselines and leads to substantial performance improvements. Our code is available at https://github.com/huangcb01/HARC.

URL PDF HTML ☆

赞 0 踩 0

2606.03390 2026-06-03 cs.RO

Extreme Motion Generation via Hybrid Null-Space Control for Straight-Line Path Following

通过混合零空间控制实现直线路径跟踪的极端运动生成

Xinyi Yuan, Weiwei Wan, Kensuke Harada

发表机构 * Graduate School of Engineering Science, The University of Osaka, Japan（大阪大学工学研究科）

AI总结提出一种混合控制器，结合强化学习策略和模型控制，在关节极限附近切换，以最大化机械臂沿预定轨迹的笛卡尔路径长度，在7自由度Franka FR3上平均延长27%的路径长度。

详情

AI中文摘要

这项工作研究了“极端运动生成”，旨在在机械臂工作空间内沿预定义轨迹最大化笛卡尔路径长度。这一目标在工业中很重要，因为路径跟踪是许多任务（如表面涂层和焊接）的基础。更关键的是，极端运动使固定基座机械臂能够在有限可达性下利用运动学能力。然而，这种利用在实践中具有挑战性，因为机械臂必须在执行过程中主动避开安全边界，这本质上是一个长视界问题。因此，我们主张长视界决策应委托给基于学习的策略以最大化利用，而经典模型控制器覆盖近边界区域，其中学习策略由于稀疏数据覆盖而急剧退化。具体来说，我们提出的方法是一个步级混合控制器，根据归一化关节极限距离在基于强化学习的控制器和模型控制器之间切换。初始关节配置通过条件扩散采样获得，基于学习到的运动先验改进了可实现的路径长度。我们在7自由度Franka FR3上对10,000个直线路径跟踪任务评估了所提出的框架，平均滚动长度比基于模型的基线延长了27%。值得注意的是，某些任务产生了朝向运动极端的显著延伸，如统计结果中报告的最大改进所示。本文的项目网站和相关视频可在此https URL找到。

英文摘要

This work studies ``extreme motion generation'', which aims to maximize the Cartesian path length along a pre-defined trajectory within the manipulator's workspace. This objective is important in industry as long as path-following is fundamental to a large variety of tasks such as surface coating and welding. More critically, extreme motion enables a fixed-base manipulator to exploit the kinematic capability under limited reachability. However, such exploitation is challenging in practice, as the manipulator must actively avoid the safety boundary through execution, which is inherently a long-horizon problem. Accordingly, we claim that long-horizon decision-making should be delegated to a learning-based policy to maximize exploitation, while a classical model-based controller covers the near-boundary region, where the learning policy degrades sharply due to sparse data coverage. In detail, our proposed method is a step-level hybrid controller that switches between an RL-based and a model-based controller according to the normalized joint-limit distance. The initial joint configuration is sampled through conditional diffusion-based sampling, which improves the achievable path length based on the learned motion prior. We evaluate the proposed framework on 10,000 straight-line path-following tasks with a 7-DoF Franka FR3, extending the average rollout length by 27\% over the model-based baseline. Notably, certain tasks yield a pronounced extension toward the motion extreme, as reflected in the maximum improvement reported in the statistical results. The project website and related videos of this paper can be found at https://yuan-xinyi.github.io/extreme-motion-generation/.

URL PDF HTML ☆

赞 0 踩 0

2606.03385 2026-06-03 cs.RO cs.AI

Grasp-Then-Plan with Failure Attribution: A Closed Two-Stage Framework for Precise and Generalizable Robotic Manipulation

先抓取后规划与失败归因：一种用于精确且可泛化机器人操作的闭环两阶段框架

Jiahao Xu, Peiyuan Wang, Hanzhuo Zhang, Zihao Yu, Tianyu Fu, Hao Chen, Xuanhao Xiang, Jianbo Yu, Chenchen Fu, Wanyuan Wang

发表机构 * School of Computer Science and Engineering, Southeast University, China（东南大学计算机科学与工程学院）

AI总结提出GTP-FA框架，通过任务导向的两阶段抓取-规划流程和失败归因模型，在抓取和规划模块中分别注入任务先验和风险惩罚以及针对高风险初始状态进行数据收集和微调，显著提升机器人操作任务的成功率。

Comments 32 pages, project page: https://sites.google.com/view/gtp-fa/

详情

AI中文摘要

在机器人操作中，抓取与运动规划之间的紧密耦合常常掩盖失败的真实原因，导致低效的试错过程。为了实现高效的长时域操作，我们提出了GTP-FA（先抓取后规划与失败归因），一种面向任务的两阶段抓取-规划框架，该框架生成抓取候选并根据所选抓取执行下游运动规划。给定失败的操作轨迹，我们学习一个失败归因模型，该模型可泛化到未见过的抓取，并生成失败模式的稳定分布以进行诊断引导的优化。基于这些归因结果，我们以诊断驱动的方式优化两个模块：在抓取侧，我们将任务级先验和风险惩罚注入抓取候选评分和优化中，以抑制不稳定或与任务不兼容的抓取；在规划侧，我们通过数据收集和微调针对高风险初始状态，以解决真正的规划瓶颈。我们在仿真和真实机器人实验中评估了所提出的框架，并表明GTP-FA在基于RL、IL、扩散策略和VLA的设置中提升了相应的基础学习器，实现了显著更高的总体任务成功率。

英文摘要

In robotic manipulation, the tight coupling between grasping and motion planning often obscures the true source of failure, leading to inefficient trial-and-error. To enable efficient long-horizon manipulation, we propose GTP-FA (Grasp-Then-Plan with Failure Attribution), a task-oriented two-stage grasp-then-plan framework that generates grasp candidates and performs downstream motion planning conditioned on the selected grasp. Given a failed manipulation trajectory, we learn a failure attribution model that generalizes to unseen grasps and produces a stable distribution over failure modes for diagnosis-guided optimization. Based on these attribution results, we then optimize both modules in a diagnosis-driven manner: on the grasping side, we inject task-level priors and risk penalties into grasp candidate scoring and optimization to suppress unstable or task-incompatible grasps; on the planning side, we target high-risk initial states through data collection and fine-tuning to address genuine planning bottlenecks. We evaluate the proposed framework in both simulation and real-robot experiments, and show that GTP-FA improves the corresponding base learners across RL, IL, diffusion-policy, and VLA-based settings, achieving substantially higher overall task success rates.

URL PDF HTML ☆

赞 0 踩 0

2606.03365 2026-06-03 cs.LG

Link Prediction or Perdition: the Seeds of Instability in Knowledge Graph Embeddings

链接预测还是预测失灵：知识图谱嵌入中不稳定的种子

Guillaume Méroué, Fabien Gandon, Pierre Monnin

发表机构 * Université Côte d’Azur, Inria, CNRS, I3S, France（法国埃克塞特大学、法国国家信息与自动化研究所、法国国家科学研究中心、I3S研究所）

AI总结本文系统分析了多种知识图谱嵌入模型在链接预测中的稳定性，发现高性能模型在三元组预测和嵌入空间上存在显著不稳定性，且随机种子、超参数等因素独立引发同等程度的不稳定，投票机制仅能有限提升稳定性。

Comments Paper accepted at ESWC 2026 (https://2026.eswc-conferences.org)

详情

DOI: 10.1007/978-3-032-25156-5_11

AI中文摘要

嵌入模型（KGEMs）是完成知识图谱的主要链接预测方法。标准评估协议强调基于排名的指标如MRR或Hits@$K$，但通常忽略随机种子对结果稳定性的影响。此外，这些指标掩盖了个别预测和嵌入空间组织中的潜在不稳定性。在这项工作中，我们对多个数据集上的多种KGEM进行了系统的稳定性分析。我们发现高性能模型实际上在三元组级别产生分歧预测，并具有高度可变的嵌入空间。通过隔离随机因素（即初始化、三元组排序、负采样、dropout、硬件），我们表明每个因素独立地引发相当程度的不稳定性。此外，对于给定模型，具有更好MRR的超参数配置并不能保证更稳定。而且，投票虽然是一种已知的补救机制，但只能提供有限的稳定性增强。这些发现凸显了当前基准测试协议的关键局限性，并引发了对KGEM用于知识图谱补全的可靠性的担忧。

英文摘要

Embedding models (KGEMs) constitute the main link prediction approach to complete knowledge graphs. Standard evaluation protocols emphasize rank-based metrics such as MRR or Hits@$K$, but usually overlook the influence of random seeds on result stability. Moreover, these metrics conceal potential instabilities in individual predictions and in the organization of embedding spaces. In this work, we conduct a systematic stability analysis of multiple KGEMs across several datasets. We find that high-performance models actually produce divergent predictions at the triple level and highly variable embedding spaces. By isolating stochastic factors (i.e., initialization, triple ordering, negative sampling, dropout, hardware), we show that each independently induces instability of comparable magnitude. Furthermore, for a given model, hyperparameter configurations with better MRR are not guaranteed to be more stable. Moreover, voting, albeit a known remediation mechanism, only provides a limited enhancement of stability. These findings highlight critical limitations of current benchmarking protocols, and raise concerns about the reliability of KGEMs for knowledge graph completion.

URL PDF HTML ☆

赞 0 踩 0

2606.03363 2026-06-03 cs.CL

EntSQL: A Benchmark for Grounding Text-to-SQL in Long-Context Enterprise Knowledge

EntSQL：一个将Text-to-SQL置于长上下文企业知识中的基准

Chengxi Liao, Tao Xu, Zulong Chen, Chuanfei Xu, Yiyan Wang, Xinyun Wang, Yanlong Zhang, Xiaojun Chen, Zhibo Yang, Zeyi Wen

发表机构 * HKUST (GZ)（香港科技大学（广州））； Alibaba Group（阿里巴巴集团）； Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)（广东人工智能与数字经济实验室（深圳））

AI总结提出EntSQL基准，通过包含1066个跨五个业务领域的中英文对齐示例，评估LLM在长上下文企业文档中基于私有业务知识生成SQL的能力，最佳系统仅达15.9%准确率。

详情

AI中文摘要

Text-to-SQL使得通过自然语言访问数据库成为可能，最近的LLM显著提升了其能力。现有的基准如Spider、BIRD和Spider~2.0评估了模式泛化、大规模数据库和现实工作流，但很大程度上忽略了SQL生成依赖于私有业务知识（如内部指标、报告惯例和组织规则）的企业场景。我们引入了EntSQL，一个面向企业的Text-to-SQL基准，用于评估在专有业务文档上的长上下文基础。EntSQL包含1066个跨五个业务领域的中英文对齐语义示例，大多数示例需要超越问题和模式的领域知识，并涉及复杂的SQL结构。在英文输入上，当提供长文档时，最佳评估系统仅达到15.9%，突显了在企业知识基础上生成SQL的难度。

英文摘要

Text-to-SQL enables natural language access to databases, and recent LLMs have substantially advanced its capabilities. Existing benchmarks such as Spider, BIRD, and Spider~2.0 evaluate schema generalization, large-scale databases, and realistic workflows, but largely overlook enterprise scenarios where SQL generation depends on private business knowledge, such as internal metrics, reporting conventions, and organizational rules. We introduce EntSQL, an enterprise-oriented Text-to-SQL benchmark for evaluating long-context grounding over proprietary business documents. EntSQL contains 1,066 aligned Chinese-English semantic examples across five business domains, with most examples requiring domain knowledge beyond the question and schema and involving complex SQL structures. On English inputs, the best evaluated system reaches only 15.9\% when long-form documents are provided, highlighting the difficulty of grounding SQL generation in enterprise knowledge.

URL PDF HTML ☆

赞 0 踩 0

2606.03361 2026-06-03 cs.LG

Mitigating False Credit Propagation: Probabilistic Graphical Reward Aggregation for Rubric-Based Reinforcement Learning

缓解虚假信用传播：基于概率图奖励聚合的准则强化学习

Can Lv, Mingju Chen, Heng Chang, Shiji Zhou

发表机构 * Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing, School of Artificial Intelligence, Beihang University（北京未来区块链与隐私计算先进创新中心，人工智能学院，北京航空航天大学）； Tsinghua University（清华大学）

AI总结针对准则奖励中因忽略准则间依赖关系导致的虚假信用传播问题，提出概率图框架Graphical Event Aggregation for Rubric rewards (GEAR)，通过建模潜在伯努利事件和软抑制传播实现依赖感知的奖励聚合，在多个基准上提升性能并减少信用泄漏。

详情

AI中文摘要

基于准则的奖励越来越多地用于开放式语言模型的后训练，但准则级别的分数通常作为独立效用进行聚合。这种扁平标量化忽略了准则间由准则指定的前提和激活关系，使得即使触发奖励或惩罚的条件不存在，奖励或惩罚仍被计入。我们将这种结构性的奖励聚合失败称为 extbf{虚假信用传播}（FCP）。为解决这一局限，我们提出\ourname（ extbf{G}raphical extbf{E}vent extbf{A}ggregation for extbf{R}ubric rewards），一种用于依赖感知准则聚合的概率图框架。\ourname将每个准则结果建模为类型化准则图中的潜在伯努利事件，从不受支持的父事件向其子事件传播软抑制，并将结果事件概率聚合为归一化的期望符号效用。这产生了一个线性时间的奖励计算，可以插入到标准的基于准则的RL流程中，而无需改变外部优化算法。在HealthBench、WritingBench和PLawBench上使用两种策略骨干的实验表明，\ourname一致优于扁平聚合和确定性门控，相对于扁平聚合实现了高达15.5%的相对增益。FCP诊断进一步显示，相对于扁平聚合，\ourname减少了96.5%的泄漏，同时保留了比确定性门控更多的许可下游效用。我们的代码在此https URL公开。

英文摘要

Rubric-based rewards are increasingly used for open-ended language model post-training, but criterion-level scores are often aggregated as independent utilities. This flat scalarization ignores rubric-specified prerequisite and activation relations among criteria, allowing reward or penalty to be counted even when the condition that licenses it is absent. We call this structural reward-aggregation failure \textbf{False Credit Propagation} (FCP). To address this limitation, we propose \ourname (\textbf{G}raphical \textbf{E}vent \textbf{A}ggregation for \textbf{R}ubric rewards), a probabilistic graphical framework for dependency-aware rubric aggregation. \ourname models each criterion outcome as a latent Bernoulli event in a typed rubric graph, propagates soft suppression from unsupported parent events to their children, and aggregates the resulting event probabilities into a normalized expected signed utility. This yields a linear-time reward computation that can be plugged into standard rubric-based RL pipelines without changing the outer optimization algorithm. Experiments on HealthBench, WritingBench, and PLawBench with two policy backbones show that \ourname consistently improves over flat aggregation and deterministic gating, achieving relative gains of up to 15.5\% over flat aggregation. FCP diagnostics further show that \ourname reduces leakage by 96.5\% relative to flat aggregation while preserving more licensed downstream utility than deterministic gating. Our code is publicly available at https://github.com/LvCan926/GEAR.

URL PDF HTML ☆

赞 0 踩 0