arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1938
专题追踪
2605.20777 2026-05-21 cs.CV

AttriStory: Fine-grained Attribute Realization for Visual Storytelling with Diffusion Models

AttriStory: 基于扩散模型的视觉叙事中细粒度属性实现

Manogna Sreenivas, Rohit Kumar, Soma Biswas

发表机构 * Indian Institute of Science(印度科学研究院)

AI总结 本文提出AttriStory基准,通过细粒度属性实现提升视觉叙事的质量,引入了在早期去噪步骤中操作的潜在优化模块,并通过AttriLoss目标增强属性-对象对的对齐度,从而实现更精确的属性定位。

Comments Accepted at CVPR AIStory Workshop, 2026

详情
AI中文摘要

基于扩散模型的视觉叙事在保持叙事场景中角色一致性方面取得了显著进展。然而,一个关键的差距仍然存在:尽管这些方法确保角色在不同场景中保持一致,但它们没有系统的方法来确保生成图像中诸如服装颜色和纹理等细粒度属性得到忠实呈现。为此,我们引入了AttriStory基准,通过大型语言模型收集了200个跨场景故事,涵盖10种不同的艺术风格。每个场景都包含详细的属性规范,以实现丰富的视觉叙事。进一步,为了解决属性实现问题,我们提出了一种插件式的潜在优化模块,在早期去噪步骤中操作,当模型建立结构和语义内容时。我们通过AttriLoss目标实现这一点,该目标旨在最大化所需属性-对象对的交叉注意力图的对齐度,同时抑制虚假关联,引导模型正确定位属性。这种方法与现有的一致性机制正交,能够无缝集成到当前的故事生成流程中,而无需进行架构修改。我们的实验表明,AttriLoss在所有基线中都实现了持续的改进。这项工作将属性实现定位为视觉叙事的一个独立且互补的维度,与角色一致性并列,推动该领域向细粒度属性控制的故事生成发展。项目页面:https://manogna-s.github.io/attristory/

英文摘要

Visual storytelling with diffusion models has made impressive strides in maintaining character consistency across narrative scenes. However, a critical gap remains: while these methods ensure a character remains consistent across scenes, they provide no systematic method to ensure if fine-grained attributes such as color and textures of clothing, accessories are faithfully rendered in the generated images. Towards this goal, we introduce AttriStory, a benchmark enabling attribute realization in visual storytelling. We curate 200 multi-scene stories across 10 distinct artistic styles using Large Language Model. Each scene is constructed with detailed attribute specifications to enable rich visual narratives. Further, to address attribute realization, we propose a plug-and-play latent optimization module that operates during early denoising steps, when the model establishes structural and semantic content. We achieve this through AttriLoss objective designed to maximize alignment between the cross-attention maps for desired attribute-object pairs while suppressing spurious associations, guiding models to localize attributes correctly. This approach operates orthogonally to existing consistency mechanisms, integrating seamlessly with current story generation pipelines without requiring architectural modifications. Our experiments demonstrate consistent improvements on incorporating AttriLoss across all baselines. This work positions attribute realization as a distinct, complementary dimension of visual storytelling, alongside character consistency, advancing the field toward fine-grained attribute-controlled story generation. Project-page:https://manogna-s.github.io/attristory/

2605.20774 2026-05-21 cs.RO

VLA-REPLICA: A Low-Cost, Reproducible Benchmark for Real-World Evaluation of Vision-Language-Action Models

VLA-REPLICA: 一种低成本、可重复的现实世界评估视觉-语言-动作模型的基准

Alex S. Huang, Jiahui Zhang, Shiqing Tang, Yu Xiang

发表机构 * Intelligent Robotics and Vision Lab, University of Texas at Dallas(德克萨斯大学达拉斯分校智能机器人与视觉实验室) Allen High School(艾伦高中)

AI总结 本文提出VLA-REPLICA,一种低成本、可重复的现实世界评估视觉-语言-动作模型的基准,通过使用现成组件构建,提供一致的环境用于政策评估,并包含多样化的操作任务和小规模演示数据集,用于目标域适应。

详情
AI中文摘要

视觉-语言-动作(VLA)模型在通用目的机器人操作中显示出强大的潜力,但其现实世界评估仍受到缺乏可访问、可重复和一致的基准的限制。模拟基准无法捕捉现实世界的复杂性,而现有的现实世界基准通常需要昂贵的硬件、集中评估或任务多样性有限。我们介绍了VLA-REPLICA,一种低成本、易于重复的现实世界评估VLA模型的基准。该系统由现成组件构建,可以快速组装并在不同实验室中复制,为全球各地的政策评估提供一致的环境。VLA-REPLICA包含多样化的操作任务和一个小规模的演示数据集用于目标域适应,并为在分布和出分布设置中的现实世界评估提供了协议。对模仿学习和最先进的VLA模型的实验揭示了模型的优势和局限性,而不同独立构建设置中的一致结果证明了我们基准的可重复性。

英文摘要

Vision-Language-Action (VLA) models have shown strong promise for general-purpose robotic manipulation, but their real-world evaluation remains limited by a lack of accessible, reproducible, and consistent benchmarks. Simulation benchmarks fail to capture real-world complexity, while existing real-world benchmarks often require expensive hardware, centralized evaluation, or are limited in task diversity. We introduce VLA-REPLICA, a low-cost, easily reproducible real-world benchmark for evaluating VLA models. Built from off-the-shelf components, our system can be quickly assembled and replicated across laboratories, providing a consistent environment for policy evaluation anywhere in the world. VLA-REPLICA includes a diverse suite of manipulation tasks and a small-scale demonstration dataset for target-domain adaptation, with real-world evaluation protocols for both in-distribution and out-of-distribution settings. Experiments with imitation learning and state-of-the-art VLA models reveal model strengths and limitations, while consistent results across independently constructed setups demonstrate the reproducibility of our benchmark.

2605.20771 2026-05-21 cs.LG

Cumulative Meta-Learning from Active Learning Queries for Robustness to Spurious Correlations

通过主动学习查询进行累积元学习以增强对虚假相关性的鲁棒性

Kin Whye Chew, Jingxian Wang

发表机构 * Department of Computer Science(计算机科学系) National University of Singapore(新加坡国立大学)

AI总结 本文提出了一种累积主动元学习(CAML)框架,通过主动学习查询样本来元学习先验知识,以提高模型对虚假相关性的鲁棒性,实验结果显示在多个基准测试中性能显著提升。

Comments Under review. 26 pages, 7 figures

详情
AI中文摘要

现实世界数据集中的虚假相关性导致机器学习模型依赖于无关模式,削弱了可靠性、泛化能力和公平性。主动学习提供了一种有前景的方法来解决这一故障模式,通过查询能够区分核心特征和虚假特征的信息样本。然而,标准的主动学习方法只是将查询的示例添加到标记集中,仅更新了似然项。在深度学习领域,这些信息样本的影响可能被更大的标记集稀释,并被过参数化的模型记忆化。我们提出了累积主动元学习(CAML),一种主动学习框架,利用查询的示例来元学习先验,或归纳偏差,以指导模型的适应。CAML将每个主动学习轮次视为一个元学习任务:当前的标记集作为元训练数据用于适应,而新查询的批次作为元测试数据用于评估泛化能力。与传统元学习不同,CAML利用主动学习轮次之间的序列依赖性,通过维护一个逐步细化的累积归纳偏差。理论上,我们证明了这种累积形式引入了交互项,将早期元学习的归纳偏差与后期查询诱导的目标联系起来,捕捉了标准元学习中缺失的依赖关系。实验表明,CAML在多个虚假相关性基准测试和获取策略中提高了少数群体的准确性,最高在Dominoes上提升了27.8%,在Waterbirds上提升了29.9%,在SpuCo上提升了14.3%,在CivilComments上提升了24.0%。

英文摘要

Spurious correlations in real-world datasets cause machine learning models to rely on irrelevant patterns, undermining reliability, generalization, and fairness. Active learning offers a promising way to address this failure mode by querying informative samples that distinguish core features from spurious ones. However, standard active-learning methods simply append queried examples to the labeled set, effectively updating only the likelihood term. In deep learning regimes, the influence of these informative samples can be diluted by the larger labeled set and memorized by overparameterized models. We propose Cumulative Active Meta-Learning (CAML), an active-learning framework that uses queried examples to meta-learn the prior, or inductive bias, governing how the model adapts. CAML casts each active-learning round as a meta-learning task: the current labeled set serves as meta-train data for adaptation, while the newly queried batch serves as meta-test data for evaluating generalization. Unlike conventional meta-learning, which treats tasks as independent and identically distributed, CAML exploits the sequential dependence between active-learning rounds by maintaining a cumulative inductive bias that is progressively refined. Theoretically, we show that this cumulative formulation introduces interaction terms that couple earlier meta-learned inductive biases with later query-induced objectives, capturing dependencies absent from standard meta-learning. Empirically, CAML improves minority-group accuracy across spurious-correlation benchmarks and acquisition strategies, with gains of up to 27.8% on Dominoes, 29.9% on Waterbirds, 14.3% on SpuCo, and 24.0% on CivilComments.

2605.20767 2026-05-21 cs.CL cs.LG stat.ME

The Illusion of Intervention: Your LLM-Simulated Experiment is an Observational Study

干预的幻觉:你的LLM模拟实验实际上是一个观察性研究

Victoria Lin, Taedong Yun, Maja Matarić, John Canny, Arthur Gretton, Alexander D'Amour

发表机构 * Google DeepMind(谷歌深Mind) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文探讨了大型语言模型在模拟人类行为中的潜在作用,指出在LLM模拟的合成用户中进行干预可能引起潜在用户属性的意外变化,从而导致用户漂移,影响效果估计。本文提出了使用负对照结果来检测分布变化的方法,并通过调整角色描述以减少偏倚来缓解漂移问题。

详情
AI中文摘要

大型语言模型(LLMs)显示出作为人类行为模拟器的潜力,提供了一种可扩展的方式研究对干预的反应。然而,由于LLMs主要基于观察性数据进行训练,在与LLM模拟的合成用户进行实验时,干预可能会引起潜在用户属性的意外变化,导致用户漂移,其中隐含的模拟总体在不同处理条件下有所不同,这可能会扭曲效应估计。我们正式化了由于用户漂移可能产生的混淆或选择偏差,并展示了干预依赖性变化如何放大或减弱干预下用户响应的观测差异。为了诊断混淆,我们提出使用负对照结果——在干预下应保持不变的属性——来识别干预条件间的分布变化,提供用户漂移的证据。为了缓解漂移,我们研究了通过获取额外的混杂因素来调整角色描述,发现针对特定场景的相关混杂因素可以显著减少调查式和多轮代理评估中的偏倚。

英文摘要

Large language models (LLMs) show potential as simulators of human behavior, offering a scalable way to study responses to interventions. However, because LLMs are trained largely on observational data, interventions in experiments with LLM-simulated synthetic users can induce unintended shifts in latent user attributes, causing user drift where the implicit simulated population differs across treatment conditions, potentially distorting effect estimates. We formalize the confounding or selection bias that can arise due to user drift and show how intervention-dependent shifts can inflate or attenuate observed differences in user responses under intervention. To diagnose confounding, we propose using negative control outcomes--attributes that should remain invariant under intervention--to identify distribution shifts across intervention conditions, providing evidence of user drift. To mitigate drift, we study adjusting the persona specification by eliciting additional confounders, finding that targeted, setting-relevant confounders can substantially reduce bias across survey-style and multi-turn agent evaluations.

2605.20766 2026-05-21 cs.CV

Diffuse to Detect: Bi-Level Sample Rebalancing with Pseudo-Label Diffusion for Point-Supervised Infrared Small-Target Detection

Diffuse to Detect: 基于伪标签扩散的双级样本再平衡点监督红外小目标检测

Zhu Liu, Yuanhang Yao, Ping Qian, Zihang Chen, Risheng Liu

发表机构 * School of Software Technology, Dalian University of Technology, Dalian, China(大连理工大学软件学院)

AI总结 本文提出了一种更适应且稳定的框架,通过利用热辐射模式与热扩散的内在一致性,提出了一种物理诱导的标注策略,扩展单点标签为可靠的伪掩码,并开发了双级双更新框架,联合优化检测器权重、样本权重和扩散参数,以提高监督效果并缓解样本不平衡问题。

详情
AI中文摘要

点监督已成为解决红外小目标检测密集标注问题的可扩展解决方案,但其性能受限于两个耦合的瓶颈:在杂乱、低对比度的红外图像中伪标签演化的不稳定性以及严重的样本分布不平衡。本文提出了一种更适应且稳定的框架来解决这些问题。利用热辐射模式与热扩散的内在一致性,我们提出了一种物理诱导的标注策略,将单点标签扩展为可靠的伪掩码。为进一步增强监督并缓解样本不平衡,我们开发了双级双更新框架,联合优化检测器权重、样本权重和扩散参数。一个元分类器动态预测样本级损失权重,而一个可微扩散模块通过检测反馈细化伪标签,使训练与超参数优化之间实现自适应交互。在多个数据集上的广泛实验表明,该方法实现了五倍的标注加速,优越的检测精度,并在仅使用30%训练数据时表现出可比的性能,验证了该方法的效率和实用性。我们的代码可在https://github.com/yuanhang-yao/diffuse-to-detect获取。

英文摘要

Point supervision has become a scalable solution to address dense annotation for infrared small target detection, but its performance is limited by two coupled bottlenecks: unstable pseudo-label evolution in cluttered, low-contrast infrared imagery and severe sample-distribution imbalance. In this paper, we present a more adaptive and stable framework to address these issues. Leveraging the intrinsic consistency between thermal radiation patterns and heat diffusion, we propose a physics-induced annotation strategy that expands single-point labels into reliable pseudo-masks. To further enhance supervision and alleviate sample imbalance, we develop a bi-level dual-update framework that jointly optimizes detector weights, sample weights, and diffusion parameters. A meta-classifier dynamically predicts sample-wise loss weights, while a differentiable diffusion module refines pseudo-labels with detection feedback, enabling adaptive interaction between training and hyperparameter optimization. Extensive experiments across multiple datasets demonstrate five-fold annotation acceleration, superior detection accuracy, and comparable performance with 30% of the training data, validating the efficiency and practicality of our approach. Our code is available at https://github.com/yuanhang-yao/diffuse-to-detect.

2605.20760 2026-05-21 cs.CV

SpineContextResUNet: A Computationally Efficient Residual UNet for Spine CT Segmentation

SpineContextResUNet: 一种计算高效的残差U-Net用于脊柱CT分割

K S Nithurshen, Saurabh J. Shigwan

发表机构 * Shiv Nadar University(施瓦德纳大学)

AI总结 本文提出SpineContextResUNet,一种高效的3D残差U-Net,用于快速脊柱定位,通过轻量级的上下文块在不牺牲性能的情况下减少了计算资源需求,适用于资源受限环境。

Comments 2 Figures, 3 Tables

详情
AI中文摘要

自动分割CT扫描中的脊柱是病理评估和手术规划的前提。然而,基于Transformer或大规模集合的方法需要大量GPU资源,限制了在资源受限环境或边缘设备上的临床应用。为此,我们引入了SpineContextResUNet,一种计算高效的3D残差U-Net,用于快速脊柱定位。我们的架构整合了一个轻量级的上下文块,该块使用并行多扩张卷积来捕捉长距离解剖依赖,而无需递归神经网络(RNN)的高延迟或自注意力机制的记忆开销。在两个公开基准测试集VerSe2020和CTSpine1K上的广泛验证显示,我们的模型分别实现了88.17%和88.13%的Dice分数。为了评估在严格硬件限制下的性能,我们将模型与一个缩放后的瓶颈SwinUNETR进行了比较,以匹配我们的~1.7M硬件足迹。尽管受限的Transformer由于在有限数据集中的空间归纳偏置缺乏而遭受严重性能下降,我们的CNN方法成功地保持了高精度。关键的是,重基线如TotalSegmentator由于在商用硬件(Intel Core i5,8GB RAM)上的内存耗尽而失败,而我们的模型在内存限制下执行稳健,使其成为点诊诊断和在Nvidia Jetson Orin Nano等边缘平台部署的可行解决方案。

英文摘要

Automated segmentation of the vertebral column in Computed Tomography (CT) scans is a prerequisite for pathological assessment and surgical planning. However, state-of-the-art methods, particularly those based on Transformers or large-scale ensembles, demand substantial GPU resources, creating a barrier for clinical adoption in resource-constrained environments or on edge devices. To address this, we introduce SpineContextResUNet, a computationally efficient 3D Residual U-Net designed for rapid spinal localization. Our architecture integrates a lightweight Context Block that employs parallel multi-dilated convolutions to capture long-range anatomical dependencies without the high latency of Recurrent Neural Networks (RNNs) or the memory overhead of Self-Attention mechanisms. Extensive validation on two public benchmarks, VerSe2020 and CTSpine1K, demonstrates that our model achieves a Dice score of 88.17% and 88.13% respectively. To evaluate performance under strict hardware constraints, we compared our model against a bottlenecked SwinUNETR scaled to match our ~1.7M hardware footprint. While the constrained Transformer suffers severe performance degradation due to a lack of spatial inductive biases in a limited-data regime, our CNN-based approach successfully maintains high accuracy. Crucially, heavy baselines like TotalSegmentator fail due to memory exhaustion on commodity hardware (Intel Core i5, 8GB RAM), our model performs robust inference, making it a viable solution for point-of-care diagnostics and deployment on edge platforms like the Nvidia Jetson Orin Nano.

2605.20758 2026-05-21 cs.AI cs.CV cs.LG cs.RO

Conflict-Aware Additive Guidance for Flow Models under Compositional Rewards

面向组合奖励的冲突感知加法引导:流模型中的对抗性生成

Xuehui Yu, Fucheng Cai, Meiyi Wang, Xiaopeng Fan, Harold Soh

发表机构 * Smart Systems Institute, National University of Singapore, Singapore(新加坡国立大学智能系统研究所) Faculty of Computing, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机学院) School of Computing, National University of Singapore, Singapore(新加坡国立大学计算机学院)

AI总结 本文提出了一种面向组合奖励的冲突感知加法引导方法,用于在流模型中处理对抗性生成问题,通过动态检测和解决梯度冲突来纠正离曼福德漂移,提升了生成保真度。

Comments Forty-Third International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

在推理时间进行引导采样可以无需微调就通过解释生成过程为可控轨迹来驱动最先进的扩散和流模型。这提供了一种简单灵活的方式,将外部约束(如成本函数或预训练验证器)注入受控生成中。然而,现有方法在同时组合多个约束时往往失效,导致偏离真实数据曼福德。在本工作中,我们识别出这种离曼福德漂移的根本原因,并发现近似误差随着梯度不一致程度严重增加。基于这些发现,我们提出了一种轻量且可学习的方法,即冲突感知加法引导(g^car),该方法通过动态检测和解决梯度冲突来主动纠正离曼福德漂移。我们验证了g^car在多样化的领域中的有效性,从合成数据集和图像编辑到生成决策规划与控制。我们的结果表明,g^car有效纠正了离曼福德漂移,在生成保真度方面超越了基线方法,同时使用轻量计算。代码可在https://github.com/yuxuehui/CAR-guidance获取。

英文摘要

Inference-time guided sampling steers state-of-the-art diffusion and flow models without fine-tuning by interpreting the generation process as a controllable trajectory. This provides a simple and flexible way to inject external constraints (e.g., cost functions or pre-trained verifiers) for controlled generation. However, existing methods often fail when composing multiple constraints simultaneously, which leads to deviations from the true data manifold. In this work, we identify root causes of this off-manifold drift and find that the approximation error scales severely with gradient misalignment. Building on these findings, we propose Conflict-Aware Additive Guidance ($g^\text{car}$), a lightweight and learnable method, which actively rectifies off-manifold drift by dynamically detecting and resolving gradient conflicts. We validate $g^\text{car}$ across diverse domains, ranging from synthetic datasets and image editing to generative decision-making for planning and control. Our results demonstrate that $g^\text{car}$ effectively rectifies off-manifold drift, surpassing baselines in generation fidelity while using light compute. Code is available at https://github.com/yuxuehui/CAR-guidance.

2605.20756 2026-05-21 cs.LG cs.AI math.OC stat.ML

Correcting Stochastic Update Bias in Preconditioned Language Model Optimizers

纠正预条件语言模型优化器中的随机更新偏差

Nikhil Nayak, Julia White, Urchade Zaratiana, Kelton Zhang, Henrijs Princis, Dhruv Atreja, Henry Fawcett, Matthew Thomas, George Hurn-Maloney, Ash Lewis

发表机构 * Fastino Labs(Fastino实验室)

AI总结 本文研究了预条件优化器中随机更新规则的有限样本偏差问题,提出了一种单批次偏差校正框架,通过交叉拟合预条件估计和方差校正逆运算来减少梯度-预条件器耦合偏差和逆运算偏差,从而提升预条件优化器的性能。

Comments 32 pages, 3 figures, 13 tables

详情
AI中文摘要

预条件优化器在语言模型训练中至关重要,但其随机更新规则通常被视为对群体预条件下降的直接近似。我们证明这种观点忽略了两个有限样本偏差。首先,梯度和预条件器通常从同一个mini-batch估计,引入梯度-预条件器耦合偏差。其次,即使预条件器估计是无偏的,其逆或逆根通常有偏,因为逆运算是非线性的。我们提出了一种单批次偏差校正框架,以解决这两种效应:交叉拟合预条件估计从独立的微批次组中估计分子和预条件器,而方差校正逆运算利用微批次变化来减去主导的delta-方法偏差项。该框架适用于对角矩、对角曲率和矩阵预条件方法,分别在AdamW、Sophia和Shampoo中实现。偏差校正将Qwen2.5-0.5B的保持预训练损失减少了0.15、0.07和0.11 nat,分别;对混合质量预训练和下游指令微调的影响始终是中性到积极的。这些结果确立了偏差校正作为减少有限样本更新偏差和提升预条件优化器性能的实用机制。

英文摘要

Preconditioned optimizers are central to language model training, but their stochastic update rules are usually treated as direct approximations to population preconditioned descent. We show that this view misses two finite-sample biases. First, the gradient and preconditioner are typically estimated from the same minibatch, introducing gradient--preconditioner coupling bias. Second, even when the preconditioner estimate is unbiased, its inverse or inverse-root is generally biased because inversion is nonlinear. We propose a single-batch bias-correction framework that addresses both effects: cross-fitted preconditioning estimates the numerator and preconditioner from independent microbatch groups, while variance-corrected inversion uses microbatch variability to subtract the leading delta-method bias term. The framework applies to diagonal moment, diagonal curvature, and matrix preconditioning methods, instantiated in AdamW, Sophia, and Shampoo. Bias correction reduces held-out pretraining loss on Qwen2.5-0.5B by $0.15$, $0.07$, and $0.11$ nats, respectively; the effects on mixed-quality pretraining and downstream instruction tuning are consistently neutral-to-positive. Together, these results establish bias correction as a practical mechanism for reducing finite-sample update bias and improving the performance of preconditioned optimizers.

2605.20751 2026-05-21 cs.LG cs.AI cs.SY eess.SY

PACD-Net: Pseudo-Augmented Contrastive Distillation for Glycemic Control Estimation from SMBG

PACD-Net: 假设增强对比学习用于从SMBG估计血糖控制

Canyu Lei, David Repaske, Jianxin Xie

发表机构 * University of Virginia, School of Data Science, Charlottesville, VA 22903, USA(弗吉尼亚大学数据科学学院) University of Virginia, Department of Pediatrics, Charlottesville, VA 22903, USA(弗吉尼亚大学儿科系)

AI总结 本研究提出PACD-Net,一种自监督对比学习框架,用于从稀疏不规则采样的SMBG数据中估计血糖控制指标,通过伪SMBG样本指导学习并提高模型的准确性和稳定性。

详情
AI中文摘要

有效的糖尿病管理需要持续监测血糖水平。临床中,通过连续葡萄糖监测(CGM)获取的指标如时间范围(TIR)、低于范围时间(TBR)和高于范围时间(TAR)用于评估血糖控制。然而,由于CGM成本高且可及性有限,许多患者依赖自测血糖(SMBG)。与CGM不同,SMBG提供稀疏且不规则的测量,使得准确估计这些指标具有挑战性。传统监督学习方法在稀疏数据下表现不佳,导致泛化能力差和性能不稳定。为此,我们提出PACD-Net,一种自监督对比学习框架,用于从SMBG估计血糖控制。使用具有更丰富时间覆盖的伪SMBG样本作为教师信号,指导从稀疏观测中学习。此外,多视图对比学习强制不同采样模式下的表征一致性。模型采用混合Swin Transformer-CNN主干网络以捕捉稀疏SMBG序列中的时间依赖性。实验结果表明,PACD-Net在真实世界SMBG数据中对TAR、TIR和TBR的估计优于现有方法,实现了在极稀疏观测设置下的改进准确性和增强的稳定性与泛化能力。所提出的框架为临床SMBG解释提供了实用工具,并为从稀疏且不规则采样的传感器数据中学习提供了通用方法。

英文摘要

Effective diabetes management requires continuous monitoring of glycemic levels. Clinically, glycemic control is assessed using metrics such as Time in Range (TIR), Time Below Range (TBR), and Time Above Range (TAR), typically derived from continuous glucose monitoring (CGM). However, many patients rely on self-monitoring of blood glucose (SMBG) due to the high cost and limited accessibility of CGM. Unlike CGM, SMBG provides sparse and irregular measurements, making accurate estimation of these metrics challenging. Conventional supervised learning approaches struggle under such sparsity, leading to poor generalization and unstable performance. To address this, we propose PACD-Net, a self-supervised contrastive knowledge distillation framework for estimating glycemic control from SMBG. Pseudo-SMBG samples with richer temporal coverage are used as teacher signals to guide learning from sparse observations. In addition, multi-view contrastive learning enforces representation consistency across diverse sampling patterns. The model adopts a hybrid Swin Transformer-CNN backbone to capture temporal dependencies in sparse SMBG sequences. Experimental results demonstrate that PACD-Net consistently outperforms existing methods in estimating TAR, TIR, and TBR from real-world SMBG data, achieving improved accuracy as well as enhanced stability and generalization under extremely sparse observation settings. The proposed framework provides a practical tool for clinical SMBG interpretation and offers a generalizable approach for learning from sparse and irregularly sampled sensor data in broader applications.

2605.20745 2026-05-21 cs.LG cs.AI cs.CL

The Hidden Signal of Verifier Strictness: Controlling and Improving Step-Wise Verification via Selective Latent Steering

验证器严格性的隐含信号:通过选择性潜在引导控制和改进逐步验证

Yefan Zhou, Yilun Zhou, Austin Xu, Soroush Vosoughi, Shafiq Joty, Jiang Gui

发表机构 * Dartmouth College(达特茅斯学院) Datadog AI Research(Datadog人工智能研究) Salesforce AI Research(Salesforce人工智能研究)

AI总结 本文研究了通过隐藏状态干预控制验证器严格性的方法,提出VerifySteer通过利用潜在正确性信号进行样本级路由并选择性干预段落边界,从而在ProcessBench和Hard2Verify数据集上优于基线方法,且在推理计算上更高效。

详情
AI中文摘要

生成验证器已成为逐步验证的一种有前途的范式,但其验证行为往往校准不佳:它们可能过于宽松而错过错误步骤,或过于严格而拒绝正确推理。我们将这种倾向于过于宽松或过于严格的行为称为验证器严格性。在本工作中,我们研究是否可以通过隐藏状态干预来控制验证器严格性。我们揭示了一个验证特定的隐藏状态信号:在逐步验证中,验证器接受或拒绝解决方案步骤的倾向编码在对应的验证段落边界附近。利用这一信号,我们证明隐藏状态引导可以直接调节验证器严格性,而无需微调。然而,统一引导会导致错误检测与正确性认证之间的权衡。为了解决这个问题,我们提出了VerifySteer,它利用潜在正确性信号进行样本级路由,并选择性地在段落边界进行干预。在ProcessBench和Hard2Verify上的实验表明,VerifySteer优于提示优化和激活引导基线,并且在需要更少推理计算的情况下与自一致性竞争。VerifySteer还与验证微调互补,在微调验证器上提供进一步的收益。代码可在https://github.com/YefanZhou/VerifySteer上获得。

英文摘要

Generative verifiers have emerged as a promising paradigm for step-wise verification, but their verification behavior is often poorly calibrated: they may be under-critical and miss erroneous steps, or over-critical and reject correct reasoning. We refer to this tendency to be overly lenient or overly critical as verifier strictness. In this work, we study whether verifier strictness can be controlled through hidden-state intervention. We uncover a verification-specific hidden-state signal: in step-wise verification, a verifier's tendency to accept or reject a solution step is encoded near the boundary of the corresponding verification paragraph. Exploiting this signal, we show that hidden-state steering can directly modulate verifier strictness without fine-tuning. However, uniform steering induces a trade-off between error detection and correctness certification. To address this, we propose VerifySteer, which exploits latent correctness signals for sample-level routing and selectively intervenes on paragraph boundaries. Experiments on ProcessBench and Hard2Verify show that VerifySteer outperforms prompt optimization and activation steering baselines, and is competitive with self-consistency while requiring 4-7x less inference compute. VerifySteer is also complementary to verification fine-tuning, providing further gains on top of fine-tuned verifiers. The code is available at https://github.com/YefanZhou/VerifySteer.

2605.20744 2026-05-21 cs.LG cs.AI

Hack-Verifiable Environments: Towards Evaluating Reward Hacking at Scale

可验证的环境:面向大规模评估奖励黑客的尝试

Amit Roth, Ankur Samanta, Matan Halevy, Yoav Levine, Yonathan Efroni

发表机构 * Tel Aviv University(特拉维夫大学) Columbia University(哥伦比亚大学) Taso Labs(Taso实验室)

AI总结 本文提出了一种新的评估方法来衡量奖励黑客,通过在环境中嵌入可检测的奖励黑客机会,使评估更加可靠和自动化,通过TextArena测试床分析了不同语言模型在多样化环境中的奖励黑客行为。

Comments Project Page - https://majoroth.github.io/hack-verifiable-environments/

详情
AI中文摘要

使自主代理与人类意图对齐仍然是现代AI中的核心挑战。这一挑战的一个关键表现是奖励黑客,即代理在评估信号下表现成功,但违反了预期目标。奖励黑客已在多种设置中被观察到,但可靠的大规模测量方法仍然匮乏。在本文中,我们引入了一种新的评估范式来衡量奖励黑客。与以往主要通过事后分析代理轨迹不同,我们直接在环境中嵌入可检测的奖励黑客机会,使其利用可验证,从而能够确定和自动化测量代理如何利用这些漏洞。我们通过TextArena实现了这一方法,并发布了Hack-Verifiable TextArena,一个可以可靠测量奖励黑客的测试床。使用此基准,我们分析了不同语言模型在多样化环境和设置中的奖励黑客行为。我们开源代码在https://github.com/MajoRoth/hack-verifiable-environments/。

英文摘要

Aligning autonomous agents with human intent remains a central challenge in modern AI. A key manifestation of this challenge is reward hacking, whereby agents appear successful under the evaluation signal while violating the intended objective. Reward hacking has been observed across a wide range of settings, yet methods for reliably measuring it at scale remain lacking. In this work, we introduce a new evaluation paradigm for measuring reward hacking. Whereas prior studies have primarily analyzed it post hoc by inspecting agent trajectories, we instead embed detectable reward hacking opportunities directly into environments. This makes their exploitation verifiable by design, enabling deterministic and automated measurement of whether and how agents exploit such vulnerabilities. We instantiate this approach in $\textit{TextArena}$ and release $\textit{Hack-Verifiable TextArena}$, a testbed in which reward hacking can be measured reliably. Using this benchmark, we analyze reward hacking behavior across language models in diverse environments and settings. We open source the code at https://github.com/MajoRoth/hack-verifiable-environments/.

2605.20743 2026-05-21 cs.CV cs.CL

Draw2Think: Harnessing Geometry Reasoning through Constraint Engine Interaction

Draw2Think: 通过约束引擎交互增强几何推理

Juncheng Hu, Jiawei Du, Xin Zhang, Joey Tianyi Zhou

发表机构 * National University of Singapore(新加坡国立大学) Centre for Frontier AI Research, Agency for Science, Technology and Research(科技研究局前沿人工智能研究中心) Institute of High Performance Computing, Agency for Science, Technology and Research(科技研究局高性能计算研究所)

AI总结 Draw2Think通过与GeoGebra约束引擎交互,将几何推理从潜在空间推断转换为与约束引擎的代理交互,从而提高几何推理的准确性和可验证性。

详情
AI中文摘要

视觉-语言模型在解决几何问题时准确性不断提高,但其中间状态仍然保持在潜在空间中且不可验证:文本推理或绘图代码中表达的关系无法保证约束满足的配置能实现它。我们发现现有的基于渲染像素或单次脚本的外部化方法无法提供精确的、每一步的几何保证。通过代数定义强制几何关系从而填补了这一差距:工作空间变成一个经过约束检查的动态画布。我们提出了Draw2Think框架,该框架将几何推理从潜在空间推断转换为与GeoGebra约束引擎的代理交互。在提出-绘制-验证循环中,Draw2Think将假设外部化到可执行画布上,测量精确的几何量,并将结构化的观察反馈给模型,使后续推理从由共享工作空间支撑的检查画布状态开始。这种外部化使两个属性可以分别审计:模型级别的构造保真度(画布是否实现了预期的配置)和引擎级别的测量保真度(来自画布约束的精确值和关系)。在构造、结果和渲染评估中,Draw2Think构建的画布在GeoGoal上通过95.9%的谓词级别和84.0%的严格问题级别构造检查,改进了平面/实体基准测试的结果准确性,最高提高了4.1%/16.4%,并在GenExam-math上达到了68.2%/90.5%的严格/宽松渲染分数。项目页面可在https://draw2think.github.io/上找到。

英文摘要

Vision-language models solve geometry problems with rising accuracy, yet their intermediate states remain latent and unverifiable: a relation expressed in textual reasoning or drawing code carries no guarantee that a constraint-satisfying configuration realizes it. We observe that existing externalization methods based on rendered pixels or one-shot scripts fail to provide exact, per-action geometric guarantees. Enforcing geometric relations by algebraic definition closes this gap: the workspace becomes a constraint-checked evolving canvas. We present Draw2Think, a framework that recasts geometric reasoning from latent spatial inference into agentic interaction with the GeoGebra constraint engine. In a Propose-Draw-Verify loop, Draw2Think externalizes hypotheses onto an executable canvas, measures exact geometric quantities, and feeds structured observations back to the model, so subsequent reasoning proceeds from checked canvas state grounded by the shared workspace. This externalization makes two properties separately auditable: model-level Construction Fidelity (whether the canvas realizes the intended configuration) and engine-level Measurement Faithfulness (exact values and relations from canvas constraints). Across construction, outcome, and rendering evaluations, Draw2Think builds canvases that pass 95.9% predicate-level and 84.0% strict problem-level construction checks on GeoGoal, improves outcome accuracy by up to 4.1%/16.4% on planar/solid benchmarks, and attains 68.2%/90.5% strict/relaxed rendering scores on GenExam-math. Project page is available at https://draw2think.github.io/

2605.20742 2026-05-21 cs.AI

VBFDD-Agent for Electric Vehicle Battery Fault Detection and Diagnosis: Descriptive Text Modeling of Battery Digital Signals

VBFDD-Agent 用于电动汽车电池故障检测与诊断:电池数字信号的描述性文本建模

Joey Chan, Zhen Chen, Ershun Pan

发表机构 * Department of Industrial Engineering and Management, School of Mechanical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China(工业工程与管理系,机械工程学院,上海交通大学,上海200240,中国)

AI总结 本研究提出了一种基于描述性文本建模的电池信号报告方法,用于解决开放源代码电池故障报告数据集稀缺和缺乏统一维护知识表示的问题,通过构建语言语料库来改进电池健康诊断和维护,提出了VBFDD-Agent,整合了描述性电池状态文本、历史案例检索、本地维护手册和大语言模型推理,生成结构化的诊断结果和维护建议。

详情
AI中文摘要

随着电动汽车的迅速普及,锂离子电池的安全性和可靠性已成为关键问题。有效的异常检测对于确保电池安全运行至关重要。然而,随着电池系统和运行场景日益复杂,电池故障诊断和维护需要更强的跨领域适应性和人机协作能力。传统故障检测和诊断方法通常针对特定场景和预定义流程设计,使其在复杂现实应用中效果有限。为了解决开放源代码电池故障报告数据集稀缺和缺乏统一维护知识表示的问题,本研究提出了一种电池信号报告的描述性文本建模方法。监测信号、统计特征、异常记录和状态评估结果被转换为结构化且易于阅读的自然语言描述,形成用于电池健康诊断和维护的语言语料库。基于此语料库,我们提出了VBFDD-Agent,一种用于汽车级电池系统的电池故障检测和诊断代理。VBFDD-Agent整合了描述性电池状态文本、历史案例检索、本地维护手册和大语言模型推理,以生成结构化的诊断结果和维护建议。实验表明,所提出的框架能够基于描述性文本表示准确执行异常监控,并提供灵活、高效且可操作的维护建议。专家评估进一步确认了所生成建议的实用价值。总体而言,VBFDD-Agent将传统电池诊断从标签预测扩展到可解释和以维护为导向的决策支持。

英文摘要

With the rapid proliferation of electric vehicles, the safety and reliability of lithium-ion batteries have become critical concerns. Effective anomaly detection is essential for ensuring safe battery operation. However, as battery systems and operating scenarios become increasingly complex, battery fault diagnosis and maintenance require stronger cross-domain adaptability and human-AI collaboration. Traditional fault detection and diagnosis methods are usually designed for specific scenarios and predefined workflows, making them less effective in complex real-world applications. To address the scarcity of open-source battery fault report corpora and the lack of unified maintenance knowledge representation, this study proposes a descriptive text modeling approach for battery signal reports. Monitoring signals, statistical features, anomaly records, and state assessment results are transformed into structured and readable natural language descriptions, forming a language corpus for battery health diagnosis and maintenance. Based on this corpus, we propose VBFDD-Agent, a vehicle battery fault detection and diagnosis agent for automotive-grade battery systems. VBFDD-Agent integrates descriptive battery-state texts, historical case retrieval, local maintenance manuals, and large language model reasoning to generate structured diagnostic results and maintenance recommendations. Experiments show that the proposed framework can accurately perform anomaly monitoring based on descriptive textual representations and provide flexible, efficient, and actionable maintenance suggestions. Expert evaluation further confirms the practical value of the generated recommendations. Overall, VBFDD-Agent extends traditional battery diagnosis from label prediction to interpretable and maintenance-oriented decision support.

2605.20740 2026-05-21 cs.LG cs.AI cs.CL

Distribution-Aware Reward: Reinforcement Learning over Predictive Distributions for LLM Regression

Distribution-Aware Reward: 用于LLM回归的预测分布强化学习

Jungsoo Park, Hyungjoo Chae, Ethan Mendes, Jay DeYoung, Varsha Kishore, Wei Xu, Alan Ritter

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Allen Institute for AI(人工智能研究院)

AI总结 本文提出Distribution-Aware Reward,一种基于预测分布的强化学习方法,旨在提升语言模型在回归任务中的预测分布质量,而非仅优化单个解码输出。通过连续排名概率分数评估多个解码样本的分布,并基于每个rollout对分布质量的边际贡献分配信用,从而提升预测的准确性和分散性。实验表明,该方法在多个任务中优于监督微调和点wise强化学习基线,尤其在KBSS数据集上Spearman相关性提升6点。

Comments 21 pages, 5 figures

详情
AI中文摘要

大型语言模型能够从异质输入(如文本、代码和分子字符串)预测实值量,但大多数训练目标独立评分每个解码的浮点数,仅改进点估计而无法确保校准的预测分布。这限制了需要候选排序或不确定性估计的应用。我们引入Distribution-Aware Reward,一种基于策略的强化学习目标,其主要贡献是训练语言模型生成更好的回归任务预测分布,而非仅优化单个解码输出与标量目标的匹配。我们的方法将多个解码样本视为经验预测分布,并使用连续排名概率分数进行评估,基于每个rollout对分布质量的边际贡献分配leave-one-out信用,奖励既准确又适当分散的预测。我们在受控高斯混合任务、代码性能预测和分子属性预测(从SMILES字符串)上评估了我们的方法。在所有任务中,我们的方法优于监督微调和点wise强化学习基线,具有显著的排名相关性提升,包括在KBSS数据集上Spearman相关性提升6点。在MoleculeNet上,仅使用SMILES字符串,仍能与强大的图基和3D分子模型竞争。进一步分析表明,我们的方法缓解了rollout多样性崩溃并改进了不确定性诊断,表明直接优化预测分布使语言模型回归更具鲁棒性和校准性。

英文摘要

Large language models can predict real-valued quantities from heterogeneous inputs such as text, code, and molecular strings, but most training objectives score each decoded floating-point number independently, improving point estimates without ensuring calibrated predictive distributions. This limits applications requiring candidate ranking or uncertainty estimation. We introduce Distribution-Aware Reward, an on-policy reinforcement learning objective whose main contribution is to train language models to produce better predictive distributions for regression tasks, rather than only optimizing individual decoded outputs against scalar targets. Our method treats multiple decoded samples as an empirical predictive distribution, evaluates it with the Continuous Ranked Probability Score, and assigns leave-one-out credit based on each rollout's marginal contribution to distribution quality, rewarding predictions that are both accurate and appropriately dispersed. We evaluate our method on a controlled Gaussian-mixture task, code performance prediction, and molecular property prediction from SMILES strings. Across tasks, our method improves over supervised fine-tuning and pointwise reinforcement learning baselines, with strong rank-correlation gains, including a 6-point Spearman improvement on KBSS. On MoleculeNet, it uses only SMILES strings yet remains competitive with strong graph-based and 3D molecular models. Further analyses show that our method mitigates rollout diversity collapse and improves uncertainty diagnostics, suggesting that directly optimizing predictive distributions makes language model regression more robust and better calibrated.

2605.20738 2026-05-21 cs.CV

STAR-IOD: Scale-decoupled Topology Alignment with Pseudo-label Refinement for Remote Sensing Incremental Object Detection

STAR-IOD: 无尺度耦合拓扑对齐与伪标签细化用于遥感增量目标检测

Yaoteng Zhang, Qing Zhou, Junyu Gao, Qi Wang

发表机构 * School of Computer Science, Northwestern Polytechnical University, Xi’an 710072, China(计算机科学学院,西北工业大学,西安710072,中国) School of Artificial Intelligence, OPtics and ElectroNics (iOPEN), Northwestern Polytechnical University, Xi’an 710072, China(人工智能学院,光学与电子学(iOPEN),西北工业大学,西安710072,中国)

AI总结 本文提出STAR-IOD框架,通过子空间解耦拓扑蒸馏模块和聚类驱动伪标签生成器,解决遥感增量目标检测中类别间拓扑关系对齐和尺度变化导致的表示差异问题,同时通过动态识别类别特定阈值来缓解旧类标注缺失问题,实验表明在DIOR-IOD和DOTA-IOD数据集上,方法在mAP上分别优于现有方法1.7%和2.1%。

Comments STAR-IOD was accepted by ISPRS Journal of Photogrammetry and Remote Sensing

详情
AI中文摘要

遥感影像通常以连续数据流的形式出现。传统检测器在学习新类别时往往会遗忘之前学习的类别;因此,研究遥感增量目标检测(RS-IOD)具有重要意义。然而,现有方法大多忽视了遥感场景中普遍存在的类别内尺度变化,这削弱了知识迁移和旧知识保留的有效性。此外,RS-IOD还受到标注缺失的影响,导致模型将旧类实例误分类为背景。为了解决这些挑战,我们提出了一种新的框架STAR-IOD。首先,我们引入了子空间解耦拓扑蒸馏(STD)模块,以转移结构知识,显式对齐类别间拓扑关系,并缓解由尺度变化引起的类别内表示差异。此外,我们引入了聚类驱动伪标签生成器(CPG),这是一个即插即用模块,利用K-Means聚类动态识别类别特定阈值,从而保证真正阳性目标与背景噪声之间的准确区分,并缓解旧类标注缺失问题。我们还构建了两个遥感增量目标检测数据集,DIOR-IOD和DOTA-IOD,以促进RS-IOD的研究。广泛的实验表明,我们的方法在DIOR-IOD和DOTA-IOD数据集上分别以1.7%和2.1%的mAP优于现有方法,有效缓解了灾难性遗忘,同时在基础类和新类上保持了强劲的检测性能。代码和数据集已发布在:https://github.com/zyt95579/STAR-IOD。

英文摘要

Remote sensing imagery typically arrives in the form of continuous data streams. Traditional detectors often forget previously learned categories when learning new ones; therefore, research on Remote Sensing Incremental Object Detection (RS-IOD) is of great significance. However, existing methods largely overlook the intra-class scale variations prevalent in remote sensing scenes, which undermines the effectiveness of knowledge transfer and old knowledge preservation. Moreover, RS-IOD also suffers from missing annotations, which cause the model to misclassify old-class instances as background. To address these challenges, we propose a novel framework, STAR-IOD. First, we introduce a Subspace-decoupled Topology Distillation (STD) module to transfer structural knowledge, explicitly aligning inter-class topological relationships and mitigating intra-class representation discrepancies induced by scale shifts. Furthermore, we introduce the Clustering-driven Pseudo-label Generator (CPG), a plug-and-play module that leverages K-Means clustering to dynamically identify class-specific thresholds, thereby guaranteeing an accurate distinction between true positive targets and background noise and alleviating the issue of missing annotations for old classes. We also constructed two Remote Sensing Incremental Object Detection datasets, DIOR-IOD and DOTA-IOD to facilitate research on RS-IOD. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches by 1.7% and 2.1% mAP on DIOR-IOD and DOTA-IOD, respectively, effectively alleviating catastrophic forgetting while preserving strong detection performance on both base and novel classes. The code and dataset are released at: https://github.com/zyt95579/STAR-IOD.

2605.20737 2026-05-21 cs.CV

Resolving Long-Tail Ambiguity in Unsupervised 3D Point Cloud Segmentation with Language Priors

通过语言先验解决无监督3D点云分割中的长尾歧义

Siqi Wei, Hongbin Xu, Feng Xiao, Tian Lan, Chun Li, Ming Li, Qiuxia Wu

发表机构 * South China University of Technology(华南理工大学) Bytedance(字节跳动) Tsinghua University(清华大学) Shenzhen MSU-BIT University(深圳MSU-BIT大学) Guangming Laboratory(光明实验室)

AI总结 本文提出LangTail框架,利用语言模型中的平衡世界知识来缓解无监督3D分割中的长尾歧义问题,通过建立语言衍生语义先验与视觉上不常见的小类之间的多级关联,提升小类的表示能力,实验表明在ScanNet-v2、S3DIS和nuScenes数据集上均取得显著提升。

Comments In submission. The code will be released at: https://github.com/Whisky0129/langtail_official

详情
AI中文摘要

现有的无监督3D点云分割方法主要依赖于纯视觉相似性基于聚类的学习范式,这存在一个根本性限制:长尾歧义。在这样的范式中,次要类别的特征会被主导簇持续吸收,导致预测严重不平衡。为了解决这个问题,我们提出了LangTail,一种语言引导的分层学习框架,利用语言模型中编码的平衡世界知识来缓解无监督3D分割中的长尾歧义。关键思想是建立语言衍生语义先验与视觉上不常见的次要类别之间的多级关联,从而补偿纯粹视觉聚类对主导类别的偏关注。具体来说,LangTail首先从语言模型中构建实体级语义先验,捕捉跨类别的平衡和细粒度世界知识。这些先验通过对比对齐注入到分层聚类框架中。这引导多粒度语义结构的形成,并防止次要类别被主导簇吸收,从而为不常见的类别产生更具判别性的表示。在ScanNet-v2、S3DIS和nuScenes上进行的大量实验表明,LangTail在ScanNet-v2、S3DIS和nuScenes上分别比现有方法提高了+13.5、+12.9和+8.9 mIoU。这些结果证明了语言先验在提升3D点云中少数类别表示的有效性。代码将在:https://github.com/Whisky0129/langtail_official发布。

英文摘要

Existing approaches for unsupervised 3D point cloud segmentation predominantly rely on a purely visual similarity-based learning-by-clustering paradigm, which suffers from a fundamental limitation: long-tail ambiguity. In such a paradigm, features of minor classes are consistently absorbed by dominant clusters, leading to severely imbalanced predictions. To address this issue, we propose LangTail, a language-guided hierarchical learning framework that leverages the balanced world knowledge encoded in language models to mitigate long-tail ambiguity in unsupervised 3D segmentation. The key idea is to establish multi-level associations between language-derived semantic priors and visually underrepresented minor classes, thereby compensating for the biased attention of purely visual clustering toward dominant classes. Specifically, LangTail first constructs an entity-level semantic prior from language models, capturing balanced and fine-grained world knowledge across categories. These priors are injected into a hierarchical clustering framework via contrastive alignment. This guides multi-granularity semantic structure formation and prevents minor classes from being absorbed by dominant clusters, yielding more discriminative representations for underrepresented categories. Extensive experiments on ScanNet-v2, S3DIS, and nuScenes demonstrate that LangTail consistently outperforms existing methods by significant margins, \ie, +13.5, +12.9, and +8.9 mIoU, respectively. These results demonstrate the effectiveness of language priors in improving the representation of minority classes in 3D point clouds. The code will be released at: https://github.com/Whisky0129/langtail_official.

2605.20733 2026-05-21 cs.CV

Sketch2MinSurf: Vision-Language Guided Generation of Editable Minimal Surfaces from Hand-Drawn Sketches

Sketch2MinSurf: 通过视觉-语言引导从手绘草图生成可编辑的最小曲面

Wenda Wang, Anqi Liu, Junqi Yang, Lei He, Luying Wang, Jiachen Lu, Weixin Huang

发表机构 * School of Architecture, Tsinghua University(清华大学建筑学院) Department of Architecture, National University of Singapore(新加坡国立大学建筑系)

AI总结 本研究提出Sketch2MinSurf方法,结合视觉-语言引导和几何优化,从手绘草图生成平滑且可编辑的3D曲面,通过空间-拓扑编码和Sketch2MinSurf结构损失函数实现拓扑一致性与几何重建的联合约束。

Comments 22 pages, 16 figures, includes appendix

详情
AI中文摘要

将手绘草图转换为结构化的3D几何体仍然具有挑战性,因为非欧几里得曲面的表示和拓扑一致性维护困难。现有的生成模型如GANs、NeRFs和扩散架构往往无法直接生成可编辑的流形用于下游设计流程。我们提出了Sketch2MinSurf,一种结合视觉-语言和几何优化的混合框架,通过将视觉-语言引导与最小曲面理论相结合,从手绘草图生成平滑且可编辑的3D曲面。我们的方法核心是一种空间-拓扑编码,将几何表示为节点坐标和实/虚拟边骨架的元组,使在生成过程中能够实现稳定的拓扑控制。我们进一步引入了Sketch2MinSurf结构损失函数(S2MS-Loss),一种奖励调制的目标,联合约束几何重建和拓扑一致性。在100个草图的测试集上,Sketch2MinSurf实现了0.844的拓扑相似度得分,优于现有的草图到形状基线。生成的流形可以直接编辑且没有非流形伪影。一所大学的公共艺术装置展示了该方法在人类意图驱动的3D形式生成中的潜力。数据集和代码可在https://anonymous.4open.science/r/Sketch2MinSurf/上获取。

英文摘要

Converting hand-drawn sketches into structured 3D geometries remains challenging due to the difficulty of representing non-Euclidean surfaces and maintaining topological consistency. Existing generative models such as GANs, NeRFs, and diffusion architectures often fail to produce editable manifolds directly usable in downstream design workflows. We present Sketch2MinSurf, a hybrid vision-language and geometric optimization framework that integrates vision-language guidance with minimal-surface theory to generate smooth and editable 3D surfaces from hand-drawn sketches. The core of our approach is a spatial-topological encoding that represents geometry as tuples of node coordinates and real/virtual edge skeletons, enabling stable topological control during generation. We further introduce the Sketch2MinSurf Structural Loss (S2MS-Loss), a reward-modulated objective that jointly constrains geometric reconstruction and topological coherence. On a test set of 100 sketches, Sketch2MinSurf achieves a topological similarity score of 0.844, outperforming existing sketch-to-shape baselines. The generated manifolds are directly editable and free from non-manifold artifacts. A public art installation at a university showcases the method's potential for human-intent-driven 3D form generation. The dataset and code are available at https://anonymous.4open.science/r/Sketch2MinSurf/.

2605.20732 2026-05-21 cs.CV

Deep Attention Reweighting: Post-Hoc Attention-Based Feature Aggregation in CNNs for Disentangling Core and Spurious Features under Spurious Correlations

深度注意力重加权:CNN中的后处理注意力特征聚合以解纠缠核心与伪相关特征

Kin Whye Chew, Jingxian Wang

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 本文提出了一种基于注意力的后处理特征聚合方法DAR,通过替换全局平均池化层来减少CNN中因伪相关特征引起的纠缠,从而提升模型的泛化能力和公平性。

Comments Under review. 26 pages, 7 figures

详情
AI中文摘要

卷积神经网络(CNNs)经常利用数据集中的伪相关性,学习出表面预测但因果无关的特征,导致泛化能力差和公平性问题。深度特征重加权(DFR)是一种后处理技术,通过在目标数据集上重新训练分类头来减少模型对伪相关性的依赖。然而,我们发现DFR受限于在纠缠特征上操作,限制了其增强核心特征同时抑制伪特征的能力。我们追溯这种纠缠到普遍存在的全局平均池化(GAP)层,该层 indiscriminately 将空间上不同的核心和伪特征压缩成单一表示。为了解决这个问题,我们提出了深度注意力重加权(DAR),一种基于注意力的后处理特征聚合模块,它替换了GAP层并与分类头一起重新训练。DAR在特征图上计算空间位置的自适应加权,使在压缩成纠缠特征前能选择性地抑制伪特征。在各种数据集、指标和消融实验中,DAR始终优于DFR,证明了我们的基于注意力的聚合方法减轻了GAP引起的纠缠并减少了对伪相关性的依赖。

英文摘要

Convolutional Neural Networks (CNNs) often exploit spurious correlations in datasets, learning superficially predictive yet causally irrelevant features, leading to poor generalization and fairness issues. Deep Feature Reweighting (DFR) is a post-hoc technique that reduces a trained model's reliance on spurious correlations by retraining its classification head on a target dataset. However, we show that DFR is fundamentally constrained by operating on entangled features, limiting its ability to amplify the core features while simultaneously suppressing the spurious ones. We trace this entanglement to the ubiquitous Global Average Pooling (GAP) layer, which indiscriminately collapses spatially distinct core and spurious features into a single representation. To address this, we propose Deep Attention Reweighting (DAR), a post-hoc attention-based aggregation module that replaces GAP and is retrained jointly with the classification head. DAR computes an adaptive weighting of spatial locations across feature maps, enabling selective suppression of spurious features before the collapse into entangled features. Across various datasets, metrics, and ablations, DAR consistently outperforms DFR, demonstrating that our attention-based aggregation mitigates GAP-induced entanglement and reduces spurious reliance.

2605.20730 2026-05-21 cs.CL cs.AI

Distributional Alignment as a Criterion for Designing Task Vectors in In-Context Learning

分布对齐作为设计任务向量在上下文学习中的准则

Jihoon Kwon, Jiwon Choi, Jy-yong Sohn

发表机构 * Seoul National University(首尔国立大学) Yonsei University(延世大学)

AI总结 本文提出通过分布对齐来设计任务向量,引入了NTP距离作为衡量指标,并开发了线性任务向量方法以提升性能和效率。

Comments 9 pages, preprint

详情
AI中文摘要

在上下文学习(ICL)中,大型语言模型(LLMs)通过演示来适应新任务,但随着上下文长度增加,推理成本也随之上升。虽然任务向量通过压缩演示为紧凑的隐藏状态表示提供了有前途的替代方案,但其质量只能通过下游任务准确性来评估。本文认为,使用任务向量的推理应使其预测分布与ICL的预测分布对齐。为此,我们引入了$d_{ ext{NTP}}$,一个衡量任务向量推理与ICL推理之间下一个标记概率差异的指标。我们的实证分析表明,$d_{ ext{NTP}}$作为性能代理,与下游准确性呈强负相关。受此启发,我们开发了线性任务向量(LTV)方法,通过闭合形式的线性映射来最小化$d_{ ext{NTP}}$,通过回归估计演示效果。在八个分类基准和五个LLMs上,LTV一致优于现有任务向量基线,平均准确率提高了9.2%,同时减少了推理延迟。我们进一步证明LTV在回归任务上优于基线。此外,我们研究了LTV在不同模型规模间的可转移性;这在任务向量研究中仍是一个初级问题。具体而言,我们实证显示,较大模型的任务向量可以将较小模型的性能提高6.4%,表明提取的任务表示有新的用途。

英文摘要

In-context learning (ICL) allows large language models (LLMs) to adapt to new tasks through demonstrations, yet it suffers from escalating inference costs as context length increases. While task vectors offer a promising alternative by compressing demonstrations into compact hidden-state representations, their quality has been evaluated only through downstream task accuracy. This indirect criterion provides limited insight into how to design more effective task vector extraction methods. In this paper, we posit that inference using task vectors should align their predictive distribution with that of ICL. To quantify this, we introduce $d_{\text{NTP}}$, a metric that measures the discrepancy in next-token probabilities between task vector-based and ICL-based inference. Our empirical analysis reveals that $d_{\text{NTP}}$ serves as a performance proxy, exhibiting a strong negative correlation with downstream accuracy. Motivated by this, we develop Linear Task Vector (LTV), a method designed to minimize $d_{\text{NTP}}$ via a closed-form linear mapping that estimates demonstration effects through regression. Across eight classification benchmarks and five LLMs, LTV consistently outperforms existing task vector baselines, improving average accuracy by 9.2\% while reducing inference latency. We further show that LTV outperforms the baselines on regression tasks. Moreover, we investigate the transferability of LTV across different model scales; an aspect that has remained nascent in task vector research. Specifically, we empirically show that task vectors from a larger model can enhance a smaller model's performance by 6.4\%, suggesting a new utility for extracted task representations.

2605.20729 2026-05-21 cs.CL

MTR-Suite: A Framework for Evaluating and Synthesizing Conversational Retrieval Benchmarks

MTR-Suite: 一个用于评估和合成对话检索基准的框架

Junhao Ruan, Abudukeyumu Abudula, Bei Li, Yongjing Yin, Xinyu Liu, Kechen Jiao, Xin Chen, Jingang Wang, Xunliang Cai, Tong Xiao, Jingbo Zhu

发表机构 * School of Computer Science and Engineering, Northeastern University(东北大学计算机科学与工程学院) Meituan Inc.(美团公司) NiuTrans Research(牛译研所) Tsinghua University(清华大学)

AI总结 本文提出MTR-Suite框架,通过LLM审计、多智能体系统生成和通用领域基准,解决对话检索基准评估和合成中的成本高、标注稀疏和自动化方法僵化的问题。

Comments Accepted to ACL 2026 (main conference). 28 pages. Code and data: https://github.com/rangehow/mtr-suite

详情
AI中文摘要

准确评估对话检索对于推进检索增强生成(RAG)系统至关重要。然而,现有的对话检索基准存在成本高、标注稀疏或自动化方法僵化、不自然的问题。为了解决这些挑战,我们引入MTR-Suite,一个统一的框架,用于审计、合成和基准测试检索。它具有三个特点:(1)MTR-Eval,一个基于LLM的审计器,用于量化先前基准中的对齐差距;(2)MTR-Pipeline,一个使用贪心遍历聚类的多智能体系统,能够以1/400的成本生成高保真对话;(3)MTR-Bench,一个严谨的通用领域基准。MTR-Bench模拟生产式挑战(如困难的主题切换、冗长),提供更强大的判别能力。我们公开了代码和数据,以促进未来研究,网址为https://github.com/rangehow/mtr-suite.

英文摘要

Accurate evaluation of conversational retrieval is pivotal for advancing Retrieval-Augmented Generation (RAG) systems. However, existing conversational retrieval benchmarks suffer from costly, sparse human annotation or rigid, unnatural automated heuristics. To address these challenges, we introduce MTR-Suite, a unified framework for auditing, synthesizing, and benchmarking retrieval. It features: (1) MTR-Eval, an LLM-based auditor quantifying alignment gaps in previous benchmarks; (2) MTR-Pipeline, a multi-agent system using greedy traversal clustering to generate high-fidelity dialogues at 1/400th human cost; and (3) MTR-Bench, a rigorous general-domain benchmark. MTR-Bench mimics production-style challenges (hard topic switching, verbosity), offering superior discriminative power. We make our code and data publicly available to facilitate future research at https://github.com/rangehow/mtr-suite.

2605.20728 2026-05-21 cs.CV

Early High-Frequency Injection for Geometry-Sensitive OOD Detection

早期高频注入用于几何敏感的域外检测

Chuanjie Cheng, Ningkang Peng, Chenxi Liu, Yifan He, Peirong Ma, Yanhui Gu

发表机构 * Nanjing Normal University(南京师范大学)

AI总结 本文通过带宽分析揭示了高频输入对几何敏感域外检测的重要性,提出EIHF方法在CIFAR-100和ImageNet-100上提升了检测性能,同时揭示了其在场景中心Places迁移上的局限性。

详情
AI中文摘要

事后域外检测器在训练后对logits或特征进行评分,其成功依赖于表示中已编码的几何结构。我们通过跨CE、SimCLR、SupCon和域外导向表示方法PALM的带宽MMD^2分析重新审视这一假设。在我们的诊断中,低频输入带诱导更弱的ID/OOD特征差异,而高频带倾向于提供更强的分离性。这一观察促使提出EIHF,一种输入侧干预方法,在第一次卷积之前暴露高频证据而不改变训练目标。EIHF在几何敏感的域外检测中表现最强:在匹配的训练和评分设置下,它重塑类条件特征几何并减少ID/OOD马哈拉诺斯距离重叠。在CIFAR-100和ImageNet-100上的实验显示,在CIFAR-100上获得提升,在ImageNet-100上获得最佳的平均FPR95和次佳的平均AUROC,同时揭示了在场景中心Places迁移上的局限性。代码可在https://anonymous.4open.science/r/EIHF获得。

英文摘要

Post-hoc OOD detectors score logits or features after training, so their success depends on the geometry already encoded in the representation. We revisit this assumption through a band-wise MMD^2 analysis across CE, SimCLR, SupCon, and the OOD-oriented representation method PALM. In our diagnostic, low-frequency input bands induce weaker ID/OOD feature discrepancy, whereas higher-frequency bands tend to provide stronger separability. This observation motivates EIHF, an input-side intervention that exposes high-frequency evidence before the first convolution without changing the training objective. EIHF is strongest for geometry-sensitive OOD detection: under matched training and scoring settings, it reshapes class-conditional feature geometry and reduces ID/OOD Mahalanobis score overlap. Experiments on CIFAR-100 and ImageNet-100 show gains on CIFAR-100 and the best average FPR95 with second-best average AUROC on ImageNet-100, while also revealing a limitation on the scene-centric Places shift. Code is available at https://anonymous.4open.science/r/EIHF.

2605.20727 2026-05-21 cs.CV

GAMR: Geometric-Aware Manifold Regularization with Virtual Outlier Synthesis for Learning with Noisy Labels

GAMR: 带虚拟异常合成的几何感知流形正则化用于噪声标签学习

Ningkang Peng, Jingyang Mao, Xiaoqian Peng, Peirong Ma, Xichen Yang, Weiguang Qu, Yanhui Gu

发表机构 * Nanjing Normal University(南京师范大学) Nanjing University of Chinese Medicine(南京中医药大学)

AI总结 本文提出了一种几何感知流形正则化方法,通过主动合成虚拟异常样本来重构特征空间几何,从而提升在噪声标签下的学习性能,其核心贡献是增强模型对难样本和噪声样本的区分能力,实现更鲁棒的表示学习。

详情
AI中文摘要

深度神经网络(DNNs)在处理噪声标签时会遭受显著的性能下降,主要由于过度拟合错误标记的数据。当前主流方法试图通过在训练过程中被动过滤干净样本来缓解这一问题。然而,在受噪声破坏的特征空间中,简单的样本过滤难以区分具有挑战性的样本和噪声样本,从而成为模型性能的瓶颈。我们首次强调了主动重塑特征空间几何在学习噪声数据中的根本重要性。我们提出了一种新颖的几何感知流形正则化范式,其核心思想是通过主动合成虚拟异常样本来显式构建数据流形之间的能量屏障。通过施加促进类内紧凑性和类间分离的几何约束,该方法增强了难样本与噪声样本之间的可区分性,从而学习到更鲁棒的表示。我们的正则化机制具有高度的通用性,其有效性不依赖于任何关于噪声模式的先验假设。它可以作为独立机制集成到现有的样本选择框架中,提供更强的鲁棒性以应对多样的噪声环境。实验表明,我们的范式在多个基准上,包括CIFAR-10,均实现了超越当前最先进(SOTA)方法的性能,特别是在更具挑战性的不对称噪声条件下表现尤为突出。此外,该范式显著增强了模型在Out-of-Distribution(OOD)检测方面的能力,确保了在开放世界场景中更高的可靠性和安全性。

英文摘要

Deep neural networks (DNNs) experience significant performance degradation when processing noisy labels, primarily due to overfitting on mislabeled data. Current mainstream approaches attempt to mitigate this issue by passively filtering clean samples during training. However, simple sample filtering within feature spaces degraded by noise struggles to distinguish between challenging samples and noisy samples, creating a bottleneck for model performance. We highlight for the first time the fundamental importance of actively reshaping feature space geometry for learning from noisy data. We propose a novel Geometry-aware Manifold Regularization Paradigm whose core idea is to explicitly construct energy barriers between data manifolds by actively synthesizing virtual outlier samples. By imposing geometric constraints that promote intra-class compactness and inter-class separation, this approach enhances the discriminability between hard and noisy samples, leading to the learning of more robust representations. Our regularization mechanism exhibits high universality, with effectiveness independent of any prior assumptions about noise patterns. It can be integrated as a standalone mechanism into existing sample selection frameworks, providing stronger robustness against diverse noisy environments. Experiments demonstrate that our paradigm achieves performance surpassing current state-of-the-art (SOTA) methods on multiple benchmarks, including CIFAR-10, with particularly pronounced advantages under more challenging asymmetric noise conditions. Furthermore, this paradigm significantly enhances the model's capability in Out-of-Distribution (OOD) detection, ensuring superior reliability and safety for deployment in open-world scenarios.

2605.20725 2026-05-21 cs.CV

Holistic Reliability Propagation: Decoupling Annotation and Prediction for Robust Noisy-Label

整体可靠性传播:解耦标注与预测以实现鲁棒的噪声标签

Jingyang Mao, Ningkang Peng, Yanhui Gu

发表机构 * Nanjing Normal University(南京师范大学)

AI总结 本文提出了一种整体可靠性传播方法,通过解耦标注和预测来提高在噪声标签下的鲁棒性,该方法通过双层元学习生成两个批次标准化标量,分别用于给定标签和伪标签,并在不同目标上路由这些可靠性,从而在合成和现实基准上提升了平均准确率。

详情
AI中文摘要

在多媒体分类中,使用噪声标签学习时通常将外部注释和模型预测合并为一个可靠性权重,尽管这两个来源可能因不同的原因失效。我们相反地估计解耦的可靠性:双层元学习为每个样本生成两个批次标准化标量,alpha用于给定标签,beta用于伪标签,而不将它们限制为总和为一。整体可靠性传播(HRP)然后将它们路由到不同的目标,使用可靠性感知的Mixup和全局门控在输入分支上,以及beta门控的伪标签正例在对比分支上。在合成和现实世界基准上,HRP在强基线之上提高了平均准确率,并在最高噪声率下保持竞争力。

英文摘要

Learning with noisy labels in multimedia classification often combines external annotations and model predictions into a single reliability weight, even though the two sources can fail for different reasons. We instead estimate disentangled reliabilities: bilevel meta-learning produces two batch-normalized scalars per sample, alpha for the given label and beta for the pseudo-label, without constraining them to sum to one. Holistic Reliability Propagation (HRP) then routes them to different objectives, using reliability-aware Mixup with global gating on the input branch and beta-gated pseudo-label positives on the contrastive branch. On synthetic and real-world benchmarks, HRP improves average accuracy over strong baselines and remains competitive at the highest noise rates.

2605.20723 2026-05-21 cs.LG

Memory-Efficient Partitioned DNN Inference on Resource-Constrained Android Crowds

在资源受限的Android蜂窝中实现内存高效的分区DNN推理

Lakshani Manamperi, Disumi Pathirana, Thiwanka Pathirana, Nipun Premarathna, Kutila Gunasekera

发表机构 * Department of Computer Science and Engineering, University of Moratuwa, Moratuwa, Sri Lanka(计算机科学与工程系,莫图瓦大学,莫图瓦,斯里兰卡)

AI总结 本文提出了一种在资源受限的Android设备上实现高效DNN推理的方法,通过五个机制将内存压力分散到多个设备上,从而在不修改模型的情况下实现ONNX推理,显著降低了电池消耗和延迟。

Comments 6 pages, 3 figures, 4 tables. Accepted at the ICML 2026 Workshop on Machine Learning for the Global South

详情
AI中文摘要

在边缘机器学习中,将大型深度神经网络部署到内存受限的移动设备上是一个核心挑战。尽管压缩、剪枝和量化可以降低每个参数的成本,但基于Transformer的模型仍然太大,无法适应商用Android手机约3.3-7.4 GB RAM的范围。我们提出了CROWDio的DNN管道调度子系统,通过五个机制将内存压力分散到多个设备上,从而在不修改模型的情况下实现资源受限Android设备上的实用ONNX推理。这些机制包括JIT延迟分区加载、单分区驻留约束、四层亲和调度器、zlib压缩张量传输以及流式1:1依赖模型。在DistilBERT(Sanh等人,2019)(约6700万参数,SST-2)上跨五个Android手机进行十次运行评估时,我们的系统使每个设备的峰值RSS保持在43±2 MB,限制电池消耗到每运行50±3 mAh,同时流式并发将批次延迟降低了34%低于屏障同步。

英文摘要

Deploying large deep neural networks on memory-constrained mobile devices is a central challenge in edge ML. While compression, pruning, and quantization reduce per-parameter cost, transformer-based models remain too large for the 3.3-7.4 GB RAM envelope of commodity Android handsets. We present the DNN pipeline scheduling subsystem of CROWDio, which achieves practical ONNX inference across resource-constrained Android workers without model modification, by distributing memory pressure across devices via five mechanisms: JIT deferred partition loading, a single-partition-resident constraint, a 4-tier affinity scheduler, a zlib-compressed tensor transport, and a streaming 1:1 dependency model. Evaluated on DistilBERT (Sanh et al., 2019) (approximately 67 M parameters, SST-2) across five Android handsets over ten runs, our system holds peak per-device RSS to 43+-2 MB and limits battery draw to 50+-3 mAh per run, while streaming concurrency cuts batch latency 34% below barrier synchronisation.

2605.20722 2026-05-21 cs.LG cs.AI

AGPO: Adaptive Group Policy Optimization with Dual Statistical Feedback

AGPO: 基于双统计反馈的自适应群体策略优化

Miaobo Hu, Shuhao Hu, Bokun Wang, Ruohan Wang, Xin Wang, Xiaobo Guo, Daren Zha, Jun Xiao

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所) School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络安全学院) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院)

AI总结 本文提出AGPO,一种无 critic 的 GRPO 改进方法,通过群体层面的统计信息控制更新幅度和探索。在九个英语和中文数学/STEM 基准上,Qwen2.5-14B 在相同生成 token 预算下优于 PPO/GRPO,达到 GSM8K 67.3% 和 MATH 40.5%。

详情
AI中文摘要

强化学习提升大语言模型推理能力,但 PPO/GRPO 通常使用固定剪切和解码温度,使训练脆弱且调参困难。我们提出自适应群体策略优化(AGPO),一种无 critic 的 GRPO 改进方法,利用群体层面统计信息控制更新幅度和探索。AGPO 使用共享的探针衍生统计状态驱动两个控制器:(i)自适应剪切,根据奖励分散度和偏度、探针投票熵、策略熵和逐步 KL 偏移设置信任区域大小;(ii)双向自适应温度采样,根据与运行基线相对的中心不确定性加热或冷却解码。在九个英语和中文数学/STEM 基准上,使用 AGPO 训练的 Qwen2.5-14B 在相同生成 token 预算下优于 PPO/GRPO,达到 GSM8K 67.3% 和 MATH 40.5%。收益转移到 Llama-3-8B 和 Gemma-2-9B,消融实验确认两个模块互补。我们的实现可在 https://github.com/wandugu/paper_agpo 公开获取。

英文摘要

Reinforcement learning improves LLM reasoning, but PPO/GRPO typically use fixed clipping and decoding temperature, which makes training brittle and tuning-heavy. We propose Adaptive Group Policy Optimization (AGPO), a critic-free refinement of GRPO that uses group-level statistics to control both update magnitude and exploration. AGPO uses a shared probe-derived statistical state to drive two controllers: (i) adaptive clipping, which sets the trust-region size from reward dispersion and skewness, probe vote entropy, policy entropy, and step-wise KL drift; and (ii) bidirectional adaptive temperature sampling, which heats or cools decoding around a base temperature according to centered uncertainty relative to a running baseline. On nine English and Chinese math/STEM benchmarks, Qwen2.5-14B trained with AGPO outperforms PPO/GRPO under the same generated-token budget, reaching 67.3% on GSM8K and 40.5% on MATH. Gains transfer to Llama-3-8B and Gemma-2-9B, and ablations confirm both modules are complementary. Our implementation is publicly available at https://github.com/wandugu/paper_agpo.

2605.20721 2026-05-21 cs.LG

Robust Recommendation from Noisy Implicit Feedback: A GMM-Weighted Bayes-label Transition Matrix Framework

从噪声隐式反馈中鲁棒推荐:一种加权贝叶斯标签转移矩阵框架

Zongyu Li, Xuanyu Liu, Gongce Cao, Shirui Sun, Yaqi Fang, Yongshuai Yu

发表机构 * Guangdong University of Technology(广东工业大学) University of Chinese Academy of Sciences(中国科学院大学) Capital Normal University(首都师范大学) Xiamen University(厦门大学) Beijing Jiaotong University(北京交通大学)

AI总结 本文提出了一种鲁棒的高斯混合模型加权贝叶斯标签转移矩阵框架(RGBT),通过利用高斯混合模型生成实例特定的可靠性评分,系统校准贝叶斯标签转移矩阵估计以减少偏差,从而在保证全样本利用的同时,实现一致的估计和显著的估计方差减少。

详情
AI中文摘要

在推荐系统中,从隐式反馈学习受到普遍标签噪声的挑战。虽然传统去噪方法通常丢弃噪声实例以确保鲁棒性,但这种策略不可避免地导致数据利用率低。替代方法利用贝叶斯标签转移矩阵(BLTM)可以利用所有可用数据,但其估计在实际推荐场景中往往存在偏差。为了解决这些限制,本文提出了一种鲁棒的高斯混合模型加权贝叶斯标签转移矩阵框架(RGBT)。我们的解决方案利用高斯混合模型(GMM)推导实例特定的可靠性评分,系统校准BLTM估计以减轻偏差。理论分析确认,通过利用BLTM框架结合GMM校准,我们的方法同时确保了全样本利用、一致的估计以及关键的估计方差显著减少。在多个真实世界和合成翻转数据集上的广泛实验表明,RGBT不仅比主流可靠样本去噪方法更有效地利用噪声样本,而且在状态-of-the-art转移矩阵去噪方法中实现了显著更优的转移矩阵校准能力。

英文摘要

Learning from implicit feedback in recommender systems is fundamentally challenged by pervasive label noise. While conventional denoising approaches often discard noisy instances to ensure robustness, this strategy inevitably suffers from low data utilization. Alternative methods that employ a Bayes-label transition matrix (BLTM) can leverage all available data, but their estimates tend to be biased in practical recommendation scenarios. To address these limitations, this paper proposes a Robust GMM-weighted Bayes-label Transition Matrix framework (RGBT). Our solution utilizes a Gaussian Mixture Model (GMM) to derive instance-specific reliability scores, which systematically calibrate the BLTM estimation to mitigate bias. Theoretical analysis confirms that our approach, by leveraging the BLTM framework with GMM calibration, simultaneously ensures full sample utilization, delivers consistent estimation, and critically, achieves a significant reduction in estimation variance. Extensive experiments on multiple real-world and synthetically flipped datasets demonstrate that RGBT not only utilizes noisy samples more effectively than mainstream reliable sample-based denoising methods, but also achieves significantly superior calibration capability of the transition matrix compared to state-of-the-art transition matrix-based denoising approaches.

2605.20713 2026-05-21 cs.CV cs.AI cs.LG

SAVER: Selective As-Needed Vision Evidence for Multimodal Information Extraction

SAVER:选择性所需视觉证据用于多模态信息提取

Miaobo Hu, Shuhao Hu, Bokun Wang, Rui Chen, Xin Wang, Xiaobo Guo, Daren Zha, Jun Xiao

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所) University of Chinese Academy of Sciences(中国科学院大学) School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络安全学院) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院)

AI总结 该研究提出SAVER框架,通过选择性视觉证据提升多模态命名实体识别和关系抽取的性能,减少计算开销并提高准确性。

详情
AI中文摘要

多模态信息提取在社交媒体中具有挑战性,因为帖子可能附加多个弱相关、冗余甚至误导性的图像。在这样的情况下,持续的多模态融合会浪费计算资源并放大虚假的视觉提示。核心挑战是决定是否为每个候选跨度或标记实体对咨询视觉信息,以及如果需要,哪些小图像子集提供可信的证据。我们提出SAVER,一种选择性视觉所需框架用于多模态命名实体识别和多模态关系抽取。SAVER使用符合性地面性门(CGG)来估计MNER中的跨度级视觉地面性,从两个标记实体推导出对级激活,通过符合性风格程序和Clopper-Pearson上界校准激活阈值。当被激活时,一个子模ularity相关性-多样性选择器选择跨图像的紧凑证据子集,然后通过集合变换器进行聚合。一个受能量启发的联合评分头结合文本、可选视觉证据、文本-图像一致性以及稀疏路由用于实体类型或关系分类。实验表明,SAVER在强文本-only和持续多模态基线上一致提高F1,同时减少AURC,增加激活覆盖面积,在固定风险水平下,降低FLOPs和P90延迟。

英文摘要

Multimodal IE in social media is difficult because a post may attach multiple images that are weakly related, redundant, or even misleading with respect to the text. In this setting, always-on multimodal fusion wastes computation and can amplify spurious visual cues. The core challenge is to decide, for each candidate span or marked entity pair, whether vision should be consulted at all and, if so, which small subset of images provides trustworthy evidence. We propose SAVER, a selective vision-as-needed framework for multimodal named entity recognition and multimodal relation extraction. SAVER uses a Conformal Groundability Gate (CGG) to estimate span-level visual groundability in MNER, derive pair-level activation in MRE from the two marked entities, and calibrate the activation threshold on a held-out split via a conformal-style procedure with Clopper--Pearson upper bounds. When activated, a submodular relevance--diversity selector chooses a compact evidence subset across images, which is then aggregated by a Set Transformer. An energy-inspired joint scoring head combines text, optional visual evidence, text--image consistency, and sparse routing for entity typing or relation classification. Experiments show that SAVER consistently improves F1 over strong text-only and always-on multimodal baselines, while reducing AURC, increasing activation coverage at a fixed risk level, and lowering FLOPs and P90 latency.

2605.20712 2026-05-21 cs.CL cs.AI

SCRIBE: Diagnostic Evaluation and Rich Transcription Models for Indic ASR

SCRIBE:用于印度语言ASR的诊断评估和丰富转录模型

Kavya Manohar, Arghya Bhattacharya, Kush Juvekar, Kumarmanas Nethil

发表机构 * Adalat AI, India(印度Adalat人工智能公司)

AI总结 SCRIBE通过沙地容忍对齐和领域词汇注入,提供词错误率的分类分解,解决了传统词错误率在处理聚合语言时的不足,同时释放了用于印地语、马拉雅尔语和卡纳达语的丰富转录模型。

Comments Submitted to Interspeech 2026

详情
AI中文摘要

自动语音识别仅在更正成本低于手动输入时才取代打字,这一阈值由错误类型而非数量决定:纠正一个误识别的领域术语的成本远高于插入一个逗号。词错误率(WER)在两个方面失效:它将不同的错误类别合并为一个标量,且它在结构上惩罚了聚合语言,其中有效的沙地合并会膨胀分数。我们引入SCRIBE,一个诊断框架,通过沙地容忍对齐和领域词汇注入,将错误分解为词法、标点、数字和领域实体率。人类验证确认SCRIBE在WER无法做到的地方与专家判断一致。我们发布了SCRIBE,一个LLM整理流程、基准测试和开放权重的丰富转录模型,适用于印地语、马拉雅尔语和卡纳达语。

英文摘要

Automatic speech recognition replaces typing only when correction costs less than manual entry, a threshold determined by error types, not counts: fixing a misrecognized domain term costs far more than inserting a comma. Word error rate (WER) fails on two fronts: it collapses distinct error categories into a single scalar, and it structurally penalizes agglutinative languages where valid sandhi merges inflate scores. We introduce SCRIBE, a diagnostic framework that provides categorical error decomposition into lexical, punctuation, numeral, and domain-entity rates through sandhi-tolerant alignment with domain vocabulary injection. Human validation confirms SCRIBE aligns with expert judgment where WER does not. We release SCRIBE, an LLM curation pipeline, benchmarks, and open-weight rich transcription models for Hindi, Malayalam, and Kannada.

2605.20696 2026-05-21 cs.LG

Distributed Direct Preference Optimization

分布式直接偏好优化

Zhanhong Jiang

发表机构 * Translational AI Center, Iowa State University, Ames, USA(翻译人工智能中心,爱荷华州立大学,爱荷华州阿姆斯)

AI总结 本文研究了在分布式环境中直接偏好优化(DPO)的收敛性和时间复杂度,分析了联邦学习和去中心化学习中偏好数据碎片化对优化动态的影响,并提出了具有理论保证的鲁棒且可扩展的实现实现方法。

Comments 29 pages, 12 figures

详情
AI中文摘要

基于偏好强化学习(RL)是将策略与人类判断对齐的关键范式,然而其在分布式设置中,偏好数据在异构用户之间碎片化的情况下理论行为仍不明确。直接偏好优化(DPO)避免显式奖励建模,但在联邦和去中心化训练中缺乏收敛保证,其中通信约束和非独立同分布(non-IID)偏好根本上改变了优化动态。我们为分布式环境中的DPO提供了首次收敛性和时间复杂度分析。通过建模具有用户特定偏好分布的个性化离线RL,我们刻画了诱导的全局优化景观。对于联邦DPO,我们推导了收敛率,量化了客户端漂移、通信频率和偏好异质性的影响;对于去中心化DPO,我们建立了在一般通信图上的收敛性,并展示了谱连通性如何控制优化速度和共识。实证上,我们在标准对齐基准上验证了我们的理论见解,证明了我们提出的方法不仅具有强理论保证,而且在实践中也表现出鲁棒性和可扩展性。代码库在此处提供。

英文摘要

Preference-based reinforcement learning (RL) is a key paradigm for aligning policies with human judgments, yet its theoretical behavior in distributed settings where preference data are fragmented across heterogeneous users remains poorly understood. Direct Preference Optimization (DPO) avoids explicit reward modeling but lacks convergence guarantees under federated and decentralized training, where communication constraints and non-IID preferences fundamentally alter optimization dynamics. We provide the first convergence and time-complexity analysis of DPO in distributed environments. Modeling personalized offline RL with user-specific preference distributions, we characterize the induced global optimization landscape. For federated DPO, we derive convergence rates that quantify the impact of client drift, communication frequency, and preference heterogeneity; for decentralized DPO, we establish convergence over general communication graphs and show how spectral connectivity governs optimization speed and consensus. Empirically, we corroborate our theoretical insights on standard alignment benchmarks, demonstrating that our proposed methods not only enjoy strong theoretical guarantees but also deliver robust and scalable performance in practice. The code base is available here.

2605.20693 2026-05-21 cs.CL cs.AI stat.ML

Interpretable Discriminative Text Representations via Agreement and Label Disentanglement

通过共识和标签解缠获得可解释的判别文本表示

Tong Wang, Yiqing Xu, Leo Yang Yang

发表机构 * Yale University(耶鲁大学) Stanford University(斯坦福大学) Hong Kong Baptist University(香港 Baptist 大学)

AI总结 本文提出了一种可解释的判别文本表示方法,通过共识和标签解缠来确保特征的可解释性和可重复性,实验表明该方法在多个文本分类任务中表现优异,产生了更清晰且更少标签纠缠的特征。

详情
AI中文摘要

可解释的文本表示应暴露出不仅具有预测性,而且对独立审计员来说有意义的坐标。现有的判别表示通常使用匿名嵌入方向,而概念瓶颈和LLM辅助方法将自然语言名称附加到特征上,但并未确保这些定义是可重复的或与目标标签不同。我们提出了一种可解释判别文本表示的操作标准:每个坐标应满足概念清晰度,通过独立标注员应用特征定义之间的机会调整一致性来衡量,并且标签解缠,即特征不应仅仅改述预测目标。我们通过LLM辅助特征发现(LFD)方法实现了这一标准,这是一种迭代方法,从对比性反向文本对中提出词汇和语义特征,通过跨LLM Cohen's $κ$ 筛选候选,并通过残差保留的预测增益选择特征。一种简化分析将$κ$筛选与每个特征的注释噪声界限联系起来,正式化一致性作为可靠性检查。在十个跨越七个语料库的文本分类任务中,LFD与强大的文本瓶颈基线具有相同的预测性能,同时产生明显更清晰且标签纠缠更少的特征。232名人类审计员的实验表明,LFD特征在人类-人类和人类-LLM一致性方面优于基线概念,且审计员一致认为它们更少标签泄漏。这些结果表明,经过一致性测试和标签解缠的坐标为可解释文本分类提供了一个实用的可审计标准。

英文摘要

Interpretable text representations should expose coordinates that are not only predictive, but also meaningful enough for independent auditors to apply. Existing discriminative representations often use anonymous embedding directions, while concept-bottleneck and LLM-assisted methods attach natural-language names to features without ensuring that those definitions are reproducible or distinct from the target label. We propose an operational criterion for interpretable discriminative text representations: each coordinate should satisfy conceptual clarity, measured by chance-adjusted agreement between independent annotators applying the feature definition, and label disentanglement, meaning the feature should not merely paraphrase the prediction target. We instantiate this criterion in LLM-assisted Feature Discovery (LFD), an iterative method that proposes lexical and semantic features from contrastive outcome-opposed text pairs, screens candidates using cross-LLM Cohen's $κ$, and selects features by residual held-out predictive gain. A stylized analysis connects the $κ$ screen to a per-feature annotation-noise bound, formalizing agreement as a reliability check. Across ten text-classification tasks spanning seven corpora, LFD matches the predictive performance of a strong text bottleneck baseline while producing substantially clearer and less label-entangled features. Human audits with 232 raters show that LFD features achieve higher human--human and human--LLM agreement than baseline concepts, and raters consistently judge them as less label-leaking. These results suggest that agreement-tested, label-disentangled coordinates provide a practical auditability standard for interpretable text classification.