arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.27295 2026-05-27 cs.CV

Gemini Embedding 2: A Native Multimodal Embedding Model from Gemini

Gemini Embedding 2：来自Gemini的原生多模态嵌入模型

Madhuri Shanbhogue, Zhe Li, Shanfeng Zhang, Gustavo Hernández Ábrego, Shih-Cheng Huang, Aashi Jain, Daniel Salz, Sonam Goenka, Chaitra Hegde, Ji Ma, Feiyang Chen, Jiaxing Wu, Tanmaya Dabral, Babak Samari, Kevin Poulet, Daniel Cer, Kaifeng Chen, Paul Suganathan, Hui Hui, Jovan Andonov, Philippe Schlattner, Jay Han, Iftekhar Naim, Wing Lowe, Vladimir Pchelin, Albert Yang, Yi-Ting Chen, Zhongli Ding, Grace Zhang, Georg Heigold, Yichang Chen, Antoine Reveillon, Brendan Mccloskey, Wenlei Zhou, Dahun Kim, Rui Meng, Emma Wang, Jack Zheng, Halley Fede, Zhen Yang, Keegan Mosley, Brian Potetz, Sahil Dua, Henrique Schechter Vera, Shen Gao, Hesen Zhang, Andreas Hess, Hengxuan Ying, Alberto Montes, Karan Gill, Min Choi, Sebastian Russo, Anja Hauth, Jinhyuk Lee, Michael Boratko, Megan Barnes, Vikram Rao, Claudiu Musat, Cyril Allauzen, Ehsan Variani, Shankar Kumar, Tom Bagby, Junyi Jiao, Yang Gu, Tengxin Li, Ayush Agrawal, Roberto Santana, Dev Nath, Stephen Karukas, Shuoxuan Han, Lucia Loher, Alice Twu, Nidhi Vyas, Siddharth Bhai, Frank Palma Gomez, Wangyuan Zhang, Chaoren Liu, Jizheng Yang, Steve Qiu, Shijie Zhang, Sujay Kulkarni, Sascha Rothe, Sean Nakamoto, Raphael Hoffmann, Zach Gleicher, Yunhsuan Sung, Qin Yin, Tom Duerig, Mojtaba Seyedhosseini

AI总结提出原生多模态嵌入模型Gemini Embedding 2，通过多任务多阶段对比学习统一视频、音频、图像和文本的表示空间，在单模态、跨模态和多模态检索任务上达到最先进性能。

详情

AI中文摘要

我们介绍了Gemini Embedding 2，一种原生多模态嵌入模型，允许在统一表示空间中对视频、音频、图像和文本模态进行嵌入。我们利用Gemini的多模态能力，为所有这些模态的交错输入任意组合生成嵌入，这些嵌入在广泛的任务中具有良好的泛化能力。在多任务多阶段训练设置中应用大规模对比学习，我们在关键嵌入基准测试中取得了最先进的性能，包括涵盖多种任务的单模态、跨模态和多模态检索。我们展示了我们的嵌入模型在多种任务上表现出强大的性能（在MSCOCO上得分为62.9 R@1，在Vatex上为68.8 NDCG@10，在MTEB多语言上为69.9，在MTEB代码上为84.0），超越了专门模型的性能。这些统一的能力使Gemini Embedding 2成为下游用例（如RAG、推荐和搜索）的有前途的候选者。此外，它在不同领域（从天文学和生物科学到美术和烹饪艺术）的强大零样本性能，使其成为即使对于专业领域也非常可靠的即用型表示。

英文摘要

We introduce Gemini Embedding 2, a native multimodal embedding model that allows embedding video, audio, image, and text modalities in a unified representation space. We leverage the multimodal capabilities of Gemini to produce embeddings for arbitrary combinations of interleaved inputs across all these modalities that generalize well across a wide variety of tasks. Applying large-scale contrastive learning in a multi-task multi-stage training setup, we achieve state-of-the-art performance on key embedding benchmarks including unimodal, cross-modal, and multimodal retrieval spanning a diverse set of tasks. We show that our embedding model demonstrates strong performance (with a score of 62.9 R@1 on MSCOCO, 68.8 NDCG@10 on Vatex, 69.9 on MTEB multilingual and 84.0 on MTEB Code) across a variety of tasks surpassing the performance of specialized models. These unified capabilities make Gemini Embedding 2 a promising candidate for downstream use cases such as RAG, recommendation and search. Furthermore, its robust zero-shot performance across distinct fields - from astronomy and bioscience to fine arts and the culinary arts - establishes it as a highly reliable, out-of-the-box representation even for specialized domains.

URL PDF HTML ☆

赞 0 踩 0

2605.27294 2026-05-27 cs.CL cs.IR

Separating Semantic Competition from Context Length in RAG Reading

在RAG阅读中区分语义竞争与上下文长度

Vyzantinos Repantis, Ameya Gawde, Harshvardhan Singh, Rohit Alekar, Cien Zhang, Svetlana Karslioglu, Akash Vishwakarma

AI总结通过匹配对照实验，分离出检索增强生成中阅读器的语义竞争效应，证明性能下降部分源于竞争而非仅上下文长度。

Comments 4 pages, 1 figure, 2 tables

详情

AI中文摘要

检索增强生成（RAG）系统即使在检索到正确段落时也可能错误回答。模型仍需阅读检索到的段落，并在看似相关的段落中识别出包含答案的那一个。这种段落阅读模型称为阅读器。它的失败仅仅是因为上下文更长，还是因为其他段落与正确段落真正竞争？我们引入并展示了一种RAG阅读的匹配对照协议：保持段落数量和长度固定，但将强竞争段落替换为不那么竞争的实段。我们在SQuAD上对两个紧凑开放模型应用此对照。这种替换部分恢复了性能，对F1和答案包含的影响最强。对于Phi-2，它恢复了+6.0 EM点、+7.0答案包含点和+0.057 F1。对于Qwen2.5-1.5B，它恢复了+4.5 EM点、+9.0答案包含点和+0.068 F1。为了跟踪性能如何随竞争段落积累而变化，我们还报告了保留曲线，并在曲线未交叉半保留时用右删失半衰期进行总结。这些结果共同表明，该协议分离了与上下文长度不同的竞争效应，尽管该效应对F1和答案包含比精确匹配更清晰，并且也随片段长度变化。

英文摘要

Retrieval-augmented generation (RAG) systems can respond incorrectly even when the correct passage was retrieved. The model must still read the retrieved passages and identify which one contains the answer among others that look relevant. This passage-reading model is called the reader. Does it fail simply because the context is longer or because the other passages genuinely compete with the correct one? We introduce and demonstrate a matched-control protocol for RAG reading: we keep the number and length of passages fixed, but replace hard competitors with less competitive real passages. We apply this control across two compact open models on SQuAD. This replacement partially restores performance, with the strongest effects on F1 and answer inclusion. For Phi-2, this recovers +6.0 EM points, +7.0 answer-inclusion points, and +0.057 F1. For Qwen2.5-1.5B, it recovers +4.5 EM points, +9.0 answer-inclusion points, and +0.068 F1. To track how performance changes as competitors accumulate, we also report retention curves and summarize them with a right-censored half-life when the curves do not cross half-retention. Together, these results show the protocol isolates a competition effect distinct from context length, though the effect is clearer for F1 and answer inclusion than for exact match, and also varies with snippet length.

URL PDF HTML ☆

赞 0 踩 0

2605.27293 2026-05-27 cs.LG stat.ML

BASIS: Batchwise Advantage Estimation from Single-Rollout Information Sharing for LLM Reasoning

BASIS: 基于单次采样信息共享的批量优势估计用于LLM推理

Shijin Gong, Erhan Xu, Kai Ye, Francesco Quinzan, Giulia Livieri, Chengchun Shi

AI总结提出BASIS算法，通过单次采样和批次内信息共享改进价值函数估计，在减少计算开销的同时提升策略优化性能。

Comments 17 pages, 7 figures

详情

AI中文摘要

基于可验证奖励的强化学习已成为提升大型语言模型推理能力的标准方法。现有算法在价值估计和策略学习中面临计算效率与样本效率之间的权衡。我们引入BASIS，一种无评论家的后训练算法，旨在解决这一权衡。在每个在线训练步骤中，BASIS每个提示仅采样一次，但利用整个批次中跨提示的丰富信息来改进价值函数估计。实验表明，与代表性单次采样基线REINFORCE++相比，BASIS将价值函数估计的MSE降低了69%，并且使用一次采样达到的MSE低于使用8次采样的组均值估计器。价值估计的改进转化为更好的策略优化：使用显著更少的训练时间，BASIS达到了接近多次采样GRPO型基线的性能，并且通常优于单次采样REINFORCE型基线。

英文摘要

Reinforcement learning with verifiable rewards has become a standard recipe for improving the reasoning abilities of large language models. Existing algorithms face a tradeoff between computational efficiency and sample efficiency in value estimation and policy learning. We introduce BASIS, a critic-free post-training algorithm designed to address this tradeoff. At each online training step, BASIS samples only one rollout per prompt, but leverages rich information across prompts in the entire batch to improve value function estimation. Our experiments demonstrate that BASIS reduces MSE in value function estimation by 69% compared to REINFORCE++, a representative single-rollout baseline, and achieves lower MSE with one rollout than group mean estimators with 8 rollouts. This improvement in value estimation translates to better policy optimization: using substantially less training time, BASIS achieves performance close to multi-rollout GRPO-type baselines and often outperforms single-rollout REINFORCE-type baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.27288 2026-05-27 cs.CL cs.AI cs.LG

It's Not Always Sycophancy: Measuring LLM Conformity as a Function of Epistemic Uncertainty

并非总是谄媚：基于认知不确定性测量LLM的从众行为

Kevin H. Guo, Chao Yan, Avinash Baidya, Katherine Brown, Xiang Gao, Juming Xiong, Zhijun Yin, Bradley A. Malin

AI总结本文提出MUSE框架，通过区分谄媚从众和不确定性驱动的从众，揭示LLM在用户反驳时改变立场的行为机制，并发现两种从众均随用户感知专业性和建议合理性增强。

详情

AI中文摘要

大型语言模型（LLMs）已知会放弃初始立场以适应用户的反驳。虽然先前研究主要将此行为归因于从人类反馈强化学习中习得的谄媚，但我们假设从众行为也受模型在推理时的认知不确定性驱动。本文提出MUSE，一个两阶段评估框架，用于解开驱动LLM从众行为的机制。具体而言，MUSE将模型回答查询时的认知不确定性与其在后续轮次中屈服于用户反驳的可能性进行映射。我们证明驱动从众的机制不仅限于谄媚。具体来说，我们刻画了共同驱动从众的两个不同因素：谄媚从众，即模型即使对其初始回答绝对确定也会与用户反驳保持一致；以及不确定性驱动从众，即模型从众可能性随其不确定性增加而增加。此外，我们进行消融研究，证明谄媚从众和不确定性驱动从众均随1）LLM对用户感知专业性的增加和2）用户建议的合理性增加而增长。更广泛地说，MUSE通过区分对齐诱导的谄媚和训练语料驱动的不确定性，为更有针对性的干预策略提供信息。

英文摘要

Large language models (LLMs) are known to abandon their initial stance to conform to user pushback. While prior research largely attributes this behavior to sycophancy learned during reinforcement learning from human feedback, we hypothesize that conformity is also driven by a model's epistemic uncertainty at inference time. In this paper, we introduce MUSE, a two-stage evaluation framework to disentangle the mechanisms driving LLM conformity. Specifically, MUSE maps a model's epistemic uncertainty in responding to a query against its likelihood to yield to user pushback in a subsequent turn. We demonstrate that the mechanisms driving conformity extend beyond sycophancy alone. Specifically, we characterize two distinct factors that jointly drive conformity: sycophantic conformity, where a model aligns with user pushback even with absolute certainty in its initial response, and uncertainty-driven conformity, where a model's likelihood for conformity increases alongside its uncertainty. Furthermore, we conduct ablation studies to demonstrate that both sycophantic conformity and uncertainty-driven conformity grow with 1) the LLM's perceived expertise of the user and 2) the plausibility of the user's suggestions. More broadly, MUSE informs more targeted intervention strategies by distinguishing alignment-induced sycophancy and training-corpora-driven uncertainty.

URL PDF HTML ☆

赞 0 踩 0

2605.27287 2026-05-27 cs.CV

A Dynamic Programming Framework for Discovering Count and Values of Multilevel Image Thresholding

一种用于发现多级图像阈值计数和值的动态规划框架

Eslam Hegazy, Mohamed Gabr

AI总结提出一种基于动态规划和改进最小误差阈值准则的自动多级阈值方法，能自动确定阈值数量，在速度上优于传统动态规划方法，但SSIM和PSNR略低。

详情

AI中文摘要

多级图像阈值化是当今计算机视觉应用中重要的预处理算法。由于大多数常见的阈值化方法将期望的阈值数量作为用户输入，因此能够从输入图像本身自动确定合适阈值数量的阈值化方法具有优势。本文详细介绍了一种基于动态规划算法和改进的最小误差阈值（MET）准则的新型阈值化方法。通过实证统计研究，指出了该方法为何更优。此外，在自然、卫星和医学测试图像的综合集合上，将该方法与其它最先进方法进行了扩展比较。数值结果表明，当阈值数量较高时，所提出的MET-DP方法比传统的动态规划阈值化方法耗时少得多。该方法能够为大多数不同类型的测试图像检测出合适的阈值数量。然而，以阈值数量作为输入的传统方法产生的阈值化图像在结构相似性指数（SSIM）和峰值信噪比（PSNR）值上高于MET-DP。源代码可在https://w3id.org/met-dp/article1-code找到。

英文摘要

Multilevel Image thresholding is an important preprocessing algorithm in computer vision applications nowadays. Since most common thresholding methods take the desired count of thresholds as input by the user, thresholding methods that automatically determines a suitable count of thresholds from the input image itself are advantageous. In this article, a novel thresholding method based on a dynamic programming algorithm and a modification of Minimum Error Thresholding (MET) criterion is thoroughly presented. An empirical statistical study is performed to pinpoint why this proposed method is superior. Moreover, an extended comparison between this proposed method and other state-of-the-art methods is performed on a comprehensive set of natural, satellite and medical test images. The numerical results show that the proposed MET-DP method takes much less time than traditional dynamic programming thresholding methods when the number of thresholds is high. The proposed method can detect a suitable count of thresholds for most of tested images of different types. However, traditional methods that take the count of thresholds as input produce thresholded images of higher structural similarity index measure (SSIM) and peak signal-to-noise ratio (PSNR) values than MET-DP. Source code can be found on https://w3id.org/met-dp/article1-code

URL PDF HTML ☆

赞 0 踩 0

2605.27281 2026-05-27 cs.LG stat.ML

Causal Risk Minimization for High-Dimensional Treatments

高维处理变量的因果风险最小化

Nikita Dhawan, Arnav Paruthi, Andrew Kim, Lovedeep Gondara, Jekaterina Novikova, Chris J. Maddison

AI总结针对高维处理空间（如文本）的因果推断，提出通过分解因果误差为矩平衡误差序列并优化高阶平衡目标，以及将高维处理投影到低维属性的方法，实现无需属性特定训练的因果估计。

Comments 18 pages, 4 figures

详情

AI中文摘要

预测具有多种可能变化的干预效果（例如，影响心理健康结果的治疗内容或推动股价变动的财报电话会议记录）在多个领域中非常有用。然而，经典的因果估计量通常假设所有可能的干预都被观察到，这在干预变化广泛的情况下（例如，在所有文本字符串的空间中）是不可行的。我们采用了一种将因果推断重新表述为学习问题的著名方法，以处理高维处理空间。具体来说，在标准假设（如无未观测混杂）下，我们证明因果误差可分解为一系列递增阶数的矩平衡误差，并设计了直接改进因果估计的目标函数。我们还展示了如何将高维处理的效果投影到低维处理属性上，这使得单个模型能够回答多个因果问题，而无需额外的属性特定训练。我们在高维连续、离散和文本处理设置中经验性地评估了我们的估计量，其中文本处理使用了亚马逊评论的半合成数据集。我们的实验证明了高阶平衡误差优化的优势以及投影因果估计与属性特定估计的竞争性能。

英文摘要

Predicting the effect of interventions with many possible variations, e.g., therapeutic content that affects mental health outcomes or an earnings call transcript that drives movement in share price, is useful across several domains. However, classical causal estimators tend to assume that all possible interventions are observed, which is infeasible when interventions vary widely, for instance, in the space of all text strings. We adapt a well-known approach of recasting causal inference as a learning problem, to address high-dimensional treatment spaces. Specifically, under standard assumptions like no unobserved confounding, we show that causal error decomposes into a series of moment-balancing errors of increasing order, and design objectives that directly improve causal estimation. We also show how to project the effect of a high-dimensional treatment onto lower-dimensional treatment attributes, which allows a single model to answer several causal questions without additional attribute-specific training. We empirically evaluate our estimators in settings with high-dimensional continuous, discrete, and text treatments, the last of which used a semi-synthetic dataset of Amazon Reviews. Our experiments demonstrate the benefit of higher-order balance error optimization and competitive performance of projected causal estimates with attribute-specific estimators.

URL PDF HTML ☆

赞 0 踩 0

2605.27269 2026-05-27 cs.LG stat.AP

Transfer Learning using 66 Diseases for Disease Forecasting Applications

使用66种疾病的迁移学习进行疾病预测应用

Lauren J Beesley, Alexander C Murph, Dave Osthus, Lauren A Castro

AI总结本研究通过迁移学习整合66种传染病及多种数据流，发现大多数情况下加入其他数据流能提升预测性能，但数据质量至关重要，并构建了公开数据库。

详情

AI中文摘要

疾病预测模型通常依赖于单一数据流，这使得模型在历史数据短或噪声大时变得脆弱。最近表现最佳的模型表明，综合同一疾病的多个报告系统可以提升性能。其他近期工作进一步扩展了这一想法，使用迁移学习利用不同疾病的数据来训练某一疾病的预测模型。我们极大地扩展了这些方法，在涵盖66种传染病和多个数据流的数据上训练机器学习模型。我们研究了整合不同数据流对预测20种不同疾病数据流的价值。我们发现，在绝大多数（84.9%）考虑的时间序列和模型结构中，整合其他数据流改善了预测。然而，我们的工作强调，添加数据的质量很重要，添加与目标数据流极其不同的数据有时会降低预测性能。这项工作的一个主要贡献是编制了一个公开可用的数据库，供传染病预测社区使用。

英文摘要

Disease forecasting models typically rely on a single data stream, making models brittle when histories are short or noisy. Recent top-performing models have shown that synthesizing multiple reporting systems for the same disease improves performance. Other recent work takes this idea a step further, using transfer learning to train a forecasting model for one disease using data from a different disease. We expand upon each of these approaches greatly, training machine learning models on data that span 66 infectious diseases and several data streams. We investigate the value of incorporating different data streams for forecasting 20 different disease data streams. We find that incorporating other data streams improves forecasting in the vast majority (84.9%) of time series and model structures considered. However, our work highlights that the quality of the added data matters, where adding data extremely different from the target data stream can sometimes degrade forecast performance. A major contribution of this work is in compiling a publicly-available database of data for use by the infectious disease forecasting community.

URL PDF HTML ☆

赞 0 踩 0

2605.27268 2026-05-27 cs.CL cs.AI

Lost in Sampling: Assessing Lexical Reachability in LLMs via the Word Coverage Score (WCS)

迷失在采样中：通过词覆盖率评估大语言模型中的词汇可达性

Samer Awad, Javier Conde, Carlos Arriaga, Tairan Fu, Javier Coronado-Blázquez, Pedro Reviriego

AI总结提出词覆盖率（WCS）指标，量化标准采样过滤器（如Top-p、Top-k、Min-p）如何抑制低频率高信息词汇的生存率，揭示解码机制对语言多样性的影响。

Comments 15 pages, 6 figures

详情

AI中文摘要

现代大语言模型（LLM）常因生成重复和同质化文本而受到批评，尽管它们拥有庞大的潜在词汇量。以往研究关注模型知识和训练数据，我们则探究解码机制在抑制语言多样性中的作用。我们引入词覆盖率（WCS），该指标量化了标准采样过滤器（如Top-$p$、Top-$k$和Min-$p$）在数学上剔除上下文适当的人类词汇的程度。WCS并非评估静态知识，而是衡量低频率、高信息人类词汇的词汇存活率作为采样参数的函数。通过审计人类撰写的语料片段中的开放权重模型，我们识别出哪些合理的词汇选择因解码器而变得不可达，即使它们存在于概率空间中。我们的结果提供了定量证据，表明行业标准的采样默认值充当了无意的审查机制，将人类表达的独特纹理平滑为同质化的话语。WCS为优化文本连贯性与词汇丰富性之间的权衡提供了严谨框架，为在生成模型中保留人类语言多样性提供了诊断工具。

英文摘要

Modern Large Language Models (LLMs) are often criticized for producing repetitive and homogeneous text, despite possessing vast latent vocabularies. While previous research has focused on model knowledge and training data, we investigate the role of decoding mechanics in suppressing linguistic diversity. We introduce the Word Coverage Score (WCS), a metric that quantifies the extent to which contextually appropriate human vocabulary is mathematically pruned by standard sampling filters (e.g., Top-$p$, Top-$k$, and Min-$p$). Rather than assessing static knowledge, the WCS measures the lexical survival rate of low-frequency, high-information human words as a function of sampling parameters. By auditing open-weight models on human-authored corpus fragments, we identify which logical lexical choices are rendered unreachable by the decoder, even when they reside within the probability space. Our results provide quantitative evidence that industry-standard sampling defaults act as unintended censorship mechanisms, smoothing the unique textures of human expression into a homogenized discourse. The WCS offers a rigorous framework for optimizing the trade-off between text coherence and lexical richness, providing a diagnostic tool for preserving the diversity of human language in generative models.

URL PDF HTML ☆

赞 0 踩 0

2605.27259 2026-05-27 cs.LG

Kan Extension Transformers: A Categorical Unification of Attention, Diffusion, and Predict-Detach Self-Conditioning

Kan扩展变换器：注意力、扩散和预测-分离自条件的范畴统一

Sridhar Mahadevan

AI总结提出Kan扩展变换器（KETs）作为多种Transformer实现的统一范畴框架，将Transformer层视为加权结构化扩展算子，并通过预测-分离机制实现有效的自条件化，实验表明预测-分离机制比改变邻域族带来更大性能提升。

Comments 30 pages

详情

AI中文摘要

我们提出Kan扩展变换器（KETs）作为多种Transformer实现的统一范畴框架。核心主张是，Transformer层可以被视为加权结构化扩展算子：标准注意力是单邻域情况，几何Transformer风格的关联混合是稀疏边限制情况，而KET是高阶单纯形情况。这一视角也阐明了与扩散式补全的桥梁。当扩展算子作用于分离的预测载体而非教师强制隐藏状态时，它成为一种有效的自条件化机制，在不泄露未来黄金令牌的情况下暴露非因果结构。我们在Penn Treebank、WikiText-2和WikiText-103上对12种不同的Transformer实现进行了全面的实验验证，这些实现在严格因果和预测-分离机制上有所不同。在严格因果设置中，二次KET是WikiText-2和WikiText-103上比较的因果架构中最强的模型。然而，在所有数据集上，最大的收益来自预测-分离机制，而非仅改变邻域族。

英文摘要

We propose Kan Extension Transformers (KETs) as a unifying categorical framework for a diverse group of Transformer implementations. The core claim is that a Transformer layer can be viewed as a weighted structured extension operator: standard attention is the singleton-neighborhood case, Geometric Transformer style incidence mixing is a sparse edge-restricted case, and KET is the higher-order simplicial case. This lens also clarifies a bridge to diffusion-style completion. When the extension operator acts on detached predictive carriers instead of teacher-forced hidden states, it becomes a valid self-conditioning mechanism that exposes noncausal structure without leaking gold future tokens. We include a comprehensive experimental validation of 12 different Transformer implementations varying across strict-causal and predict-detach regimes on Penn Treebank, WikiText-2, and WikiText-103. In the strict-causal setting, quadratic KET is the strongest model among the compared causal architectures on WikiText-2 and WikiText-103. Across all datasets, however, the largest gains come from the predict-detach regime rather than from changing the neighborhood family alone.

URL PDF HTML ☆

赞 0 踩 0

2605.27254 2026-05-27 cs.LG cs.AI

LUCoS: Latent Unsupervised Context Selection for Tabular Foundation Models

LUCoS: 表格基础模型的潜在无监督上下文选择

Oroel Ipas, Guillermo Gomez-Trenado, Rocío Romero-Zaliz, Isaac Triguero

AI总结针对表格基础模型在低标签场景下的上下文选择问题，提出LUCoS方法，利用无监督先验拟合网络（PFN）的潜在几何结构选择代表性medoids作为上下文，在67个数据集上优于随机选择和原始空间方法。

Comments Comments: 18 pages, 4 figures, supplementary appendices included

详情

AI中文摘要

选择哪些实例进行标注是低标签表格学习中的一个关键挑战。对于最近的表格基础模型（如TabPFN），上下文选择直接决定预测性能。有监督的oracle实验表明，在相同标注预算下，精心选择的标注上下文集可以显著优于随机选择。然而，在TFM文献中，冷启动设置（即必须在任何标签可用之前选择实例）很少受到关注。这个问题本质上是几何问题。在视觉和语言领域，基础模型诱导出嵌入空间，其中简单的几何选择方法是有效的。相比之下，表格实例选择迄今为止主要是在原始表格空间中进行，而该空间缺乏自然的度量；异构类型、混合尺度以及非线性交互使得原始空间距离对于上下文构建不可靠，并且随着预算增加，原始空间选择在大多数数据集上表现低于随机。我们提出LUCoS（潜在无监督上下文选择），该方法用无监督先验拟合网络（PFN）诱导的潜在几何替换原始特征几何，并选择代表性medoids作为上下文。在67个OpenML-CC18数据集上，跨六个低标签预算评估，LUCoS在平均AUC、ACC和F1上排名第一，结论在指标和数据集级别的稳健性检查中保持稳定。增益分解揭示了一个简单机制：在最小预算下，主要收益来自强制覆盖；随着预算增加，决定性因素变为衡量覆盖的表示空间。LUCoS缓解了原始特征空间选择的失败，表明可靠的无监督上下文选择更少依赖于选择器的复杂性，而更多依赖于在有意义的表示几何中定义代表性。

英文摘要

Selecting which instances to label is a key challenge in low-label tabular learning. For recent Tabular Foundation Models such as TabPFN, context selection directly determines predictive performance. Supervised oracle experiments show that carefully chosen labeled context sets can strongly outperform random selection under the same labeling budget. However, the cold-start setting, where instances must be selected before any labels are available, has received little attention in the TFM literature. This problem is fundamentally geometric. In vision and language, foundation models induce embedding spaces where simple geometric selection methods are effective. In contrast, tabular instance selection has so far been performed predominantly in the original tabular space, which lacks a natural metric; heterogeneous types, mixed scales, and nonlinear interactions make raw-space distances unreliable for context construction, and original-space selection falls below random on the majority of datasets as the budget grows. We propose LUCoS (Latent Unsupervised Context Selection), which replaces raw-feature geometry with the latent geometry induced by embeddings from an unsupervised Prior-Fitted Network (PFN) and selects representative medoids as context. Evaluated on 67 OpenML-CC18 datasets across six low-label budgets, LUCoS ranks first under mean AUC, ACC, and F1, with conclusions stable across metrics and dataset-level robustness checks. A gain decomposition reveals a simple mechanism: at the smallest budgets, the main benefit comes from enforcing coverage; as the budget increases, the decisive factor becomes the representation space in which coverage is measured. LUCoS mitigates failures of original feature space selection, showing that reliable unsupervised context selection depends less on selector sophistication than on defining representativeness in a meaningful representation geometry.

URL PDF HTML ☆

赞 0 踩 0

2605.27249 2026-05-27 cs.AI cs.CL

Gumbel Machine: Counterfactual Student Writing Generation via Gumbel Noise Steering

Gumbel机器：通过Gumbel噪声引导生成反事实学生写作

Hunter McNichols, Alexander Scarlatos, Mihai Dascalu, Danielle McNamara, Andrew Lan

AI总结提出Gumbel机器，一种利用β-Hindsight控制解码算法生成既符合评分标准又与学生原文相似的反事实文本的模块化方法。

Comments preprint

详情

AI中文摘要

跨学科教学的有效方法是提供高质量工作的示例。然而，示例可能与学生的当前工作存在显著差异，使得学生难以模仿。理想的学习示范是学生工作的反事实版本，即与学生自身工作相似但有所改进的版本。现有的使用大型语言模型（LLMs）进行反事实文本生成的自动化方法导致了难以转化为实际应用的领域特定系统。我们提出了Gumbel机器，一种灵活、模块化的反事实生成方法，它利用LLM的指令遵循能力，同时鼓励与参考事实文本的相似性。我们方法的核心是一种新颖的受控解码算法β-Hindsight控制，该算法在反事实生成过程中利用潜在随机性作为可调的相似性控制机制。在根据各种标准评分的学生写作数据集上的实验表明，我们的方法在生成既符合评分标准又与参考文本相似的反事实文本方面是有效的。

英文摘要

An effective method of teaching across disciplines is to provide examples of high-quality work. However, an example may be significantly different from a student's current work, making it challenging for them to emulate. An ideal learning demonstration is a counterfactual version of the student work, an improved version that is still similar to their own. Existing automated approaches for counterfactual text generation using Large Language Models (LLMs) result in domain-specific systems that are difficult to translate into practical applications. We present the Gumbel Machine, a flexible, modular approach to generating counterfactuals that leverages LLM instruction-following capabilities while encouraging similarity to a reference factual text. Central to our approach is a novel, controlled decoding algorithm, $β$-Hindsight control, which uses latent randomness as a tunable similarity control mechanism during counterfactual generation. Experiments on datasets of student writing, scored on various criteria, demonstrate the effectiveness of our approach at generating counterfactuals both rubric-consistent and similar to a reference.

URL PDF HTML ☆

赞 0 踩 0

2605.27245 2026-05-27 cs.LG

Symbolic Regression via Latent Iterative Refinement

通过潜在迭代细化的符号回归

Xieting Chu, Sriram Vishwanath, Vijay Ganesh

AI总结提出潜在方程嵌入（LEE）框架，通过迭代推断在功能基础化的潜在空间中缩小符号回归的推断差距，生成更简单且准确的表达式。

Comments Preprint. 21 pages, 11 figures

详情

AI中文摘要

符号回归（SR）旨在寻找拟合观测数据的封闭形式数学表达式。神经SR方法通过训练编码器将观测数据直接映射到表达式来摊销搜索，但这种摊销推断在其一次性预测与真实后验之间留下了残余的摊销差距。我们提出潜在方程嵌入（LEE），这是一个通过在功能基础化的潜在空间中进行迭代摊销推断来缩小这一差距的框架。LEE学习一个共享的潜在空间Z，配备三个组件：编码器f_theta，将符号标记和数值观测联合嵌入到单个潜在向量z中；表达式解码器g_expr，从z重建公式；以及评估解码器g_eval，从z预测函数值，明确地将潜在空间基于功能行为。在推断时，LEE通过将解码后的表达式与观测数据联合重新编码来执行迭代细化，逐步改进潜在估计。LEE将编码器本身用作学习到的推断优化器：每个重新编码步骤隐式计算候选与数据之间的不匹配。由于g_eval在z上是可微的，我们另外将连续梯度下降与离散重新编码交错进行，产生一个混合迭代和梯度细化过程。在SRBench上，跨三个噪声水平，针对涵盖遗传规划、符号-神经混合和预训练Transformer的19个基线，LEE生成的表达式比最强精度导向的基线（包括Operon、GP-GOMEA、TPSR、RAG-SR和GenSR）简单2-10倍，复杂度为8-11，而后者为20-90。这些结果推进了精度-复杂度帕累托前沿的低复杂度区域，并显示出随着噪声增加而优雅退化。

英文摘要

Symbolic regression (SR) seeks closed-form mathematical expressions that fit observed data. Neural SR methods amortize the search by training an encoder to map observations directly to expressions in a single pass, but this amortized inference leaves a residual amortization gap between its one-shot prediction and the true posterior. We propose Latent Equation Embedding (LEE), a framework that closes this gap through iterative amortized inference in a functionally grounded latent space. LEE learns a shared latent space Z equipped with three components: an encoder f_theta that jointly embeds symbolic tokens and numerical observations into a single latent vector z; an expression decoder g_expr that reconstructs formulas from z; and an evaluation decoder g_eval that predicts function values from z, explicitly grounding the latent space in functional behavior. At inference, LEE performs iterative refinement by re-encoding decoded expressions jointly with observations, progressively improving the latent estimate. LEE uses the encoder itself as a learned inference optimizer: each re-encoding step implicitly computes the mismatch between the candidate and the data. Because g_eval is differentiable in z, we additionally interleave continuous gradient descent with discrete re-encoding, yielding a hybrid iterative and gradient refinement procedure. On SRBench across three noise levels, against 19 baselines spanning genetic programming, symbolic-neural hybrids, and pre-trained Transformers, LEE produces expressions 2--10x simpler than the strongest accuracy-oriented baselines, including Operon, GP-GOMEA, TPSR, RAG-SR, and GenSR, with complexity 8--11 versus 20--90. These results advance the low-complexity region of the accuracy-complexity Pareto frontier and show graceful degradation as noise increases.

URL PDF HTML ☆

赞 0 踩 0

2605.27243 2026-05-27 cs.CV

Can Retrieval Heads See Images? Multimodal Retrieval Heads in Long-Context Vision-Language Models

检索头能看见图像吗？长上下文视觉语言模型中的多模态检索头

Aaron Branson Cigres Li, Zhaowei Wang, Yu Zhao, Yiming Du, Haobo Li, Xiyu Ren, Ginny Wong, Simon See, Lishu Luo, Haodong Duan, Pasquale Minervini, Yangqiu Song

AI总结本文提出一种多模态检索头检测方法，发现视觉语言模型中仅有4.4-10.2%的注意力头贡献了50%的正检索分数，这些头对长上下文推理至关重要，且可直接用于文档检索提升性能。

Comments Work in Progress

详情

AI中文摘要

大型视觉语言模型越来越依赖长上下文建模来推理文档、小时级视频和长周期智能体轨迹，要求它们能在交错的文本和图像中定位相关证据。先前的工作使用大语言模型中的检索头研究了这种行为，但其基于复制的标准在证据出现在图像中时并不直接适用。我们引入了一种多模态检索头检测方法，对从问题标记到文本或视觉证据的注意力进行评分。通过这种方法，我们表明多模态检索头是稀疏的、内在的且因果重要的：仅4.4-10.2%的注意力头贡献了50%的正检索分数，而屏蔽前5%选定的头会使MMLongBench-Doc从48.2%降至5.7%，SlideVQA从71.2%降至8.9%，而随机头屏蔽的破坏性要小得多。进一步分析表明，这些头在模态间部分共享，但在每个模态内保持动态，随着上下文长度和“草堆”模态的变化，图像检索头比文本检索头变化更大。无需进一步训练，我们发现这些头也可直接用于对视觉丰富文档进行排序：在MMDocIR上，Qwen3-VL-8B选定的头评分在页面检索上比最强基线提高了7.7/7.4宏/微平均Recall@1，在布局检索上提高了6.3/6.8点。

英文摘要

Large vision-language models increasingly rely on long-context modeling to reason over documents, hour-level videos, and long-horizon agent trajectories, requiring them to locate relevant evidence across interleaved text and images. Prior work has studied this behavior using retrieval heads in large language models, but its copy-based criterion does not directly apply when evidence appears in images. We introduce a multimodal retrieval head detection method that scores attention from question tokens to textual or visual evidence. With this method, we show that multimodal retrieval heads are sparse, intrinsic, and causally important: only 4.4-10.2% of attention heads account for 50% of the positive retrieval-score mass, and masking the top-5% selected heads drops MMLongBench-Doc from 48.2% to 5.7% and SlideVQA from 71.2% to 8.9%, while random-head masking is far less damaging. Further analysis shows that these heads are partly shared across modalities yet remain dynamic within each modality, with image retrieval heads changing more than text retrieval heads as context length and haystack modality change. Without further training, we find that these heads can also be used directly to rank visually rich documents: on MMDocIR, Qwen3-VL-8B selected-head scoring improves Recall@1 by 7.7/7.4 macro/micro points for page retrieval and 6.3/6.8 points for layout retrieval over the strongest reported baseline.

URL PDF HTML ☆

赞 0 踩 0

2605.27240 2026-05-27 cs.CL

ENPMR-Bench: Benchmarking Proactive Memory Retrieval for Emotional Support Agents

ENPMR-Bench: 情感支持代理的主动记忆检索基准

Xing Fu, Yulin Hu, Mengtong Ji, Haozhen Li, Yixin Sun, Weixiang Zhao, Yanyan Zhao, Bing Qin

AI总结提出ENPMR-Bench基准，基于马斯洛需求层次评估情感支持代理主动推断用户潜在情感需求并检索适当记忆的能力，实验表明当前检索范式存在显著缺陷。

详情

AI中文摘要

记忆增强的语言代理越来越多地部署在情感支持等情感应用中，在这些应用中，理解和响应用户的潜在情感需求至关重要。然而，现有研究通常将记忆视为事实检索的工具，忽视了其在塑造用户情感体验中的作用。在这项工作中，我们引入了ENPMR-Bench，一个用于评估情感需求感知的主动记忆检索（ENPMR）的基准，这是一种核心能力，使代理能够推断用户的潜在情感需求并主动检索适当的记忆以支持共情交互。基于马斯洛需求层次，ENPMR-Bench包括超过1,800个记忆增强对话，并定义了情感需求与支持性记忆类型之间的结构化映射。实验结果表明，当前的检索范式，包括基于嵌入和LLM驱动的方法，都存在显著缺陷，共情得分明显落后于黄金记忆条件。虽然思维链提示在一定程度上改善了推断的情感需求与检索记忆之间的一致性，但性能差距仍然显著。总之，这些发现揭示了当前代理的关键局限性，并指出了通过需求敏感的记忆检索推进个性化情感支持的方向。

英文摘要

Memory-augmented language agents are increasingly deployed in affective applications such as emotional support, where understanding and responding to users' latent emotional needs is critical. However, existing research often treats memory as a tool for factual retrieval, overlooking its role in shaping users' emotional experiences. In this work, we introduce ENPMR-Bench, a benchmark for evaluating Emotional Need-aware Proactive Memory Retrieval (ENPMR), a core capability that enables agents to infer users' latent emotional needs and proactively retrieve appropriate memories to support empathetic interaction. Grounded in Maslow's hierarchy of needs, ENPMR-Bench includes over 1,800 memory-augmented dialogues and defines structured mappings between emotional needs and supportive memory types. Experimental results demonstrate that current retrieval paradigms, including both embedding-based and LLM-driven approaches, exhibit substantial deficiencies, with empathy scores significantly lagging behind golden memory conditions. While chain-of-thought prompting improves the alignment between inferred emotional needs and retrieved memories to some extent, a notable performance gap remains. Together, these findings reveal critical limitations in current agents and outline directions for advancing personalized emotional support through need-sensitive memory retrieval.

URL PDF HTML ☆

赞 0 踩 0

2605.27239 2026-05-27 cs.CL

Temporal Simultaneity Predicts Annotation Quality in Sentiment Corpora

时间同步性预测情感语料库中的标注质量

Idris Abdulmumin, Mokgadi Penelope Matloga, Tadesse Destaw Belay, Botshelo Kondowe, Letlhogonolo Mohleleng, Hareaipha Nkopo Letsoalo, Shamsuddeen Hassan Muhammad, Vukosi Marivate

AI总结通过分析Setswana情感数据集，发现标注者间一致性随时间下降的主要原因是时间同步性，即同时标注的样本一致性高，而间隔较长的标注一致性低。

详情

AI中文摘要

当标注活动跨越数周或数月且标注者池较小时，标注质量难以维持。我们提出了一个Setswana情感数据集，包含3,565条推文，由三名母语标注者在八个批次中标注，并考察了标注者间一致性（IAA）随时间下降的原因。尽管总体Randolph自由边际Kappa为$κ= 0.76$，属于“优秀”，但每批次$κ$在整个标注任务中下降了超过32个百分点。通过六项针对性分析，我们发现：(i) 标签混淆集中在负面/中性边界；(ii) 两名标注者表现出与自动驾驶标注一致的运行长度漂移；(iii) $κ$的主要预测因子是时间同步性：一分钟内标注的推文达到$κ= 0.98$，而相隔超过一天标注的推文仅达到$κ= 0.65$。标注速度和推文级语言特征与$κ$无显著关联。我们评估了三种开放多语言编码器和专有模型（GPT-5和Gemini）在三类情感分类任务上的表现；微调相比预训练基线提升了29到43个宏F1分数，其中GPT-5少样本学习总体领先（62.2宏F1）。我们发布了数据集、每条标注的时间戳和分析代码，以支持未来非洲语言NLP资源的可重复质量审计。

英文摘要

Annotation quality is difficult to sustain when campaigns span weeks or months with small annotator pools. We present a Setswana sentiment dataset of 3,565 tweets annotated by three native-speaker annotators across eight batches and examine why inter-annotator agreement (IAA) declines over time. Despite an aggregate Randolph's free-marginal Kappa of $κ= 0.76$, "excellent," per-batch $κ$ falls by more than 32 points across the annotation task. Through six targeted analyses, we find that (i) label confusion concentrates on the negative/neutral boundary, (ii) two annotators show run-length drift consistent with autopilot labeling, and (iii) the dominant predictor of $κ$ is temporal simultaneity: tweets labeled within one minute achieve $κ= 0.98$, while those labeled more than a day apart reach only $κ= 0.65$. Annotation speed and tweet-level linguistic features show no meaningful association with $κ$. We benchmark three open multilingual encoders and proprietary models (GPT-5 and Gemini) on three-class sentiment classification; fine-tuning yields gains of 29 to 43 macro-F1 points over pretrained baselines, with GPT-5 few-shot leading overall (62.2 macro-F1). We release the dataset, per-annotation timestamps, and analysis code to support reproducible quality auditing for future African language NLP resources.

URL PDF HTML ☆

赞 0 踩 0

2605.27236 2026-05-27 cs.LG physics.ao-ph

Explainable Comparison of Feature-Based and Deep Learning Models for TROPOMI Methane Plume Screening

基于特征和深度学习模型用于TROPOMI甲烷羽流筛选的可解释比较

Solomiia Kurchaba, Joannes D. Maasakkers, Berend J. Schuit, Ilse Aben

AI总结本研究比较了基于特征（SVC、随机森林、XGBoost）和基于图像（ResNet-18、ResNet-34）的模型在甲烷羽流-伪影分类中的性能，并通过SHAP可解释性分析为操作筛选提供指导。

详情

AI中文摘要

连续且全球性地检测大量甲烷排放是全球变暖减缓的关键步骤。卫星观测（例如来自S5P/TROPOMI）结合羽流检测算法可以在这一努力中发挥关键作用。然而，并非所有看起来像甲烷排放羽流的TROPOMI羽流检测都是实际排放的结果。数据中相当一部分类似羽流的特征是检索伪影。此类伪影可能是由海拔或反照率梯度变化、高浓度气溶胶、海岸线、水体等引起的。先前的工作通过支持向量机分类器（SVC）解决了羽流-伪影分类问题，该分类器在由领域专家设计的大量基于观测的标量特征上训练。然而，这种方法将算法接收的信息范围限制在专家认为重要的内容上，破坏了像素之间的空间关系，并在统计聚合过程中丢失信息。在本研究中，我们在平衡和不平衡评估设置下比较了基于特征（SVC、随机森林、XGBoost）和基于图像（ResNet-18、ResNet-34）的模型用于甲烷羽流-伪影分类。为了解释结果，我们将基于SHAP的可解释性应用于两个模型家族。我们的发现为操作甲烷筛选工作流程（如CAMS甲烷热点探索器）中的模型选择提供了实用指导。

英文摘要

Continuous and global detection of large methane emissions is a crucial step for global warming mitigation. Satellite observations, such as from S5P/TROPOMI, combined with plume detection algorithms, can play a key role in this effort. However, not all TROPOMI plume detections that look like methane emission plumes are the result of actual emissions. A significant part of the plume-like features in the data are retrieval artifacts. Such artifacts could be the result of variations in elevation or albedo gradients, high concentrations of aerosols, coastal lines, water bodies, etc. Previous work approached the problem of plume-artifact classification by means of a Support Vector Machine Classifier (SVC), trained on an extensive set of observation-based scalar features designed by domain experts. However, such an approach limits the information scope received by the algorithm to what is deemed to be important by the experts, breaks the spatial relationship between pixels, and loses information during the process of statistical aggregation. In this study, we compare feature-based (SVC, Random Forest, XGBoost) and image-based (ResNet-18, ResNet-34) models for methane plume-artifact classification under balanced and imbalanced evaluation settings. To interpret the results, we apply SHAP-based explainability to both model families. Our findings provide practical guidance for model selection in operational methane-screening workflows such as the CAMS Methane Hotspot Explorer.

URL PDF HTML ☆

赞 0 踩 0

2605.27235 2026-05-27 cs.CV

MRT: Masked Region Transformer for Layered Image Generation and Editing at Scale

MRT：用于大规模分层图像生成与编辑的掩码区域变换器

Zhicong Tang, Zhao Zhang, Jingye Chen, Mohan Zhou, Yifan Pu, Yuchi Liu, Yalong Bai, Ethan Smith, Yuhui Yuan

AI总结提出MRT，一个200亿参数的掩码区域扩散模型，通过统一文本到层、图像到层和层到层任务，并引入溢出感知画布层，实现高效的多层透明图像生成与编辑。

Comments CVPR 2026

详情

AI中文摘要

分层图像生成与编辑是一项基础能力，能够实现生成视觉内容的逐层重用、编辑和组合，类似于自然语言中的词级编辑。尽管其重要性，但在大规模场景下仍是一个未充分探索的领域。为解决这一问题，我们提出了MRT，一个200亿参数的掩码区域扩散模型，专为多层透明图像生成与编辑设计，并在超过1000万个涵盖多种宽高比和文本提示的多语言设计样本上训练。为充分利用这一规模，我们做出了两项关键技术贡献。首先，我们在共享的掩码区域扩散框架内统一了三个互补任务，包括文本到层、图像到层和层到层，其中选择性标记掩码实现了灵活的逐层生成与编辑。其次，为实现溢出层生成，我们引入了一个溢出感知画布层，用于处理边界不一致性并支持半透明背景合成，从而生成超出可见画布边界的完整可编辑层。此外，我们应用扩散蒸馏实现了8步实时多层生成，且质量下降极小。大量实验表明，我们的框架在所有三个任务上显著优于先前的最先进方法（包括各种商业系统），为多层透明图像生成建立了新基准。值得注意的是，根据用户研究结果，我们的模型在图像到层质量上显著优于同期Qwen-Image-Layered模型，同时在图像到层推理中实现了10-100倍的推理速度提升，并将激活GPU内存消耗降低50-90%。

英文摘要

Layered image generation and editing is a fundamental capability that enables layer-wise reuse, editing, and composition of generated visual content, analogous to word-level editing in natural language. Despite its importance, this remains an underexplored area at scale. To address this gap, we present MRT, a 20B-parameter masked region diffusion model tailored for multi-layer transparent image generation and editing, trained on over 10M multilingual design samples spanning diverse aspect ratios and textual prompts. To fully leverage this scale, we make two key technical contributions. First, we unify three complementary tasks including text-to-layers, image-to-layers, and layers-to-layers within a shared masked region diffusion framework, where selective token masking enables flexible layer-wise generation and editing. Second, to enable overflow layer generation, we introduce an overflow-aware canvas layer that handles boundary inconsistencies and supports semi-transparent background synthesis, enabling complete editable layers extending beyond visible canvas boundaries. Additionally, we apply diffusion distillation to achieve 8-step, real-time multi-layer generation with minimal quality degradation. Extensive experiments demonstrate that our framework substantially outperforms prior state-of-the-art approaches, including various commercial systems, across all three tasks, establishing a new benchmark for multi-layer transparent image generation. Notably, our model significantly outperforms the concurrent Qwen-Image-Layered model in image-to-layers quality according to user-study results, while achieving 10-100\times faster inference and reducing activation GPU memory consumption by 50-90\% during image-to-layer inference.

URL PDF HTML ☆

赞 0 踩 0

2605.27220 2026-05-27 cs.CL cs.IR

The Coverage Illusion: From Pre-retrieval Routing Failure to Post-retrieval Cascades in a Production RAG System

覆盖幻觉：从检索前路由失败到生产RAG系统中的检索后级联

Zafar Hussain, Kristoffer Nielbo

AI总结本文通过丹麦国家百科全书的案例研究，发现合成查询高估了LLM增强的需求（覆盖幻觉），并提出一种检索后级联策略，按成本递增顺序执行工作流，仅在无结果时升级到LLM增强，从而在无需训练开销的情况下提升质量并降低延迟。

详情

AI中文摘要

在现代RAG流水线中，HyDE和查询扩展等查询增强方法被应用于每个查询，导致大量的LLM推理成本和端到端延迟增加。这种开销在实际生产流量中的经验依据仍未得到充分探索。我们以丹麦国家百科全书为案例研究，评估了来自生产流量和合成条件的20,000个查询-工作流对上的五种检索工作流。在该系统中，合成查询表明超过90%的查询需要LLM增强才能实现高检索覆盖率。然而，在我们的生产延迟策略下，只有27.8%的真实用户查询需要LLM增强。我们将这种差距称为覆盖幻觉，并将其归因于合成查询与真实查询分布之间的结构性不匹配。检索前路由无法解决这一差距，因为LLM增强的需求只有在搜索索引后才能揭示，这一结果得到了我们对四种机器学习范式的评估的证实。这种仅从查询无法检测到的覆盖差距，促使我们采用检索后级联策略，该策略按成本递增顺序运行工作流，仅当某一步骤未返回文档时才升级到LLM增强。该级联策略完全无需训练开销或辅助服务基础设施，在质量上比Always-HyDE提高了+0.140综合总体分数，延迟降低了31.8%，并且72.2%的真实用户查询无需LLM增强即可得到服务。

英文摘要

In modern RAG pipelines, query augmentation methods such as HyDE and query expansion are applied to every query, resulting in substantial LLM inference costs and increased end-to-end latency. The empirical justification for this overhead in real production traffic remains largely unexplored. We present a case study of the Danish National Encyclopedia, evaluating five retrieval workflows over 20,000 query-workflow pairs from production traffic and synthetic conditions. In this system, synthetic queries suggest that LLM augmentation is needed for over 90% of queries to achieve high retrieval coverage. However, under our production deferral policy, only 27.8% of real user queries need LLM augmentation. We call this gap the Coverage Illusion and attribute it to a structural mismatch between synthetic and real query distributions. Pre-retrieval routing cannot resolve this gap, as the need for LLM augmentation is only revealed after searching the index, a result confirmed by our evaluation of four machine learning paradigms. The coverage gap, undetectable from the query alone, motivates a post-retrieval cascade that runs workflows in cheapest-first order and escalates to LLM augmentation only when a step returns no documents. Operating entirely without training overhead or secondary serving infrastructure, the cascade improves quality by +0.140 Composite Overall points over Always-HyDE, reduces latency by 31.8%, and serves 72.2% of real user queries without LLM augmentation.

URL PDF HTML ☆

赞 0 踩 0

2605.27219 2026-05-27 cs.LG stat.ML

Nonlinear Data Integration via Kernel Methods for Data Collaboration Analysis

基于核方法的非线性数据整合用于数据协作分析

Yamato Suetake, Yuta Kawakami, Shunnosuke Ikeda, Yuichi Takano

AI总结针对分散保密数据协作分析中线性整合方法重建风险高且无法对齐非线性变换的问题，提出非线性核整合（NKI）方法，通过核岭回归和特征值问题获得全局最优解，并引入图正则化和中心化约束以捕获几何和目标变量信息，在图像分类任务中提升了准确率并降低了重建风险。

Comments 50 pages, 7 figures

详情

AI中文摘要

分散保密数据集的协作分析很重要，但原始数据集的直接共享常受隐私和机构限制。数据协作（DC）分析通过各方特定的混淆函数将每个数据集转换为隐私保护的中间表示，并使用锚数据集将它们整合为公共协作表示。然而，许多现有的DC分析方法依赖线性变换进行数据混淆和整合，这可能增加重建风险。尽管非线性降维可以缓解这一风险，但传统的线性整合方法无法准确对齐非线性变换产生的中间表示。此外，现有的整合方法主要最小化各方之间的差异，并未明确纳入对下游分析有用的几何或目标变量信息。为克服这些限制，我们首先将线性核整合（LKI）公式化为一种线性整合方法，然后对其进行核化以获得非线性核整合（NKI）。NKI通过核岭回归和特征值问题获得全局最优解。我们还引入了图正则化和中心化约束，使得目标表示能够捕获对下游分析有用的几何和目标变量信息。在图像分类任务上的实验表明，在非线性降维下，NKI比现有的线性整合方法提高了分类准确率，而目标变量感知的图正则化和中心化进一步带来了增益。结果还表明，降维选择显著影响分类准确率和重建风险。

英文摘要

Collaborative analysis of decentralized confidential datasets is important, but direct sharing of original datasets is often restricted by privacy and institutional constraints. Data collaboration (DC) analysis transforms each dataset into privacy-preserving intermediate representations via party-specific obfuscation functions and integrates them into common collaboration representations using an anchor dataset. However, many existing DC analysis methods rely on linear transformations for data obfuscation and integration, which may increase reconstruction risk. Although nonlinear dimensionality reduction can mitigate this risk, conventional linear integration methods cannot accurately align intermediate representations produced by nonlinear transformations. Moreover, existing integration methods mainly minimize discrepancies among parties and do not explicitly incorporate geometric or target-variable information useful for downstream analysis. To overcome these limitations, we first formulate linear kernel integration (LKI) as a linear integration method and then kernelize it to obtain nonlinear kernel integration (NKI). NKI admits a globally optimal solution via kernel ridge regression and an eigenvalue problem. We also introduce graph regularization and a centering constraint so that the target representation can capture geometric and target-variable information useful for downstream analysis. Experiments on image classification tasks demonstrate that NKI improves classification accuracy over existing linear integration methods under nonlinear dimensionality reduction, with further gains from target-variable-aware graph regularization and centering. The results also show that dimensionality reduction choices substantially affect both classification accuracy and reconstruction risk.

URL PDF HTML ☆

赞 0 踩 0

2605.27209 2026-05-27 cs.AI

Learning to Act under Noise: Enhancing Agent Robustness via Noisy Environments

在噪声中学习行动：通过噪声环境增强智能体鲁棒性

Yuxin Chen, Xiaodong Cai, Junfeng Fang, Zhuowen Han, Yu Wang, Yaorui Shi, Yi Zhang, Qi Gu, Xunliang Cai, Xiang Wang, An Zhang, Tat-Seng Chua

AI总结提出NoisyAgent框架，通过在训练中引入用户噪声和工具噪声，提升智能体在真实世界噪声环境下的鲁棒性和泛化能力。

详情

AI中文摘要

大型语言模型（LLMs）的最新进展促进了LLMs作为能够推理、规划和工具使用的交互式智能体的广泛部署。尽管在现有基准测试中表现强劲，但此类智能体在部署到现实世界环境时往往表现出显著退化，因为现实环境本质上是随机且不完美的。我们认为，这种差异源于理想化训练设置与现实交互动态之间的根本性不匹配，当前范式依赖于精心策划的任务指令和稳定、可控的环境。为了解决这一差距，我们提出了NoisyAgent，一个明确将环境不完美性纳入智能体学习过程的智能体训练框架。我们识别出现实场景中交互噪声的两个主要来源：用户噪声，捕捉用户交互中的模糊性和变异性；以及工具噪声，反映工具执行中的失败和异常。我们通过修改用户交互模式和模拟训练环境中的工具执行结果，将此类扰动引入训练流程。为了稳定训练同时鼓励智能体处理日益具有挑战性的不完美性，噪声仅应用于部分轨迹，并随着模型适应当前噪声水平而逐步增加难度。大量实验表明，我们的方法在噪声和动态环境下持续提升智能体鲁棒性。我们的分析揭示，在噪声条件下训练也在理想化基准测试中带来了性能提升，这表明对环境噪声的受控暴露促进了更可泛化的推理和决策行为。我们的发现强调了建模交互不完美性对于弥合智能体训练与现实部署之间差距的重要性。

英文摘要

Recent advances in large language models (LLMs) have facilitated the widespread deployment of LLMs as interactive agents capable of reasoning, planning, and tool use. Despite strong performance on existing benchmarks, such agents often exhibit notable degradation when deployed in real-world settings, where environments are inherently stochastic and imperfect. We argue that this discrepancy arises from a fundamental mismatch between idealized training settings and real-world interaction dynamics, where current paradigms rely on carefully curated task instructions and stable, well-controlled environments. To address this gap, we propose NoisyAgent, an agentic training framework that explicitly incorporates environmental imperfections into the agent learning process. We identify two major sources of interaction noise in real-world scenarios: user noise, which captures ambiguity and variability in user interaction, and tool noise, which reflects failures and anomalies in tool execution. We introduce such perturbations into the training pipeline by modifying user interaction patterns and simulating tool execution results within the training environment. To stabilize training while encouraging agents to handle increasingly challenging imperfections, noise is applied to only a subset of rollouts and progressively increased in difficulty as the model adapts to the current noise level. Extensive experiments demonstrate that our approach consistently improves agent robustness under noisy and dynamic environments. Our analysis reveals that training under noise conditions also yields performance gains on idealized benchmarks, suggesting that controlled exposure to environmental noise promotes more generalizable reasoning and decision-making behaviors. Our findings highlight the importance of modeling interaction imperfections for bridging the gap between agent training and real-world deployment.

URL PDF HTML ☆

赞 0 踩 0

2605.27204 2026-05-27 cs.CL cs.IR

GraphReview: Scientific Paper Evaluation via LLM-Based Graph Message Passing

GraphReview: 基于LLM的图消息传递的科学论文评估

Pujun Zheng, Wanying Ren, Jiacheng Yao, Guoxiu He, Star X. Zhao

AI总结提出GraphReview框架，通过图消息传递整合论文内在质量、同期关联和历时关联，利用LLM生成节点先验和边比较证据，结合个性化PageRank进行质量排序、决策预测和审稿生成，在决策和排序指标上平均提升29.7%。

详情

AI中文摘要

科学论文评估通常不仅涉及评估稿件本身，还需要将其与同期研究和先前文献联系起来。然而，现有的基于LLM的方法通常分别建模这些信号，缺乏跨论文传播审稿证据的统一机制。我们提出$ extbf{GraphReview}$，一个基于图的LLM框架，将论文评估形式化为在语义论文图上进行审稿信号的消息传递。该图联合捕捉内在质量、同期论文之间的同步链接以及指向先前工作的历时链接。LLM用于估计节点级质量先验，并通过成对论文比较生成边级比较证据，而个性化PageRank整合审稿信号用于质量排序、决策预测和审稿生成。为了生成更高质量的图证据，我们提出了奖励诱导的最大似然目标来训练LLM骨干网络。实验表明，GraphReview始终优于最强基线，在决策和排序指标上平均提升29.7%，包括准确率提升23.7%，Spearman's $ρ$提升57.6%。它还生成更高质量的审稿文本，并在不同时间段和会议场所中有效泛化。代码可在https://github.com/ECNU-Text-Computing/GraphReview获取。

英文摘要

Scientific paper evaluation often involves not only assessing a manuscript itself, but also relating it to contemporaneous research and prior literature. However, existing LLM-based methods typically model these signals separately and lack a unified mechanism for propagating review evidence across papers. We propose $\textbf{GraphReview}$, a graph-based LLM framework that formulates paper evaluation as review-signal message passing over a semantic paper graph. The graph jointly captures intrinsic quality, synchronic links among contemporaneous papers, and diachronic links to prior work. LLMs are used to estimate node-level quality priors and generate edge-level comparative evidence through pairwise paper comparisons, while Personalized PageRank integrates review signals for quality ranking, decision prediction, and review generation. To produce higher-quality graph evidence, we propose reward-induced maximum likelihood objectives for training the LLM backbones. Experiments show that GraphReview consistently outperforms the strongest baseline, achieving average improvements of 29.7% on decision and ranking metrics, including gains of 23.7% in Accuracy and 57.6% in Spearman's $ρ$. It also produces higher-quality review texts and generalizes effectively across time periods and conference venues. The code is available at https://github.com/ECNU-Text-Computing/GraphReview.

URL PDF HTML ☆

赞 0 踩 0

2605.27203 2026-05-27 cs.CV cs.AI

Generative Animations: A Multi-Model Pipeline for Prompt-Driven Motion Synthesis

生成式动画：面向提示驱动运动合成的多模型流水线

Mannat Khurana, Sanyam Jain, Rishav Agarwal

AI总结提出一种结合大语言模型和分割模型的流水线，将自然语言提示自动转换为符合场景几何、深度遮挡和3D透视变换的动画运动路径。

Comments 5 pages, 6 figures

2605.27195 2026-05-27 cs.CL

EpiCurveBench: Evaluating VLMs on Epidemic Curve Digitization

EpiCurveBench: 评估视觉语言模型在流行病曲线数字化中的表现

Thomas Berkane, Maimuna S. Majumder

AI总结针对现有图表数据提取基准中忽略时间序列结构的问题，提出包含1000张真实流行病曲线图像的EpiCurveBench基准和基于动态规划的EpiCurveSimilarity评估指标，实验表明最强模型仅达52.3% ECS，且ECS能更好区分模型性能。

详情

AI中文摘要

使用视觉语言模型（VLM）进行图表到数据提取的评估，越来越多地依赖于那些显示递减余量的基准（前沿VLM在ChartQA上超过89%）以及将提取点视为无序键值对的指标，忽略了时间序列的时间结构，并将小的对齐偏移视为灾难性失败。我们通过EpiCurveBench（一个从多种公共卫生来源精选的1000张真实流行病曲线图像基准）和EpiCurveSimilarity（ECS，一种通过动态规划对齐预测序列和真实序列的评估指标，容忍局部时间偏移和间隙，同时按比例惩罚它们）来解决这两个空白。评估六种方法——三种前沿闭源VLM、一种开源VLM和两种专门的图表提取系统——我们发现最强的模型仅达到52.3% ECS，并且ECS将四种通用VLM分散在25个百分点的范围内，而键值指标（RMS、SCRM）将它们压缩在5个百分点的范围内。我们进一步针对四个下游流行病学汇总统计量验证ECS，发现更高的ECS预测更小的总计数、峰值时间和峰值幅度误差，以及更高的增长率保真度；在所有四个统计量中，ECS的相关性比动态时间规整强1.5-3.6倍，后者缺乏间隙惩罚，因此无法区分截断预测与时间保真预测。EpiCurveBench针对一个高影响力的公共卫生应用——解锁被困在已发表图表中的数十年的疫情数据——但该基准和指标直接适用于任何结构化时间序列图表提取场景。

英文摘要

Chart-to-data extraction with vision-language models (VLMs) is increasingly evaluated on benchmarks that show diminishing headroom (frontier VLMs exceed 89% on ChartQA) and with metrics that treat extracted points as unordered key-value pairs, ignoring the temporal structure of time series and penalizing small alignment shifts as catastrophic failures. We address both gaps with EpiCurveBench, a benchmark of 1,000 real-world epidemic curve images curated from diverse public-health sources, and EpiCurveSimilarity (ECS), an evaluation metric that aligns predicted and ground-truth series via dynamic programming, tolerating local temporal shifts and gaps while penalizing them proportionally. Evaluating six methods--three frontier closed VLMs, one open VLM, and two specialized chart-extraction systems--we find the strongest model reaches only 52.3% ECS, and that ECS spreads the four general-purpose VLMs over a 25-point range where key-value metrics (RMS, SCRM) compress them into a 5-point band. We further validate ECS against four downstream epidemiological summary statistics, finding that higher ECS predicts smaller errors in total counts, peak timing, and peak magnitude, and higher growth-rate fidelity; across all four, ECS correlates 1.5--3.6 times more strongly than Dynamic Time Warping, which lacks a gap penalty and therefore cannot distinguish a truncated prediction from a temporally faithful one. EpiCurveBench targets a high-impact public-health application--unlocking decades of outbreak data trapped in published figures--but the benchmark and metric apply directly to any structured time-series chart-extraction setting.

URL PDF HTML ☆

赞 0 踩 0

2605.27194 2026-05-27 cs.CL cs.CV cs.LG

Not All Tokens Matter Equally: Dynamic In-context Vector Distillation with Decisive-Token Supervision for Long-form Medical Report Generation

并非所有标记都同等重要：基于关键标记监督的动态上下文向量蒸馏用于长医学报告生成

Ning Wu, Rui Liu, Xinkun Lin, Weixing Chen, Jinxi Xiang, Tao Wei, Lina Yao, Mingjie Li

AI总结提出DIVE框架，通过关键标记监督和状态条件动态引导，解决长文本生成中标记级蒸馏忽略关键标记的问题，在医学报告生成任务上取得最佳性能。

Comments Preprint. 20 pages, 6 figures

详情

AI中文摘要

将示范效果蒸馏到隐藏空间干预中提供了一种轻量级的替代全微调的方法。然而，现有的多模态变体主要是在短文本任务上评估的，其中输出在几个标记后结束。将这些方法扩展到长文本生成暴露了一个基本但未充分研究的局限性：标记级蒸馏隐式地将所有输出标记视为同等信息量，但长文本输出由高频模板和语法标记主导，而实际决定输出质量的标记稀疏分布。在医学报告生成（MRG）中，有两种这样的关键标记突出：决定诊断内容的病理相关标记和决定终止的序列结束（EOS）事件。两者在均匀交叉熵下都受到不足的监督，自回归解码通过偏离教师强制轨迹进一步加剧了问题。我们提出DIVE，一个冻结骨干的蒸馏框架，通过两种与这些失败相匹配的互补机制来解决长文本报告生成。关键标记监督通过提高病理相关标记和EOS事件的交叉熵贡献来恢复监督平衡，确保内容保真度和终止在训练期间学习，而不是在解码时施加。状态条件动态引导用隐藏状态相关的适配器替换固定的开环残差，允许注入信号随着解码漂移而适应。在MIMIC-CXR和CheXpert Plus上使用两个医学VLM骨干的实验表明，DIVE在词汇和临床代理指标中始终位列最强方法之一。我们的方法在所有数据集-骨干设置中实现了最佳的BLEU-4、ROUGE-L和RadGraph F1，同时在粗粒度标签级CheXbert F1上保持竞争力。

英文摘要

Distilling demonstration effects into hidden-space interventions offers a lightweight alternative to full finetuning. However, existing multimodal variants are mostly evaluated on short-form tasks, where outputs end after a few tokens. Extending these methods to long-form generation exposes a fundamental yet underexamined limitation: token-level distillation implicitly treats all output tokens as equally informative, but long-form outputs are dominated by high-frequency template and grammatical tokens, while the tokens that actually determine output quality are sparsely distributed. In medical report generation (MRG), two such decisive tokens stand out: pathology-related tokens that determine diagnostic content, and the end-of-sequence (EOS) event that determines termination. Both receive insufficient supervision under uniform cross-entropy, and autoregressive decoding further compounds the problem by drifting away from teacher-forced trajectories. We propose DIVE, a frozen-backbone distillation framework that addresses long-form report generation through two complementary mechanisms matched to these failures. Decisive-token supervision restores supervision balance by upweighting the cross-entropy contribution of pathology-related tokens and the EOS event, ensuring that content fidelity and termination are learned during training rather than imposed at decoding time. State-conditioned dynamic steering replaces fixed open-loop residuals with hidden-state-dependent adapters, allowing the injected signal to adapt as decoding drifts. Experiments on MIMIC-CXR and CheXpert Plus with two medical VLM backbones show that DIVE consistently ranks among the strongest methods across lexical and clinical-proxy metrics. Our method achieves the best BLEU-4, ROUGE-L, and RadGraph F1 in all dataset--backbone settings, while remaining competitive on coarse label-level CheXbert F1.

URL PDF HTML ☆

赞 0 踩 0

2605.27190 2026-05-27 cs.CL cs.AI cs.LG cs.SD

Learning When to Think While Listening in Large Audio-Language Models

在大音频语言模型中学习何时在聆听时思考

Zhiyuan Song, Weici Zhao, Yang Xiao, Suhao Yu, Cheng Zhu, Jiatao Gu

AI总结提出一种可学习的等待-思考-回答控制机制，通过多奖励强化学习优化大音频语言模型在流式语音交互中的推理时机，在提升准确率的同时减少响应延迟。

Comments 19 pages, 4 figures, 6 tables

详情

AI中文摘要

近期大音频语言模型（LALMs）的进展使得实时、流式的语音交互越来越实用。在这种场景下，推理质量和响应速度紧密耦合：将推理延迟到语音端点可以提高答案质量，但会将思考时间转移到用户可见的响应延迟中，而过早回答则可能在决定性证据到达之前做出承诺。我们为LALMs引入了一种可学习的等待-思考-回答控制公式。受人类对话渐进性启发，控制器在部分音频证据下决定何时等待、何时外化紧凑的推理更新、以及何时回答。以Qwen2.5-Omni-7B为基础模型，我们从语音推理数据中构建对齐的等待-思考-回答轨迹，使用监督微调（SFT）训练控制器，然后应用解耦裁剪和动态采样策略优化（DAPO）。奖励结合了答案正确性、动作有效性、更新时机、延迟同步、推理质量和链一致性，优化完整的等待-思考-回答轨迹，而不仅仅是最终答案。在一个六任务合成语音推理问答（SRQA）基准上，六奖励DAPO控制器将行加权准确率从67.6%提升到70.3%，同时在相同Qwen部署环境下将端点后最终思考长度减少14%。在一个包含186个人类录音的真实音频基准（Real Audio Bench）上，作为超越文本转语音（TTS）渲染语音的迁移检查，控制器家族仍然有效：SFT实现了最强的准确率，而六奖励DAPO控制器是唯一最终思考长度低于基础模型的学习变体。这些结果表明，流式模型应该学习在音频流中何时使中间推理显式化。

英文摘要

Recent advances in Large Audio-Language Models (LALMs) have made real-time, streaming spoken interaction increasingly practical. In this setting, reasoning quality and responsiveness are tightly coupled: delaying reasoning until the speech endpoint can improve answer quality but moves deliberation into user-visible response delay, while answering too early risks committing before decisive evidence arrives. We introduce a learnable wait-think-answer control formulation for LALMs. Motivated by the incremental nature of human conversation, the controller decides under partial audio evidence when to wait, when to externalize a compact reasoning update, and when to answer. Using Qwen2.5-Omni-7B as the base model, we construct aligned wait-think-answer traces from spoken reasoning data, train the controller with supervised fine-tuning (SFT), and then apply Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO). The reward combines answer correctness, action validity, update timing, latency synchronization, reasoning quality, and chain consistency, optimizing the complete wait-think-answer trajectory and not the final answer alone. On a six-task synthetic spoken reasoning question answering (SRQA) benchmark, the six-reward DAPO controller improves the row-weighted accuracy from 67.6% to 70.3% while reducing post-endpoint final-think length by 14% under the same Qwen deployment harness. On a 186-item human-recorded Real Audio Bench, a transfer check beyond text-to-speech (TTS)-rendered speech, the controller family remains functional: SFT achieves the strongest accuracy, while the six-reward DAPO controller is the only learned variant whose final-think length falls below the base. These results suggest that a streaming model should learn when to make intermediate reasoning explicit during the audio stream.

URL PDF HTML ☆

赞 0 踩 0

2605.27189 2026-05-27 cs.CL cs.LG cs.SD eess.AS q-bio.NC

Beyond Binary: Speech Representations Across the Cognitive Score Hierarchy

超越二元：认知评分层级中的语音表征

Serli Kopar, Roshan Prakash Rane, Christian Mychajliw, Lydia Federmann, Gerhard Eschweiler, Daniela Berg, Sam Gijsen, Paula Andrea Perez-Toro, Kerstin Ritter

AI总结本研究利用5,754份德语神经心理学评估录音，比较手工声学特征与自监督学习嵌入在轻度认知障碍认知评估层级（任务、领域、全局）中的表现，发现任务约束与评估层级之间的关联。

详情

AI中文摘要

本研究考察了轻度认知障碍中语音表征与认知评估层级结构之间的关系。利用5,754份德语神经心理学评估录音，我们在三个评分层级（任务、领域和全局）上评估了六项认知任务。我们比较了手工声学特征与自监督学习（SSL）嵌入。结果表明，尽管SSL表示在较低层级通常优于手工特征，但这种趋势在MCI分类中发生逆转。此外，任务特定约束影响性能：响应自由度较大的任务随着层级增加表现出性能稀释，表明“专家”表示，而高度结构化任务的性能向更高层级增加，表明“通才”表示。这些发现揭示了自动临床语音分析中任务约束与评估层级之间的联系。

英文摘要

This study examines the relationship between speech representations and the hierarchical structure of cognitive assessment in mild cognitive impairment. Utilizing 5,754 German neuropsychological assessment recordings, we evaluate six cognitive tasks across three score levels: task, domain, and global levels. We compare hand-crafted acoustic features with self-supervised learning (SSL) embeddings. Results show that although SSL representations generally outperform hand-crafted features at lower levels, this trend reverses for MCI classification. Furthermore, task-specific constraints influence performance: tasks with greater response freedom exhibit performance dilution as hierarchical levels increase, suggesting ``specialist'' representations, whereas the performance of highly structured tasks increases toward higher levels, suggesting ``generalist'' representations. These findings show links between task constraints and assessment hierarchy in automated clinical speech analysis.

URL PDF HTML ☆

赞 0 踩 0

2605.27186 2026-05-27 cs.CL

MAIGO: Mitigating Lost-in-Conversation with History-Cleaned On-Policy Self-Distillation

MAIGO: 通过历史清理的在线策略自蒸馏缓解对话丢失

Haoyu Zheng, Yun Zhu, Shu Yuan, Shangming Chen, Qing Wang, Wenqiao Zhang, Jun Xiao, Yueting Zhuang

AI总结针对大语言模型在多轮对话中性能下降（对话丢失）的问题，提出MAIGO方法，通过在线策略自蒸馏和清理历史助手回复来减少自污染，无需验证器或推理时辅助，显著提升多轮对话准确性。

详情

AI中文摘要

大语言模型通常能从完整指定的提示中解决任务，但当相同需求在多轮中展开时，性能会下降，这被称为对话丢失（LiC）差距。我们将这种退化部分归因于自污染：中间助手的回复进入后续上下文，并将早期偏差向前传递。受此机制启发，我们提出了MAIGO，一种在线策略自蒸馏方法，通过使用模型自身策略的历史清理参考来减少这种污染。对于中间轮次，MAIGO移除先前的助手回复，同时保留用户可见的分片前缀；对于回答轮次，它从基于完整用户侧对话的配对全视图参考中蒸馏。一个可靠性权重降低与干净参考不一致的中间轮次样本的权重。MAIGO不需要验证器奖励、状态标签或推理时辅助。在具有确定性验证器的LiC配对视图协议下，MAIGO将Qwen2.5-7B-Instruct的SHARDED准确率从52.8提升至66.1，SHARDED/FULL比率从66.5%提升至84.1%，同时保持FULL准确率在2.3个点以内。这些结果表明，自污染是LiC差距中一个可训练的成分。

英文摘要

Large language models often solve tasks from a fully specified prompt but degrade when the same requirements unfold over multiple turns, known as the lost-in-conversation (LiC) gap. We trace part of this degradation to self-contamination: intermediate assistant replies enter later context and carry early deviations forward. Motivated by this mechanism, we propose MAIGO, an on-policy self-distillation method that reduces this contamination using history-cleaned references from the model's own policy. For middle turns, MAIGO removes prior assistant replies while preserving the user-visible sharded prefix; for answer turns, it distills from paired full-view references conditioned on the completed user-side dialogue. A reliability weight downweights middle-turn samples that disagree with the clean reference. MAIGO requires no verifier rewards, state labels, or inference-time scaffolding. Under the LiC paired-view protocol with deterministic verifiers, MAIGO improves Qwen2.5-7B-Instruct SHARDED accuracy from 52.8 to 66.1 and the SHARDED/FULL ratio from 66.5% to 84.1%, while keeping FULL accuracy within 2.3 points. These results show that self-contamination is a trainable component of the LiC gap.

URL PDF HTML ☆

赞 0 踩 0

2605.27178 2026-05-27 cs.CV cs.AI cs.LG cs.RO

FoundObj: Self-supervised Foundation Models as Rewards for Label-free 3D Object Segmentation

FoundObj: 自监督基础模型作为无标签3D物体分割的奖励

Zihui Zhang, Zhixuan Sun, Yafei Yang, Jinxi Li, Jiahao Chen, Bo Yang

AI总结提出FoundObj框架，利用自监督2D/3D基础模型的语义和几何先验作为奖励，通过强化学习引导超点合并，实现无标注复杂场景3D物体分割。

Comments ICML 2026. Zihui and Zhixuan are co-first authors. Code and data are available at: https://github.com/vLAR-group/FoundObj

详情

AI中文摘要

我们解决了在训练过程中不依赖任何场景级人类标注的复杂场景点云中3D物体分割的挑战性任务。现有方法通常局限于识别简单物体，这主要是由于学习过程中物体先验不足。在本文中，我们提出了FoundObj，一个新颖的框架，其特点是基于超点的物体发现代理，该代理在我们的创新语义和几何奖励模块的指导下逐步合并合适的相邻超点。这些模块协同利用自监督2D/3D基础模型中的语义和几何先验，为物体发现代理提供互补反馈，并通过强化学习实现对多类物体的鲁棒识别。在多个基准上的大量实验表明，我们的方法始终优于现有基线。值得注意的是，我们的方法在零样本和长尾场景中表现出强大的泛化能力，突显了其在可扩展、无标签3D物体分割方面的潜力。

英文摘要

We address the challenging task of 3D object segmentation in complex scene point clouds without relying on any scene-level human annotations during training. Existing methods are typically constrained to identifying simple objects, primarily due to insufficient object priors in the learning process. In this paper, we present FoundObj, a novel framework featuring a superpoint-based object discovery agent that incrementally merges suitable neighboring superpoints, guided by our innovative semantic and geometric reward modules. These modules synergistically leverage semantic and geometric priors from self-supervised 2D/3D foundation models, providing complementary feedback to the object discovery agent and enabling robust identification of multi-class objects through reinforcement learning. Extensive experiments on diverse benchmarks demonstrate that our approach consistently outperforms existing baselines. Notably, our method exhibits strong generalization in zero-shot and long-tail scenarios, underscoring its potential for scalable, label-free 3D object segmentation.

URL PDF HTML ☆

赞 0 踩 0

2605.27174 2026-05-27 cs.SD cs.AI cs.CY

An investigation of AI integration in sound designer workflows and experiences

AI在声音设计师工作流程与体验中的整合研究

Nelly Garcia, Joshua Reiss

AI总结通过混合方法研究（76人调查+20人访谈），发现当前AI工具在快速消费媒体中表现良好，但缺乏高端声音设计所需的叙事复杂性，从业者偏好辅助性、任务特定的应用，而非端到端生成系统。

详情

AI中文摘要

人工智能正越来越多地被整合到专业音频制作工作流程中，然而开发者生产的工具与实际声音设计师的需求之间仍存在差距。本文通过一项混合方法研究调查了这一差距，包括对76名从业者的调查以及对20名行业专业人士的后续半结构化访谈。使用描述性统计分析和主题分析对结果进行分析，以识别两个数据集中的模式。我们的分析得出了五个主题：上下文、工作流程、潜力、风险和正确使用。我们的工作表明，当前的AI工具在快速消费媒体环境中表现良好，但缺乏高端声音设计（电影、沉浸式体验等）所需的叙事复杂性。从业者表现出对辅助性、任务特定应用的偏好，特别是在音频修复和库管理方面，而不是端到端生成系统。这项工作为创意产业中AI及AI增强工具的使用正在进行的讨论做出了贡献。我们从声音设计师和创意音频从业者的角度报告了该领域的当前状况，并根据我们的发现为声音技术专家和开发者提供了一系列建议，以指导开发更明智的AI声音设计工具。

英文摘要

Artificial intelligence is increasingly being integrated into professional audio production workflows, yet a gap persists between the tools developers produce and the requirements of practising sound designers. This paper investigates this gap through a mixed-methods study comprising a survey of 76 practitioners and follow-up semi-structured interviews with 20 industry professionals. Results were analysed using descriptive statistical analysis and thematic analysis to identify patterns across both datasets. Five themes emerged from our analysis: Context, Workflow, Potential, Risks, and Right Use. Our work indicates that current AI tools perform adequately in fast-consumption media contexts but lack the narrative sophistication required for high-end sound design (films, immersive experiences etc). Practitioners demonstrate a preference for assistive, task-specific applications, particularly in audio restoration and library management, over end-to-end generative systems. This work contributes to the on-going discussion on the use of AI and AI-enhanced tools in the creative industries. We report on the current status of the field from the point of view of sound designers and creative audio practitioners, and offer a set of recommendation for sound technologist and developers based on our findings to guide the development of more informed AI tools for sound design.

URL PDF HTML ☆

赞 0 踩 0

2605.27168 2026-05-27 cs.CL cs.AI cs.CY

Grounding Text Embeddings in Stakeholder Associations

将文本嵌入与利益相关者关联对齐

Jonathan Rystrøm, Sofie Burgos-Thorsen, Zihao Fu, Johan Irving Søltoft, Kenneth C. Enevoldsen, Chris Russell

AI总结提出利益相关者对齐练习方法，通过评估嵌入模型与人类专家的语义距离一致性，发现神经文本嵌入在丹麦政策案例中可靠性显著低于专家（差距19-26个百分点），且该差距在美国联邦AI用例中复现（16个百分点）。

详情

AI中文摘要

文本嵌入被广泛用于分析大型复杂文本语料库。然而，尚不清楚这些嵌入是否捕捉到与使用它们的人类专家相同的语义距离。确保嵌入表示与人类意图一致对于有效分析至关重要。我们提出了利益相关者对齐练习，这是一种使专家关联显式化并将嵌入模型结果扎根于人类理解的方法。在我们关于丹麦政策问题的主要案例研究中，我们发现神经文本嵌入的可靠性远低于人类专家（差距19-26个百分点），并且这种不对齐会传播到下游聚类性能（练习排名与聚类质量之间的Spearman $ρ=0.9$）。一项关于美国联邦AI用例的二次研究使用数字协议和不同的专家社区在英语中复现了该差距（16个百分点）——表明该差距并非单一工具或领域的产物。利益相关者对齐练习提供了一种实用方法，用于评估嵌入模型是否捕捉到对领域专家最重要的语义区分。

英文摘要

Text embeddings are widely used to analyse large corpora of complex texts. However, it is unclear whether the embeddings capture the same semantic distances as the human experts using them. Ensuring alignment between embedding representations and human intentions is essential for valid analyses. We present the Stakeholder Grounding Exercise, a method for making expert associations explicit and grounding embedding model results in human understanding. In our primary case study on Danish policy issues, we find that neural text embeddings are substantially less reliable than human experts (19-26 pp gap), and that this misalignment propagates to downstream clustering performance (Spearman $ρ=0.9$ between exercise ranking and cluster quality). A secondary study on US Federal AI use cases replicates the gap (16pp) in English, using a digital protocol and a different community of experts -- demonstrating that the gap is not an artefact of a single instrument or domain. The Stakeholder Grounding Exercise offers a practical method for assessing whether embedding models capture the semantic distinctions that matter most to domain experts.

URL PDF HTML ☆

赞 0 踩 0