arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.03731 2026-06-03 cs.LG stat.ML

Conformal Language Modeling via Posterior Sampling

通过后验采样的共形语言建模

Nicolas Emmenegger, Theo X. Olausson, Armando Solar-Lezama, Chara Podimata

发表机构 * Massachusetts Institute of Technology（麻省理工学院）

AI总结提出通过近似LLM后验采样（条件为校准的高分区域）来替代事后过滤，实现目标风险控制并提高下游效用。

详情

AI中文摘要

大型语言模型仍然受到幻觉的困扰。最近的工作试图使用基于共形预测的统计技术来抑制其普遍性，取得了理论和实证上的成功。然而，这些方法以事后方式运作，将采样过程本身视为原子操作，然后通过外科手术式地修改样本来移除幻觉声明。这种过滤与生成之间的脱节可能导致样本不连贯、不一致，或者仅仅在模型本身下不太可能。此外，事后手术无法将概率质量转移到更有用和更有帮助的响应上。为了解决这些问题，我们提出从LLM后验的近似中采样，其中条件事件对应于一个校准的高分区域。我们开发了一种针对条件序列生成场景的校准程序，该程序能有效识别该区域并实现目标风险控制。在实证中，我们将我们的方法应用于以开放式的传记生成和数学问题解决为重点的案例研究；与先前的工作相比，我们获得了相同的统计保证，且下游效用更高。

英文摘要

Large Language Models remain plagued by hallucinations. Recent work has sought to tame their prevalence using statistical techniques based on conformal prediction, with both theoretical and empirical success. However, these methods operate in a post-hoc fashion, treating the sampling procedure itself as atomic and then surgically altering samples to remove hallucinated claims. This disconnect between filtering and generation can result in samples that are incoherent, inconsistent, or simply unlikely under the model itself. Moreover, post-hoc surgery is unable to shift probability mass towards more useful and helpful responses. To address these issues, we propose to instead sample from approximations to an LLM posterior, where the conditioning event corresponds to a calibrated, high-scoring region. We develop a calibration procedure tailored to the setting of conditional sequential generation that effectively identifies this region and achieves target risk control. Empirically, we apply our method to case studies focused on open-ended biography generation and mathematical problem solving; compared to prior work, we obtain the same statistical guarantees, with higher downstream utility.

URL PDF HTML ☆

赞 0 踩 0

2606.03728 2026-06-03 cs.CL cs.IR

Re-Ranking Through an Attribution Lens for Citation Quality in Legal QA

通过归因视角对法律问答中的引用质量进行重排序

Mohamed Hesham Elganayni, Selim Saleh

发表机构 * Technical University of Munich（慕尼黑技术大学）

AI总结针对法律问答中检索增强生成系统的引用质量问题，提出基于扰动归因分数训练轻量级交叉编码器对候选段落重排序，显著提升引用忠实度并与专家答案对齐。

Comments 11 pages, 4 tables, 1 figure. Published at ASAIL 2026 (8th Workshop on Automated Semantic Analysis of Information in Legal Text), co-located with ICAIL 2026, Singapore

详情

AI中文摘要

用于法律问答的检索增强生成系统通常基于语义相似度检索段落，并将其提供给语言模型，然后生成带引用的答案。先前的工作假设高排名的段落最有可能被模型有效引用。基于扰动的归因方法（如C-LIME）仅用于事后解释。然而，在AQuAECHR基准测试中，语义相似度与段落归因并不相关。在检索器的候选池中，基于相似度的排序在呈现黄金引用段落方面表现不如随机选择。为了解决这一局限性，我们训练了一个轻量级交叉编码器，基于连续的扰动归因分数在生成前对段落进行重排序。该方法在AQuAECHR基准测试上使用两个语言模型和五折交叉验证进行评估。重排序器显著提高了引用的忠实度以及与专家黄金答案的对齐程度。值得注意的是，在不同模型上独立训练的两个重排序器收敛程度超过了它们的原始归因一致性。这一发现表明，交叉编码器减少了模型特定的噪声，并产生了一个可部分跨模型传递的共享相关性信号，尽管同模型重排序仍然更有效。这些结果表明，基于扰动的归因为引用感知检索提供了一种实用的、模型无关的训练信号。

英文摘要

Retrieval-augmented generation systems for legal question answering typically retrieve passages based on semantic similarity and provide them to a language model, which then generates cited answers. Prior work assumes that highly ranked passages are most likely to be usefully cited by the model. Perturbation-based attribution methods, such as C-LIME, have been used exclusively for post-hoc explanation. However, on the AQuAECHR benchmark, semantic similarity does not correlate with passage attribution. Within a retriever's candidate pool, similarity-based ranking performs worse than random selection at surfacing gold citation paragraphs. To address this limitation, a lightweight cross-encoder is trained on continuous perturbation-based attribution scores to re-rank passages prior to generation. This approach is evaluated on the AQuAECHR benchmark, using two language models and five-fold cross-validation. The re-ranker substantially improves citation faithfulness and alignment with gold expert answers. Notably, two re-rankers trained independently on different models converge beyond their raw attribution agreement. This finding indicates that the cross-encoder reduces model-specific noise and produces a shared relevance signal that partially transfers across models, although same-model re-ranking remains more effective. These results demonstrate that perturbation-based attribution provides a practical, model-agnostic training signal for citation-aware retrieval.

URL PDF HTML ☆

赞 0 踩 0

2606.03715 2026-06-03 cs.CV

Text-to-Image Models Need Less from Text Encoders Than You Think

文生图模型对文本编码器的依赖比你想象的要少

Nurit Spingarn, Noa Cohen, Tamar Rott Shaham, Tomer Michaeli

发表机构 * Technion – Israel Institute of Technology（技术学院 – 以色列理工学院）； MIT CSAIL（麻省理工学院计算机科学与人工智能实验室）

AI总结本文发现基于扩散Transformer的文生图模型主要依赖文本编码器提供的单词含义和词序信息，而非完整的上下文信息，并通过构建仅含位置标记词袋的嵌入验证了这一观点。

Comments Project webpage: https://nsping13.github.io/contextless-TTI/

详情

AI中文摘要

文生图模型依赖文本提示作为与人类意图交互的主要接口。提示由文本编码器编码为嵌入，以条件化图像生成过程。除了单个标记的含义外，文本嵌入还编码了整个提示中的上下文信息，如组合性和属性绑定。然而，图像模型是否实际利用了这些更丰富的信息仍未被充分探索。在此，我们探讨问题：文本表示的哪些方面对图像生成至关重要？我们表明，基于扩散Transformer的文生图模型通常仅依赖文本表示的两个相对简单的方面：（i）相邻标记合并为单词表示（对于跨多个标记的单词），以及（ii）词序，该词序由文本编码器的位置嵌入印刻。为了证明这一点，我们构建了一种新的文本嵌入，它仅编码单个单词的含义和顺序，但缺乏关于整个提示的任何上下文信息。我们发现，这种带位置标记的词袋表示足以成功引导图像生成，实现了与完整文本嵌入引导生成相当的视觉质量和文本保真度。这表明，与普遍看法相反，文生图模型通常不使用文本嵌入中除单词含义和词序之外的丰富信息。相反，复杂语言结构的解码由图像模型本身执行。项目网页：此 https URL

英文摘要

Text-to-image models rely on text prompts as their primary interface to human intent. Prompts are encoded by a text encoder into embeddings that condition the image generation process. Beyond individual token meanings, text embeddings encode contextual information across the full prompt, such as compositionality and attribute binding. However, whether image models actually exploit this richer information remains underexplored. Here, we address the question: Which aspects of text representation are essential for image generation? We show that text-to-image diffusion transformer-based models commonly rely only on two relatively straightforward aspects of text representations: (i) the merging of adjacent tokens into a word representation, for words spanning multiple tokens, and (ii) word order, which is imprinted by the positional embedding of the text-encoder. To show this, we construct a new text embedding that encodes only individual word meanings and order but lacks any contextual information about the full prompt. We find that this bag of position-tagged words representation is sufficient to successfully guide image generation, achieving visual quality and text fidelity that are on par with full text embedding-guided generation. This demonstrates that, contrary to common belief, text-to-image models often do not use the rich information encoded in the text embedding beyond individual word meanings and word order. Instead, the decoding of complex linguistic structures is performed by the image model itself. Project webpage: https://nsping13.github.io/contextless-TTI/

URL PDF HTML ☆

赞 0 踩 0

2606.03713 2026-06-03 cs.CV

Investigating Adversarial Robustness of Multi-modal Large Language Models

探究多模态大语言模型的对抗鲁棒性

Hashmat Shadab Malik, Muzammal Naseer, Salman Khan

发表机构 * Mohamed Bin Zayed University of AI, UAE（穆罕默德·本·扎耶德人工智能大学，阿联酋）； Khalifa University, UAE（哈利法大学，阿联酋）； Australian National University, Australia（澳大利亚国立大学，澳大利亚）

AI总结通过系统研究多模态大语言模型的对抗鲁棒性，提出诊断性CLIP对齐协议预测鲁棒视觉编码器的迁移效果，并证明端到端多模态对抗训练能显著提升模型在强对抗攻击下的性能。

详情

AI中文摘要

多模态大语言模型（MLLMs）在视觉-语言任务上表现出色，但通过视觉编码器（如CLIP）引入视觉输入显著扩大了攻击面，使这些模型容易受到视觉对抗扰动的影响。先前的防御方法通常通过在对抗微调期间强制与CLIP原始嵌入空间严格对齐来保持与预训练MLLMs的兼容性；虽然实用，但这种约束从根本上限制了可实现的鲁棒性。我们对MLLMs的对抗鲁棒性进行了系统研究。我们首先引入了一个诊断性CLIP对齐协议，该协议在完整的MLLM训练之前预测哪些鲁棒视觉编码器能有效迁移到多模态设置中，揭示出大规模多模态对抗预训练（而非仅单模态规模）是强鲁棒性迁移的关键因素。通过端到端多模态训练将这些编码器集成到MLLMs中，与受约束的即插即用基线相比，在强对抗攻击下，字幕生成平均提升28个CIDEr点，VQA准确率提升11.7%。我们进一步表明，直接对标准非鲁棒MLLM应用对抗训练会降低干净和对抗性能，从而确立了鲁棒视觉表示作为严格先决条件，而从鲁棒骨干网络进行端到端对抗训练则额外带来1.9个CIDEr点和4.3% VQA准确率的提升。除了训练时防御外，轻量级的测试时视觉随机变换可作为非鲁棒MLLM的有效黑盒防御，将对抗性能从接近零提升到与鲁棒模型相当的水平。最后，我们展示了鲁棒模型在白盒视觉越狱攻击下显著减少了有毒生成。代码和预训练权重将公开发布。

英文摘要

Multi-modal Large Language Models (MLLMs) achieve strong performance on vision-language tasks, but incorporating visual inputs through a vision encoder (e.g., CLIP) substantially expands the attack surface, making these models vulnerable to visual adversarial perturbations. Prior defenses typically preserve compatibility with pretrained MLLMs by enforcing strict alignment to CLIP's original embedding space during adversarial fine-tuning; while practical, this constraint fundamentally limits achievable robustness. We present a systematic investigation of adversarial robustness in MLLMs. We first introduce a diagnostic CLIP-alignment protocol that predicts, prior to full MLLM training, which robust vision encoders will transfer effectively to the multimodal setting, revealing that large-scale multimodal adversarial pretraining, rather than unimodal scale alone, is the critical factor for strong robustness transfer. Integrating such encoders into MLLMs via end-to-end multimodal training yields average gains of 28 CIDEr points on captioning and 11.7% VQA accuracy under strong adversarial attacks compared to constrained plug-and-play baselines. We further show that adversarial training applied directly to a standard non-robust MLLM degrades both clean and adversarial performance, establishing robust visual representations as a strict prerequisite, while end-to-end adversarial training from a robust backbone delivers additional gains of 1.9 CIDEr points and 4.3% VQA accuracy. Beyond training-time defenses, lightweight test-time visual stochastic transformations serve as an effective black-box defense for non-robust MLLMs, elevating adversarial performance from near-zero to levels comparable with robust models. Finally, we show that our robust models substantially reduce toxic generation under white-box visual jailbreak attacks. Code and pretrained weights will be released publicly.

URL PDF HTML ☆

赞 0 踩 0

2606.03712 2026-06-03 cs.LG

When Graph Tokens Sink: A Mechanistic Analysis of Graph Language Models

当图标记沉没：图语言模型的机制分析

Ding Zhang, Runtao Zhou, Wenqing Zheng, Rizal Fathony, Bayan Bruss, Chirag Agarwal

发表机构 * University of Virginia（弗吉尼亚大学）； Capital One

AI总结本文通过分析图语言模型中图标记的内部行为，发现激活层面的显著性与图信息利用之间存在解耦，揭示了现有图标记构建、放置和对齐机制的局限性。

详情

AI中文摘要

图语言模型（GLMs）已成为将大型语言模型（LLMs）适应图学习任务的一个有前景的方向。通过将图拓扑和节点信息转换为图标记，GLMs允许LLMs联合处理结构化图输入和文本指令。然而，LLMs如何内部解释这些图标记以及图标记是否作为图结构的有意义载体仍不清楚。在这项工作中，我们通过代表性GLM架构中的图标记行为分析了LLMs如何处理图信息。发现：我们发现GLMs中图标记的内部显著性与图信息利用并不等价。图沉没标记一致地表现为激活层面的异常值：它们可以通过一小部分隐藏状态维度上的巨大激活值来识别，并且偏向于早期的图标记位置。然而，这种激活层面的显著性并不意味着这些标记是图信息的主要载体。与语言和视觉-语言模型中的经典注意力沉没不同，图沉没标记不一定从查询标记中吸引最大的注意力权重。通过剪枝、重新定位和交换干预，我们表明图沉没标记对于下游预测并不是最重要的语义或结构标记。含义：这些结果共同表明，在当前的GLMs将图结构映射到LLM标记空间后，产生的图标记表示并不会自然地形成完全可用的拓扑感知内部表示；相反，它们在激活层面的显著性和图语义效用之间表现出解耦。这种解耦指出了现有图标记构建、放置和对齐机制的局限性。

英文摘要

Graph Language Models (GLMs) have become a promising direction for adapting Large Language Models (LLMs) to graph learning tasks. By transforming graph topology and node information into graph tokens, GLMs allow LLMs to jointly process structured graph inputs and textual instructions. Yet, it remains unclear how LLMs internally interpret these graph tokens and whether graph tokens act as meaningful carriers of graph structure. In this work, we analyze how LLMs process graph information through graph-token behavior in representative GLM architectures. Findings. We find that the internal saliency of graph tokens in GLMs is not equivalent to graph information utilization. Graph sink tokens consistently emerge as activation-level outliers: they can be identified by massive activation values along a small set of hidden-state dimensions and are biased toward early graph-token positions. However, this activation-level saliency does not imply that these tokens are the main carriers of graph information. Unlike classical attention sinks in language and vision-language models, graph sink tokens do not necessarily attract the largest attention weights from query tokens. Through pruning, repositioning, and swapping interventions, we show that graph sink tokens are not the most important semantic or structural tokens for downstream prediction. Implications. Together, these results suggest that after current GLMs map graph structure into the LLM token space, the resulting graph-token representations do not naturally form a fully usable topology-aware internal representation; instead, they exhibit a decoupling between activation-level saliency and graph-semantic utility. This decoupling points to limitations in existing graph-token construction, placement, and alignment mechanisms.

URL PDF HTML ☆

赞 0 踩 0

2606.03705 2026-06-03 cs.AI

Code-on-Graph: Iterative Programmatic Reasoning via Large Language Models on Knowledge Graphs

图上的代码：通过大型语言模型在知识图谱上进行迭代式程序化推理

Weiwei Ding, Zixuan Li, Long Bai, Zhuo Chen, Kun Su, Fei Wang, Xiaolong Jin, Jin Zhang, Jiafeng Guo, Xueqi Cheng

发表机构 * Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences（中国科学院计算技术研究所人工智能安全重点实验室）； Shandong University（山东大学）； Shandong University-Weihai Research Institute of Industrial Technology（山东大学威海工业技术研究院）

AI总结提出Code-on-Graph (CoG)框架，通过将知识图谱模式表示为Python类并生成可执行代码，解决现有LLM-KG集成中操作符不灵活和知识注入不可扩展的问题，在WebQSP、CWQ和GrailQA上提升高达10.5%。

详情

AI中文摘要

知识图谱（KGs）被广泛用于缓解大型语言模型（LLMs）的局限性，如知识过时和幻觉。现有的LLM-KG集成框架通常依赖预定义操作符从知识图谱中检索事实知识，并将其注入提示以生成答案。这种范式面临两个关键瓶颈：1）不灵活性：预定义操作符范围有限，因此缺乏足够的组合表达能力来完全捕捉知识图谱问题所需的复杂语义。2）不可扩展性：将事实知识直接注入提示限制了处理大规模事实知识的可扩展性。为了解决这两个瓶颈，我们提出了Code-on-Graph（CoG），一个用于LLM-KG集成的程序化推理框架。具体来说，给定每个推理步骤检索到的事实知识，CoG首先识别相应的知识图谱模式，并将这些模式表示为Python类，这些类作为检索事实的抽象接口。然后，它生成基于这些类的可执行代码，在执行过程中，检索到的事实被实例化为相应类的对象。这种设计实现了灵活的基于代码的推理，同时避免将大规模事实知识直接注入提示。在WebQSP、CWQ和GrailQA上的实验表明，CoG比之前的最先进模型性能提升高达10.5%。

英文摘要

Knowledge Graphs (KGs) are widely used to mitigate the limitations of Large Language Models (LLMs), such as outdated knowledge and hallucinations. Existing LLM-KG integration frameworks typically rely on predefined operators to retrieve factual knowledge from KGs and inject it into prompts for answer generation. This paradigm faces two critical bottlenecks: 1) Inflexibility: The predefined operators are limited in scope and thus lack sufficient compositional expressiveness to fully capture the complex semantics required by KG questions. 2) Unscalability: Direct injection of factual knowledge into prompts limits scalability in handling large-scale factual knowledge. To address these two bottlenecks, we propose Code-on-Graph (CoG), a programmatic reasoning framework for LLM-KG integration. Specifically, given the factual knowledge retrieved at each reasoning step, CoG first identifies the corresponding KG schemas and represents these schemas as Python classes, which serve as abstract interfaces to the retrieved facts. It then generates executable code grounded in these classes, with the retrieved facts instantiated as objects of the corresponding classes during execution. This design enables flexible code-based reasoning while avoiding the direct injection of large-scale factual knowledge into prompts. Experiments on WebQSP, CWQ, and GrailQA demonstrate that CoG outperforms prior state-of-the-art models by up to 10.5%.

URL PDF HTML ☆

赞 0 踩 0

2606.03704 2026-06-03 cs.AI cs.CE cs.CY

Dynamic Objective Selection with Safeguards and LLM Oversight for Financial Decision-Making

动态目标选择与防护机制及大语言模型监督在金融决策中的应用

Keigo Sakurai, Takahiro Ogawa, Miki Haseyama, Anjyu Anan, Kei Nakagawa

发表机构 * Hokkaido University（北海道大学）； Nomura Asset Management Co., Ltd.（日兴资产经营管理公司）； Kobe University（Kobe大学）； Osaka Metropolitan University（大阪市立大学）

AI总结提出DOSS方法，通过将目标选择建模为分类问题并利用滚动窗口进行顺序更新，结合置信度感知门控和LLM监督，实现金融决策中动态目标选择，降低误选和过度切换风险。

Comments Accpeted to The 2nd Workskop on Advances in Financial AI Workshop: Towards Agentic and Responsible Systems at ICLR 2026

详情

AI中文摘要

金融决策任务（如股票推荐和投资组合配置）通常估计未来收益和风险，然后为投资者选择交易或配置，所选优化目标往往决定实际表现。然而，由于市场条件随时间变化，固定目标在不同市场状态下可能次优，而依赖潜在状态估计的状态切换流程可能噪声大或延迟，频繁切换会增加交易成本和运营不稳定性。本文提出DOSS（带防护机制的动态目标选择），一种基于学习的选择器，直接从近期收益的可解释统计摘要中为每个时间点选择决策相关的目标函数，从少量候选（如追求收益、规避损失和风险调整）中选择，无需引入中间状态变量。DOSS将目标选择形式化为目标上的分类问题，并通过滚动窗口进行顺序更新以做出前瞻性选择，避免时间泄漏，同时为每个提议输出置信度分数。为缓解部署中的误选和过度切换，DOSS应用置信度感知门控，并带有故障安全机制，将低置信度提议覆盖为保守默认值，并实施与切换频率相关的显式控制。我们进一步通过将大语言模型（LLM）定位为监督组件而非新目标生成器来整合治理：LLM仅限于接受提议目标或将其覆盖为预定义安全默认值，并在需要时由确定性基于规则的约束触发覆盖。

英文摘要

Financial decision-making tasks such as stock recommendation and portfolio allocation typically estimate future return and risk and then select trades or allocations for an investor, and the chosen optimization objective often determines realized performance. However, because market conditions evolve over time, a fixed objective can be suboptimal across regimes, while regime-switching pipelines that rely on latent regime estimates can be noisy or delayed and frequent switching can increase turnover and operational instability. In this paper, we propose DOSS (Dynamic Objective Selection with Safeguards), a learning-based selector that directly chooses the decision-relevant objective function at each time point from interpretable statistical summaries of recent returns, selecting among a small set of candidates (e.g., return-seeking, loss-averse, and risk-adjusted) without introducing intermediate regime variables. DOSS formulates objective selection as a classification problem over objectives and performs sequential updates with a rolling window to make forward-looking selections without temporal leakage, while also outputting a confidence score for each proposal. To mitigate misselection and excessive switching in deployment, DOSS applies confidence-aware gating with a fail-safe that overrides low-confidence proposals to a conservative default and enforces explicit controls tied to switching frequency. We further integrate governance by positioning a Large Language Model (LLM) as an oversight component rather than a generator of new objectives: the LLM is restricted to accept a proposed objective or override it to a predefined safe default, with deterministic rule-based constraints triggering overrides when needed.

URL PDF HTML ☆

赞 0 踩 0

2606.03698 2026-06-03 cs.LG

Multi$^2$: Hierarchical Multi-Agent Decision-Making with LLM-Based Agents in Interactive Environments

Multi$^2$：基于LLM智能体在交互环境中的分层多智能体决策

Sangeun Park, Minhae Kwon

发表机构 * KAIST（韩国科学技术院）

AI总结提出Multi$^2$分层多智能体决策框架，通过高层智能体（System 1）使用监督微调生成子目标，低层智能体（System 2）使用离线到在线强化学习执行原子动作，以缓解目标漂移并实现长期稳定控制。

Comments Accepted at ICML 2026

详情

AI中文摘要

大型语言模型（LLM）研究的一个核心目标是构建能够通过与动态环境持续交互进行规划、行动和适应的智能体系统。尽管最近的基于LLM的智能体展现出令人印象深刻的上下文推理能力，但它们的长期决策仍然脆弱，常常遭受目标漂移，即目标和计划在长时间交互中发生偏移。我们引入了Multi$^2$，一个分层多智能体决策框架，将智能体行为显式分解为互补角色。高层智能体（System 1）使用监督微调（SFT）专注于上下文感知的子目标生成，而低层智能体（System 2）通过交互环境中的离线到在线强化学习（RL）执行原子动作。这种分离实现了稳定的长期控制，减轻了目标漂移，并允许高效适应。在多种交互环境中，Multi$^2$持续优于强智能体基线，在多轮交互中展现出改进的鲁棒性和协调性。除了性能提升，我们还引入并发布了三个分层基准数据集，填补了训练和评估基于LLM智能体的分层决策的长期空白。

英文摘要

A central goal of large language model (LLM) research is to build agentic systems that can plan, act, and adapt through sustained interaction with dynamic environments. While recent LLM-based agents exhibit impressive contextual reasoning, their long-horizon decision-making remains fragile, often suffering from objective drift, where goals and plans drift over extended interactions. We introduce Multi$^2$, a hierarchical multi-agent decision-making framework that explicitly decomposes agent behavior into complementary roles. A high-level agent (System 1) focuses on context-aware sub-goal generation using supervised fine-tuning (SFT), while a low-level agent (System 2) executes atomic actions through offline-to-online reinforcement learning (RL) in interactive environments. This separation enables stable long-horizon control, mitigates objective drift, and allows efficient adaptation. Across diverse interactive environments, Multi$^2$ consistently outperforms strong agentic baselines, demonstrating improved robustness and coordination in multi-turn interaction. Beyond performance, we introduce and release three hierarchical benchmark datasets, filling a long-standing gap in training and evaluating hierarchical decision-making for LLM-based agents.

URL PDF HTML ☆

赞 0 踩 0

2606.03695 2026-06-03 cs.CL

Don't Forget Your Embeddings: Robust Knowledge Erasure via Precise Editing of Embeddings

不要忘记你的嵌入：通过精确编辑嵌入实现鲁棒的知识擦除

Clara Haya Suslik, Or Shafran, Mor Geva

发表机构 * Blavatnik School of Computer Science and AI, Tel Aviv University（巴尔-艾赫伦计算机科学与人工智能学院，特拉维夫大学）

AI总结提出 EMBER 模块，利用稀疏矩阵分解精确擦除词嵌入中的概念相关特征，增强现有知识擦除方法的鲁棒性和特异性。

详情

AI中文摘要

随着语言模型在现实应用中的广泛部署，从模型中擦除特定知识的能力对安全性和合规性变得至关重要。主流方法通过更新模型参数实现持久移除，但目标知识往往可以通过对抗性提示或重新学习恢复。在这项工作中，我们假设这种局限性部分源于现有方法忽略了嵌入层。为了解决这个问题，我们引入了 EMBedding ERasure (EMBER)，一个即插即用的擦除模块，利用稀疏矩阵分解从词嵌入中精确擦除概念相关特征。通过在 Gemma-2-2B-it 和 Llama-3.1-8B-Instruct 上对不同概念的综合评估，我们发现用 EMBER 增强现有方法可以一致地提高擦除效果和特异性，且连贯性损失最小。此外，它显著提高了对重新学习的鲁棒性，将恢复的准确率降低高达 50%，在 Llama 上限制在 35%，而先前方法为 70%-76%。进一步分析表明，连贯性成本是局部的，仅影响一小部分概念专属词元。我们的工作确立了精确的嵌入层干预对于鲁棒的概念擦除是必要的，并证明现有方法可以从这种增强中受益。

英文摘要

As language models are increasingly deployed in real-world applications, the ability to erase specific knowledge from them becomes critical for safety and compliance. Prominent methods seek persistent removal by updating the model's parameters, yet the target knowledge often can be recovered through adversarial prompting or relearning. In this work, we hypothesize this limitation stems in part from existing methods overlooking the embedding layer. To address this, we introduce EMBedding ERasure (EMBER), a plug-n-play erasure module that leverages Sparse Matrix Factorization for precise erasure of concept-related features from token embeddings. Through comprehensive evaluations across diverse concepts on Gemma-2-2B-it and Llama-3.1-8B-Instruct, we find that augmenting existing methods with EMBER consistently improves erasure efficacy and specificity across task formats, with minimal coherence loss. Moreover, it dramatically improves robustness to relearning, reducing regained accuracy by up to 50%, limiting it to 35% on Llama compared to 70%-76% for prior methods. Further analysis shows that the coherence cost is localized, affecting only a small set of concept-exclusive tokens. Our work establishes that precise embedding-level intervention is necessary for robust concept erasure, and demonstrates that existing methods can benefit from such augmentation.

URL PDF HTML ☆

赞 0 踩 0

2606.03694 2026-06-03 cs.RO cs.CV cs.HC

Face versus Body Tracking for Human-Robot Interaction: An Egocentric Dataset

面向人机交互的面部与身体跟踪：一个自我中心数据集

Jessica Wenninger, Gabriel Skantze

发表机构 * Furhat Robotics ； University of Naples Federico II（那不勒斯费德里科二世大学）； Division of Speech, Music and Hearing, KTH Royal Institute of Technology（语音、音乐和听觉研究所，皇家理工学院）

AI总结针对社交机器人自我中心视角下频繁身份切换问题，提出一个自定义标注的自我中心数据集，通过系统评估检测误差、对比面部与身体跟踪，并分析扩展空间记忆和外观重识别的影响，最终优化管道将身份切换减少49%。

Comments 8 pages, 5 figures, 3 tables. Accepted to the 35th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN 2026)

详情

AI中文摘要

为了实现有意义的人机交互（HRI），机器人必须通过持续跟踪用户来不断评估参与度。然而，最先进的计算机视觉模型主要针对监控或自动驾驶进行了优化。社交机器人面临独特的自我中心挑战，例如人类跳动、相互遮挡或离开画面。频繁的身份切换（IDSW）会导致机器人在对话中失去立足点。为了解决这个问题，我们引入了一个新颖的、自定义标注的自我中心数据集，通过Furhat机器人收集，以捕捉复杂的社会动态。我们进行了系统评估，将检测错误与跟踪逻辑分离，比较面部与身体跟踪，并评估扩展空间记忆和外观重识别（ReID）的影响。结果表明，增加空间记忆可以缓解长时间遮挡，但在复杂动态事件上失败。集成ReID解决了复杂的切换，但表现出相反的效果：它显著提高了身体跟踪的稳定性，但由于轮廓角度敏感性导致面部IDSW激增。最终，我们的优化管道将IDSW减少了49%，减轻了交互中断。由于标准基准缺乏密集的近距离遮挡，这项工作强调了原生捕捉社会动态对于真正验证HRI感知模型的迫切需求。

英文摘要

To enable meaningful human-robot interaction (HRI), a robot must continuously assess engagement by consistently tracking users over time. State-of-the-art computer vision models, however, are heavily optimized for surveillance or autonomous driving. A social robot faces distinct egocentric challenges, such as humans bouncing, obstructing each other, or leaving the frame. Frequent identity switches (IDSW) cause the robot to lose its footing mid-conversation. To address this, we introduce a novel, custom-annotated egocentric dataset collected via the Furhat robot to capture complex social dynamics. We present a systematic evaluation isolating detection errors from tracking logic, comparing face versus body tracking, and assessing the impact of extended spatial memory and appearance re-identification (ReID). Results indicate that increasing spatial memory mitigates prolonged occlusions but fails on complex dynamic events. Integrating ReID resolves complex switches but exhibits opposing effects: it substantially improves body tracking stability, yet causes facial IDSW to spike due to profile angle sensitivity. Ultimately, our optimized pipeline reduces IDSW by 49\%, mitigating interaction breakdowns. Because standard benchmarks lack dense, close-quarter occlusions, this work highlights the critical need for natively captured social dynamics to truly validate HRI perception models.

URL PDF HTML ☆

赞 0 踩 0

2606.03693 2026-06-03 cs.CL cs.CV

Does Language Shift Break Medical Vision-Language Models? Indonesian Radiology Visual Question Answering Case Study

语言转换会破坏医学视觉语言模型吗？印度尼西亚放射学视觉问答案例研究

Pieter Christy Yan Yudhistira, Dzaki Rafif Malik, Novanto Yudistira

发表机构 * Intelligent System Laboratory, Faculty of Computer Science Brawijaya University（智能系统实验室，计算机科学学院布拉维亚大学）

AI总结本研究通过构建印尼语放射学VQA数据集IndoRad-VQA，评估医学视觉语言模型在非英语临床语言下的鲁棒性，发现英语与印尼语设置间存在8-25%的性能差距，表明需要更包容的多语言评估。

Comments accepted to MMFM-BIOMED Workshop @ CVPR 2026

详情

AI中文摘要

医学视觉语言模型（VLM）通常在英语放射学视觉问答基准上进行评估，其在非英语临床语言下的鲁棒性很大程度上未被探索。我们引入了IndoRad-VQA，这是VQA-RAD的印尼语改编版，以评估当问题以印尼语提出时，医学VLM是否保留放射学推理能力。放射学问答对被翻译成印尼语，并通过基于自我评估的质量控制来保持临床意义、术语一致性和答案等价性。我们在英语和印尼语提示设置下评估了通用、东南亚多语言和医学专用VLM。除了准确性，我们量化了英语和印尼语输入之间的语言鲁棒性差距。我们还进行了错误分析，以识别问答的失败模式，例如是/否翻转、侧向性错误和输出语言不匹配。我们的发现表明，在英语医学VQA基准上的强性能并不一定转化为印尼语临床环境中的鲁棒行为。我们观察到英语和印尼语设置之间的性能差距为8%到25%，具体取决于评估指标。这些结果突显了对医学多模态基础模型进行更包容的多语言评估的必要性。数据集可在以下网址获取：此 https URL。

英文摘要

Medical Vision-Language Models (VLMs) are typically evaluated on English radiology visual question answering benchmarks, leaving their robustness under non-English clinical language largely unexplored. We introduce IndoRad-VQA, an Indonesian adaptation of VQA-RAD, to assess whether medical VLMs retain radiology reasoning ability when questions are asked in Bahasa Indonesia. Radiology question-answer pairs are translated into Indonesian with self-evaluation-based quality control to preserve clinical meaning, terminology consistency, and answer equivalence. We evaluate general-purpose, Southeast Asian multilingual, and medical-specific VLMs under English and Indonesian prompting settings. Beyond accuracy, we quantify the language robustness gap between English and Indonesian inputs. We also conduct an error analysis to identify failure modes of question answering, such as yes/no flips, laterality errors, and output-language mismatches. Our findings show that strong performance on English medical VQA benchmarks does not necessarily translate to robust behavior in Indonesian clinical contexts. We observe a performance gap of 8 to 25 percent between the English and Indonesian settings, depending on the evaluation metric. These results highlight the need for more inclusive multilingual evaluation of medical multimodal foundation models. The dataset is available at https://huggingface.co/datasets/Lab-IS/IndoRad-VQA.

URL PDF HTML ☆

赞 0 踩 0

2606.03692 2026-06-03 cs.AI cs.CL

SkillPyramid: A Hierarchical Skill Consolidation Framework for Self-Evolving Agents

SkillPyramid：一种用于自我进化智能体的层次化技能整合框架

Yuan Xiong, Ziqi Miao, Qian Chen, Lijun Li, Yequan Wang, Shizhu He, Jun Zhao, Kang Liu

发表机构 * The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China（认知与决策智能复杂系统重点实验室，自动化研究所，中国科学院，北京，中国）； School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China（中国科学院大学人工智能学院，北京，中国）； Shanghai Artificial Intelligence Laboratory, Shanghai, China（上海人工智能实验室，上海，中国）； Beijing Academy of Artificial Intelligence, Beijing, China（北京人工智能研究院，北京，中国）

AI总结针对智能体缺乏系统性技能构建、积累和迁移的问题，提出SkillPyramid层次化技能整合框架，通过自进化机制在任务执行中组合、验证和吸收新技能，在三个基准上平均奖励提升38.0%，执行步骤减少27.7%。

详情

AI中文摘要

最近的AI智能体可以灵活调用技能来解决复杂任务，但其长期改进从根本上受到缺乏系统性技能构建、积累和迁移的限制。特别是，没有统一的技能整合框架，智能体倾向于在不同任务中冗余构建相似能力，无法有效将经验转化为可复用资产，并且难以将任务特定技能泛化到新场景。为了解决这一限制，我们提出了SkillPyramid，一个技能整合框架，它重用现有技能经验以实现更广泛的任务泛化。在层次化技能拓扑上运行，SkillPyramid进一步引入了一种自进化机制，使智能体能够在任务执行过程中组合、验证和吸收新技能。在ALFWorld、WebShop和ScienceWorld上使用四个骨干模型的实验表明，SkillPyramid将平均奖励提高了38.0%，并将执行步骤减少了27.7%。总体而言，我们的方法将技能集合从静态资源池转变为动态进化系统。

英文摘要

Recent AI agents can flexibly invoke skills to solve complex tasks, but their long-term improvement is fundamentally constrained by a lack of systematic skill construction, accumulation, and transfer. In particular, without a unified framework for skill consolidation, agents tend to redundantly construct similar capabilities across different tasks, are unable to effectively transform experience into reusable assets, and struggle to generalize task-specific skills to novel scenarios. To address this limitation, we propose SkillPyramid, a skill consolidation framework that reuses existing skill experience for broader task generalization. Operating on a hierarchical skill topology, SkillPyramid further introduces a self-evolution mechanism that enables agents to compose, validate, and incorporate new skills during task execution. Experiments on ALFWorld, WebShop, and ScienceWorld across four backbone models show that SkillPyramid substantially increases the average reward by 38.0% and reduces execution steps by 27.7%. Overall, our method transforms a skill collection from a static resource pool into a dynamic evolution system.

URL PDF HTML ☆

赞 0 踩 0

2606.03689 2026-06-03 cs.LG cs.AI

Staying Alive: Uncensored Survival Analysis with Tabular Foundation Models

保持存活：基于表格基础模型的无审查生存分析

Mariana Vargas Vieyra

发表机构 * GitHub

AI总结提出一种无需训练的生存回归方法，利用表格基础模型预测事件时间并迭代填补右删失数据，构建加速失效时间模型，在标准基准上表现与需训练的模型相当。

详情

AI中文摘要

生存分析是一种统计框架，用于建模直到某个感兴趣事件发生的时间跨度。它广泛应用于包括医疗保健和客户流失预测在内的多个领域，其适用性的一个核心挑战在于事件时间被部分观测或存在右删失。近年来，表格基础模型因其能够在单次前向传播中执行预测任务而无需数据集特定的参数拟合，引起了广泛关注。尽管取得了成功，但由于右删失的存在，它们在时间-事件数据预测任务中的应用仍然困难。在这项工作中，我们提出了一种无需训练的生存回归方法，通过利用表格基础模型来预测事件时间并迭代地填补右删失数据。我们的方法使用表格基础模型构建加速失效时间模型，除了拟合单个标量参数外无需训练。随后，基于Buckley-James估计器，我们引入了一种非参数上下文内估计器来处理右删失数据。我们在标准生存分析基准上的实验表明，我们的方法与几种需要训练的参数和半参数生存回归模型（包括Cox回归和参数加速失效时间模型）相比具有竞争力。

英文摘要

Survival Analysis (SA) is a statistical framework that models the time span until some event of interest occurs. Widely used in several domains, including healthcare and churn prediction, a central challenge in its applicability stems from the time of the event being partially observed or \emph{right-censoring}. Tabular Foundation Models (TFM) have attracted significant interest in recent years due to their ability to perform prediction tasks in a single forward pass, requiring no dataset-specific parameter fitting. Despite their success, their application to prediction tasks on time-to-event data remains difficult due to right censoring. In this work, we present a training-free method to survival regression by leveraging TFMs to both predict the time of the event and iteratively impute right-censored data. Our method uses a TFM to construct an Accelerated Failure Time (AFT) model requiring no training beyond fitting a single scalar parameter. Subsequently, by building on the Buckley-James estimator, we introduce a non-parametric in-context estimator for right-censored data. Our experiments on standard survival analysis benchmarks show that our method is competitive with several parametric and semi-parametric survival regression models that require training, including Cox regression and parametric AFT models.

URL PDF HTML ☆

赞 0 踩 0

2606.03685 2026-06-03 cs.LG cs.AI

A Close Look At World Model Recovery In Supervised Fine-Tuned LLM Planners

监督微调的大语言模型规划器中世界模型恢复的深入探究

Patrick Emami, Nan Qiang, Peter Graf

发表机构 * National Laboratory of the Rockies（落基山国家实验室）

AI总结通过可解释性实验，研究监督微调如何影响大语言模型在经典规划任务中恢复世界模型的能力，发现微调使模型线性编码动作有效性和状态谓词，且更广泛的状态空间覆盖有助于更准确的世界模型恢复。

Comments 17 pages. Under review at TMLR

详情

AI中文摘要

监督微调（SFT）改进了大语言模型（LLM）中的端到端经典规划，但这些模型是否也学会了表示和推理它们正在解决的规划问题？由于经典规划问题的相对复杂性以及端到端规划生成对LLM的挑战，探索这个问题一直很困难。在我们的工作中，我们设计并执行了一系列可解释性实验，通过检查微调LLM的内部表示和生成能力，全面探究世界模型恢复。我们发现：a) 对有效动作序列进行监督微调使LLM能够线性编码动作有效性和一些状态谓词。b) 难以使用输出概率对动作有效性进行分类的模型可能仍然学习到将有效动作与无效动作分开的内部表示。c) 微调期间更广泛的状态空间覆盖（例如来自随机游走数据）能更准确地恢复底层世界模型。总之，这项工作为将可解释性技术应用于规划LLM提供了一种方法，并产生了有助于揭示LLM中知识表示方式的见解。

英文摘要

Supervised fine-tuning (SFT) improves end-to-end classical planning in large language models (LLMs), but do these models also learn to represent and reason about the planning problems they are solving? Due to the relative complexity of classical planning problems and the challenge that end-to-end plan generation poses for LLMs, it has been difficult to explore this question. In our work, we devise and perform a series of interpretability experiments that holistically interrogate world model recovery by examining both internal representations and generative capabilities of fine-tuned LLMs. We find that: a) Supervised fine-tuning on valid action sequences enables LLMs to linearly encode action validity and some state predicates. b) Models that struggle to use output probabilities for classifying action validity may still learn internal representations that separate valid from invalid actions. c) Broader state space coverage during fine-tuning, such as from random walk data, yields more accurate recovery of the underlying world model. In summary, this work contributes a recipe for applying interpretability techniques to planning LLMs and generates insights that shed light on open questions about how knowledge is represented in LLMs.

URL PDF HTML ☆

赞 0 踩 0

2606.03681 2026-06-03 cs.LG

Speedrunning Tabular Foundation Model Pretraining

表格基础模型预训练的速通

Salih Bora Ozturk, Alexander Pfefferle, Frank Hutter

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出一种速通竞赛格式，通过优化单文件训练脚本，在nanoTabPFN上实现81倍预训练加速，并建立社区排行榜以累积改进。

2606.03675 2026-06-03 cs.CV

A Fast Methane Detection Pipeline on Board Satellites Based on Mag1c-SAS and LinkNet

基于Mag1c-SAS和LinkNet的星载甲烷快速检测流水线

Jonáš Herec, Vít Růžička, Rado Pitoňák, Jan Sedmidubsky

发表机构 * Zaitra s.r.o.（泽特拉公司）； NASA JPL（美国国家航空航天局喷气推进实验室）； Faculty of Informatics, Masaryk University（马萨里克大学信息学院）

AI总结提出Mag1c-SAS算法加速甲烷检测，并结合轻量级LinkNet模型降噪，在星载硬件上实现高效、低功耗的甲烷泄漏检测。

Comments arXiv admin note: substantial text overlap with arXiv:2507.01472

详情

AI中文摘要

甲烷是一种强效温室气体，通过高光谱卫星图像早期检测泄漏有助于减缓气候变化。然而，许多现有高光谱任务仅捕获操作员手动瞄准的区域，从而遗漏潜在感兴趣事件。为了经济高效地克服下行链路速率慢的问题，星载检测是一种可行的解决方案。然而，传统的甲烷检测方法对于资源受限的星载硬件计算需求过高。本工作通过关注高效、低功耗算法来加速甲烷检测。具体而言，我们测试了先前未用于甲烷检测的快速目标检测ACE和CEM方法，并提出了Mag1c-SAS——当前最先进Mag1c算法的显著更快变体。为了探索其检测潜力，我们将它们与基于U-Net和LinkNet的机器学习模型集成。我们在STARCOP数据集和一个新的EMIT-MSeg数据集上评估我们的方法，该数据集我们与高质量注释策略一起引入并开源。所提出的Mag1c-SAS方法被证明非常有效，运行速度比原始Mag1c方法快约80倍，提供视觉上相似但噪声更大的结果。当额外与轻量级LinkNet方法配对时，它有效降低了噪声，在EMIT-MSeg上相比基线Mag1c方法AUPRC得分提高了超过30个百分点，在STARCOP上F1得分提高了约4个百分点。我们评估了两种新颖的波段选择策略，并通过硬件分析确认了系统的星载可行性，展示了边际功耗和高效的CPU/RAM利用率。我们以用户友好的轻量级PyPI库形式发布最终系统，网址为：this https URL，同时所有实验代码、模型和数据发布在：this https URL。

英文摘要

Methane is a potent greenhouse gas, and detecting leaks early via hyperspectral satellite imagery can help climate change mitigation efforts. Meanwhile, many existing hyperspectral missions only capture areas manually targeted by operators, thus missing potential events of interest. To overcome slow downlink rates cost-effectively, onboard detection is a viable solution. However, traditional methane detection methods are too computationally demanding for resource-limited onboard hardware. This work accelerates methane detection by focusing on efficient, low-power algorithms. In particular, we test fast target detection ACE and CEM methods that have not been previously used for methane detection and propose Mag1c-SAS -- a significantly faster variant of the current state-of-the-art Mag1c algorithm. To explore their detection potential, we integrate them with a machine learning model based on U-Net and LinkNet. We evaluate our methods on the STARCOP dataset and a novel EMIT-MSeg dataset, which we introduce and open-source alongside a high-quality annotation strategy. The proposed Mag1c-SAS approach proves highly effective by operating ~80x faster than the original Mag1c approach, providing a visually similar, but noisier result. When additionally paired with the lightweight LinkNet approach, it effectively reduces noise, achieving AUPRC score improvements of over 30 pp on EMIT-MSeg compared to the baseline Mag1c approach, and an F1 score on STARCOP ~4 pp higher. We evaluate two novel band selection strategies and confirm the system's onboard viability through hardware profiling, demonstrating marginal power consumption and efficient CPU/RAM utilization. We release the final system in a user-friendly and lightweight PyPI library at: https://pypi.org/project/onboard-methane-detection/, alongside all experimental code, models, and data at: https://github.com/zaitra/methane-filters-benchmark.

URL PDF HTML ☆

赞 0 踩 0

2606.03672 2026-06-03 cs.SD cs.MM

Foley-Omni: A Unified Multimodal Generation Model from Task-Level Audio Synthesis to Complete Video Soundtrack Generation

Foley-Omni：从任务级音频合成到完整视频配乐生成的统一多模态生成模型

Ye Tao, Lupeng Liu, Xuenan Xu, Jiasun Feng, Jiarui Wang, Ying Qin, Shuiyang Mao, Wei Liu, Shuai Wang

发表机构 * School of Intelligence Science and Technology, Nanjing University（南京大学智能科学与技术学院）； Video Rebirth ； Shanghai Jiao Tong University（上海交通大学）； Beijing Jiaotong University（北京交通大学）； Shanghai AI Laboratory（上海人工智能实验室）

AI总结提出Foley-Omni统一多模态音频生成模型，通过共享潜变量生成过程联合建模语音、音效和音乐，实现从孤立任务级合成到完整视频配乐生成，并构建V2ST-Bench基准进行综合评估。

详情

AI中文摘要

最近的统一音频生成模型可以支持语音、音效和音乐等多种任务，但大多数仍然专注于孤立的任务级合成。然而，真实的视频制作通常需要为同一视频联合且一致地生成完整音轨的多个组成部分。我们提出了Foley-Omni，一种统一的多模态音频生成模型，通过在一个共享的潜变量生成过程中联合建模语音、音效和音乐，将孤立的任务级合成扩展到完整的视频配乐生成。为了支持训练和可重复评估，我们开发了一个视听数据整理流程，并引入了V2ST-Bench，一个用于整体视频配乐生成评估的基准。实验表明，Foley-Omni在单个合成任务上与专家系统相比具有竞争性的性能，同时在混合配乐生成中提高了语音清晰度、视听一致性和感知质量。

英文摘要

Recent unified audio generation models can support diverse tasks across speech, sound effects, and music, but most of them still focus on isolated task-level synthesis. However, real video production often requires multiple components of a complete audio track to be generated jointly and consistently for the same video. We present Foley-Omni, a unified multimodal audio generation model that extends isolated task-level synthesis to complete video soundtrack generation by jointly modeling speech, sound effects, and music within a shared latent generation process. To support training and reproducible evaluation, we develop an audiovisual data curation pipeline and introduce V2ST-Bench, a benchmark for holistic video soundtrack generation evaluation. Experiments show that Foley-Omni achieves competitive performance with expert systems on individual synthesis tasks, while improving speech intelligibility, audiovisual consistency and perceptual quality for mixed soundtrack generation.

URL PDF HTML ☆

赞 0 踩 0

2606.03666 2026-06-03 cs.CV

Beyond Single Solution: Multi-Hypothesis Collaborative Deep Unfolding Network for Image Compressive Sensing

超越单一解：用于图像压缩感知的多假设协作深度展开网络

Wenxue Cui, Hualin Li, Yuhang Qin, Yifu Xu, Xiaopeng Fan, Debin Zhao

发表机构 * Harbin Institute of Technology, Harbin, China（哈尔滨工业大学）； Harbin Institute of Technology Suzhou Research Institute, Suzhou, China（哈尔滨工业大学苏州研究院）

AI总结针对压缩感知问题的病态性，提出一种多假设协作深度展开网络（MHC-DUN），通过联合优化多个解空间，利用AlphaNet动态预测空间变步长进行梯度下降，并设计多假设协作近端映射模块，以提升重建质量。

Comments Accepted by CVPR 2026

详情

AI中文摘要

最近的深度展开网络（DUNs）通过将迭代优化与深度学习架构有效集成，推动了压缩感知（CS）的发展。然而，大多数CS方法主要将其推理限制在单一解空间，忽略了CS问题固有的病态性，该病态性本质上允许多个合理的候选假设。本文提出了一种新颖的多假设协作深度展开CS网络（MHC-DUN），该网络通过跨不同解空间联合优化，显式建模并利用多个假设。具体而言，遵循近端梯度下降算法，MHC-DUN在此多假设范式下联合执行梯度下降和近端映射。i) 对于梯度下降，引入精心设计的AlphaNet，动态预测所有假设的空间变步长，实现跨多个解的协作梯度更新。ii) 对于近端算子，设计了一个复杂的多假设协作近端映射模块，该模块利用假设内和假设间的相关性先验，联合优化多个解。为了实现端到端训练，设计了一种新颖的复合损失函数，该函数平衡测量保真度、假设多样性和重建精度，在保持重建保真度的同时鼓励探索互补解。实验结果表明，所提出的CS方法优于现有的CS网络。

英文摘要

Recent deep unfolding networks (DUNs) have advanced Compressive Sensing (CS) by effectively integrating iterative optimization with deep learning architectures. However, most CS approaches predominantly confine their inference to a single solution space, neglecting the inherent ill-posedness of CS problems that intrinsically permits multiple plausible candidate hypotheses. In this paper, a novel Multi-Hypothesis Collaborative Deep Unfolding CS Network (MHC-DUN) is proposed, which explicitly models and leverages multiple hypotheses by jointly optimizing across diverse solution spaces. Specifically, following the Proximal Gradient Descent algorithm, MHC-DUN jointly performs gradient descent and proximal mapping within this multi-hypothesis paradigm. i) For gradient descent, a well-designed AlphaNet is introduced to dynamically predict spatially varying step sizes for all hypotheses, enabling collaborative gradient updates across multiple solutions. ii) For proximal operator, a sophisticated multi-hypothesis collaborative proximal mapping module is designed, which leverages both intra-hypothesis and inter-hypothesis correlation priors to jointly refine multiple solutions. To enable end-to-end training, a novel composite loss function is designed, which balances measurement fidelity, hypothesis diversity, and reconstruction accuracy, encouraging exploration of complementary solutions while maintaining reconstruction fidelity. Experimental results reveal that the proposed CS method outperforms existing CS networks.

URL PDF HTML ☆

赞 0 踩 0

2606.03657 2026-06-03 cs.AI

Diagnosing Knowledge Gaps in LLM Tool Use: An Agentic Benchmark for Novel API Acquisition

诊断大语言模型工具使用中的知识缺口：面向新API获取的智能体基准

Jinnuo Liu, Yue Peng, Jinhan Niu, Hongyi Wen

发表机构 * NYU Shanghai（纽约大学上海分校）

AI总结提出 NovelAPIBench 基准，通过动态发现新API、分解知识包并生成可执行任务，诊断模型在API使用中的六类错误，发现检索与参数调优互补。

Comments 37 pages, 12 figures

详情

AI中文摘要

用于代码生成的大语言模型通常需要使用预训练数据中不存在的API。这不仅仅是回忆函数名：模型必须协调签名、模块路径、输入输出契约、语义和可执行使用模式。现有的新API基准通常是静态的，依赖于粗略的通过/失败指标，或使用可能无法反映真实库演变的合成API。我们引入了NovelAPIBench，一个全自动动态基准，对于任何基础模型和目标库，发现新API，提取分解的知识包，生成可执行编码任务，并将失败样本分配到六个诊断类别。在大约1.9K个任务、四个基础模型和五个领域上，我们比较了通过检索注入的知识与通过参数自适应内化的知识。我们发现知识组件不可互换：使用示例是最强的独立信号，而最佳的双组件设置将签名与机制或示例配对，具体取决于领域和骨干。添加更多上下文，尤其是源代码，可能通过增加导入路径错误而有害。一旦外部知识被移除，参数自适应也不能取代检索；相反，微调主要教会模型如何使用提供的包，并且这种能力可以迁移到保留的库。这些结果表明检索和调优扮演互补角色：检索提供易变的API内容，而调优改进程序性整合。

英文摘要

Large language models for code generation often need to use APIs that are absent from their pretraining data. This requires more than recalling a function name: models must coordinate signatures, module paths, input-output contracts, semantics, and executable usage patterns. Existing novel-API benchmarks are typically static, rely on coarse pass/fail metrics, or use synthetic APIs that may not reflect real library evolution. We introduce NovelAPIBench, a fully automated dynamic benchmark that, for any base model and target library, discovers novel APIs, extracts decomposed knowledge bundles, generates executable coding tasks, and assigns failed samples to six diagnostic categories. Across about 1.9K tasks, four base models, and five domains, we compare knowledge injected through retrieval with knowledge internalized through parametric adaptation. We find that knowledge components are not interchangeable: usage examples are the strongest standalone signal, while the best two-component setting pairs signatures with either mechanisms or examples depending on the domain and backbone. Adding more context, especially source code, can hurt by increasing import-path errors. Parametric adaptation also does not replace retrieval once external knowledge is removed; rather, fine-tuning mainly teaches models how to use provided bundles, and this ability transfers to held-out libraries. These results suggest that retrieval and tuning play complementary roles: retrieval supplies volatile API content, while tuning improves procedural integration.

URL PDF HTML ☆

赞 0 踩 0

2606.03655 2026-06-03 cs.AI cs.LO

Towards Non-Monotonic Entailment in Propositional Defeasible Standpoint Logic

命题可废止立场逻辑中的非单调蕴涵

Nicholas Leisegang, Thomas Meyer, Ivan Varzniczak

发表机构 * University of Cape Town and CAIR, South Africa（开普敦大学和CAIR，南非）； Université Sorbonne Paris Nord, Inserm, Sorbonne Université, Limics, 93017 Bobigny, France（巴黎-索邦大学，Inserm，索邦大学，Limics，法国93017博比尼）； ISTI-CNR, Pisa, Italy（意大利比萨ISTI-CNR）

AI总结本文通过引入情境立场条件句，将KLM风格的非单调理性蕴涵关系提升到命题可废止立场逻辑（PDSL）的一个片段中，并证明了该片段可表达为一组情境条件句，进而将基于排序的蕴涵关系（如理性和词典序闭包）从命题情况忠实翻译到PDSL，同时保持复杂度界限。

详情

AI中文摘要

近期在可废止推理领域的研究中，Kraus等人提出的优先语义和蕴涵概念已被应用于模态逻辑。然而，该领域的工作主要集中在可满足性检查以及单调蕴涵关系上，后者在推理上可能较弱。引入这一概念的一个特定模态逻辑是命题立场逻辑，其中的模态可以表达不同视角的观点。这导致了命题可废止立场逻辑（PDSL）的形式化。在本文中，我们提出了一种方法，将（非单调）理性蕴涵关系类从传统的KLM风格推理提升到PDSL的一个片段中。为此，我们通过情境立场条件句扩展了PDSL的表达力，使得我们能够在给定立场的上下文中讨论可废止条件句。这使我们能够用情境条件句重新刻画PDSL的语法，并表明PDSL的一个大片段可以表达为一组情境条件句。然后，我们专注于刻画该片段中的非单调蕴涵，定义了一种方法，将任何基于排序的蕴涵关系从命题情况移植到PDSL情况。这首先在一般情形下描述，然后在理性和词典序闭包的具体情形下考虑，为每个推理提供了到PDSL的忠实翻译。我们还表明，该PDSL片段中的蕴涵检查可以主要使用命题情况下的算法进行，同时保持复杂度界限。

英文摘要

Recent work in defeasible reasoning has seen notions of preferential semantics and entailment in the style of Kraus et al. applied to modal logics. However, work in this field has focussed primarily on satisfiability checking, and monotonic notions of entailment, which may be inferentially weak. One particular modal logic where this has been introduced is propositional standpoint logics, where modalities can express the views of different viewpoints. This has resulted in the formalisation of propositional defeasible standpoint logic (PDSL). In this paper, we propose a means of lifting the class of (non-monotonic) rational entailment relations from traditional KLM-style reasoning to a fragment of PDSL. In order to do so, we extend the expressivity of PDSL via situated standpoint conditionals, allowing us to talk about a defeasible conditional holding in the context of a given standpoint. This allows us to re-characterise the syntax of PDSL in terms of situated conditionals, and shows that a large fragment of PDSL is expressible as a set of situated conditionals. We then focus on characterising non-monotonic entailment in this fragment, defining a method to transport any ranking-based entailment relation from the propositional case into the PDSL case. This is first described in the general case and then considered in the specific cases of rational and lexicographic closures, providing a faithful translation of each inference into PDSL. We also show that entailment-checking in this fragment of PDSL can be done largely using algorithms from the propositional case, while preserving complexity bounds.

URL PDF HTML ☆

赞 0 踩 0

2606.03648 2026-06-03 cs.CL cs.AI

Safety Measurements for Fine-tuned LLMs Should be Grounded in Capability

微调大语言模型的安全性测量应基于能力

Krishnapriya Vishnubhotla, Hillary Dawkins, Isar Nejadgholi, Svetlana Kiritchenko

发表机构 * National Research Council, Canada（加拿大国家研究理事会）

AI总结通过将微调锚定于特定能力目标，多维度评估微调对模型能力和安全性的影响，发现微调模型对安全提示可能产生不连贯输出、自动安全判断不可靠，且结论因安全基准和评估者而异。

Comments 8 pages plus appendices

详情

AI中文摘要

通过微调将基础大语言模型适应用户的任务或偏好风格可能会损害模型的安全性。先前的研究在有限且看似随机的实验设置中考察了微调对模型安全性的影响。我们认为，将微调锚定于特定的能力目标对于避免任意的经验选择至关重要，这使我们能够得出关于安全性影响的有意义结论，并在一致的基础上比较缓解方法。我们通过关注能力和安全性，对微调对模型行为的影响进行了多维度评估。我们的结果揭示了重要问题：(1) 微调模型可能对安全提示产生不连贯的生成内容，(2) 对于这种不连贯输出，自动安全判断不可靠，(3) 关于微调影响的结论可能因安全基准以及安全评估者的选择而改变。

英文摘要

Adapting foundation large language models to a user's task or preferred style through fine-tuning can result in compromising the model's safety. Previous works examined the effects of fine-tuning on model safety in limited and seemingly random experimental settings. We argue that anchoring fine-tuning to a specific capability goal is essential for avoiding arbitrary empirical choices, allowing us to draw meaningful conclusions about safety impacts, and to compare mitigation methods on a consistent basis. We conduct a multi-dimensional evaluation of the effects of fine-tuning on model behavior by focusing on capability as well as safety. Our results surface important issues that (1) fine-tuned models can produce incoherent generations in response to safety prompts, (2) automated safety judgments are unreliable for such incoherent outputs, and (3) the conclusions about the effects of fine-tuning can change depending on the choice of safety benchmark as well as the safety evaluator.

URL PDF HTML ☆

赞 0 踩 0

2606.03646 2026-06-03 cs.CV

A Benchmark for Semi-supervised Multi-modal Crowd Counting

半监督多模态人群计数基准

Haoliang Meng, Xiaopeng Hong, Yabin Wang, Wangmeng Zuo

发表机构 * Harbin Institute of Technology（哈尔滨工业大学）； Pengcheng Laboratory（鹏城实验室）

AI总结本文构建了首个半监督多模态人群计数基准，通过制定标准化协议和评估多种基线方法，为该任务奠定基础。

2606.03645 2026-06-03 cs.LG cs.AI

The Shape of Addition: Geometric Structures of Arithmetic in Large Language Models

加法的形状：大型语言模型中算术的几何结构

Liuyuan Wen, Xun Zhu, Lihao Huang, Wenbin Li, Yang Gao

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结通过分析多操作数加法中残差流的几何结构，发现等原始和轨迹（IRST）并建立噪声量化模型，将算术错误解释为由内部神经噪声引起的几何滑移，并利用几何一致性检查方法检测和纠正量化失败。

Comments Accepted by ICML 2026

详情

AI中文摘要

大型语言模型在基本算术中表现出矛盾的脆弱性，暗示内部计算与离散输出之间存在脱节。通过分析多操作数加法中的残差流几何结构，我们识别出等原始和轨迹（IRST），这是一种由语义数字锚定并由连续进位纤维调制的几何结构。我们提出噪声量化模型来解释这种几何结构，将算术错误视为由内部神经噪声推动连续的潜在进位势跨越量化阈值引起的几何滑移。这一几何框架进一步阐明了探针多功能性，解释了轻量级探针如何从单个激活向量中解开共存的潜在信号（如真实值与幻觉）。最后，我们通过一种几何一致性检查方法验证了这些见解，该方法在推理过程中有效检测和纠正了这些量化失败。我们的代码可在以下网址获取：https://this URL。

英文摘要

Large Language Models exhibit paradoxical fragility in fundamental arithmetic, implying a disconnect between internal computation and discrete output. By analyzing the residual stream geometry during multi-operand addition, we identify the Iso-Raw-Sum Trajectory (IRST), a geometric structure where representations are anchored by semantic digits and modulated by continuous carry fibers. We propose the Noisy Quantization Model to explain this geometry, framing arithmetic errors as Geometric Slippages caused by internal neural noise pushing a continuous, latent Carry Potential across quantization thresholds. This geometric framework further elucidates Probe Versatility, explaining how lightweight probes can disentangle coexisting latent signals (such as ground truth versus hallucination) from a single activation vector. Finally, we validate these insights through a geometric consistency check method that effectively detects and corrects these quantization failures during inference. Our code is available at https://github.com/RL-MIND/Shape-of-Addition.

URL PDF HTML ☆

赞 0 踩 0

2606.03644 2026-06-03 cs.LG

Spatial Transcriptomics-Guided Alignment Enhances Molecular Profiling in Pathology Foundation Model

空间转录组学引导的对齐增强病理基础模型中的分子分析

Fengtao Zhou, Yingxue Xu, Zhengyu Zhang, Yihui Wang, Zhengrui Guo, Ling Liang, Jiabo Ma, Cheng Jin, Ziyi Liu, Huajun Zhou, Hongyi Wang, Du Cai, Chenglong Zhao, Xi Wang, Can Yang, Yu Wang, Wenbin Li, Feng Gao, Zhe Wang, Zhenhui Li, Xiuming Zhang, Li Liang, Hao Chen

发表机构 * Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong SAR, China（计算机科学与工程系，香港科学与技术大学，香港特别行政区，中国）； Department of Pathology, Nanfang Hospital, Southern Medical University, Guangzhou, China（pathology department, 南方医科大学南芳医院，广州，中国）； Department of Pathology, School of Basic Medical Sciences, Southern Medical University, Guangzhou, China（pathology department, 南方医科大学基础医学学院，广州，中国）； Guangdong Province Key Laboratory of Molecular Tumor Pathology, Guangzhou, China（广东省分子肿瘤病理学重点实验室，广州，中国）； Jinfeng Laboratory, Chongqing, China（金风实验室，重庆，中国）

AI总结提出STAMP框架，利用空间转录组数据通过通路感知对齐策略增强病理基础模型的分子感知能力，并在多层级评估中验证其临床效用。

详情

AI中文摘要

全面的分子分析对于现代精准肿瘤学至关重要，但高昂的成本、标本耗尽和漫长的周转时间仍然阻碍其应用。虽然病理基础模型（PFMs）已显示出从常规苏木精-伊红（H&E）全切片图像推断分子表型的潜力，但当前架构主要依赖于以视觉为中心的自监督学习或视觉-语言对齐，缺乏将细微形态学特征与潜在基因组改变联系起来所需的空间解析分子监督。空间转录组学（ST）作为一种变革性技术出现，能够在完整组织切片内进行转录组定量，从而保留组织学与分子谱之间的精确空间联系。在本研究中，我们提出了用于分子分析的空间转录组学引导对齐框架（STAMP），该框架赋予PFMs内在的分子感知能力。为支持这一范式，我们整理了HumanST-1k，一个涵盖不同解剖器官和测序平台的人类ST数据集。该图谱产生了180万对H&E斑块及其对应的转录组谱，提供了一个将组织学结构与其分子状态联系起来的语料库。为减轻原始转录组学中固有的技术噪声，STAMP采用了一种通路感知对齐策略，将转录组数据聚合为生物学功能通路，随后通过参数高效微调将其整合到PFMs中。这种对齐丰富了PFMs的表征空间，并释放了其解析亚视觉分子特征的能力。通过多层级评估框架验证了这些增强表征的临床实用性。

英文摘要

Comprehensive molecular profiling is essential for modern precision oncology but remains hindered by prohibitive costs, specimen exhaustion, and protracted turnaround times. While pathology foundation models (PFMs) have demonstrated potential for inferring molecular phenotypes from routine hematoxylin and eosin (H&E) whole-slide images (WSIs), current architectures primarily rely on vision-centric self-supervised learning or vision-language alignment, lacking the spatially resolved molecular supervision required to connect subtle morphological features with underlying genomic alterations. Spatial transcriptomics (ST) emerges as a transformative technology that enables transcriptomic quantification within intact tissue sections, thereby preserving the precise spatial link between histology and molecular profiles. In this study, we present a Spatial Transcriptomics-guided Alignment framework for Molecular Profiling (STAMP), which endows PFMs with intrinsic molecular awareness. To support this paradigm, we curated HumanST-1k, a human ST dataset spanning diverse anatomical organs and sequencing platforms. This atlas yields 1.8 million pairs of H&E patches and corresponding transcriptomic profiles, providing a corpus that links histological structures with their molecular states. To mitigate the technical noise inherent to raw transcriptomics, STAMP applies a pathway-informed alignment strategy that aggregates transcriptomic data into biologically functional pathways, which are subsequently integrated into PFMs via parameter-efficient fine-tuning. This alignment enriches the representation space of PFMs and unlocks their capacity to resolve sub-visual molecular signatures. The clinical utility of these augmented representations was validated through a multi-tier evaluation framework.

URL PDF HTML ☆

赞 0 踩 0

2606.03641 2026-06-03 cs.AI cs.CY

Gender-Dependent Diagnostic Substitution in LLM Medical Triage: Same Symptoms, Unequal Urgency

LLM医疗分诊中的性别依赖性诊断替代：相同症状，不同紧急程度

Qi Han Wong

发表机构 * GitHub

AI总结研究大型语言模型在相同神经症状下，仅因患者性别和年龄不同而产生不同的分诊建议，发现年轻女性被系统性低估紧急程度，机制为诊断替代。

Comments 7 pages, 3 tables. Multi-model replication across Gemini, Claude, and GPT. Code and data: https://github.com/wongqihan/ai-behavioral-experiments

详情

AI中文摘要

我们调查了大型语言模型是否会对相同的神经症状，在仅改变患者性别和年龄的情况下，产生不同的医疗分诊建议。使用三个模型家族——Gemini 3.5 Flash、Claude Sonnet 4.6和GPT-5.4-mini——我们呈现了一个标准化的症状特征（持续性头痛、视力模糊、晨起恶心、视觉障碍），跨越七个人口统计学条件：三个年龄组（25、38、65岁）×两个性别（男、女），加上一个性别未指定的基线（每个模型每个条件n=30，共630次试验）。我们发现了一个显著、系统性的性别依赖性分诊差异：年轻女性获得的急诊室转诊率显著低于同龄男性（Gemini：0% vs. 23.3%；Claude：6.7% vs. 96.7%；GPT：6.7% vs. 66.7%，所有p<0.001）。所有模型在65岁年龄组中差异消失。主要机制是诊断替代：模型锚定于与性别相关的诊断，优先将年轻女性分类为特发性颅内高压——一种流行病学上与育龄女性相关的疾病——而将男性诊断为伴有占位性病变的通用颅内压增高。这种诊断闭合将女性患者导向较低紧急程度的护理（门诊医生预约），尽管严重程度评分相当（7-9/10）。我们的发现表明，临床LLM通过使用流行病学先验来抑制分诊紧急程度，复制了已知的人类临床偏见，提示AI分诊引擎必须将紧急程度评估与概率性诊断先验解耦。我们发布了所有代码、提示和原始结果。

英文摘要

We investigate whether large language models produce different medical triage recommendations for identical neurological symptoms when only the patient's stated gender and age vary. Using three model families--Gemini 3.5 Flash, Claude Sonnet 4.6, and GPT-5.4-mini--we present a standardized symptom profile (persistent headache, blurred vision, morning nausea, visual disturbances) across seven demographic conditions: three age groups (25, 38, 65) x two genders (male, female), plus a gender-unspecified baseline (n = 30 per condition per model, 630 total trials). We find a stark, systemic gender-dependent triage disparity: young women receive significantly lower emergency room (ER) referral rates than age-matched men (Gemini: 0% vs. 23.3%; Claude: 6.7% vs. 96.7%; GPT: 6.7% vs. 66.7%, all p < 0.001). The disparity disappears at age 65 for all models. The primary mechanism is diagnostic substitution: the models anchor on a gender-associated diagnosis, preferentially classifying young women with Idiopathic Intracranial Hypertension (IIH)--a condition epidemiologically linked to women of childbearing age--while diagnosing men with generic increased intracranial pressure with space-occupying lesions in the differential. This diagnostic closure routes female patients to lower-urgency care (outpatient doctor appointments) despite comparable severity ratings (7-9/10). Our findings demonstrate that clinical LLMs replicate documented human clinical biases by using epidemiological priors to suppress triage urgency, suggesting that AI triage engines must decouple urgency assessment from probabilistic diagnostic priors. We release all code, prompts, and raw results.

URL PDF HTML ☆

赞 0 踩 0

2606.03635 2026-06-03 cs.CV cs.AI

VidMsg: A Benchmark for Implicit Message Inference in Short Videos

VidMsg：短视频中隐含信息推断的基准测试

Issar Tzachor, Michael Green, Rami Ben-Ari

发表机构 * OriginAI, Israel（OriginAI以色列）

AI总结提出VidMsg基准，通过消息优先构建流程和双向检索任务，评估视频理解模型对短视频中隐含信息的推断能力。

Comments Project page: https://iyttor.github.io/VidMsg

详情

AI中文摘要

理解短视频不仅仅是识别可见物体和动作；视频制作者常常在片段中包含潜在的信息或目的。我们引入了VidMsg，一个用于评估互联网原生短视频中隐含信息理解的基准测试。VidMsg包含400个来自YouTube的片段，涵盖9个实际主题领域和52个细粒度目标信息，涉及职业与金融、教育、健康与福祉、文化、安全、可持续性和生活方式等领域。VidMsg通过消息优先流程构建：LLM首先将目标信息转化为间接搜索场景，用于检索候选片段。然后，人工标注者保留那些传达预期信息但不过于直白的片段。VidMsg主要设计用于双向消息-片段检索，适用于视频搜索和推荐等可扩展应用，系统必须捕捉全面的视频理解。除了检索，VidMsg还包括一个诊断性多项选择问答基准，模型需要从语义相关的选项中选出片段的预期信息。与当代视频语言和检索模型的实验表明，强模型在VidMsg上常常失败，因为该任务需要语用推理、上下文线索整合以及语义相近信息的区分。我们还引入了VidVec-Msg，一种改进消息导向检索的基线方法，同时为未来工作留下了足够的提升空间。

英文摘要

Understanding short online videos involves more than identifying visible objects and actions; video makers often include an underlying message or purpose in the clip. We introduce VidMsg, a benchmark for evaluating implicit message understanding in short, internet-native video clips. VidMsg contains 400 YouTube-derived clips across 9 practical topic areas and 52 fine-grained target messages, covering domains such as career and finance, education, health and well-being, culture, safety, sustainability, and lifestyle. VidMsg is constructed through a message-first pipeline: an LLM first translates target messages into indirect search scenarios, which are used to retrieve candidate clips. Human annotators then retain clips that convey the intended message without being overly explicit. VidMsg is designed primarily for bidirectional message-clip retrieval for scalable applications such as video search and recommendation, where systems must capture holistic video understanding. In addition to retrieval, VidMsg includes a diagnostic multiple-choice QA benchmark, where models select the intended message of a clip from semantically related alternatives. Experiments with contemporary video-language and retrieval models show that strong models often fail on VidMsg, because the task requires pragmatic inference, integration of contextual cues, and discrimination among semantically close messages. We also introduce VidVec-Msg, a baseline method that improves message-oriented retrieval while leaving substantial headroom for future work.

URL PDF HTML ☆

赞 0 踩 0

2606.03629 2026-06-03 cs.AI

TSQAgent: Rating Time Series Data Quality via Dedicated Agentic Reasoning

TSQAgent: 通过专用智能体推理评估时间序列数据质量

Shunyu Wu, Dan Li, Haozheng Ye, Weibin Feng, Jian Lou, Bo Zhang, Wenjie Feng, Chenjuan Guo, See-Kiong Ng

发表机构 * Sun Yat-sen University（中山大学）； China University of Mining Technology（中国矿业大学）； University of Science and Technology of China（中国科学技术大学）； East China Normal University（华东师范大学）； National University of Singapore（新加坡国立大学）

AI总结提出TSQAgent框架，通过三个协作智能体（感知器、检查员、裁决者）识别相关质量维度并进行定量比较，显著提升LLM在时间序列数据质量评估中的表现。

详情

AI中文摘要

评估时间序列（TS）数据的质量是基础但极具挑战性的任务，因为质量维度具有多面性。最近，大语言模型（LLM）通过成对比较和逐维度评估，成为TS质量评估的一种有前景的范式。然而，现有方法依赖手动预定义的质量维度和纯文本推理，尚不清楚LLM能否识别真正相关的质量维度或进行基于证据的定量质量比较。为探究此问题，我们构建了TSQBench，一个专用基准，用于评估LLM在两种渐进能力上的表现：（i）理解和识别相关质量维度，（ii）在特定维度下进行质量比较。分析表明，当前LLM在维度识别和基于证据的质量比较方面均存在困难。为解决这些局限，我们提出TSQAgent，一种新颖的用于TS质量评级的智能体推理框架，包含三个协作角色：感知器（负责聚焦维度选择）、检查员（负责逐维度定量分析）和裁决者（负责聚合并优化最终判断）。特别地，我们引入一种智能体推理策略，赋予模型识别和优先考虑最相关质量维度的能力，并进一步提出一个配备外部分析工具的智能体工作流，以实现对选定维度的精确定量比较。在提出的基准和11个真实世界数据集上的实验表明，我们的框架不仅显著提升了LLM在质量理解和定量比较方面的能力，而且有效地将这些改进转化为更好的质量感知数据选择，从而提升下游性能和数据效率。

英文摘要

Assessing the quality of time series (TS) data is fundamental yet inherently challenging due to the multifaceted nature of quality dimensions. Recently, large language models (LLMs) have emerged as a promising paradigm for TS quality assessment via pairwise comparison and per-dimension evaluation. However, existing approaches rely on manually predefined quality dimensions and purely text-based reasoning, leaving it unknown whether LLMs can identify truly relevant quality dimensions or perform grounded and quantitative quality comparisons. To investigate this, we construct TSQBench, a dedicated benchmark for evaluating LLMs on two progressive capabilities: (i) understanding and identifying relevant quality dimensions, and (ii) performing quality comparison under specific dimensions. Our analysis reveals that current LLMs consistently struggle with both dimension identification and evidence-grounded quality comparison. To address these limitations, we propose TSQAgent, a novel agentic reasoning framework for TS quality rating consisting of three collaborative roles: Perceiver for focused dimension selection, Inspector for dimension-wise quantitative analysis, and Adjudicator that aggregates and refines the final judgment. In particular, we introduce an agentic reasoning strategy that instills the ability to identify and prioritize the most relevant quality dimensions, and further propose an agent workflow equipped with external analytical tools to enable precise quantitative comparisons over selected dimensions. Experiments on both the proposed benchmark and eleven real-world datasets demonstrate that our framework not only substantially improves LLMs' capabilities in quality understanding and quantitative comparison but also effectively translates these improvements into better quality-aware data selection, leading to enhanced downstream performance and data efficiency.

URL PDF HTML ☆

赞 0 踩 0

2606.03628 2026-06-03 cs.CL cs.AI cs.LG

Building Reliable Long-Form Generation via Hallucination Rejection Sampling

通过幻觉拒绝采样构建可靠的长文本生成

Lin Li, Georgia Channing, Suhaas M Bhat, Gabriel Davis Jones, Yarin Gal

发表机构 * Georgia Institute of Technology（佐治亚理工学院）； University of California, Berkeley（加州大学伯克利分校）； University of Cambridge（剑桥大学）； DeepMind（深度思维）

AI总结提出分段幻觉拒绝采样框架SHARS，利用任意幻觉检测器在生成过程中拒绝并重采样幻觉片段，以缓解长文本生成中的幻觉累积问题，提升事实一致性。

Comments accepted by ICML 2026

详情

AI中文摘要

大型语言模型（LLMs）在开放式文本生成方面取得了显著进展，但仍容易产生不正确或无依据的幻觉内容，这损害了其可靠性。在长文本生成中，由于幻觉雪崩现象（早期错误传播并累积到后续输出），这一问题更加严重。为了解决这一挑战，我们提出了一种新颖的推理时幻觉缓解框架，称为分段幻觉拒绝采样（SHARS），该框架使用任意幻觉检测器在生成过程中识别并拒绝幻觉片段，并重新采样直到生成忠实的内容。通过仅保留可信信息并在此基础上构建后续生成，该框架减轻了幻觉累积并增强了事实一致性。为了实例化该框架，我们采用语义不确定性作为检测器，并引入了若干关键修改以解决其局限性并更好地适应长文本。我们的方法使模型能够自我纠正幻觉，无需外部资源（如网络搜索或知识库），同时保持与这些资源的兼容性以便未来扩展。在标准化幻觉基准上的实证评估表明，我们的方法显著减少了长文本生成中的幻觉，同时保持甚至提高了生成的信息量。代码可在以下网址获取：this https URL。

英文摘要

Large language models (LLMs) have achieved remarkable progress in open-ended text generation, yet they remain prone to hallucinating incorrect or unsupported content, which undermines their reliability. This issue is exacerbated in long-form generation due to hallucination snowballing, a phenomenon where early errors propagate and compound into subsequent outputs. To address this challenge, we propose a novel inference-time hallucination mitigation framework, named Segment-wise HAllucination Rejection Sampling (SHARS), which uses an arbitrary hallucination detector to identify and reject hallucinated segments during generation and resample until faithful content is produced. By retaining only confident information and building subsequent generations upon it, the framework mitigates hallucination accumulation and enhances factual consistency. To instantiate this framework, we adopt semantic uncertainty as the detector and introduce several vital modifications to address its limitations and better adapt it to long-form text. Our method enables models to self-correct hallucinations without requiring external resources such as web search or knowledge bases, while remaining compatible with them for future extensions. Empirical evaluations on standardized hallucination benchmarks demonstrate that our method substantially reduces hallucinations in long-form generation while preserving or even improving the informativeness of generation. Code is available at: https://github.com/TreeLLi/hallucination-rejection-sampling.

URL PDF HTML ☆

赞 0 踩 0

2606.03626 2026-06-03 cs.CV cs.AI cs.CY

TurtleAI: Benchmarking Multimodal Models for Visual Programming in Turtle Graphics

TurtleAI：海龟图形学中视觉编程的多模态模型基准测试

Chao Wen, Jacqueline Staub, Adish Singla

发表机构 * MPI-SWS（马克斯·普朗克研究所-斯图加特）

AI总结提出TurtleAI基准，包含823个基于海龟图形学真实任务的视觉编程任务，评估20多个多模态模型发现成功率低于30%，并通过少量种子样本生成合成数据微调Qwen2-VL-72B提升约20%性能。

Comments ACL Findings 2026 paper

详情

AI中文摘要

视觉语言模型（VLM）已被探索用于视觉编程，即生成代码以解决视觉任务。然而，大多数先前工作侧重于提高生产力的视觉编程；目前尚不清楚当前VLM在教育导向的视觉编程上表现如何，以及哪些因素限制了它们的性能。为填补这一空白，我们引入了TurtleAI，这是一个包含823个任务的基准，这些任务基于海龟图形学领域的真实视觉编程任务精心策划。解决这些任务需要模型感知几何图案、推理空间关系，并合成能忠实再现几何图案的Python代码。我们评估了20多个VLM，包括GPT-5、GPT-4o和Qwen2-VL-72B，发现它们表现显著困难，大多数成功率低于30%。为解决这些限制，我们提出了一种仅需少量种子样本的数据生成技术。在生成的合成数据上微调Qwen2-VL-72B，在真实任务上取得了约20%的提升。我们的失败分析揭示，GPT-4o在空间推理和精确视觉复制方面存在困难，而微调主要改善了视觉推理与代码实现之间的对齐。

英文摘要

Vision-language models (VLMs) have been explored for visual programming, where they generate code to solve visual tasks. However, most prior work focuses on visual programming for productivity; it remains unclear how well current VLMs perform on education-oriented visual programming and what factors limit their performance. To bridge this gap, we introduce TurtleAI, a benchmark containing 823 tasks curated based on real-world visual programming tasks in the Turtle Graphics domain. Solving these tasks requires models to perceive geometric patterns, reason about spatial relationships, and synthesize Python code that faithfully reproduces geometric patterns. We evaluate 20+ VLMs, including GPT-5, GPT-4o, and Qwen2-VL-72B, and find that they struggle significantly, with most achieving success rates below 30%. To address these limitations, we propose a data generation technique that requires only a small set of seed samples. Fine-tuning Qwen2-VL-72B on the resulting synthetic data yields an improvement of about 20% on real-world tasks. Our failure analysis reveals that GPT-4o struggles with spatial reasoning and precise visual replication, whereas fine-tuning primarily improves the alignment between visual reasoning and code implementation.

URL PDF HTML ☆

赞 0 踩 0

2606.03624 2026-06-03 cs.AI cs.CL

Bridging Auxiliary Constraints to Resolve Instruction Following in Large Reasoning Models

桥接辅助约束以解决大型推理模型中的指令遵循问题

Zhengyi Zhao, Shubo Zhang, Huimin Wang, Zezhong Wang, Yutian Zhao, Yefeng Zheng, Binyang Li, Yulan He, Kam-Fai Wong, Xian Wu

发表机构 * The Chinese University of Hong Kong（香港中文大学）； University of International Relations（国际关系大学）； Tencent Jarvis Lab（腾讯Jarvis实验室）； Westlake University（西湖大学）； King’s College London（伦敦国王学院）

AI总结针对大型推理模型难以可靠遵循多重约束的问题，提出约束关系图补全框架，通过显式建模约束关系并发现桥接约束，将约束违反率降低39%。

Comments a pre-MIT Press publication version

详情

AI中文摘要

大型推理模型（LRMs）在许多任务中展现出令人印象深刻的能力，但在可靠地遵循多个指令方面存在困难，要么无法满足单个约束，要么难以同时平衡相互竞争的约束。我们将这一挑战形式化为约束遵循问题（CAP）。本文引入了一个新颖的框架，通过将指令表示为约束的结构化知识图来解决CAP。我们的方法，约束关系图补全（CRGC），显式建模约束之间的关系，识别遵循挑战，并发现“桥接约束”，帮助模型更好地聚焦和协调需求。桥接约束作为辅助指令，使主要约束更加突出和兼容。与通过通用训练方法增强指令遵循的现有方法不同，CRGC通过利用模型自身的知识来创建更好的生成路径，从而专门提高约束满足度。在三个流行的指令遵循数据集上的实验表明，与标准提示相比，我们的方法将约束违反减少了39%，同时保持了大型推理模型的推理能力。

英文摘要

Large Reasoning Models (LRMs) have demonstrated impressive capabilities in many tasks, yet they struggle with reliably following multiple instructions, either by failing to satisfy individual constraints or by struggling to balance competing constraints simultaneously. We formalize this challenge as the Constraint Adherence Problem (CAP). This paper introduces a novel framework that addresses CAP by representing instructions as a structured knowledge graph of constraints. Our approach, Constraint Relationship Graph Completion (CRGC), explicitly models relationships between constraints, identifies adherence challenges, and discovers ``bridge constraints'' that help the model better focus on and reconcile requirements. Bridge constraints act as auxiliary instructions that make primary constraints more salient and compatible. Unlike existing approaches that enhance instruction following through general training methods, CRGC specifically improves constraint satisfaction by leveraging the model's own knowledge to create better pathways for generation. Experiments across three popular instruction following datasets demonstrate that our approach reduces constraint violations by 39% compared to standard prompting while maintaining reasoning abilities of large reasoning models.

URL PDF HTML ☆

赞 0 踩 0