arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2090
2605.27167 2026-05-27 cs.RO

TCBiRRT: Rapid Motion Planning for Tightly Coupled Dual-arm Space Manipulator Using Task-space Random Expansion

TCBiRRT:紧耦合双臂空间机械臂的任务空间随机扩展快速运动规划

Jiawei Zhang, Xinhao Miao, Jifeng Guo, Qinghua Li, Chengchao Bai

AI总结 针对紧耦合双臂空间机械臂在闭链约束下的运动规划问题,提出一种任务空间约束的双向快速随机扩展树算法(TCBiRRT),通过在任务空间直接采样和节点扩展,结合路径逆运动学映射和重抓取机制,显著提高规划成功率和速度。

Comments 12 pages, 9 figures

详情
AI中文摘要

在大型空间结构的在轨组装中,为紧耦合双臂空间机械臂规划在闭链约束下的运动路径是一个基础且具有挑战性的问题。闭链约束显著减少了可行构型空间,使得现有规划器难以高效生成无碰撞运动,尤其是在杂乱环境中。为解决这一问题,本文提出了一种任务空间约束的双向快速随机扩展树算法,称为TCBiRRT。与在高维构型空间中运行的传统方法不同,所提方法直接在由操作对象位姿定义的任务空间中进行随机采样和节点扩展。开发了一种任务空间节点扩展策略来生成候选对象运动,然后通过路径逆运动学算法将其映射到连续关节路径。该方法进一步与双向RRT框架和重抓取机制集成,以高效连接两个随机树。在具有不同环境复杂度的代表性在轨组装场景中进行了大量仿真。结果表明,与最先进的规划器相比,TCBiRRT实现了显著更高的成功率和数量级的规划时间改进。所提方法为紧耦合双臂空间机械臂的运动规划提供了一种高效且鲁棒的解决方案。

英文摘要

Planning the motion path for a tightly coupled dual-arm space manipulator under closed-chain constraints is a fundamental yet challenging problem in on-orbit assembly of large-scale space structures. The closed-chain constraints significantly reduce the feasible configuration space, making it difficult for existing planners to efficiently generate collision-free motions, especially in cluttered environments. To address this issue, this paper proposes a task-space constrained bidirectional rapidly-exploring random tree algorithm, termed TCBiRRT. Unlike conventional methods that operate in the high-dimensional configuration space, the proposed approach performs random sampling and node expansion directly in the task space defined by the manipulated object pose. A task-space node expansion strategy is developed to generate candidate object motions, which are then mapped to continuous joint paths using a path inverse kinematics algorithm. The method is further integrated with a bidirectional RRT framework and a regrasp mechanism to efficiently connect two random trees. Extensive simulations are conducted in representative on-orbit assembly scenarios with varying levels of environmental complexity. The results demonstrate that TCBiRRT achieves significantly higher success rates and orders-of-magnitude improvements in planning time compared to state-of-the-art planners. The proposed method provides an efficient and robust solution for motion planning of tightly coupled dual-arm space manipulators.

2605.27164 2026-05-27 cs.AI

Query Symbolically or Retrieve Semantically? A Dataset and Method for Semi-Structured Question Answering

符号查询还是语义检索?面向半结构化问答的数据集与方法

Mateusz Czyżnikiewicz, Ryszard Tuora, Adam Kozakiewicz, Tomasz Ziętkiewicz, Mateusz Galiński, Michał Godziszewski, Michał Karpowicz, Timothy Hospedales, Cristina Cornelio

AI总结 提出 DualGraph 框架,通过文本知识图谱和符号知识图谱双视图实现半结构化文档的语义检索与符号查询结合,并在 SpecsQA 基准上超越现有方法。

详情
AI中文摘要

检索增强生成(RAG)系统通常通过查询与文档块之间的语义相似性来检索证据。虽然这种方法对非结构化文本有效,但在半结构化语料库上可靠性较低,因为回答可能需要跨多个文档的结构化属性进行精确过滤、聚合或穷举检索。符号方法支持此类操作,但在嘈杂的自然语言语料库上往往脆弱。我们通过 DualGraph 解决了这一差距,这是一个 RAG 框架,通过两种互补视图表示文档:用于语义检索的文本知识图谱和用于对类型化主语-谓语-宾语三元组进行符号查询的符号知识图谱。基于这两个组件,我们提供了多种策略来选择或组合语义和符号证据。我们还引入了 SpecsQA,这是一个来自商业购物网站的基准测试,包含半结构化产品文档和人工策划的问题,涵盖开放式和面向规格的检索。实验表明,DualGraph 在各种问题类型上始终优于最先进的密集检索、GraphRAG、符号和基于表格的基线。代码和数据可在 https://github.com/corneliocristina/DualGraphRAG 获取。

英文摘要

Retrieval-Augmented Generation (RAG) systems for question answering typically retrieve evidence by semantic similarity between the query and document chunks. While effective for unstructured text, this approach is less reliable on semi-structured corpora where answering may require exact filtering, aggregation, or exhaustive retrieval over structured attributes across multiple documents. Symbolic approaches support such operations, but they are often brittle on noisy natural-language corpora. We address this gap with DualGraph, a RAG framework that represents documents through two complementary views: a Textual Knowledge Graph for semantic retrieval and a Symbolic Knowledge Graph for symbolic querying over typed subject--predicate--object triples. Building on these two components, we provide multiple strategies for selecting or combining semantic and symbolic evidence.We also introduce SpecsQA, a benchmark from a commercial shopping website with semi-structured product documents and manually curated questions spanning open-ended and specification-oriented retrieval. Experiments show that DualGraph consistently outperforms state-of-the-art dense-retrieval, GraphRAG, symbolic, and table-oriented baselines across question types.Code and data are available at https://github.com/corneliocristina/DualGraphRAG.

2605.27163 2026-05-27 cs.LG stat.ML

The Role of Causal Features in Strategic Classification for Robustness and Alignment

因果特征在战略分类中的作用:鲁棒性与对齐

Antonio Gois, Sophia Gunluk, Nir Rosenfeld, Nidhi Hegde, Simon Lacoste-Julien, Dhanya Sridhar

AI总结 本文通过因果模型分析战略分类中的分布偏移,证明因果分类在噪声有界时达到最优误差,并分解OOD交叉熵风险,揭示因果特征在长期激励对齐中的优势。

Comments Accepted at AISTATS 2026. 20 pages, 5 figures

详情
AI中文摘要

在战略分类中,机构(例如银行)预期用户会改变其特征以提高分类任务(例如贷款偿还)中的效用,从而进行适应。由于关键挑战是用户引起的分布偏移,我们转向因果模型,该模型已被证明可以限制最坏情况下的分布外(OOD)风险,并建立了几个将因果关系与战略分类联系起来的新结果。首先,我们证明,当噪声以某种方式有界时,因果分类在任何足够大的适应后都能达到最优分类误差。其次,当这些假设不成立时,我们证明最优分类器的OOD交叉熵风险分解为一个OOD偏差项和一个由未使用所有可观测特征引起的项,从而使我们能够理解因果分类器何时具有优势。最后,我们证明使用因果特征可以允许机构与用户之间的长期激励对齐,这与先前强调此类方法社会成本的工作形成对比。我们在合成数据上凭经验验证了我们的理论,发现我们的结果预测了实际行为。

英文摘要

In strategic classification, an institution (e.g., a bank) anticipates adaptation from users who change their features to increase utility in a classification task (e.g., loan repayment). Since a key challenge is the distribution shift induced by users, we turn to causal models, which have been shown to bound the worst-case out-of-distribution (OOD) risk, and establish several new results that link causality and strategic classification. First, we show that causal classification leads to optimal classification error after any sufficiently large adaptation, when the noise is bounded in a certain way. Second, when these assumptions do not hold, we show OOD cross-entropy risk of optimal classifiers decomposes into an OOD bias term and a term arising from not using all observable features, allowing us to understand when causal classifiers have an advantage. Finally, we show that the use of causal features can allow alignment of long-term incentives between institutions and users, contrasting with previous work that highlights social costs of such approaches. We validate our theory empirically on synthetic data, finding that our results predict behavior in practice.

2605.27161 2026-05-27 cs.CL

Formalization of Malagasy conjugation

马达加斯加语动词变位的形式化

Joro Ny Aina Ranaivoarison, Eric Laporte, Baholisoa Simone Ralalaoherivony

AI总结 本文基于Unitex平台,通过构建电子词典和有限状态转换器,实现了马达加斯加语简单动词的形态分析,并优先保证可读性以便语言学家扩展更新。

详情
Journal ref
Language and Technology Conference, 2013, Poznań, Poland, pp.457-462
AI中文摘要

本文报告了为构建基于词典的马达加斯加语简单动词形态分析器所进行的核心语言工作。该分析器使用Unitex平台,包括构建马达加斯加语简单动词的电子词典。数据基于形态特征进行编码。动词词干的形态变化及其与屈折词缀的组合通过可编辑图表示的有限状态转换器进行形式化。78个转换器使Unitex能够生成词干变体的词典。另外271个转换器被形态分析器用于识别变位动词中的词干和词缀。词典和转换器的设计优先考虑可读性,以便语言学家能够扩展和更新它们。

英文摘要

This paper reports the core linguistic work performed to construct a dictionary-based morphological analyser for Malagasy simple verbs. It uses the Unitex platform and comprised the contruction of an electronic dictionary for Malagasy simple verbs. The data is encoded on the basis of morphological features. The morphological variations of verb stems and their combination with inflectional affixes are formalized in finite-state transducers represented by editable graphs. 78 transducers allow Unitex to generate a dictionary of allomorphs of stems. 271 other transducers are used by the morphological analyser of Unitex to recognize the stem and the affixes in conjugated verbs. The design of the dictionary and transducers prioritizes readability, so that they can be extended and updated by linguists.

2605.27158 2026-05-27 cs.CV

Model discovery for dynamical systems with complex-valued product units

具有复数值乘积单元的动力学系统模型发现

Martin Brückmann, Babette Dellen, Uwe Jaekel

AI总结 提出基于复数值乘积单元网络的数据驱动方法,直接从观测轨迹学习包含分数或负指数单项式的稀疏线性组合,从而发现动力学系统的控制方程。

Comments 16 pages, 8 figures

详情
AI中文摘要

从观测轨迹中发现动力学系统的控制方程比单纯预测未来状态能更深入地理解其结构。我们提出一种基于复数值乘积单元网络的数据驱动模型发现方法,其中每个单元表示一个复数值单项式,网络输出是这些单项式的稀疏线性组合。与SINDy等基于库的方法不同,我们的方法不需要预定义候选函数集:相关的单项式(包括分数或负指数)直接从数据中学习。在四个混沌基准系统(Lorenz63、Lorenz84、四翼吸引子和Lorenz63的分数阶变体)上,使用至少3000个训练点,我们对前三个系统在90%的试验中恢复了精确的控制方程,对分数阶情况在70-90%的试验中恢复。应用于真实世界的人体步态加速度计信号,模型产生具有有界预测误差的稳定轨迹,在比训练间隔长三倍的测试时间范围内,RMSE约为信号幅度范围的12-14%,展示了其在高维系统(其中解析方程不可用)中的潜力。

英文摘要

Discovering the governing equations of a dynamical system from observed trajectories provides deeper insight into its structure than mere prediction of future states. We present a data-driven approach to model discovery based on complex-valued product-unit networks, in which each unit represents a complex monomial and the network output is a sparse linear combination of such monomials. In contrast to established library-based methods such as SINDy, our approach does not require a predefined set of candidate functions: the relevant monomials, including those with fractional or negative exponents, are learned directly from data. Across four chaotic benchmark systems (Lorenz63, Lorenz84, the Four-Wing attractor, and a fractional variant of Lorenz63), we recover the exact governing equations in 90% of trials for the first three systems, and in 70-90% of trials for the fractional case, using at least 3000 training points. Applied to real-world human-gait accelerometer signals, the model produced stable trajectories with bounded prediction errors, corresponding to an RMSE of approximately 12-14% of the signal amplitude range over a test horizon three times longer than the training interval, demonstrating its potential for high-dimensional systems in which analytic equations are unavailable.

2605.27157 2026-05-27 cs.AI

Detecting Is Not Resolving: The Monitoring Control Gap in Retrieval Augmented LLMs

检测不等于解决:检索增强型大语言模型中的监控控制差距

Zhe Yu, Wenpeng Xing, Chen Ye, Xuyang Teng, Bo Yang, Changting Lin, Meng Han

AI总结 本文通过多轮文档累积协议发现检索增强型大语言模型存在监控控制差距,即模型能识别矛盾证据但无法安全约束最终建议,并揭示其机制在于行动选择缺陷。

详情
AI中文摘要

检索增强型大语言模型被部署用于证据质量决定行动安全的任务,但评估协议假设单轮鲁棒性能够预测证据跨轮累积时的鲁棒性。我们证明这一假设根本错误。模型存在监控-控制差距:它们容易承认矛盾证据,但这种意识无法约束最终建议——检测认知冲突并不意味着安全解决它。通过跨四个模型家族(1.5B-32B参数)和超过50,000次轮次级评估的多轮文档累积协议,我们证明单轮诊断系统性地高估了RAG安全性,矛盾承认与安全解决不相关(这一模式得到针对性人工验证的证实),并且不存在通用的提示修复方法。汇聚的机制证据——隐藏状态探测、注意力分析和响应策略分类——指向行动选择作为最可能的缺陷所在:危险相关信息被内部表示并在不安全生成期间获得增强的注意力,但未能约束输出行为。在检索增强系统可被信任用于高风险场景之前,必须测量并弥合模型识别与行动之间的差距。

英文摘要

Retrieval-augmented LLMs are deployed for tasks where evidence quality determines action safety, yet evaluation protocols assume that single-turn robustness predicts robustness when evidence accumulates across turns. We show this assumption is fundamentally incorrect. Models exhibit a monitoring-control gap: they readily acknowledge contradictory evidence, yet this awareness fails to constrain their final recommendations - detecting epistemic conflict does not imply resolving it safely. Through a multi-turn document accumulation protocol across four model families (1.5B-32B parameters) and over 50,000 turn-level evaluations, we demonstrate that single-turn diagnostics systematically overestimate RAG safety, that contradiction acknowledgement is uncorrelated with safe resolution, a pattern corroborated by targeted human validation, and that no universal prompt fix exists. Converging mechanism evidence - hidden-state probing, attention analysis, and response-strategy taxonomy - points to action selection as the most plausible locus of the deficit: danger-relevant information is internally represented and receives enhanced attention during unsafe generation, yet fails to constrain output behavior. The gap between what models recognize and what they do must be measured and closed before retrieval-augmented systems can be trusted in high-stakes settings.

2605.27156 2026-05-27 cs.CL cs.AI

LitSeg: Narrative-Aware Document Segmentation for Literary RAG

LitSeg: 面向文学RAG的叙事感知文档分割

Ruikang Zhang, Zhanni Chen, Yiqiao Cai, Qi Su

AI总结 提出LitSeg,一种基于叙事理论引导的文档分割框架,通过多阶段提示提取事件、梳理叙事线索并定位转折点,以解决现有分割方法忽视文学叙事结构导致检索与生成性能下降的问题,并引入轻量版LitSeg-Lite通过数据蒸馏降低计算开销。

详情
AI中文摘要

检索增强生成(RAG)通过引入外部知识增强了大型语言模型(LLMs),特别是在文学作品等长尾领域。然而,RAG中关键的文档分割步骤仍未得到充分探索。现有策略通常语义盲目,忽视了文学作品复杂的叙事结构,常常导致情节碎片化和指代不清,严重阻碍了检索和生成性能。为了解决这一问题,我们提出了LitSeg,一种新颖的叙事理论引导的分割框架。通过采用多阶段提示,LitSeg明确提取有效事件,梳理叙事线索,阐明叙事结构,并定位转折点以指导分割。为了减轻大规模模型多阶段推理的计算开销,我们进一步引入了LitSeg-Lite,一种轻量级的单遍分块器,通过两阶段训练策略在LitSeg生成的数据上进行微调,将复杂过程蒸馏为单次推理。大量实验表明,通过结构独立的文本块,我们的方法在检索准确性和上下文相关性上显著优于基线,最终提升了下游问答性能,而消融研究验证了叙事学指导和数据蒸馏的有效性。

英文摘要

Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by incorporating external knowledge, particularly for long-tail domains such as literary works. However, the critical step of document segmentation in RAG remains largely underexplored. Existing strategies are typically semantically blind and overlook the complicated narrative structures of literary works, often resulting in fragmented plots and unclear references that severely hinder retrieval and generation performance. To address this, we propose LitSeg, a novel narrative-theory-guided segmentation framework. By employing multi-stage prompting, LitSeg explicitly extracts valid events, untangles narrative threads, clarifies narrative structures, and locates turning points to inform segmentation. To alleviate the computational overhead of multi-stage inference with large-scale models, we further introduce LitSeg-Lite, a lightweight single-pass chunker fine-tuned on LitSeg-generated data via a two-stage training strategy, distilling the complex process into a single inference pass. Extensive experiments demonstrate that with structurally independent text chunks, our methods significantly improve retrieval accuracy and context relevance over baselines, ultimately enhancing downstream QA performance, while ablation studies validate the efficacy of narratological guidance and data distillation.

2605.27154 2026-05-27 cs.CV

Touch-R1: Reinforcing Touch Reasoning in MLLMs

Touch-R1:在多模态大语言模型中强化触觉推理

Yingxin Lai, Yafei Zhou, Fucai Zhu, Siyu Zhu, Weihao Yuan

AI总结 针对触觉推理中物理属性序数性和跨传感器分布偏移的挑战,提出基于触觉接地GRPO目标训练的Touch-R1模型,在TouchReason-Bench上平均性能超过Octopi-13B和GPT-4o。

Comments Our code and data will be made public on the https://laiyingxin2.github.io/Projects

详情
AI中文摘要

虽然基于规则的强化学习最近在多模态模型中催化了显式推理,但触觉推理仍然很大程度上未被探索。现有的触觉语言模型主要依赖于监督或对比目标,这限制了它们将预测基于物理证据或纠正误导性视觉先验的能力。触觉推理引入了两个模态特定的挑战:物理属性(如硬度、粗糙度)的序数性质,以及光学触觉硬件固有的跨传感器分布偏移。在这项工作中,我们引入了TouchReason-1M,一个大规模多模态数据集,包含来自四个不同传感器的超过100万同步触觉对,以及TouchReason-Bench,一个用于评估触觉感知和视觉-触觉冲突解决的严格框架。在此基础上,我们提出了Touch-R1,一个基于Qwen2.5-VL-7B的触觉推理多模态大语言模型。Touch-R1通过一个触觉接地的GRPO目标进行训练,该目标结合了序数感知准确性、跨传感器物理一致性、结构化格式控制以及输入侧触觉接地目标。具体来说,触觉使用奖励仅在真实触觉输入相对于去除、打乱或噪声掩蔽触觉流的反事实控制产生更优正确性时赋予信用。在TouchReason-Bench上,Touch-R1-7B平均优于Octopi-13B 18.4%和GPT-4o 24.7%。其结构化推理轨迹揭示了探测、比较和修正的涌现行为,表明R1风格的推理可以有效地基于物理接触。

英文摘要

While rule-based reinforcement learning has recently catalyzed explicit reasoning in multimodal models, tactile reasoning remains largely underexplored. Existing tactile-language models primarily rely on supervised or contrastive objectives, which limits their capacity to ground predictions in physical evidence or rectify misleading visual priors. Tactile reasoning introduces two modality-specific challenges: the ordinal nature of physical attributes (e.g., hardness, roughness) and the cross-sensor distribution shifts inherent in optical tactile hardware. In this work, we introduce TouchReason-1M, a large-scale multimodal dataset comprising over 1M synchronized tactile pairs across four distinct sensors, and TouchReason-Bench, a rigorous framework for evaluating tactile perception and visual-tactile conflict resolution. Building upon these, we propose Touch-R1, a tactile reasoning MLLM based on Qwen2.5-VL-7B. Touch-R1 is trained via a tactile-grounded GRPO objective that combines ordinal-aware accuracy, cross-sensor physical consistency, structured-format control, and an input-side tactile grounding objective. Specifically, the tactile-use reward assigns credit only when authentic tactile inputs yield superior correctness relative to counterfactual controls where the tactile stream is removed, shuffled, or noise-masked. On TouchReason-Bench, Touch-R1-7B outperforms Octopi-13B by 18.4\% and GPT-4o by 24.7\% on average. Its structured reasoning traces reveal emergent behaviors of probing, comparison, and revision, demonstrating that R1-style reasoning can be effectively grounded in physical contact.

2605.27146 2026-05-27 cs.CV

Chaos-SSL: An Attention-Based Self-Supervised Learning Framework with Chaotic Transformation for Medical Image Classification

Chaos-SSL:基于混沌变换的注意力自监督学习框架用于医学图像分类

Joao Batista Florindo

AI总结 提出Chaos-SSL框架,利用一维混沌映射作为非线性数据增强进行自监督预训练,并结合注意力融合模型,在皮肤病变和糖尿病视网膜病变分类上达到与最先进方法竞争的性能。

详情
Journal ref
In Proceedings of VISAPP 2026 - Volume 1, pages 574-581
AI中文摘要

自监督学习(SSL)已成为缓解对大规模标注数据集依赖的强大范式,这是医学图像分析中的常见瓶颈。然而,依赖简单几何和颜色增强的标准SSL方法可能无法捕捉到分类细微病理所需的细粒度、复杂纹理细节。本文介绍了Chaos-SSL,一种新颖的两阶段医学图像分类框架。在第一阶段,我们提出了一种新的自监督预训练策略,利用一维混沌映射(Logistic、Tent和Sine)作为对比学习的复杂非线性增强。我们假设这些混沌变换创建了“更难”且语义更丰富的视图,迫使网络学习细粒度医学纹理的鲁棒表示。在第二阶段,我们引入了一种基于注意力的融合模型,该模型动态地将来自Chaos-SSL模型的专门特征与来自更大的ImageNet预训练模型的通用特征相结合。我们在两个公共数据集上验证了我们的方法:ISIC 2018(皮肤病变)和APTOS 2019(糖尿病视网膜病变)。我们的结果表明,使用Tent映射预训练30个epoch的Chaos-SSL模型,随后进行注意力融合,其性能与最先进方法完全竞争,在ISIC 2018上达到0.9261的准确率,在APTOS 2019上达到0.8726的准确率。这显著优于现有的SSL方法,包括几种最新方法。

英文摘要

Self-Supervised Learning (SSL) has emerged as a powerful paradigm to mitigate the reliance on large, annotated datasets, a common bottleneck in medical image analysis. However, standard SSL methods, which rely on simple geometric and color augmentations, may fail to capture the fine-grained, complex textural details necessary for classifying subtle pathologies. This paper introduces Chaos-SSL, a novel two-stage framework for medical image classification. In the first stage, we propose a new self-supervised pre-training strategy that leverages 1D chaotic maps (Logistic, Tent, and Sine) as a complex, non-linear augmentation for contrastive learning. We hypothesize that these chaotic transformations create ``harder'' and more semantically-rich views, forcing a network to learn robust representations of fine-grained medical textures. In the second stage, we introduce an attention-based fusion model that dynamically combines the specialized features from our Chaos-SSL model with the general-purpose features of a larger, ImageNet-pre-trained model. We validate our method on two public datasets: ISIC 2018 (skin lesions) and APTOS 2019 (diabetic retinopathy). Our results demonstrate that the Chaos-SSL model pre-trained with a Tent map for 30 epochs, followed by attention fusion, achieves performance fully competitive with the state-of-the-art, yielding an accuracy of 0.9261 on ISIC 2018 and 0.8726 on APTOS 2019. This significantly outperforms existing SSL methods, including several recent approaches.

2605.27144 2026-05-27 cs.CV cs.LG

Is an Image Also Worth 16x16=256 Superpixels? A Framework for Attentional Image Classification

图像是否也值得16x16=256个超像素?一个用于注意力图像分类的框架

Pedro Henrique da Costa Avelar, Anderson R. Tavares, Luís C. Lamb

AI总结 提出超像素变换器(SPT)框架,统一超像素图像分类与视觉变换器,通过多维正弦余弦位置编码和增强的补丁数据结构,在多个数据集上优于超像素图神经网络方法,与视觉变换器竞争。

详情
AI中文摘要

基于超像素的图像分类传统上利用图神经网络(GNN)处理不规则图像表示。计算机视觉的最新进展,由视觉变换器(ViT)驱动,引入了自注意力模型的新范式,在各种任务中超越了卷积神经网络(CNN)。然而,GNN、超像素和变换器之间的协同联系仍未探索。在这项工作中,我们提出了超像素变换器(SPT),这是一个统一超像素图像分类和ViT的新框架。SPT将超像素图像分类与图注意力网络(SICGAT)模型和ViT泛化,以支持任意超像素分块策略、连接图和位置编码。我们引入了改进,包括多维正弦余弦位置编码和完全包含超像素形状和颜色信息的增强补丁数据结构。通过在CIFAR10、FashionMNIST和Imagenette等数据集上测试SPT,采用各种超像素生成和图连接策略,我们证明SPT相比以前的超像素GNN方法实现了优越的性能,并与ViT保持竞争力。值得注意的是,我们的方法解决了SICGAT的局限性,例如像素聚合过程中的信息丢失,并展示了受限图连接如何增强ViT性能。SPT弥合了基于超像素和变换器模型之间的差距,为跨领域泛化和混合注意力框架的未来创新开辟了道路,并表明图像也值得$16\times16$个超像素。

英文摘要

Superpixel-based image classification has traditionally leveraged graph neural networks (GNNs) for processing irregular image representations. Recent advances in computer vision, driven by Vision Transformers (ViTs), have introduced new paradigms in self-attentional models, surpassing convolutional neural networks (CNNs) in various tasks. However, a synergistic connection between GNNs, superpixels, and transformers remains unexplored. In this work, we propose Superpixel Transformers (SPT), a novel framework that unifies superpixel-based image classification and ViTs. SPT generalizes the Superpixel Image Classification with Graph Attention Networks (SICGAT) model and ViT to support arbitrary superpixel-based chunking strategies, connectivity graphs, and positional encodings. We introduce refinements including a multidimensional sine-cosine positional encoding and an enriched patch data structure that fully incorporates superpixel shape and color information. By testing SPT across datasets such as CIFAR10, FashionMNIST, and Imagenette, with various superpixel generation and graph connectivity strategies, we demonstrate that SPT achieves superior performance compared to previous superpixel-based GNN methods and remains competitive with ViTs. Notably, our approach addresses the limitations of SICGAT, such as information loss during pixel aggregation, and shows how constrained graph connectivity can enhance ViT performance. SPT bridges the gap between superpixel-based and transformer models, opening avenues for cross-domain generalization and future innovations in hybrid attentional frameworks, and showing that an image can also be worth $16\times16$ superpixels.

2605.27141 2026-05-27 cs.AI

VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions

VitaBench 2.0:评估长期用户交互中的个性化与主动型代理

Yuxin Chen, Yi Zhang, Zhengzhou Cai, Yaorui Shi, Zhiyuan Yao, Chenhang Cui, Jingnan Zheng, Yaqi Huo, Xi Su, Qi Gu, Xunliang Cai, Xiang Wang, An Zhang, Tat-Seng Chua

AI总结 针对现有代理基准忽视用户偏好推断与利用的问题,提出VitaBench 2.0基准,通过时间序列任务和可扩展记忆接口评估代理在长期交互中的个性化与主动性,实验表明最先进模型仍面临挑战。

详情
AI中文摘要

大型语言模型已演变为交互式代理,与用户在现实任务中协作。在这种设置下,有效协作越来越依赖于理解用户未明确表达的内容,因为用户意图往往反映在碎片化的日常交互中,需要个性化建模和主动交互。然而,现有的代理基准主要评估推理和工具使用,在很大程度上忽视了在现实场景中推断和利用用户偏好的挑战。为解决这一差距,我们引入了VitaBench 2.0,这是一个用于评估长期用户交互中个性化与主动代理行为的基准。在VitaBench 2.0中,任务被组织为单个用户的时间顺序序列,其中偏好嵌入在碎片化和异构的交互中。成功完成任务要求代理从这些交互中持续提取、利用和更新用户偏好。我们进一步通过要求代理识别缺失信息并在决策前主动从用户或环境中获取信息的任务来评估主动性。为了支持系统分析,我们提供了一个可扩展的记忆接口,使得不同记忆架构之间的受控比较成为可能。我们对一系列前沿专有和开源LLM进行了基准测试。结果表明,即使对于最先进的模型,现实世界的个性化仍然极具挑战性,揭示了当前能力与实际需求之间的巨大差距。广泛的分析进一步揭示了当前代理在现实世界个性化决策中的失败模式和能力瓶颈,为未来的模型改进提供了见解。

英文摘要

Large language models (LLMs) have evolved into interactive agents that collaborate with users in real-world tasks. Effective collaboration in such settings increasingly depends on understanding the user beyond what is explicitly stated, as user intent is often reflected in fragmented daily interactions and requires both personalized modeling and proactive interaction. However, existing agent benchmarks primarily evaluate reasoning and tool use, largely overlooking the challenges of inferring and leveraging user preferences in realistic scenarios. To address this gap, we introduce VitaBench 2.0, a benchmark for evaluating personalized and proactive agent behavior in long-term user interactions. In VitaBench 2.0, tasks are organized as temporally ordered sequences for individual users, where preferences are embedded in fragmented and heterogeneous interactions. Successful completion of tasks requires the agent to continuously extract, utilize, and update user preferences from these interactions. We further evaluate proactiveness through tasks that require agents to recognize missing information and actively acquire it from users or environments before making decisions. To support systematic analysis, we provide an extensible memory interface that enables controlled comparison across different memory architectures. We benchmark a diverse set of frontier proprietary and open-source LLMs. Results show that real-world personalization remains highly challenging even for state-of-the-art models, revealing a substantial gap between current capabilities and practical requirements. Extensive analysis further reveals the failure modes and capability bottlenecks of current agents in real-world personalized decision-making, providing insights for future model improvements.

2605.27140 2026-05-27 cs.AI

StepOPSD: Step-Aware Online Preference Distillation for Agent Reinforcement Learning

StepOPSD: 面向智能体强化学习的步骤感知在线偏好蒸馏

Yanfei Zhang, Xu Lin, Chenglin Wu

AI总结 提出StepOPSD框架,以智能体步骤为信用分配单元,通过事后增强教师上下文重新评分步骤段,并在GRPO更新前进行归一化每步信用预算的优势塑造,解决多轮智能体强化学习中的信用分配不匹配问题。

详情
AI中文摘要

多轮智能体的强化学习存在信用分配不匹配问题:奖励稀疏且基于轨迹,而成功往往取决于少数局部决策。现有的在线策略蒸馏(OPD)提供了更密集的令牌级监督,但通常将异质的智能体轨迹视为整体字符串而非因果交互单元。我们提出StepOPSD,一种事后回放偏好自蒸馏框架,以智能体步骤作为信用重分配的单位。StepOPSD将轨迹分解为以动作中心的步骤段,在事后增强的教师上下文中重新评分,并将令牌级对数概率差距转化为符号保持的优势塑造,在GRPO更新前进行归一化的每步信用预算。在ALFWorld和Search-QA上使用Qwen3-1.7B和Qwen2.5-3B-Instruct的实验中,StepOPSD在对局部因果错误最敏感的子集上取得了最佳或次佳结果,包括ALFWorld Heat(79.1%)、PickTwo(95.0%)、Search-QA TriviaQA(61.6%)的第一名,以及HotpotQA(40.4%)的并列最佳。结果进一步揭示了一致的双旋钮定律:较小的α_clip作为广泛稳定的局部信任区域,而最优全局混合强度λ_mix依赖于任务。这些发现表明,当轨迹级奖励与决定下游成功的局部动作弱对齐时,步骤感知蒸馏最为有用。

英文摘要

Reinforcement learning for multi-turn agents suffers from a credit-assignment mismatch: rewards are sparse and trajectory-level, while success often hinges on a few local decisions. Existing online policy distillation (OPD) provides denser token-level supervision, but typically treats heterogeneous agent trajectories as monolithic strings rather than causal interaction units. We present StepOPSD, a post-rollout preference self-distillation framework that takes the agent step as the unit of credit redistribution. StepOPSD decomposes trajectories into action-centered step segments, rescoring them under hindsight-enriched teacher contexts and converting token-level log-probability gaps into sign-preserving advantage shaping with a normalized per-step credit budget before the GRPO update. Across ALFWorld and Search-QA with Qwen3-1.7B and Qwen2.5-3B-Instruct, StepOPSD attains best or second-best results on subsets most sensitive to local causal errors, including first-place performance on ALFWorld Heat (79.1%), PickTwo (95.0%), Search-QA TriviaQA (61.6%), and tied-best performance on HotpotQA (40.4%). The results further reveal a consistent two-knob law: smaller α_clip acts as a broadly stabilizing local trust region, whereas the optimal global mixing strength λ_mix remains task-dependent. These findings suggest that step-aware distillation is most useful when trajectory-level rewards are weakly aligned with the local action that determines downstream success.

2605.27138 2026-05-27 cs.AI

ICCU: In-Context Continual Unlearning via Pattern-Induced Refusal Rules

ICCU: 通过模式诱导拒绝规则进行上下文持续遗忘

Ruihao Pan, Suhang Wang

AI总结 提出ICCU框架,通过从遗忘数据中诱导可读拒绝规则并在推理时应用,无需修改模型参数,实现高效、无干扰的持续机器遗忘。

详情
AI中文摘要

机器遗忘旨在从训练好的语言模型中移除特定数据的影响。在实际部署中,遗忘请求通常顺序到达,这对现有的基于微调的方法提出了挑战:对每个请求进行微调成本高昂、累积效用损失,并可能导致跨请求干扰。为了解决这些问题,我们提出了ICCU(上下文持续遗忘),一种上下文持续遗忘框架,它从遗忘数据集中诱导出可读的拒绝规则,并在推理时作为过滤器或通过系统提示应用,而不修改模型参数。由于规则作为与顺序无关的并集累积,ICCU是组合的且无跨请求干扰,并且原始遗忘集数据可以在规则诱导后丢弃。大量实验表明,ICCU有效抑制目标知识同时保持效用,可扩展到顺序请求,并且对释义和跨语言查询保持鲁棒性。

英文摘要

Machine unlearning aims to remove the influence of specific data from trained language models. In real-world deployments, unlearning requests often arrive sequentially, which challenges existing fine-tuning-based methods: fine-tuning each request is costly, accumulates utility loss, and may cause cross-request interference. To address these issues, we propose ICCU (In-Context Continual Unlearning), an in-context continual unlearning framework that induces readable refusal rules from unlearning datasets and applies them at inference time either as a filter or via the system prompt, without modifying model parameters. Because rules are accumulated as an order-independent union, ICCU is compositional and free of cross-request interference, and the original forget-set data can be discarded after rule induction. Extensive experiments show that ICCU effectively suppresses target knowledge while preserving utility, scales across sequential requests, and remains robust to paraphrased and cross-lingual queries.

2605.27136 2026-05-27 cs.CV

Leveraging Visual Signals for Robust Token-Level Uncertainty in Vision-Language Generation

利用视觉信号实现视觉-语言生成中鲁棒的词元级不确定性

Joseph Hoche, David Brellmann, Gianni Franchi

AI总结 针对大型视觉语言模型不确定性量化中视觉信息利用不足的问题,提出基于视觉锚定的词元级不确定性量化框架VIG-TUQ,通过加权语言不确定性与视觉锚定分数,无需训练即可提升不确定性估计性能。

详情
AI中文摘要

不确定性量化(UQ)对于大型视觉语言模型(LVLMs)的可靠预测和实际部署仍然是一个关键挑战。然而,现有方法大多源自LLM文献,主要关注语言模态,而视觉信息对LVLM不确定性的贡献在很大程度上未被探索。在本文中,我们研究了LVLMs如何处理视觉信息,以及这一过程是否可用于改进不确定性估计。通过分析生成过程中视觉特征整合后的隐藏表示,我们观察到高置信度预测比不确定预测更依赖于视觉内容。基于这一发现,我们提出了视觉锚定语元级UQ(VIG-TUQ),这是一个无需训练的框架,通过用视觉锚定分数加权词元级语言不确定性,将视觉锚定显式纳入不确定性估计。我们在多个数据集和不同的LVLM架构(包括早期融合、晚期融合和原生融合模型)上评估了VIG-TUQ。结果表明,我们的方法通常优于现有的词元级不确定性方法。代码和数据将在接收后公开。

英文摘要

Uncertainty quantification (UQ) remains a critical challenge in Large Vision Language Models (LVLMs) for reliable predictions and real-world deployment. However, most existing methods are adapted from the LLM literature and primarily focus on the language modality, leaving the contribution of visual information to LVLM uncertainty largely underexplored. In this paper, we investigate how LVLMs process visual information and whether this process can be used to improve uncertainty estimation. By analyzing hidden representations after the integration of visual features during the generation process, we observe that high-confidence predictions rely more heavily on visual content than uncertain ones. Building on this insight, we propose Visual-Grounded Token UQ (VIG-TUQ), a training-free framework that explicitly incorporates visual grounding into uncertainty estimation by weighting token-level language uncertainty with visual grounding scores. We evaluate VIG-TUQ on multiple datasets and across diverse LVLM architectures, including early-fusion, late-fusion, and native-fusion models. Results indicate that our method often improves upon existing token-level uncertainty approaches. Code and data will be made available upon acceptance.

2605.27134 2026-05-27 cs.AI

Scaling, Benchmarking, and Reasoning of Vision-Language Agents for Mobile GUI Navigation

面向移动GUI导航的视觉语言模型:缩放、基准测试与推理

Heng Qu, Yike Liu, Renren Jin, Wenzong Zhang, Pengzhi Gao, Wei Liu, Jian Luan

AI总结 本文系统研究了视觉语言模型在移动GUI导航中的数据缩放、基准测试与推理,提出了大规模数据集HyperTrack和开源工具包GUIEvalKit,并发现基于强化学习的微调优于监督微调,尤其在域外场景中表现更佳。

Comments Accepted at ICML 2026

详情
AI中文摘要

视觉语言模型(VLM)在移动GUI导航方面取得了快速进展。本文针对该领域中基于VLM的智能体,系统研究了数据缩放、基准测试和推理。为了促进严格评估,我们引入了HyperTrack,这是一个大规模数据集,包含超过650个中国移动应用程序的16000多个真实世界任务,以及GUIEvalKit,一个用于在离线GUI导航任务上统一基准测试VLM的开源工具包。利用HyperTrack,我们分析了训练数据规模对监督微调和基于强化学习的微调的影响。我们的结果表明,基于强化学习的微调始终优于监督微调,特别是在域外设置中,突出了数据缩放与强化学习之间的协同作用。借助GUIEvalKit,我们进一步对最先进的VLM进行了基准测试,并分析了交互历史和推理能力如何影响任务完成。HyperTrack和GUIEvalKit共同为在移动GUI导航任务中开发和评估VLM智能体提供了一个全面的平台。

英文摘要

Vision-Language Models (VLMs) have shown rapid progress in mobile GUI navigation. This paper presents a systematic study of data scaling, benchmarking, and reasoning for VLM-based agents in this domain. To facilitate rigorous evaluation, we introduce HyperTrack, a large-scale dataset with over 16000 real-world tasks across more than 650 Chinese mobile applications, along with GUIEvalKit, an open-source toolkit for unified benchmarking of VLMs on offline GUI navigation tasks. Using HyperTrack, we analyze the effects of training data scale on both supervised and reinforcement-based finetuning. Our results show that reinforcement-based finetuning consistently outperforms supervised finetuning, particularly in out-of-domain settings, highlighting the synergy between data scaling and reinforcement learning. Leveraging GUIEvalKit, we further benchmark state-of-the-art (SOTA) VLMs and analyze how interaction history and reasoning capabilities influence task completion. Together, HyperTrack and GUIEvalKit provide a comprehensive platform for developing and evaluating VLM agents in mobile GUI navigation tasks.

2605.27133 2026-05-27 cs.LG cs.AI

Deep-layer limit and stability analysis of the basic forward-backward-splitting induced network (II): learning problems

基本前向-后向分裂诱导网络的深层极限与稳定性分析(II):学习问题

Xuan Lin, Chunlin Wu

AI总结 本文研究基本前向-后向分裂(FBS)诱导网络的训练问题,证明其收敛到深层极限系统的学习问题,并给出扰动稳定性分析。

Comments 38 pages, 1 figure

详情
AI中文摘要

源自迭代优化方案和数值常/偏微分方程(ODE/PDE)的深度展开神经网络在过去十年中引起了数据科学界的广泛关注。其中,许多重要的网络架构是从基本的前向-后向分裂(FBS)算法构建的。在本文中,我们继续研究最基本的FBS诱导网络,该网络通过引入直接参数松弛从原始FBS算法展开。基于我们先前前向系统分析中的差分/微分包含公式,我们在此考虑相应学习问题的一些理论方面。在一些温和假设下,我们建立了基本FBS诱导网络的训练问题收敛到深层极限系统的学习问题的一般收敛性质,这意味着一个$\Gamma$-收敛论证,表明网络最优学习参数的任意聚点是深层极限系统学习问题的解。还对这些学习问题的扰动稳定性进行了定性分析。进行了一个简单的数值实验以验证我们的主要一般收敛结果。

英文摘要

Deep unfolding neural networks derived from iterative optimization schemes and numerical ordinary/partial differential equations (ODEs/PDEs) have attracted much attention in data science over the last decade. Therein, numerous important network architectures were constructed from the basic forward-backward-splitting (FBS) algorithm. In this paper, we continue our research on the most basic FBS-induced network, an architecture unrolled from the original FBS algorithm by incorporating direct parameter relaxations. Following the difference/differential inclusion formulations in our previous forward system analyses, we here consider some theoretical aspects of corresponding learning problems. Under some mild assumptions, we establish a general convergence property of the training problem of the basic FBS-induced network to the learning problem of the deep-layer limit system, implying a $Γ$-convergence argument showing that any cluster point of the optimal learning parameters for the network is a solution to the learning problem of the deep-layer limit system. A qualitative analysis of perturbation stabilities of these learning problems is also presented. A simple numerical experiment is conducted to validate our main general convergence result.

2605.27132 2026-05-27 cs.CV

Image Thresholding: Understanding Bias of Evaluation Metrics towards Specific Evaluation Functions

图像阈值化:理解评估指标对特定评估函数的偏差

Eslam Hegazy, Mohamed Gabr

AI总结 本文通过分析BSDS500数据集上所有可能阈值的阈值化目标函数与质量指标的相关性,揭示了Otsu准则与SSIM和PSNR的高相关性,以及Kapur熵的弱相关性,表明存在固有的指标-目标函数偏差。

Comments Submitted to ICPR 2026 (https://icpr2026.org)

详情
AI中文摘要

多级图像阈值化广泛应用于从医学成像到遥感的分割任务中。经典的目标函数,如Otsu的类间方差和Kapur的熵,通常通过元启发式算法进行优化,并使用结构相似性指数(SSIM)和峰值信噪比(PSNR)等指标评估性能。这些评估隐含地假设SSIM和PSNR提供了分割质量的无偏度量。在本研究中,我们通过分析BSDS500数据集中所有可能阈值下阈值化目标函数与质量指标之间的相关性来检验这一假设。结果表明,Otsu准则始终与SSIM和PSNR表现出高相关性,而Kapur熵的相关性较弱且变化较大。Otsu在所有图像上与PSNR的相关性优于Kapur,在超过91%的图像上与SSIM的相关性也优于Kapur。我们的发现揭示了一种固有的指标-目标函数偏差。这项工作强调了需要更中立的评估框架,并激励将分析扩展到其他阈值化准则和领域。本文的源代码可在https://w3id.org/met-dp/icpr26-95找到。

英文摘要

Multilevel image thresholding is widely used for segmentation in applications ranging from medical imaging to remote sensing. Classical objective functions, such as Otsu's between-class variance and Kapur's entropy, are often optimized using metaheuristic algorithms, with performance evaluated via metrics like Structural Similarity Index (SSIM) and Peak Signal-to-Noise Ratio (PSNR). These evaluations implicitly assume that SSIM and PSNR provide unbiased measures of segmentation quality. In this study, we examine this assumption by analyzing the correlation between thresholding objective functions and quality metrics across all possible thresholds for images in the BSDS500 dataset. Results show that Otsu's criterion consistently exhibits high correlation with both SSIM and PSNR, while Kapur's entropy demonstrates weaker and more variable correlation. Otsu outperforms Kapur in correlation with PSNR for all images and with SSIM for over 91%. Our findings reveal an inherent metric-objective-function bias. This work highlights the need for more neutral evaluation frameworks and motivates extending the analysis to additional thresholding criteria and domains. Source code of this paper can be found at https://w3id.org/met-dp/icpr26-95

2605.27130 2026-05-27 cs.LG cs.AI

DEI: Diversity in Evolutionary Inference for Quality-Diversity Search

DEI:质量-多样性搜索中的进化推理多样性

John Donaghy, Shikhar Rastogi

AI总结 提出DEI框架,通过异构大语言模型作为变异算子进行分布式质量-多样性搜索,实验表明模型多样性比并行性更能提升搜索性能。

Comments Accepted to ICML 2026 Workshop Scalable Learning and Optimization for Efficient Multimodal AI Agents (SCALE)

详情
AI中文摘要

我们提出DEI:进化推理中的多样性,一个分布式质量-多样性(QD)搜索框架,该框架将异构大语言模型(LLM)分配为变异算子,在通过非阻塞集合操作通信的对等节点间运行。与同质并行搜索(在所有工作节点上复制单一模型的归纳偏差)不同,DEI将每个LLM独特的创造性先验视为行为新颖性的互补来源。通过DEI扩展数字红皇后框架,节点在每轮结束时共享局部最优解,以播种下一轮的种群。这产生了跨模型的对抗压力,推动了超越模型内自博弈的鲁棒性。在Core War领域(一个竞争性编程基准,其中Redcode战士程序在模拟机器中战斗)上评估,一个四节点异构集成(GPT-5.4-mini、Claude Sonnet 4.6、GPT-5.2和Claude Haiku 4.5)在相等的总LLM调用预算下,相比单节点基线,实现了124%更高的合并存档QD分数(45.90 vs. 20.46)和28%更高的覆盖率(80.6% vs. 63.0%的单元格)。异构集成还在QD分数、覆盖率和所有四个模型家族的保留解泛化性上优于同等预算的同质集成。这些结果首次提供了经验证据,表明模型多样性(而非仅仅是并行性)是分布式基于LLM的QD搜索中增益的关键驱动因素。

英文摘要

We present DEI: Diversity in Evolutionary Inference, a distributed Quality-Diversity (QD) search framework that assigns heterogeneous large language models (LLMs) as mutation operators across peer nodes communicating with non-blocking collective operations. Unlike homogeneous parallel search, which replicates a single model's inductive biases across all workers, DEI treats each LLM's distinct creative prior as a complementary source of behavioral novelty. Extending the Digital Red Queen framework with DEI, nodes share local optimal solutions at the end of each round to seed the next round's population. This creates cross-model adversarial pressure that drives robustness beyond intra-model self-play. Evaluated on the Core War domain, a competitive programming benchmark in which Redcode warrior programs battle inside a simulated machine, a four-node heterogeneous ensemble (GPT-5.4-mini, Claude Sonnet 4.6, GPT-5.2, and Claude Haiku 4.5) achieves 124 percent higher merged-archive QD-Score (45.90 vs. 20.46) and 28 percent higher coverage (80.6 percent vs. 63.0 percent of cells) than a single-node baseline at equal total LLM-call budget. The heterogeneous ensemble also outperforms an equally-budgeted homogeneous ensemble on QD-Score, coverage, and held-out solution generality across all four model families. These results provide the first empirical evidence that model diversity, not merely parallelism, is the key driver of gain in distributed LLM-based QD search.

2605.27129 2026-05-27 cs.CV cs.RO

YOLO26-RipeLoc Lite: A lightweight architecture for tomato ripeness detection and picking point localization in greenhouse robotic harvesting

YOLO26-RipeLoc Lite:用于温室机器人采摘中番茄成熟度检测与采摘点定位的轻量级架构

Rajmeet Singh, Manveen Kaur, Shahpour Alirezaee, Irfan Hussain

AI总结 提出基于YOLO26的轻量级架构YOLO26-RipeLoc Lite,通过轻量特征金字塔网络、成熟度感知注意力模块和紧凑检测头,实现温室番茄的成熟度分类与中心点定位,在仅2.38M参数下达到92.9% mAP@0.5。

详情
AI中文摘要

在温室番茄生产中,自动化收获需要准确检测成熟番茄、进行成熟度分类,并为机器人末端执行器精确定位采摘点。本文提出YOLO26-RipeLoc Lite,一种基于YOLO26的轻量级深度学习架构,用于同时检测、成熟度分类和温室番茄的中心点定位。该模型引入了三项改进:(1) 轻量特征金字塔网络(LFPN),采用深度可分离卷积实现高效多尺度融合;(2) 成熟度感知注意力模块(RAAM),具有双池化和可学习的成熟度偏置向量,增强颜色纹理区分能力;(3) 紧凑检测头(CDH),采用共享卷积和集成的中心点回归分支,用于直接抓取规划。该模型在来自阿联酋阿布扎比SILAL温室的自定义数据集(1500张图像,6227个实例,其中3566个成熟,2661个未成熟)上进行评估。YOLO26-RipeLoc Lite在仅使用2.38M参数的情况下,实现了92.9%的mAP@0.5(成熟95.2%,未成熟90.6%),在所有评估架构中精度最高(95.2%)。训练后批量归一化剪枝30%可将参数减少至约1.8M,且精度损失可忽略。消融研究证实,温室感知的HSV增强提供了最大的改进(+2.02个百分点 mAP@50),骨干网络冻结达到了峰值精度(93.8%),而三阶段渐进解冻获得了最佳的定位质量(mAP@50:95为64.6%)。与YOLOv8n/s、YOLO11n/s、YOLO12n/s和YOLO26s的比较证实了其优越的精度-效率:比YOLO12n精度高2.9个百分点,参数少7.0%,并集成了用于机器人末端执行器引导的中心点定位。

英文摘要

In greenhouse tomato production, automated harvesting requires accurate detection of ripe tomatoes, ripeness classification, and precise picking-point localization for robotic end-effectors. This paper proposes YOLO26-RipeLoc Lite, a lightweight deep learning architecture based on YOLO26 for simultaneous detection, ripeness classification, and center-point localization of greenhouse tomatoes. The model introduces three modifications: (1) a Lightweight Feature Pyramid Network (LFPN) with depthwise separable convolutions for efficient multi-scale fusion, (2) a Ripeness-Aware Attention Module (RAAM) with dual pooling and a learnable ripeness bias vector for enhanced color-texture discrimination, and (3) a Compact Detection Head (CDH) with shared convolutions and an integrated center-point regression branch for direct grasp planning. The model is evaluated on a custom dataset of 1,500 images with 6,227 instances (3,566 ripe, 2,661 unripe) from the SILAL greenhouse, Abu Dhabi, UAE. YOLO26-RipeLoc Lite achieves mAP@0.5 of 92.9% (95.2% ripe, 90.6% unripe) with the highest precision (95.2%) among all evaluated architectures using only 2.38M parameters. Post-training BatchNorm pruning at 30% reduces parameters to ~1.8M with negligible accuracy loss. Ablation studies confirm that greenhouse-aware HSV augmentation provides the largest improvement (+2.02 pp mAP@50), backbone freezing achieves peak precision (93.8%), and 3-phase progressive unfreezing yields the best localization quality (mAP@50:95 of 64.6%). Comparisons with YOLOv8n/s, YOLO11n/s, YOLO12n/s, and YOLO26s confirm superior accuracy-efficiency: 2.9 pp higher precision than YOLO12n with 7.0% fewer parameters and integrated center-point localization for robotic end-effector guidance.

2605.27128 2026-05-27 cs.CV cs.LG

PILOT: A Data-Free Continual Learning Approach for Real-Time Semantic Segmentation via Boundary Guidance

PILOT: 一种基于边界引导的无数据持续学习方法用于实时语义分割

Yujing Zhou, Prashant Shekhar, Thomas Yang, Yongxin Liu

AI总结 提出PILOT框架,通过冻结原网络参数并引入并行导数分支捕获新类边界信息,实现实时语义分割模型在无需旧数据情况下的增量学习,有效缓解灾难性遗忘。

详情
AI中文摘要

实时语义分割模型在准确性和推理速度之间取得了极好的平衡。然而,将这些模型部署在动态的真实世界环境中,通常需要能够在不重新训练整个数据集的情况下增量地学习新类别。这种能力被称为持续学习。在这方面,深度学习中的标准微调方法常常因灾难性遗忘而失败,即模型学习新信息但忘记了先前训练和学习的类别。针对这一关键领域,本文提出了一种针对PIDNet的新型持续学习框架,PIDNet是一种被广泛引用的最先进的实时语义分割模型。我们的方法PILOT(并行增量学习随时间)通过实现一个并行导数分支(D-branch)引入了一种实时且轻量级的策略,该分支旨在捕获新类别的高频边界信息,同时冻结原始分割网络的训练参数。这种新颖的设置允许模型适应新的语义类别,同时保留先前学习类别的知识。通过仅使用与新类别相关的数据,我们的模型显著减少了训练开销。实验结果表明,我们的方法成功分割了新类别,同时在原始基类上保持了较高的平均交并比(mIoU),从而在该领域轻松超越了所有主要的持续学习方法。总体而言,PILOT被证明能有效缓解灾难性遗忘,同时对推理延迟影响最小,从而保持实时性能。

英文摘要

Real-time semantic segmentation models offer an excellent balance between accuracy and inference speed. However, deploying these models in dynamic real world environments often requires the ability to learn novel classes incrementally without retraining on the entire dataset. This capability is known as continual learning. In this regard, the standard fine-tuning methods in deep learning often fail due to catastrophic forgetting, where the model learns new information but forgets previously trained and learned classes. Contributing to this crucial domain, the current paper proposes a novel continual learning framework tailored for PIDNet, which is a widely cited state-of-the-art real-time semantic segmentation model. Our method, PILOT(Parallel Incremental Learning Over Time), introduces a real-time and lightweight strategy by implementing a parallel Derivative-branch (D-branch) designed to capture the high frequency boundary information of novel classes while freezing the trained parameters of the original segmentation network. This novel setup allows the model to adapt to new semantic categories while preserving the knowledge of previously learned classes. By using only data associated with the new class, our model significantly reduces training overhead. Experimental results demonstrate that our approach successfully segments new classes while maintaining high mean Intersection over Union (mIoU) on the original base classes, thereby comfortably outperforming all major continual learning approaches in this domain. Overall, PILOT is shown to effectively mitigate catastrophic forgetting with minimal impact on inference latency, thus maintaining real-time performance.

2605.27117 2026-05-27 cs.AI

Position: AI Safety Requires Effective Controllability

立场:AI安全需要有效可控性

Yige Li, Yunhao Feng, Jun Sun

AI总结 本文提出AI安全应将可控性作为首要目标,通过定义可控性、引入基准测试ControlBench并分析现有对齐机制的不足,提出以控制为中心的架构框架。

Comments 23 pages

详情
AI中文摘要

AI安全在很大程度上仍被框定为对齐:训练模型遵循人类偏好、安全策略和规范约束。这种框架改善了现代语言模型的行为,但对齐行为本身并不能保证部署的智能体在开放、交互和使用工具的环境中能够被停止、覆盖或约束。一个系统可能在期望上是安全的,但在冲突指令、长期执行、对抗性输入或高风险工具使用下,仍可能无法服从明确的运行时权威。这篇立场论文认为,AI安全因此需要将可控性作为第一类目标。我们将\emph{可控性}定义为AI系统在运行时能够可靠地被显式控制信号中断、覆盖、重定向和约束的能力,同时在没有此类信号时保持普通效用。为了研究这一差距,我们引入了\controlbench{},一个用于评估高风险智能体场景中可控性失败的基准测试。基于OpenClaw的智能体实验表明,当前的对齐和防护机制降低了风险,但往往无法提供持久、权威和可执行的运行时控制。因此,我们提出了一个以控制为中心的架构框架,强调显式控制平面、运行时干预路径、持久控制状态和可审计决策接口,作为未来可控AI系统的关键设计原则。

英文摘要

AI safety is still largely framed as alignment: training models to follow human preferences, safety policies, and normative constraints. That framing has improved the behavior of modern language models, but aligned behavior does not by itself guarantee that a deployed agent can be stopped, overridden, or constrained once it operates in open-ended, interactive, and tool-using environments. A system may be safe in expectation and still fail to yield to explicit runtime authority under conflicting instructions, long-horizon execution, adversarial inputs, or risky tool use. This position paper argues that AI safety therefore requires controllability as a first-class objective. We define \emph{controllability} as the ability of an AI system to remain reliably interruptible, overridable, redirectable, and constrainable by explicit control signals at runtime while preserving ordinary utility when such signals are absent. To study this gap, we introduce \controlbench{}, a benchmark for evaluating controllability failures in high-risk agentic scenarios. Experiments with OpenClaw-based agents show that current alignment and guardrail mechanisms reduce risk, but often fail to provide persistent, authoritative, and enforceable runtime control. We therefore propose a control-centric architectural framework that highlights explicit control planes, runtime intervention pathways, persistent control states, and auditable decision interfaces as key design principles for future controllable AI systems.

2605.27116 2026-05-27 cs.CV

COVD: Continual Open-Vocabulary Object Detection with Novel Concept Injection

COVD: 通过新概念注入的持续开放词汇目标检测

Yupeng Zhang, Ruize Han, Yuzhong Feng, Zixin Ren, Yuntong Tian, Liang Wan

AI总结 提出持续开放词汇目标检测新任务COVD,通过冻结视觉编码器并仅更新文本分支参数注入新概念,实现无需额外参数的高效持续学习。

详情
AI中文摘要

开放词汇目标检测(OVD)取得了显著进展,使检测器能够从已见类别泛化到未见类别。然而,现实世界的类别空间不断演变,现有的OVD模型仍然难以处理新出现的概念,而重复的完全重新训练成本过高。为此,我们引入了一个新的任务设置,称为持续开放词汇目标检测与新概念注入(COVD),其中模型顺序学习传入的新概念组,同时保留先前的概念和原始的开放词汇知识,并附带一个新的基准Novel-114。我们的关键观察是,预训练的视觉编码器通常已经感知并表示了众多新概念,主要瓶颈在于视觉表示与文本概念之间缺乏稳定的语义对齐。基于此,我们提出了NoIn-Det,一个无需额外参数的高效持续注入框架。NoIn-Det冻结视觉编码器,仅使用常见概念和先前注入概念的文本来保留文本表示空间,并通过仅更新有利于新概念学习的少量文本分支参数来注入新概念。大量实验表明,NoIn-Det在不引入额外参数的情况下,有效学习了新概念,保留了旧知识,并持续优于现有的VLM持续学习方法。Novel-114和代码将发布。

英文摘要

Open-vocabulary object detection (OVD) has made significant progress, enabling detectors to generalize from seen to unseen categories. However, real-world category spaces continually evolve, and existing OVD models still struggle with newly emerging concepts, while repeated full retraining is prohibitively expensive. To this end, we introduce a new task setting, termed Continual OVD with Novel Concept Injection (COVD), where models sequentially learn incoming novel concept groups while preserving prior concepts and original open-vocabulary knowledge, along with a new benchmark, Novel-114. Our key observation is that pretrained visual encoders often already perceive and represent many novel concepts, and the main bottleneck lies in the lack of stable semantic alignment between visual representations and textual concepts. Based on this, we propose NoIn-Det, an efficient continual injection framework without additional parameters. NoIn-Det freezes the visual encoder, preserves the text representation space using only texts of common concepts and previously injected concepts, and injects novel concepts by updating only a small subset of text-branch parameters beneficial to novel concept learning. Extensive experiments show that NoIn-Det effectively learns novel concepts, preserves old knowledge, and consistently outperforms existing continual learning methods for VLMs without introducing additional parameters.Novel-114 and the code will be released.

2605.27115 2026-05-27 cs.AI

Counteraction-Aware Multi-Teacher On-Policy Distillation for General Capability Recovery with Domain Preservation

基于对抗感知的多教师同策略蒸馏以实现领域保留下的通用能力恢复

Tianlei Chen, Jiao Ou, Ziyuan Liu, Ruiming Tang, Jian Liang, Han Li

AI总结 针对多教师同策略蒸馏在提示覆盖不完全时出现的恢复-保留对抗和弱信号平坦化问题,提出CaMOPD方法,通过解耦交替训练和基于差距的样本选择,在保持领域性能的同时有效恢复通用能力。

详情
AI中文摘要

领域专业化可以改善LLM在垂直领域的行为,但往往会削弱从原始模型继承的通用能力。最近的多教师同策略蒸馏(MOPD)流程通过教师反馈监督学生生成的轨迹来恢复模型能力,但通常假设教师对齐的提示覆盖,即提示需要匹配教师的训练分布。当通用教师是开源模型且其训练后数据未知时,这一假设难以满足。我们不是试图重建这种隐藏分布,而是研究使用现成的代理通用提示来恢复通用能力。我们识别了在这种不完全覆盖情况下原始MOPD的两种失败模式:混合冲突的恢复和保留梯度导致的恢复-保留对抗,以及均匀平均具有不等校正需求的样本导致的弱信号平坦化。我们提出了对抗感知的多教师同策略蒸馏(CaMOPD),通过解耦交替训练和基于差距的样本选择来解决这些问题。CaMOPD为通用恢复提供专用更新,定期审查领域提示以进行保留,并选择具有较大平均词级教师-学生对数概率差距的样本以集中校正信号。在角色扮演对话和医学推理问答场景中,CaMOPD在保持领域特定行为的同时,在通用恢复方面表现优于基线。梯度一致性分析进一步支持了CaMOPD在产生更一致的校正信号方面的预期效果。

英文摘要

Domain specialization can improve LLM behavior in vertical domains, but often weakens the general capabilities inherited from the original model. Recent Multi-Teacher On-Policy Distillation (MOPD) pipelines recover model capabilities by supervising student-generated trajectories with teacher feedback, but typically assume teacher-aligned prompt coverage, requiring prompts to match the teachers' training distributions. This assumption is difficult to satisfy when the general teacher is an open-source model whose post-training data are unknown. Instead of attempting to reconstruct this hidden distribution, we study general capability recovery with readily available proxy general prompts. We identify two failure modes of vanilla MOPD in this incomplete-coverage situation: recovery-preservation counteraction from mixing conflicting recovery and preservation gradients, and weak-signal flattening from uniformly averaging samples with unequal correction demand. We propose Counteraction-Aware Multi-Teacher On-Policy Distillation (CaMOPD), which addresses these issues with decoupled alternating training and gap-based sample selection. CaMOPD gives general recovery dedicated updates, periodically reviews domain prompts for preservation, and selects samples with larger averaged token-level teacher-student log-probability gaps to concentrate correction signals. Across role-play dialogue and medical reasoning QA scenarios, CaMOPD performs best in general recovery over baselines while maintaining domain-specific behavior. Gradient coherence analyses further support the intended effect of CaMOPD in producing more coherent correction signals.

2605.27113 2026-05-27 cs.LG cs.AI

High-Quality Synthetic Financial Time-Series using a GAN-Diffusion Framework

使用GAN-扩散框架的高质量合成金融时间序列

Giuseppe Masi, Andrea Coletta, Novella Bartolini

AI总结 提出一种结合GAN和扩散模型的质量感知生成框架,通过GAN的Critic引导扩散过程,生成更真实且保留金融时间序列典型事实和资产间相关结构的合成数据。

详情
AI中文摘要

近年来,金融机构和公司越来越多地采用合成数据来解决数据稀缺问题并生成反事实市场情景。然而,再现金融时间序列的所有统计特性(通常称为典型事实)对于许多现有的通用架构来说仍然是一个开放的挑战。在本文中,我们提出了一种质量感知生成框架,该框架结合了两类生成方法,展示了它们的集成如何解决现有局限性,同时增强合成数据的真实性。具体来说,我们首先引入CoMeTS-GAN(相关多变量时间序列GAN),这是一种条件生成对抗网络(C-GAN),旨在联合生成相关股票的中价和成交量时间序列。然后,我们展示了如何将我们的GAN架构整合到最先进的扩散模型中,以提高生成的相关结构的质量。具体来说,GAN的Critic作为一个质量评估模块,指导扩散过程,在生成的时间序列中强制执行学习到的相关结构。我们的框架为真实的股票市场模拟提供了一种轻量级且响应迅速的解决方案,明确建模了资产间的相关结构。我们通过实验将我们的框架与领先的生成架构进行了比较,表明它更有效地捕捉了股票市场的典型事实并建模了资产间的相关性。

英文摘要

In recent years, financial institutions and firms have increasingly adopted synthetic data to address data scarcity and to generate counterfactual market scenarios. However, reproducing all the statistical properties of financial time series, commonly known as stylized facts, remains an open challenge for many existing general-purpose architectures. In this paper, we present a quality-aware generative framework that combines two classes of generative methods, demonstrating how their integration addresses existing limitations while enhancing the realism of synthetic data. Specifically, we first introduce CoMeTS-GAN (Correlated Multivariate Time Series GAN), a Conditional Generative Adversarial Network (C-GAN) designed to jointly generate mid-price and volume time-series for correlated stocks. We then show how our GAN architecture can be incorporated into state-of-the-art diffusion models to enhance the quality of generated correlation structures. Specifically, the GAN's Critic serves as a quality evaluation module that guides the diffusion process, enforcing learned correlation structures in the generated time-series. Our framework offers a lightweight and responsive solution for realistic stock market simulation, explicitly modeling inter-asset correlation structures. We experimentally validate our framework against leading generative architectures, showing that it more effectively captures the stylized facts of stock markets and models inter-asset correlations.

2605.27101 2026-05-27 cs.CV cs.CL

Pop-Up Distractions Reveal Bag-of-Events Behavior in Video Large Language Models

弹出式干扰揭示视频大语言模型中的事件袋行为

Oscar Chew, Serhii Honcharenko, Qian-Hui Chen, Patricia Lu, Dishant Zaveri, Khoa D. Doan, Kuan-Hao Huang

AI总结 通过插入无关广告片段,发现视频大语言模型常将不同片段的事件错误关联,表现出将视频视为事件集合而非时间序列的“事件袋”行为。

详情
AI中文摘要

视频理解的一个关键能力是跨时间可靠地将主体与事件联系起来,然而视频大语言模型(VideoLLMs)是否真正实现了这一点仍不清楚。在这项工作中,我们引入了DistractionBench来评估VideoLLMs在存在无关视频片段的情况下是否能稳健地关联主体和事件。通过受控干预,例如在较长视频中插入短广告片段,我们表明VideoLLMs经常幻觉出不同片段中实体之间的交互,错误地将注入广告中的动作归因于主视频中的主体。我们将这种系统性幻觉表征为事件袋(BoE)行为,其中模型将视频视为事件的集合而非时间结构化的序列。评估11个流行的VideoLLMs,我们发现所有模型都表现出显著的BoE行为。我们的发现表明VideoLLMs缺乏可靠的时间接地机制,并激励开发具有更稳健主体-事件关联的模型。

英文摘要

A key capability for video understanding is reliably linking subjects to events across time, yet whether Video Large Language Models (VideoLLMs) actually achieve this remains unclear. In this work, we introduce DistractionBench to evaluate whether VideoLLMs can robustly link subjects and events in the presence of unrelated video segments. Through controlled interventions, such as inserting short advertisement clips into longer videos, we show that VideoLLMs frequently hallucinate interactions between entities from different segments, incorrectly attributing actions from injected advertisements to subjects in the main video. We characterize this systematic hallucination as bag-of-events (BoE) behavior, where models process videos as collections of events rather than temporally structured sequences. Evaluating 11 popular VideoLLMs, we find that all models exhibit substantial BoE behavior. Our findings suggest that VideoLLMs lack reliable mechanisms for temporal grounding and motivate the development of models with more robust subject-event association.

2605.27097 2026-05-27 cs.LG stat.ML

Mildly Overparameterized ReLU Networks on Orthogonal Data: Incremental Learning and Implicit Bias

正交数据上的轻度过参数化ReLU网络:增量学习与隐式偏差

James Town, Etienne Boursier, Ben Lewis, Matthias Englert, Ranko Lazic

AI总结 研究从微小初始化出发的两层ReLU网络在正交数据上的梯度流动力学,揭示了当初始化尺度趋近零时极限流收敛到鞍点间跳跃过程,并证明网络在宽度m约大于log(n)时高概率插值训练数据,且学习到的插值器的平方ℓ2范数缩放为√n,与最小ℓ2范数插值器相差常数因子。

Comments 66 pages, 6 figures

详情
AI中文摘要

神经网络的成功训练依赖于一阶优化方法的使用,但这些方法的理论刻画仍不完整,尤其是在轻度过参数化设置下。本文研究从微小初始化出发的两层ReLU网络在正交训练数据上的梯度流动力学。我们证明,当初始化尺度趋近零时,极限流收敛到鞍点间跳跃过程,揭示了在每个鞍点处激活一个新神经元的增量学习现象。该分析恢复了Dana等人(2025, arXiv:2502.16977)的已知结果:只要$m \gtrsim \log(n)$(其中$m$是网络宽度,$n$是训练样本数),网络就以高概率插值训练数据。这一增量过程刻画还使我们能够推导出一个新的隐式偏差结果:学习到的插值器具有平方$\ell_2$范数缩放为$\sqrt{n}$,这处于最小$\ell_2$范数插值器的常数因子内。更广泛地,我们的工作为ReLU网络的增量学习过程提供了首个严格证明,同时表明轻度过参数化网络可以收敛到复杂度与最优插值器同阶的插值解。

英文摘要

The successful training of neural networks hinges on the use of first order optimization methods, yet the theoretical characterization of these methods remains incomplete. This is especially true in settings with mild overparameterization. In this work, we study the gradient flow dynamics of two-layer ReLU networks from small initialization with orthogonal training data. We prove the limiting flow converges to a saddle-to-saddle jump process as the initialization scale tends to zero, revealing an incremental learning phenomenon in which a new neuron activates at each saddle. This analysis recovers the known result of Dana et al. (2025, arXiv:2502.16977) that the network interpolates the training data with high probability as soon as $m \gtrsim \log(n)$, where $m$ is the network width and $n$ is the number of training samples. This incremental process characterization also allows us to derive a novel implicit bias result: the learned interpolator has a squared $\ell_2$-norm scaling as $\sqrt{n}$, which is within a constant factor of the minimal $\ell_2$-norm interpolator. More broadly, our work provides the first rigorous proof of an incremental learning process for ReLU networks, whilst suggesting mildly overparameterized networks can converge to interpolating solutions whose complexity is of the same order as that of the optimal interpolator.

2605.27091 2026-05-27 cs.CL cs.AI

MiRD: Reliable Set-Valued Prediction for Open-Ended Question Answering via Miscoverage Risk Decomposition

MiRD:通过误覆盖风险分解实现开放式问答的可靠集值预测

Anqi Hu, Zhiyuan Wang, Zijun Jia, Bo Fu

AI总结 提出MiRD两阶段框架,通过将整体误覆盖分解为采样失败和条件选择失败,在开放式问答中实现可靠的集值预测,控制采样风险和条件选择风险,并产生更紧的边界和更自适应的预测集。

详情
AI中文摘要

可靠的集值预测为缓解开放式问答中的幻觉提供了一种原则性方法,但现有的共形方法通常依赖于一个脆弱的假设:有限采样必须已经产生至少一个可接受的候选,或者违反此条件的校准示例被丢弃。在本文中,我们介绍了MiRD,一个两阶段框架,将整体误覆盖分解为采样失败和条件选择失败。在第一阶段,MiRD在固定预算下,对有限采样不产生可接受答案的概率建立了一个期望水平的边际上界。在第二阶段,基于采样成功,MiRD使用在整个校准集上定义的与接受性相关的非一致性分数来校准共形选择阈值,从而保持校准集的完整性。在三个开放式问答数据集和八个模型上,MiRD控制了采样风险、条件选择风险和整体误覆盖,同时产生了比PAC风格替代方案更紧的第一阶段边界,以及比仅成功校准更自适应的预测集。

英文摘要

Reliable set-valued prediction provides a principled way to mitigate hallucinations in open-ended question answering (QA), yet existing conformal approaches typically rely on a fragile premise: finite sampling must already produce at least one admissible candidate, or calibration examples violating this condition are discarded. In this paper, we introduce MiRD, a two-stage framework that decomposes overall miscoverage into sampling failure and conditional selection failure. In Stage I, MiRD establishes an expectation-level marginal upper bound on the probability that finite sampling produces no admissible answer under a fixed budget. In Stage II, conditioned on sampling success, MiRD calibrates a conformal selection threshold using admission-correlated nonconformity scores defined over the full calibration set, thereby preserving calibration-set integrity. Across three open-ended QA datasets and eight models, MiRD controls sampling risk, conditional selection risk, and overall miscoverage, while yielding tighter first-stage bounds than PAC-style alternatives and more adaptive prediction sets than successful-only calibration.

2605.27088 2026-05-27 cs.CL cs.LG

LLMs Are Already Good Tutors: Training-Free Prompt Optimization for Pedagogical Math Tutoring

LLMs 已经是好导师:面向教学数学辅导的无训练提示优化

Unggi Lee, Minchul Shin, Yeil Jeong, Sookbun Lee, Jeongsu Moon, Kyungtae Joo, Eunjoo Lee, Hoilym Kwon

AI总结 本研究探索通过API调用优化系统提示的无训练方法,提出5种教育专用方法,在2个OOD基准上评估12种方法,发现所有方法均超越最强RL训练基线,ParetoGrad在事后解决率、泄漏控制和有用性上达到最佳帕累托平衡。

Comments 17 pages, 5 figures

详情
AI中文摘要

将LLMs与数学辅导对齐通常需要基于RL的训练和多GPU基础设施。我们研究无训练提示优化——仅通过API调用演化系统提示——是否可以作为实用替代方案。我们改编了7种已发表方法并提出了5种教育专用方法,在2个OOD基准套件上的5种条件下评估这12种方法。所有12种最佳方法配置均超越了最强的RL训练基线(R_total = 0.633),我们的ParetoGrad在事后解决率、泄漏控制和有用性上实现了最佳帕累托平衡,而非在任何单一组件上占优。使用包含82个代码的教育代码本进行行为分析发现,无训练方法依赖教学知识模式的频率是RL训练模型的2-3倍,同时意图级脚手架减少了约10个百分点。我们还发现一个任务依赖的推理模式效应,在无训练和基于RL的范式中一致。我们的方法仅通过提示和最小计算即可高效开发教学对齐的LLM导师。

英文摘要

Aligning LLMs for math tutoring typically requires RL-based training with multi-GPU infrastructure. We investigate whether training-free prompt optimization-evolving only the system prompt via API calls-can serve as a practical alternative. We adapt 7 published methods and propose 5 education-specialized methods, evaluating these 12 methods under 5 conditions on 2 OOD benchmark suites. All 12 best-per-method configurations surpass the strongest RL-trained baseline (R_total = 0.633), and our ParetoGrad achieves the best Pareto balance across post-test solve rate, leak control, and helpfulness, rather than dominating any single component. Behavioral analysis with an 82-code educational codebook reveals that training-free methods rely on teaching-knowledge patterns at 2-3x the rate of RL-trained models, with a compensating ~10 percentage-point reduction in intent-level scaffolding. We also find a task-dependent reasoning mode effect consistent across training-free and RL-based paradigms. Our approach enables efficient development of pedagogically aligned LLM tutors with prompts alone and minimal compute.

2605.27083 2026-05-27 cs.CL cs.CR

On the Hidden Costs of Counterfactual Knowledge Training in LLM Unlearning

反事实知识训练在LLM遗忘中的隐藏代价

Xiaotian Ye, Xiaohan Wang, Mengqi Zhang, Shu Wu

AI总结 本文发现反事实微调(CFT)在LLM遗忘中存在知识冲突和幻觉溢出两大问题,并引入扩展基准RWKU+及诊断工具进行系统分析。

详情
AI中文摘要

反事实微调(CFT)已成为大语言模型(LLM)遗忘的一种有前景的范式,通过训练模型生成替代的虚构知识来取代不需要的内容。然而,在这项工作中,我们发现该范式在某些方面仍不如其他范式,并识别出导致这一差距的两个先前被忽视的陷阱:(1)知识冲突,即反事实语料库中的相互不一致导致冲突梯度,破坏参数优化;(2)幻觉溢出,即拟合虚假目标会灌输持久的捏造偏差,增加无关领域的幻觉率。为了系统诊断这些问题,我们引入了RWKU+,这是一个扩展的基准,配备了新颖的权衡指标和梯度级诊断工具。我们的工作进一步讨论了该范式的局限性和开销,旨在为更严格的LLM遗忘研究提供见解和可操作的指导。

英文摘要

Counterfactual tuning (CFT) has emerged as a promising paradigm for Large Language Model (LLM) unlearning by training models to generate alternative fictitious knowledge in place of undesired content. However, in this work, we find that this paradigm still underperforms other paradigms in some aspects, and identify two previously overlooked pitfalls underlying this gap: (1) knowledge conflict, where mutual inconsistencies within counterfactual corpora induce conflicting gradients that disrupt parameter optimization, and (2) hallucination spillover, where fitting false targets instills a persistent fabrication bias, inflating hallucination rates on unrelated domains. To systematically diagnose these issues, we introduce RWKU+, an extended benchmark equipped with novel trade-off metrics and gradient-level diagnostic tools. Our work further discusses the limitations and overhead of the paradigm, aiming to provide insights and actionable guidance for more rigorous LLM unlearning research.

2605.27082 2026-05-27 cs.AI

Can Broad Biomedical Knowledge be Contextualized into Scenario-Grounded Propositions?

广泛的生物医学知识能否被情境化为基于场景的命题?

Qingyuan Zeng, Ziyang Chen, Pengxiang Cai, Zixin Guan, Anglin Liu, Lang Qin, Xinyao Lai, Jintai Chen

AI总结 提出SCENE双层多智能体框架,通过迭代搜索将广泛生物医学知识转化为证据支持的场景化命题,并在临床试验和LINCS L1000研究中验证其有效性。

详情
AI中文摘要

生物医学发现通常需要将广泛的生物医学知识与特定的实验或临床数据联系起来。背景知识提示相关机制,但通常过于泛化,无法直接映射到数据集变量;而数据驱动模式可能具有数据集特异性且难以从机制上解释。我们将这一缺失环节研究为知识情境化:将广泛的生物医学知识转化为有证据支持的、基于场景的命题,供领域专家检查、重现和验证。我们提出SCENE,一个双层多智能体框架,将知识情境化视为迭代搜索。上层将广泛知识转化为搜索方向,并将其锚定在数据集模式中。下层通过多目标优化执行这些方向,以识别在证据强度和数据支持之间取得平衡的具体命题。两层之间的反馈逐步细化搜索。我们在两个场景中评估SCENE:在临床试验场景中发现具有异质性治疗益处的患者亚组,以及在LINCS L1000研究中识别特定情境下的生物学反应。在临床试验中,SCENE发现了具体且支持充分的亚组,并优于现有基线。在L1000研究中,SCENE识别出具有强靶标-响应匹配和高阳性率的扰动情境。这些结果表明,SCENE弥合了广泛知识与场景特定证据之间的差距,为后续验证生成了可追溯、可检查的假设。

英文摘要

Biomedical discovery often requires connecting broad biomedical knowledge with specific experimental or clinical data. Background knowledge suggests relevant mechanisms but is usually too general to map directly onto dataset variables, while data-driven patterns can be dataset-specific and hard to interpret mechanistically. We study this missing link as knowledge contextualization: transforming broad biomedical knowledge into evidence-supported, scenario-grounded propositions that domain experts can inspect, replay, and validate. We propose SCENE, a bi-level multi-agent framework that treats knowledge contextualization as iterative search. The upper level converts broad knowledge into search directions and grounds them in the dataset schema. The lower level executes these directions through multi-objective optimization to identify concrete propositions that balance evidential strength and data support. Feedback between the two levels progressively refines the search. We evaluate SCENE in two settings: discovering patient subgroups with heterogeneous treatment benefits in clinical trial scenarios, and identifying context-specific biological responses in LINCS L1000 studies. In clinical trials, SCENE discovers specific, well-supported subgroups and outperforms existing baselines. In L1000 studies, SCENE identifies perturbational contexts with strong target-response matching and high positive rates. These results show that SCENE bridges broad knowledge and scenario-specific evidence, producing traceable, inspectable hypotheses for follow-up validation.