arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2512.13278 2026-06-08 cs.CL cs.LG 版本更新

AutoTool: Dynamic Tool Selection and Integration for Agentic Reasoning

AutoTool: 面向智能体推理的动态工具选择与集成

Jiaru Zou, Ling Yang, Yunzhe Qi, Sirui Chen, Mengting Ai, Ke Shen, Jingrui He, Mengdi Wang

发表机构 * Nanyang Technological University（南洋理工大学）

AI总结提出AutoTool框架，通过双阶段优化（SFT+RL轨迹稳定化和KL正则化Plackett-Luce排序）使大语言模型具备动态工具选择能力，在数学、科学、代码和多模态推理等任务上平均提升6.4%-7.7%。

Comments ICML2026; Best Paper Award at ICCV 2025 Workshop on Multi-Modal Reasoning for Agentic Intelligence

详情

AI中文摘要

智能体强化学习推动了大语言模型（LLMs）在长链思维轨迹中进行推理，同时穿插外部工具的使用。现有方法假设工具集固定，限制了LLM智能体对新工具或演化工具集的适应性。我们提出AutoTool，一个训练框架，使LLM智能体在整个推理轨迹中具备动态工具选择能力。AutoTool采用双阶段优化流水线：（i）基于SFT和RL的轨迹稳定化，以实现连贯推理；（ii）KL正则化的Plackett-Luce排序，以优化一致的多步工具选择。我们进一步构建了一个包含20万条数据的数据集，其中包含跨1000多个工具和100多个任务（涵盖数学、科学、代码生成和多模态推理）的显式工具选择理由。在十个多样化基准上，我们使用AutoTool训练了两个基础模型：Qwen3-8B和Qwen2.5-VL-7B。在参数更少的情况下，AutoTool持续优于先进的LLM智能体和工具集成方法，在数学与科学推理上平均提升6.4%，在基于搜索的问答上提升4.5%，在代码生成上提升7.7%，在多模态理解上提升6.9%。此外，AutoTool通过在推理过程中动态利用演化工具集中的未见工具，展现出更强的泛化能力。

英文摘要

Agentic reinforcement learning has advanced large language models (LLMs) to reason through long chain-of-thought trajectories while interleaving external tool use. Existing approaches assume a fixed inventory of tools, which limits the adaptability of LLM agents to new or evolving toolsets. We present AutoTool, a training framework that equips LLM agents with dynamic tool-selection capabilities throughout their reasoning trajectories. AutoTool employs a dual-phase optimization pipeline: (i) SFT and RL-based trajectory stabilization for coherent reasoning, and (ii) KL-regularized Plackett-Luce Ranking to refine consistent multi-step tool selection. We further build a 200k dataset with explicit tool-selection rationales across 1,000+ tools and 100+ tasks spanning mathematics, science, code generation, and multimodal reasoning. Across ten diverse benchmarks, we train two base models, Qwen3-8B and Qwen2.5-VL-7B, with AutoTool. With fewer parameters, AutoTool consistently outperforms advanced LLM agents and tool-integration methods, yielding average gains of 6.4% in math & science reasoning, 4.5% in search-based QA, 7.7% in code generation, and 6.9% in multimodal understanding. In addition, AutoTool exhibits stronger generalization by dynamically leveraging unseen tools from evolving toolsets during inference.

URL PDF HTML ☆

赞 0 踩 0

2512.12997 2026-06-08 cs.CV cs.AI cs.LG 版本更新

Calibrating Uncertainty for Zero-Shot Adversarial CLIP

校准零样本对抗性CLIP的不确定性

Wenjing Lu, Zerui Tao, Yuning Qiu, Dongping Zhang, Yang Yang, Qibin Zhao

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结针对CLIP在零样本分类中对抗攻击脆弱且不确定性校准差的问题，提出基于狄利克雷分布重参数化的对抗微调目标，统一对齐语义结构与置信度，提升校准性和鲁棒性。

Comments ICML 2026

详情

AI中文摘要

CLIP在零样本分类中表现强劲，但仍易受对抗攻击。先前的对抗微调工作主要匹配干净样本和对抗样本之间的预测logits，忽略了不确定性校准，可能损害零样本泛化能力。在可靠的不确定性估计中，一个常见期望是预测不确定性应随输入难度增加或偏离训练分布而上升。然而，在对抗环境中我们经常观察到相反的情况：扰动不仅降低准确性，还抑制不确定性，导致严重的校准错误和过度自信。这揭示了鲁棒性之外的关键可靠性差距。为弥合这一差距，我们提出了一种考虑准确性和不确定性的CLIP对抗微调目标。通过将CLIP输出重参数化为狄利克雷分布的浓度参数，我们提出了一种统一表示，捕获相对语义结构和置信度大小。这使得在扰动下实现整体分布对齐，超越单一logits锚定，恢复校准的不确定性。在多个零样本基准上的实验表明，我们的方法显著提高了不确定性校准，在保持干净准确性的同时实现了具有竞争力的对抗鲁棒性。

英文摘要

CLIP delivers strong zero-shot classification but remains highly vulnerable to adversarial attacks. Prior adversarial fine-tuning work primarily matches predicted logits between clean and adversarial examples, which overlooks uncertainty calibration and may degrade the zero-shot generalization. A common expectation in reliable uncertainty estimation is that predictive uncertainty should increase as inputs become more difficult or shift away from the training distribution. However, we frequently observe the opposite in the adversarial setting: perturbations not only degrade accuracy but also suppress uncertainty, leading to severe miscalibration and over-confidence. This reveals a critical reliability gap beyond robustness. To bridge this gap, we propose an adversarial fine-tuning objective for CLIP considering both accuracy and uncertainty. By reparameterizing CLIP outputs as the concentration parameters of a Dirichlet distribution, we propose a unified representation that captures relative semantic structure and confidence magnitude. This enables holistic distribution alignment under perturbations, moving beyond single-logit anchoring and restoring calibrated uncertainty. Experiments across multiple zero-shot benchmarks demonstrate that our method significantly improves uncertainty calibration and achieves competitive adversarial robustness while preserving clean accuracy.

URL PDF HTML ☆

赞 0 踩 0

2512.10521 2026-06-08 cs.CV 版本更新

Take a Peek: Efficient Encoder Adaptation for Few-Shot Semantic Segmentation via LoRA

Take a Peek: 通过LoRA高效编码器适应少样本语义分割

Pasquale De Marinis, Gennaro Vessio, Giovanna Castellano

发表机构 * University of Bari（巴里大学）

AI总结提出TaP方法，利用低秩适应（LoRA）微调编码器，在少样本和跨域少样本语义分割中实现高效适应，提升新类分割性能。

详情

AI中文摘要

少样本语义分割（FSS）旨在仅使用少量标注支持集对查询图像中的新类进行分割。先前研究主要关注改进解码器，但编码器提取未见类有意义特征的能力有限仍是关键瓶颈。本文提出 extit{Take a Peek}（TaP），一种简单而有效的方法，通过引入基于支持集的轻量级 extit{特征空间偏移}，增强了编码器对FSS和跨域FSS的适应性。TaP利用低秩适应（LoRA）在支持集上微调编码器，计算开销极小，能够快速适应新类同时减轻灾难性遗忘。我们的方法模型无关，可无缝集成到现有FSS流程中。在多个基准（包括COCO $20^i$、Pascal $5^i$以及跨域数据集DeepGlobe、ISIC和Chest X-ray）上的大量实验表明，TaP在不同模型和shot设置下一致地提升了分割性能。值得注意的是，TaP在复杂的多类场景中取得了显著增益，突显了其在现实场景中的实际有效性。秩敏感性分析还表明，即使采用低秩适应也能实现强性能，从而确保计算效率。通过解决FSS中编码器泛化到新类的关键限制，TaP为构建更鲁棒、高效和可泛化的分割系统铺平了道路。代码可在https://github.com/pasqualedem/TakeAPeek获取。

英文摘要

Few-shot semantic segmentation (FSS) aims to segment novel classes in query images using only a small annotated support set. While prior research has mainly focused on improving decoders, the encoder's limited ability to extract meaningful features for unseen classes remains a key bottleneck. In this work, we introduce \textit{Take a Peek} (TaP), a simple yet effective method that enhances encoder adaptability for both FSS and cross-domain FSS \rev{by inducing a lightweight \textit{feature-space shift} conditioned on the support set}. TaP leverages Low-Rank Adaptation to fine-tune the encoder on the support set with minimal computational overhead, enabling fast adaptation to novel classes while mitigating catastrophic forgetting. Our method is model-agnostic and can be seamlessly integrated into existing FSS pipelines. Extensive experiments across multiple benchmarks--including COCO $20^i$, Pascal $5^i$, and cross-domain datasets such as DeepGlobe, ISIC, and Chest X-ray--demonstrate that TaP consistently improves segmentation performance across diverse models and shot settings. Notably, TaP delivers significant gains in complex multi-class scenarios, highlighting its practical effectiveness in realistic settings. A rank sensitivity analysis also shows that strong performance can be achieved even with low-rank adaptations, thereby ensuring computational efficiency. By addressing a critical limitation in FSS--the encoder's generalization to novel classes--TaP paves the way toward more robust, efficient, and generalizable segmentation systems. The code is available at https://github.com/pasqualedem/TakeAPeek.

URL PDF HTML ☆

赞 0 踩 0

2512.09634 2026-06-08 cs.CL 版本更新

Creation of the Estonian Subjectivity Dataset: Assessing the Degree of Subjectivity on a Scale

爱沙尼亚主观性数据集的创建：评估主观性程度的一个量表

Karl Gustav Gailit, Kadri Muischnek, Kairit Sirts

发表机构 * University of Tartu（塔尔图大学）

AI总结本文创建了爱沙尼亚语文档级主观性数据集，通过连续量表标注并分析标注一致性，初步实验使用大语言模型进行自动主观性分析，发现自动评分可行但不可完全替代人工。

Comments 9 pages, 5 figures, 3 appendixes, LREC 2026

详情

DOI: 10.63317/35rspcvi32vp
Journal ref: Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026) 8204-8216

AI中文摘要

本文介绍了爱沙尼亚语文档级主观性数据集的创建，分析了所得标注，并报告了使用大语言模型（LLM）进行自动主观性分析的初步实验。该数据集包含1000个文档——300篇新闻文章和700个随机选择的网络文本——每个文档由四位标注员在从0（完全客观）到100（完全主观）的连续量表上评分。由于标注员间相关性中等，部分文本得分位于量表两端，因此对得分差异最大的文本子集进行了重新标注，标注员间相关性有所提高。除了人工标注外，数据集还包括GPT-5作为标注自动化实验生成的分数。这些分数与人工标注相似，但出现了一些差异，表明基于LLM的自动主观性评分虽然可行，但并非人工标注的可互换替代方案，其适用性取决于预期应用。

英文摘要

This article presents the creation of an Estonian-language dataset for document-level subjectivity, analyzes the resulting annotations, and reports an initial experiment of automatic subjectivity analysis using a large language model (LLM). The dataset comprises of 1,000 documents-300 journalistic articles and 700 randomly selected web texts-each rated for subjectivity on a continuous scale from 0 (fully objective) to 100 (fully subjective) by four annotators. As the inter-annotator correlations were moderate, with some texts receiving scores at the opposite ends of the scale, a subset of texts with the most divergent scores was re-annotated, with the inter-annotator correlation improving. In addition to human annotations, the dataset includes scores generated by GPT-5 as an experiment on annotation automation. These scores were similar to human annotators, however several differences emerged, suggesting that while LLM based automatic subjectivity scoring is feasible, it is not an interchangeable alternative to human annotation, and its suitability depends on the intended application.

URL PDF HTML ☆

赞 0 踩 0

2511.06080 2026-06-08 cs.CV cs.CY cs.HC 版本更新

AIDEN: Design and Pilot Study of an AI Assistant for the Visually Impaired

AIDEN：面向视障人士的AI助手设计与初步研究

Luis Marquez-Carpintero, Francisco Gomez-Donoso, Zuria Bauer, Bessie Dominguez-Dager, Alvaro Belmonte-Baeza, Mónica Pina-Navarro, Francisco Morillas-Espejo, Felix Escalona, Miguel Cazorla

发表机构 * Institute for Computer Research, University of Alicante（计算机研究所，阿利坎特大学）； ETH Zurich（苏黎世联邦理工学院）

AI总结提出AIDEN系统，结合YOLO实时目标检测、LLaVA场景描述与OCR，以及基于盖革计数器隐喻的连续触觉引导，避免听觉过载并保护隐私，实验表明用户满意度高。

详情

AI中文摘要

本文介绍了AIDEN，一种基于人工智能的助手，旨在增强视障人士的自主性和日常生活质量，他们通常在物体识别、文本阅读和陌生环境导航方面遇到困难。现有的解决方案如屏幕阅读器或基于音频的助手虽然便于获取信息，但常常导致听觉过载，并在开放环境中引发隐私问题。AIDEN通过一种混合架构解决了这些限制，该架构集成了用于实时目标检测的YOLO（You Only Look Once）和用于场景描述及光学字符识别（OCR）的大型语言与视觉助手（LLaVA）。该系统的一个关键创新是基于盖革计数器隐喻的连续触觉引导机制，该机制在不占用听觉通道的情况下支持物体居中，同时通过确保不存储个人数据来保护隐私。与视障参与者进行的实证评估使用技术接受模型（TAM）评估了感知易用性和接受度。结果表明用户满意度高，特别是在直观性和感知自主性方面。此外，“寻找物体”功能实现了有效的实时性能。这些发现提供了有希望的证据，表明与传统的以音频为中心的方法相比，多模态触觉-视觉反馈可以改善日常可用性和独立性，从而推动更大规模的临床验证。

英文摘要

This paper presents AIDEN, an artificial intelligence-based assistant designed to enhance the autonomy and daily quality of life of visually impaired individuals, who often struggle with object identification, text reading, and navigation in unfamiliar environments. Existing solutions such as screen readers or audio-based assistants facilitate access to information but frequently lead to auditory overload and raise privacy concerns in open environments. AIDEN addresses these limitations with a hybrid architecture that integrates You Only Look Once (YOLO) for real-time object detection and a Large Language and Vision Assistant (LLaVA) for scene description and Optical Character Recognition (OCR). A key novelty of the system is a continuous haptic guidance mechanism based on a Geiger-counter metaphor, which supports object centering without occupying the auditory channel, while privacy is preserved by ensuring that no personal data are stored. Empirical evaluations with visually impaired participants assessed perceived ease of use and acceptance using the Technology Acceptance Model (TAM). Results indicate high user satisfaction, particularly regarding intuitiveness and perceived autonomy. Moreover, the ``Find an Object'' achieved effective real-time performance. These findings provide promising evidence that multimodal haptic-visual feedback can improve daily usability and independence compared to traditional audio-centric methods, motivating larger-scale clinical validations.

URL PDF HTML ☆

赞 0 踩 0

2512.01362 2026-06-08 cs.LG 版本更新

Directed evolution algorithm drives neural prediction

定向进化算法驱动神经预测

Yanlin Wang, Nancy M Young, Patrick C M Wong

发表机构 * Brain and Mind institute, The Chinese University of Hong Kong（脑科学与智能技术研究所，香港中文大学）； Department of Linguistics and Modern Languages, The Chinese University of Hong Kong（语言学与现代语言系，香港中文大学）； Division of Otolaryngology, Ann & Robert H. Lurie Children's Hospital of Chicago（芝加哥安·罗伯特·H·卢里儿童医院耳鼻喉科）； Department of Otolaryngology Head & Neck Surgery, Feinberg School of Medicine, Northwestern University（费因伯格医学院耳鼻喉科与头颈外科部，西北大学）； Knowles Hearing Center, Department of Communication Sciences and Disorders, Northwestern University（诺里斯听力中心，西北大学沟通科学与障碍系）

AI总结提出定向进化模型(DEM)，模拟生物定向进化试错过程，结合回放缓冲和连续反向传播，在跨域神经预测中提升泛化能力并解决标签稀缺问题。

Comments 43 pages, 5 figures

详情

AI中文摘要

神经预测为预测神经认知功能和障碍的个体差异以及为个性化干预提供预后指标提供了一种有前景的方法。然而，由于领域偏移和标签稀缺的限制，将神经预测模型转化为医学人工智能应用具有挑战性。在此，我们提出定向进化模型（DEM），一种新颖的计算模型，模拟生物定向进化的试错过程，以逼近预测建模任务的最优解。我们证明了定向进化算法是一种有效的不确定性探索策略，能够增强强化学习中的泛化能力。此外，通过将回放缓冲和连续反向传播方法整合到DEM中，我们提供了在连续学习环境中实现利用与探索之间更好权衡的证据。我们在四个不同数据集上进行了实验，这些数据集涉及接受人工耳蜗植入的儿童，其口语发展结果在个体儿童水平上差异很大。术前神经MRI数据已被证明可以准确预测这些儿童术后结果，但在数据集之间不适用。我们的结果表明，DEM能够有效提高跨域植入前神经预测的性能，同时解决目标域中标签稀缺的挑战。

英文摘要

Neural prediction offers a promising approach to forecasting the individual variability of neurocognitive functions and disorders and providing prognostic indicators for personalized invention. However, it is challenging to translate neural predictive models into medical artificial intelligent applications due to the limitations of domain shift and label scarcity. Here, we propose the directed evolution model (DEM), a novel computational model that mimics the trial-and-error processes of biological directed evolution to approximate optimal solutions for predictive modeling tasks. We demonstrated that the directed evolution algorithm is an effective strategy for uncertainty exploration, enhancing generalization in reinforcement learning. Furthermore, by incorporating replay buffer and continual backpropagate methods into DEM, we provide evidence of achieving better trade-off between exploitation and exploration in continuous learning settings. We conducted experiments on four different datasets for children with cochlear implants whose spoken language developmental outcomes vary considerably on the individual-child level. Preoperative neural MRI data has shown to accurately predict the post-operative outcome of these children within but not across datasets. Our results show that DEM can efficiently improve the performance of cross-domain pre-implantation neural predictions while addressing the challenge of label scarcity in target domain.

URL PDF HTML ☆

赞 0 踩 0

2510.03381 2026-06-08 cs.LG cs.AI 版本更新

Proxy Reconstruction Pre-training for Ramp Flow Prediction at Highway Interchanges

高速公路立交匝道流量预测的代理重建预训练

Yongchao Li, Jun Chen, Zhuoxuan Li, Chao Gao, Yang Li, Chu Zhang, Changyin Dong

发表机构 * Southeast University（东南大学）； Institute of Telecommunications and Information Sciences, China（中国电信与信息科学研究院）

AI总结提出时空解耦自编码器（STDAE），通过跨模态重建预训练从主线数据恢复匝道流量，结合GWNet等模型提升预测精度，在真实数据集上超越13个基线。

Comments Accepted at Applied Soft Computing Journal

详情

DOI: 10.1016/j.asoc.2026.115462
Journal ref: Applied Soft Computing Journal 200 (2026) 115462

AI中文摘要

立交桥是高速公路间车辆转换的关键节点，但缺乏实时匝道检测器导致交通预测存在盲区。为解决这一问题，我们提出时空解耦自编码器（STDAE），一种利用跨模态重建预训练的两阶段框架。在第一阶段，STDAE从主线数据重建历史匝道流量，迫使模型捕捉内在的时空关系。其解耦架构通过并行的空间和时间自编码器高效提取异质特征。在预测阶段，学习到的表示与GWNet等模型集成以提高准确性。在三个真实立交数据集上的实验表明，STDAE-GWNET始终优于十三个最先进的基线，并达到与使用历史匝道数据的模型相当的性能。这证明了其在克服检测器稀缺方面的有效性及其在不同预测流程中的即插即用潜力。

英文摘要

Interchanges are crucial nodes for vehicle transfers between highways, yet the lack of real-time ramp detectors creates blind spots in traffic prediction. To address this, we propose a Spatio-Temporal Decoupled Autoencoder (STDAE), a two-stage framework that leverages cross-modal reconstruction pretraining. In the first stage, STDAE reconstructs historical ramp flows from mainline data, forcing the model to capture intrinsic spatio-temporal relations. Its decoupled architecture with parallel spatial and temporal autoencoders efficiently extracts heterogeneous features. In the prediction stage, the learned representations are integrated with models such as GWNet to enhance accuracy. Experiments on three real-world interchange datasets show that STDAE-GWNET consistently outperforms thirteen state-of-the-art baselines and achieves performance comparable to models using historical ramp data. This demonstrates its effectiveness in overcoming detector scarcity and its plug-and-play potential for diverse forecasting pipelines.

URL PDF HTML ☆

赞 0 踩 0

2511.19359 2026-06-08 cs.LG 版本更新

Enhancing Conformal Prediction via Class Similarity

通过类别相似性增强保形预测

Ariel Fargion, Lahav Dabah, Tom Tirer

发表机构 * Faculty of Engineering, Bar-Ilan University, Ramat Gan, Israel（巴伊兰大学工程学院，拉马特甘，以色列）

AI总结提出利用类别相似性改进保形预测的方法，通过惩罚组外错误或利用嵌入信息，减少预测集大小并提升语义一致性。

Comments ICML 2026 (camera-ready). Code is available at: https://github.com/ariel361/CP_via_CS

详情

AI中文摘要

保形预测（CP）已成为高风险分类应用中一个强大的统计框架。CP 不是预测单个类别，而是生成一个预测集，保证以预先指定的概率包含真实标签。不同 CP 方法的性能通常通过其平均预测集大小来评估。在类别可以划分为语义组（例如需要类似治疗的疾病）的设置中，用户可以从不仅平均较小而且包含少量语义不同组的预测集中受益。本文首先解决这个问题，并最终提供一种广泛适用的工具，用于在任何数据集上提升任何 CP 方法。首先，给定一个类别划分，我们建议在 CP 评分函数中增加一个惩罚项，用于惩罚包含组外错误的预测。我们从理论上分析了这一策略，并证明了其在组相关指标上的优势。令人惊讶的是，我们从数学上表明，对于常见的类别划分，它还可以减少任何 CP 评分函数的平均集大小。我们的分析揭示了这种改进背后的类别相似性因素，并激发了一种变体，该变体可以通过利用模型的嵌入进一步减少预测集大小，而无需任何人工语义划分。最后，我们提出了一项广泛的实证研究，涵盖了著名的 CP 方法、多个模型和几个数据集，表明我们基于类别相似性的方法一致地增强了 CP 方法。

英文摘要

Conformal Prediction (CP) has emerged as a powerful statistical framework for high-stakes classification applications. Instead of predicting a single class, CP generates a prediction set, guaranteed to include the true label with a pre-specified probability. The performance of different CP methods is typically assessed by their average prediction set size. In setups where the classes can be partitioned into semantic groups, e.g., diseases that require similar treatment, users can benefit from prediction sets that are not only small on average, but also contain a small number of semantically different groups. This paper begins by addressing this problem and ultimately offers a widely applicable tool for boosting any CP method on any dataset. First, given a class partition, we propose augmenting the CP score function with a term that penalizes predictions with out-of-group errors. We theoretically analyze this strategy and prove its advantages for group-related metrics. Surprisingly, we show mathematically that, for common class partitions, it can also reduce the average set size of any CP score function. Our analysis reveals the class-similarity factors behind this improvement and motivates a variant that can further reduce prediction set size by leveraging the model's embeddings, without requiring any human semantic partition. Finally, we present an extensive empirical study, encompassing prominent CP methods, multiple models, and several datasets, which demonstrates that our class-similarity-based approach consistently enhances CP methods.

URL PDF HTML ☆

赞 0 踩 0

2511.12795 2026-06-08 cs.RO 版本更新

ActiveGrasp: Information-Guided Active Grasping with Calibrated Energy-based Model

ActiveGrasp: 基于校准能量模型的信息引导主动抓取

Boshu Lei, Wen Jiang, Kostas Daniilidis

发表机构 * University of Pennsylvania（宾夕法尼亚大学）； Archimedes, Athena RC（阿基米德、阿提卡RC）

AI总结针对密集杂乱环境中的抓取问题，提出一种校准能量模型生成抓取姿态，并基于抓取分布的信息增益主动选择视角，在有限视角下高效抓取目标物体。

Comments CVPR 2026

详情

AI中文摘要

在密集杂乱环境中抓取对机器人是一项具有挑战性的任务。以往的方法试图通过在抓取姿态生成前主动收集多个视角来解决这个问题。然而，它们要么忽略了抓取分布对信息增益估计的重要性，要么依赖于抓取分布的投影，这忽略了SE(3)流形上抓取姿态的结构。为了应对这些挑战，我们提出了一种用于抓取姿态生成的校准能量模型，以及一种从抓取分布估计信息增益的主动视角选择方法。我们的能量模型捕捉了SE(3)流形上抓取分布的多模态特性。能量水平被校准到抓取的成功率，使得预测分布与真实分布一致。通过从基于重建环境的校准分布中估计抓取的信息增益，选择下一个最佳视角，这可以高效地驱动机器人探索目标物体的可抓取部分。在模拟环境和真实机器人设置上的实验表明，与先前最先进的模型相比，我们的模型能够在有限视角预算下成功抓取杂乱环境中的物体。我们的模拟环境可以作为未来主动抓取研究的可复现平台。当论文公开发布时，我们的源代码将公开。

英文摘要

Grasping in a densely cluttered environment is a challenging task for robots. Previous methods tried to solve this problem by actively gathering multiple views before grasp pose generation. However, they either overlooked the importance of the grasp distribution for information gain estimation or relied on the projection of the grasp distribution, which ignores the structure of grasp poses on the SE(3) manifold. To tackle these challenges, we propose a calibrated energy-based model for grasp pose generation and an active view selection method that estimates information gain from grasp distribution. Our energy-based model captures the multi-modality nature of grasp distribution on the SE(3) manifold. The energy level is calibrated to the success rate of grasps so that the predicted distribution aligns with the real distribution. The next best view is selected by estimating the information gain for grasp from the calibrated distribution conditioned on the reconstructed environment, which could efficiently drive the robot to explore affordable parts of the target object. Experiments on simulated environments and real robot setups demonstrate that our model could successfully grasp objects in a cluttered environment with limited view budgets compared to previous state-of-the-art models. Our simulated environment can serve as a reproducible platform for future research on active grasping. The source code of our paper will be made public when the paper is released to the public.

URL PDF HTML ☆

赞 0 踩 0

2511.07380 2026-06-08 cs.CL 版本更新

Mining Useful General Data for Low-Resource Domain Adaptation

挖掘低资源领域适应的有用通用数据

Pingjie Wang, Hongcheng Liu, Yusheng Liao, Ziqing Fan, Yaxin Du, Shuo Tang, Yanfeng Wang, Yu Wang

发表机构 * arXiv

AI总结针对低资源领域数据稀缺问题，提出NTK-Selector方法，利用神经正切核从通用数据中筛选有用样本，显著提升领域适应效果。

Comments 39 pages

详情

AI中文摘要

由于领域特定数据的稀缺性，将大型语言模型（LLMs）适应到低资源领域仍然具有挑战性。虽然领域内数据有限，但存在大量与领域任务共享相似问答格式和推理模式的通用领域数据。这一观察提出了一个重要问题：能否挖掘有用的通用领域数据来改进低资源领域适应？我们的初步发现表明，即使没有仔细选择，通用领域的思维链数据也包含对领域适应有用的辅助信号。这一观察催生了一种新的领域适应范式，即不再完全依赖领域特定数据。为了系统地识别最有益的通用领域样本，我们提出了NTK-Selector，其动机源于神经正切核捕捉训练动态中对齐的能力。由于直接将NTK应用于预训练LLMs不切实际，我们引入了一种无雅可比矩阵的NTK近似，并在微调过程中经验性地展示了稳定的NTK类行为。在医学、金融、法律和心理领域的广泛实验表明，NTK-Selector始终优于仅使用领域数据的微调和现有数据选择基线。特别是，NTK-Selector在Llama3-8B-Instruct和Qwen3-8B上分别取得了+8.7和+5.1个百分点的提升，而仅使用领域数据的微调仅分别提升了+0.8和+0.9个百分点。

英文摘要

Adapting large language models (LLMs) to low-resource domains remains challenging due to the scarcity of domain-specific data. While in-domain data is limited, there exists a vast amount of general-domain data that shares similar question-answer formats and reasoning patterns with domain tasks. This observation raises an important question: can useful general-domain data be mined to improve low-resource domain adaptation? Our initial findings show that general-domain chain-of-thought data contains useful auxiliary signals for domain adaptation, even without careful selection. This observation motivates a new paradigm for domain adaptation beyond exclusive reliance on domain-specific data. To systematically identify the most beneficial general-domain samples, we propose NTK-Selector, motivated by the Neural Tangent Kernel's ability to capture alignment in training dynamics. Since directly applying NTK to pretrained LLMs is impractical, we introduce a Jacobian-free NTK approximation and empirically demonstrate stable NTK-like behavior during fine-tuning. Extensive experiments across medical, financial, legal, and psychological domains demonstrate that NTK-Selector consistently outperforms domain-only fine-tuning and existing data selection baselines. In particular, NTK-Selector achieves gains of +8.7 and +5.1 points on Llama3-8B-Instruct and Qwen3-8B, respectively, compared to only +0.8 and +0.9 points from domain-only fine-tuning.

URL PDF HTML ☆

赞 0 踩 0

2511.05949 2026-06-08 cs.CV 版本更新

Zero-Shot Polygon Matching with Pre-trained Models for Pose Estimation and Polygon Cloud from Challenging Stereo

基于预训练模型的零样本多边形匹配用于挑战性立体图像的姿态估计和多边形云

Chang Li, Xingtao Peng

发表机构 * Chang Li（李昌）； Xingtao Peng（彭兴涛）

AI总结提出首个零样本多边形匹配范式Z(PM)2，结合预训练模型和手工几何约束，通过双向金字塔匹配和局部-整体二分图优化解决视差不连续、尺度变化等问题，在姿态估计和3D表示中取得领先性能。

详情

AI中文摘要

尽管立体匹配在0D点和1D线基元上已经成熟，但由于视差不连续、尺度变化、训练依赖和泛化能力差等挑战，2D多边形的对应关系建立仍基本未被探索，限制了姿态估计和3D重建等下游任务。为了解决这些问题，我们首次提出了一种基于预训练模型的零样本多边形匹配范式（即Z(PM)2），通过即插即用模块结合学习特征和手工几何约束，将匹配从0D/1D基元扩展到2D多边形。该流程包括三个核心阶段：首先，检测器利用预训练的segment anything模型将分割掩码矢量化成图结构的多边形，融合几何和纹理；其次，全局匹配器使用双向金字塔和多几何约束处理视角变化；第三，局部匹配器利用局部-整体二分图优化解决视差不连续和拓扑不一致。此外，我们开发了多边形匹配引导的姿态估计，利用对应关系获得分布良好、低冗余的同名点，并首创多边形云概念及最优表面生成方法，生成结构完整、语义丰富的3D表示，超越点云和线云。由于没有可直接比较的立体图像多边形匹配方法，我们选择了最接近该任务的最先进方法作为基线。在五个具有挑战性的数据集（ISPRS、KITTI、ScanNet、SceneFlow、DTU）上的大量实验表明，Z(PM)2实现了68.60%的匹配面积分数，比MESA高出约32%，在区域级姿态估计中排名第一，具有竞争力的速度和强大的零样本泛化能力，无需任何训练要求。

英文摘要

While stereo matching has achieved maturity for 0D point and 1D line primitives, establishing correspondences for 2D polygons remains largely unexplored due to challenges including disparity discontinuity, scale variation, training dependency, and poor generalization, limiting downstream tasks such as pose estimation and 3D reconstruction. To address these issues, we are the first to propose a Zero-shot Polygon Matching paradigm with Pre-trained Models (i.e., Z(PM)2), which combines learned features and handcrafted geometric constraints through plug-and-play modules, extending matching from 0D/1D primitives to 2D polygons. The pipeline comprises three core stages: Firstly, detector leverages the pre-trained segment anything model to vectorize segmentation masks into graph-structured polygons integrating geometry and texture; Secondly, global matcher uses bidirectional-pyramid and multi-geometric constraints to handle viewpoint variation; Thirdly, local matcher leverages local-holistic bipartite graph optimization to resolve disparity discontinuity and topological inconsistency. Moreover, we develop polygon-matching-guided pose estimation using correspondences to obtain well-distributed, low-redundancy homologous points, and pioneer the polygon cloud concept with an optimal surface generation method, producing structurally complete and semantically rich 3D representations beyond point and line clouds. Since no polygon matching methods from stereo imagery are available for direct comparison, we selected state-of-the-art (SoTA) methods close to this task as baselines. Extensive experiments on five challenging datasets (ISPRS, KITTI, ScanNet, SceneFlow, DTU) show Z(PM)2 achieves a 68.60% matching area score, outperforming MESA by approximately 32% and ranking first in area-level pose estimation, with competitive speed and strong zero-shot generalization without any training requirement.

URL PDF HTML ☆

赞 0 踩 0

2510.09041 2026-06-08 cs.LG cs.AI 版本更新

Robust Driving Control for Autonomous Vehicles: An Intelligent General-sum Constrained Adversarial Reinforcement Learning Approach

自动驾驶鲁棒控制：一种智能一般和约束对抗强化学习方法

Junchao Fan, Qi Wei, Ruichen Zhang, Yang Lu, Jianhua Wang, Xiaolin Chang, Bo Ai

发表机构 * Beijing Key Laboratory of Security and Privacy in Intelligent Transportation（北京智能交通安全与隐私重点实验室）； Beijing Jiaotong University（北京交通大学）； College of Computing and Data Science（计算与数据科学学院）； Nanyang Technological University（南洋理工大学）； School of Computer Science and Technology（计算机科学与技术学院）； Taiyuan University of Technology（太原科技大学）； School of Electronics and Information Engineering（电子与信息工程学院）

AI总结针对深度强化学习在自动驾驶中易受对抗攻击的问题，提出智能一般和约束对抗强化学习（IGCARL），通过战略性目标对手和鲁棒驾驶代理的交互训练，在约束优化下提升策略稳定性，实验表明成功率比现有方法提高至少27.9%。

详情

AI中文摘要

深度强化学习（DRL）在开发自动驾驶策略方面取得了显著成功。然而，其对对抗攻击的脆弱性仍然是实际部署的关键障碍。尽管现有的鲁棒方法已取得一定成功，但它们仍面临三个关键问题：（i）这些方法针对短视的对抗攻击进行训练，限制了它们应对更具战略性威胁的能力；（ii）它们难以引发真正安全关键的事件（例如碰撞），反而常常导致轻微后果；（iii）由于缺乏鲁棒约束，这些方法在训练过程中可能导致学习不稳定和策略漂移。为了解决这些问题，我们提出了智能一般和约束对抗强化学习（IGCARL），一种新颖的鲁棒自动驾驶方法，包括一个战略性目标对手和一个鲁棒驾驶代理。战略性目标对手被设计为利用DRL的时间决策能力来执行策略协调的多步攻击。此外，它通过采用一般和目标明确地专注于引发安全关键事件。鲁棒驾驶代理通过与对手交互学习，以发展出对抗攻击的鲁棒自动驾驶策略。为了确保对抗环境中的稳定学习并减轻攻击引起的策略漂移，代理在约束公式下进行优化。大量实验表明，IGCARL相比现有最先进方法将成功率提高了至少27.9%，展示了对抗攻击的卓越鲁棒性，并增强了基于DRL的自动驾驶的安全性和可靠性。

英文摘要

Deep reinforcement learning (DRL) has demonstrated remarkable success in developing autonomous driving policies. However, its vulnerability to adversarial attacks remains a critical barrier to real-world deployment. Although existing robust methods have achieved success, they still suffer from three key issues: (i) these methods are trained against myopic adversarial attacks, limiting their abilities to respond to more strategic threats, (ii) they have trouble causing truly safety-critical events (e.g., collisions), but instead often result in minor consequences, and (iii) these methods can introduce learning instability and policy drift during training due to the lack of robust constraints. To address these issues, we propose Intelligent General-sum Constrained Adversarial Reinforcement Learning (IGCARL), a novel robust autonomous driving approach that consists of a strategic targeted adversary and a robust driving agent. The strategic targeted adversary is designed to leverage the temporal decision-making capabilities of DRL to execute strategically coordinated multi-step attacks. In addition, it explicitly focuses on inducing safety-critical events by adopting a general-sum objective. The robust driving agent learns by interacting with the adversary to develop a robust autonomous driving policy against adversarial attacks. To ensure stable learning in adversarial environments and to mitigate policy drift caused by attacks, the agent is optimized under a constrained formulation. Extensive experiments show that IGCARL improves the success rate by at least 27.9% over state-of-the-art methods, demonstrating superior robustness to adversarial attacks and enhancing the safety and reliability of DRL-based autonomous driving.

URL PDF HTML ☆

赞 0 踩 0

2510.26615 2026-06-08 cs.CL 版本更新

SlideAgent: Hierarchical Agentic Framework for Multi-Page Visual Document Understanding

SlideAgent：用于多页视觉文档理解的分层代理框架

Yiqiao Jin, Rachneet Kaur, Zhen Zeng, Sumitra Ganesh, Srijan Kumar

发表机构 * Georgia Institute of Technology（佐治亚理工学院）； J.P. Morgan AI Research（摩根大通AI研究）

AI总结提出SlideAgent，一种用于多模态多页文档（如幻灯片）理解的分层代理框架，通过全局、页面和元素三级推理构建结构化表示，在专有和开源模型上分别提升7.9%和9.8%的准确率。

Comments ACL 2026 Main Conference. https://slideagent.github.io/

详情

AI中文摘要

多页视觉文档，如手册、宣传册、演示文稿和海报，通过布局、颜色、图标和跨页引用传达关键信息。虽然多模态大语言模型（MLLMs）为文档理解提供了机会，但当前系统在处理复杂的多页视觉文档时仍存在困难，尤其是在元素和页面上的细粒度推理方面。我们引入了SlideAgent，一个用于理解多模态、多页、多布局文档（尤其是幻灯片组）的通用代理框架。SlideAgent采用专门的代理，并将推理分解为三个专门级别——全局、页面和元素——以构建结构化的、与查询无关的表示，捕捉总体主题以及详细的视觉或文本线索。在推理过程中，SlideAgent选择性激活专门代理进行多级推理，并将其输出整合为连贯的、上下文感知的答案。大量实验表明，SlideAgent在专有模型（+7.9%）和开源模型（+9.8%）上均显著提高了准确率。

英文摘要

Multi-page visual documents such as manuals, brochures, presentations, and posters convey key information through layout, colors, icons, and cross-slide references. While multimodal large language models (MLLMs) offer opportunities in document understanding, current systems struggle with complex, multi-page visual documents, particularly in fine-grained reasoning over elements and pages. We introduce SlideAgent, a versatile agentic framework for understanding multi-modal, multi-page, and multi-layout documents, especially slide decks. SlideAgent employs specialized agents and decomposes reasoning into three specialized levels--global, page, and element--to construct a structured, query-agnostic representation that captures both overarching themes and detailed visual or textual cues. During inference, SlideAgent selectively activates specialized agents for multi-level reasoning and integrates their outputs into coherent, context-aware answers. Extensive experiments show that SlideAgent significantly improves accuracy over both proprietary (+7.9%) and open-source models (+9.8%).

URL PDF HTML ☆

赞 0 踩 0

2508.17821 2026-06-08 cs.LG cs.AI cs.CL 版本更新

Limitations of Normalization in Attention Mechanism

注意力机制中归一化的局限性

Timur Mudarisov, Mikhail Burtsev, Tatiana Petrova, Radu State

发表机构 * University of Luxembourg（卢森堡大学）； London Institute for Mathematical Sciences（伦敦数学科学研究所）

AI总结本文通过理论框架和GPT-2实验，揭示softmax归一化导致注意力随选择token数增加而趋于均匀，并分析低温度下梯度敏感性带来的训练挑战。

详情

AI中文摘要

本文研究了注意力机制中归一化的局限性。我们首先建立了一个理论框架，用于识别模型的选择能力以及token选择中涉及的几何分离。我们的分析包括在softmax缩放下token向量距离和分离准则的显式界限。通过使用预训练的GPT-2模型进行实验，我们实证验证了理论结果，并分析了注意力机制的关键行为。值得注意的是，我们证明随着所选token数量的增加，模型区分信息性token的能力下降，通常趋向于均匀选择模式。我们还表明，softmax归一化下的梯度敏感性在训练过程中带来了挑战，尤其是在低温度设置下。这些发现推进了当前对基于softmax的注意力机制的理解，并激发了在未来注意力架构中需要更稳健的归一化和选择策略的需求。

英文摘要

This paper investigates the limitations of the normalization in attention mechanisms. We begin with a theoretical framework that enables the identification of the model's selective ability and the geometric separation involved in token selection. Our analysis includes explicit bounds on distances and separation criteria for token vectors under softmax scaling. Through experiments with pre-trained GPT-2 model, we empirically validate our theoretical results and analyze key behaviors of the attention mechanism. Notably, we demonstrate that as the number of selected tokens increases, the model's ability to distinguish informative tokens declines, often converging toward a uniform selection pattern. We also show that gradient sensitivity under softmax normalization presents challenges during training, especially at low temperature settings. These findings advance current understanding of softmax-based attention mechanism and motivate the need for more robust normalization and selection strategies in future attention architectures.

URL PDF HTML ☆

赞 0 踩 0

2510.16023 2026-06-08 cs.LG cond-mat.mtrl-sci 版本更新

A Conformation-Centric Generative Foundation Model for Linear Polymer Modeling and Design

面向线性聚合物建模与设计的构象中心生成式基础模型

Fanmeng Wang, Ruochao Wang, Shan Mei, Wentao Guo, Hongshuai Wang, Qi Ou, Zhifeng Gao, Hongteng Xu

发表机构 * Gaoling School of Artificial Intelligence（人工智能学院）； Renmin University of China（中国人民大学）； DP Technology（DP技术）； SINOPEC Research Institute of Petroleum Processing Co., Ltd.（中石油加工研究院）

AI总结提出PolyConFM基础模型，通过构象中心生成预训练（条件生成、掩码自回归建模和方向变换）来建模线性聚合物，在多种下游任务中优于现有方法。

详情

AI中文摘要

线性聚合物是由单体共价键合形成连续链的大分子，支撑着无数技术并是现代生活不可或缺的。虽然深度学习正在推进聚合物科学，但现有方法通常仅通过单体级描述符表示整个线性聚合物，忽视了聚合物构象中固有的全局结构信息，最终限制了其实际性能。此外，这一重要领域仍缺乏能够有效支持多种下游任务的专用基础模型，从而严重制约了进展。为应对这些挑战，我们引入了PolyConFM，一个通过构象中心生成预训练专门用于建模和设计线性聚合物的基础模型。认识到每个线性聚合物本质上是一个连续链，其构象可以自然地分解为一系列局部构象（即其重复单元的构象），我们在条件生成范式下预训练PolyConFM，通过掩码自回归（MAR）建模重建这些局部构象，并进一步生成它们的取向变换以恢复相应的聚合物构象。同时，我们通过分子动力学模拟构建了一个线性聚合物构象数据集以缓解数据稀疏性，从而实现了以构象为中心的预训练。实验表明，PolyConFM在多种下游任务中始终优于代表性的任务特定方法，从而为聚合物科学提供了针对线性聚合物的强大工具。

英文摘要

Linear polymers, macromolecules formed from monomers covalently bonded into continuous chains, underpin countless technologies and are indispensable to modern life. While deep learning is advancing polymer science, existing methods typically represent the whole linear polymer solely through monomer-level descriptors, overlooking the global structural information inherent in polymer conformations, which ultimately limits their practical performance. Moreover, this important field still lacks a dedicated foundation model that can effectively support diverse downstream tasks, thereby severely constraining progress. To address these challenges, we introduce PolyConFM, a foundation model tailored for modeling and designing linear polymers through conformation-centric generative pretraining. Recognizing that each linear polymer is essentially a continuous chain whose conformation can be naturally decomposed into a sequence of local conformations (i.e., those of its repeating units), we pretrain PolyConFM under the conditional generation paradigm, reconstructing these local conformations via masked autoregressive (MAR) modeling and further generating their orientation transformations to recover the corresponding polymer conformation. Meanwhile, we construct a linear polymer conformation dataset via molecular dynamics simulations to mitigate data sparsity, thereby enabling conformation-centric pretraining. Experiments demonstrate that PolyConFM consistently outperforms representative task-specific methods across diverse downstream tasks, thereby equipping polymer science with a powerful tool targeting linear polymers.

URL PDF HTML ☆

赞 0 踩 0

2510.01427 2026-06-08 cs.AI 版本更新

Small Language Model Agents Enable Efficient and High-Quality Knowledge Mining

小型语言模型代理实现高效高质量的知识挖掘

Sipeng Zhang, Shuhuai Lin, Xinpeng Wei, Yihang Chen, Pin Qian, Su Wang, Huan Xu

发表机构 * University of California, San Diego（加州大学圣地亚哥分校）； Carneigie Mellon University（卡内基梅隆大学）； Georgia Institute of Technology（佐治亚理工学院）

AI总结提出Falconer框架，结合大语言模型的代理推理与轻量级代理模型，通过规划与标注实现可扩展的知识挖掘，在保持指令遵循精度的同时降低90%推理成本并加速20倍以上。

Comments Code available: https://github.com/LongfeiYun17/falconer

详情

AI中文摘要

深度研究（Deep Research）的核心是知识挖掘，即根据用户指令从海量非结构化文本中提取结构化信息的任务。大语言模型（LLMs）擅长解释此类指令，但大规模部署成本过高；而传统的分类器和提取器流水线虽然高效，但脆弱且无法泛化到新任务。我们提出Falconer，一种协作框架，将LLM的代理推理与轻量级代理模型相结合，用于可扩展的知识挖掘。在Falconer中，LLM作为规划者，将用户指令分解为可执行的流水线，并作为标注者，生成监督信号以训练小型代理。该框架将分类和提取统一为两个原子操作：get label和get span，使得单个指令遵循模型能够替代多个任务特定组件。为了评估Falconer孵化的代理模型与人类和大模型提供的标注之间的一致性，我们构建了涵盖规划和端到端执行的新基准。实验表明，Falconer在指令遵循精度上接近最先进的LLM，同时将推理成本降低高达90%，并将大规模知识挖掘加速20倍以上，为深度研究提供了高效且可扩展的基础。

英文摘要

At the core of Deep Research is knowledge mining, the task of extracting structured information from massive unstructured text in response to user instructions. Large language models (LLMs) excel at interpreting such instructions but are prohibitively expensive to deploy at scale, while traditional pipelines of classifiers and extractors remain efficient yet brittle and unable to generalize to new tasks. We introduce Falconer, a collaborative framework that combines the agentic reasoning of LLMs with lightweight proxy models for scalable knowledge mining. In Falconer, LLMs act as planners, decomposing user instructions into executable pipelines, and as annotators, generating supervision to train small proxies. The framework unifies classification and extraction into two atomic operations, get label and get span, enabling a single instruction-following model to replace multiple task-specific components. To evaluate the consistency between proxy models incubated by Falconer and annotations provided by humans and large models, we construct new benchmarks covering both planning and end-to-end execution. Experiments show that Falconer closely matches state-of-the-art LLMs in instruction-following accuracy while reducing inference cost by up to 90% and accelerating large-scale knowledge mining by more than 20x, offering an efficient and scalable foundation for Deep Research.

URL PDF HTML ☆

赞 0 踩 0

2510.11014 2026-06-08 cs.RO cs.AI cs.CV 版本更新

MatterDoor: Sampling Zero-shot Spatio-semantic Priors using Generative Models

MatterDoor: 使用生成模型采样零样本空间语义先验

Subhransu S. Bhattacharjee, Hao Lu, Dylan Campbell, Rahul Shome

发表机构 * School of Computing, Australian National University（澳大利亚国立大学计算机学院）

AI总结针对机器人通过门缝观察时场景结构缺失的问题，提出MatterDoor方法，利用预训练生成模型（VLM引导外推、单目深度估计、语义分割）采样隐藏房间的语义3D点云先验，在Matterport3D基准上验证了零样本空间语义先验的有效性。

Comments Under Review

详情

AI中文摘要

自主机器人通常只能通过门缝部分观察房间，墙壁和场景结构隐藏了安全导航和目标导向行动所需的几何和任务相关语义。我们询问现成的预训练生成视觉模型能否将这些缺失结构作为零样本离线先验用于机器人推理。此类先验应支持对未观察结构的空间语义查询，估计隐藏区域中的目标物体似然以及这些区域被占用的概率。给定一个以自我为中心的RGB观测和目标查询，我们的流程使用VLM引导的外推、单目深度估计和语义分割来采样隐藏房间的语义标注3D点云假设。我们引入了MatterDoor，一个源自Matterport3D的门遮挡室内场景基准，并使用生成指标和模拟Stretch机器人目标到达任务评估所得先验。我们的结果表明，无需针对特定问题微调即可推导出对规划有用的空间语义先验。

英文摘要

Autonomous robots often view rooms only partially, through a doorway, where the walls and scene structure hide the geometry and task-relevant semantics needed for safe navigation and goal-directed action. We ask whether off-the-shelf pretrained generative vision models can derive this missing structure as zero-shot offline priors for robot reasoning. Such priors should support spatio-semantic queries over unobserved structure, estimating the target object likelihood in hidden regions and the probability that those regions are occupied. Given an egocentric RGB observation and target query, our pipeline uses VLM-guided outpainting, monocular depth estimation, and semantic segmentation to sample semantically labeled 3D point cloud hypotheses of the hidden room. We introduce MatterDoor, a Matterport3D-derived benchmark of doorway-occluded indoor scenes, and evaluate the resulting priors with generative metrics and simulated Stretch robot object-reaching tasks. Our results suggest that useful spatio-semantic priors for planning can be derived without problem-specific fine-tuning.

URL PDF HTML ☆

赞 0 踩 0

2411.09734 2026-06-08 cs.LG cs.NA math.NA math.OC 版本更新

Modeling AdaGrad, RMSProp, and Adam with Integro-Differential Equations

用积分微分方程建模 AdaGrad、RMSProp 和 Adam

Carlos Heredia

发表机构 * IAMM Research, Department of Applied Artificial Intelligence（IAMM研究院应用人工智能系）； DAMM

AI总结提出 AdaGrad、RMSProp 和 Adam 的连续时间积分微分方程模型，通过数值模拟和稳定性分析验证其与离散算法的一致性，为自适应优化方法提供新视角。

Comments 60 pages, 15 figures; v3 - Section 4 corrected

2510.07315 2026-06-08 cs.CL cs.AI cs.LG cs.SE 版本更新

SWE-IF: Aligning Code Evaluation with Human Preference

SWE-IF: 使代码评估与人类偏好对齐

Ming Zhong, Xiang Zhou, Ting-Yun Chang, Qingze Wang, Nan Xu, Xiance Si, Dan Garrette, Shyam Upadhyay, Jeremiah Liu, Jiawei Han, Benoit Schillings, Jiao Sun

发表机构 * Google DeepMind（谷歌深Mind）

AI总结提出SWE-IF基准，通过可验证指令分类法VeriCode评估代码指令遵循能力，发现指令遵循是区分LLM代码质量的关键，与功能正确性结合更能匹配人类偏好。

Comments ICML 2026

详情

AI中文摘要

大型语言模型（LLM）推动了vibe coding，用户通过自然语言交互利用LLM生成并迭代优化代码，直到通过其vibe检查。Vibe检查反映了人类偏好，超越了功能性：解决方案应感觉正确、阅读清晰、保留意图并保持正确。然而，当前的代码评估仍局限于pass@k，仅捕获功能正确性，忽略了用户常规应用的非功能性指令。在本文中，我们假设指令遵循是vibe检查中除功能正确性之外缺失的部分。为了用量化信号衡量模型的代码指令遵循能力，我们提出了VeriCode，一个包含30条可验证代码指令及其确定性验证器的分类法。我们使用该分类法增强现有评估套件，得到SWE-IF，一个评估指令遵循和功能正确性的测试平台。评估31个LLM，我们发现即使最强的模型也难以遵守多条指令，并表现出功能回归。最重要的是，功能正确性和指令遵循的复合得分与人类偏好相关性最强，其中指令遵循成为LLM之间的主要区分因素。我们的代码、数据和分类法可在https://github.com/maszhongming/SWE-IF获取。

英文摘要

Large Language Models (LLMs) have catalyzed vibe coding, where users leverage LLMs to generate and iteratively refine code through natural language interactions until it passes their vibe check. Vibe check reflects human preference and goes beyond functionality: the solution should feel right, read cleanly, preserve intent, and remain correct. However, current code evaluation remains anchored to pass@k and captures only functional correctness, overlooking non-functional instructions that users routinely apply. In this paper, we hypothesize that instruction following is the missing piece underlying vibe check besides functional correctness. To quantify models' code instruction-following capabilities with measurable signals, we present VeriCode, a taxonomy of 30 verifiable code instructions together with deterministic verifiers. We use the taxonomy to augment established evaluation suites, resulting in SWE-IF, a testbed to assess both instruction following and functional correctness. Evaluating 31 LLMs, we show that even the strongest models struggle to comply with multiple instructions and exhibit functional regression. Most importantly, a composite score of functional correctness and instruction following correlates best with human preference, with instruction following emerging as the primary differentiator among LLMs. Our code, data, and taxonomy are available at https://github.com/maszhongming/SWE-IF.

URL PDF HTML ☆

赞 0 踩 0

2510.05363 2026-06-08 cs.AI 版本更新

MHA-RAG: Improving Efficiency, Accuracy, and Consistency by Encoding Exemplars as Soft Prompts

MHA-RAG：通过将示例编码为软提示来提高效率、准确性和一致性

Abhinav Jain, Xinyu Yao, Thomas Reps, Christopher Jermaine

发表机构 * Department of Computer Science, Rice University（计算机科学系，里士大学）； Department of Computer Science, University of Wisconsin–Madison（计算机科学系，威斯康星大学麦迪逊分校）

AI总结提出MHA-RAG框架，将领域示例编码为软提示，通过多头注意力机制控制生成，在多个问答基准上相比标准RAG提升20点性能，同时降低10倍推理成本。

Comments 17 pages, 5 figures

详情

AI中文摘要

在有限训练数据下将基础模型适应到新领域具有挑战性且计算成本高。虽然先前工作证明了使用领域特定示例作为上下文演示的有效性，但我们探究了将示例纯粹表示为文本是否是最有效、最稳定和最高效的方法。我们探索了一种替代方案：使用示例顺序不变模型架构将示例表示为软提示。为此，我们引入了多头注意力检索增强生成（MHA-RAG），该框架以注意力头数作为简单超参数，用于控制不同任务下的软提示生成。在多个问答基准和模型规模上，MHA-RAG相比标准RAG实现了20个百分点的性能提升，同时将推理成本降低了10倍GFLOPs——在示例顺序不变的情况下，实现了更高的准确性和更高的效率。

英文摘要

Adapting Foundation Models to new domains with limited training data is challenging and computationally expensive. While prior work has demonstrated the effectiveness of using domain-specific exemplars as in-context demonstrations, we investigate whether representing exemplars purely as text is the most efficient, effective, and stable approach. We explore an alternative: representing exemplars as soft prompts with an exemplar order invariant model architecture. To this end, we introduce Multi-Head Attention Retrieval-Augmented Generation (MHA-RAG), a framework with the number of attention heads serving as a simple hyperparameter to control soft prompt-generation across different tasks. Across multiple question-answering benchmarks and model scales, MHA-RAG achieves a 20-point performance gain over standard RAG, while cutting inference costs by a factor of 10X GFLOPs-delivering both higher accuracy and greater efficiency, invariant to exemplar order.

URL PDF HTML ☆

赞 0 踩 0

2509.25522 2026-06-08 cs.AI 版本更新

Understanding Generative Recommendation with Semantic IDs from a Model-scaling View

从模型扩展视角理解基于语义ID的生成式推荐

Jingzhe Liu, Liam Collins, Jiliang Tang, Tong Zhao, Neil Shah, Clark Mingxuan Ju

发表机构 * Michigan State University（密歇根州立大学）； Snap Inc.（Snap公司）

AI总结揭示基于语义ID的生成式推荐在模型扩展时存在性能瓶颈，发现直接使用大语言模型作为推荐器具有更好的扩展性，性能提升可达20%。

Comments Accepted by KDD 2026

详情

DOI: 10.1145/3770855.3817976

AI中文摘要

近期生成模型的进展催生了一种有前景的推荐系统范式，称为生成式推荐（GR），它试图统一丰富的物品语义和协同过滤信号。一种流行的现代方法是使用语义ID（SIDs），即从模态编码器（如大型语言或视觉模型）的嵌入中量化得到的离散编码，在自回归用户交互序列建模设置中表示物品（以下简称基于SID的GR）。虽然其他领域的生成模型展现出完善的缩放定律，我们的工作揭示了基于SID的GR在模型扩展时存在显著瓶颈。特别是，随着我们扩大每个组件（模态编码器、量化分词器和推荐系统本身），基于SID的GR的性能迅速饱和。在这项工作中，我们确定SID编码物品语义信息的有限能力是根本瓶颈之一。基于这一观察，作为获得具有更好缩放行为的GR模型的初步努力，我们重新审视了另一种直接使用大型语言模型（LLMs）作为推荐器的GR范式（以下简称LLM-as-RS）。我们的实验表明，LLM-as-RS范式具有优越的模型缩放属性，并通过缩放实现了比基于SID的GR最佳可达性能高达20%的提升。我们还挑战了普遍认为LLMs难以捕捉协同过滤信息的观点，表明它们建模用户-物品交互的能力随着LLMs的扩展而提升。我们对基于SID的GR和LLMs在44M到14B参数模型规模上的分析强调了基于SID的GR的内在缩放限制，并将LLM-as-RS定位为通往GR基础模型的有希望路径。

英文摘要

Recent advancements in generative models have allowed the emergence of a promising paradigm for recommender systems (RS), known as Generative Recommendation (GR), which tries to unify rich item semantics and collaborative filtering signals. One popular modern approach is to use semantic IDs (SIDs), which are discrete codes quantized from the embeddings of modality encoders (e.g., large language or vision models), to represent items in an autoregressive user interaction sequence modeling setup (henceforth, SID-based GR). While generative models in other domains exhibit well-established scaling laws, our work reveals that SID-based GR shows significant bottlenecks while scaling up the model. In particular, the performance of SID-based GR quickly saturates as we enlarge each component: the modality encoder, the quantization tokenizer, and the RS itself. In this work, we identify the limited capacity of SIDs to encode item semantic information as one of the fundamental bottlenecks. Motivated by this observation, as an initial effort to obtain GR models with better scaling behaviors, we revisit another GR paradigm that directly uses large language models (LLMs) as recommenders (henceforth, LLM-as-RS). Our experiments show that the LLM-as-RS paradigm has superior model scaling properties and achieves up to 20 percent improvement over the best achievable performance of SID-based GR through scaling. We also challenge the prevailing belief that LLMs struggle to capture collaborative filtering information, showing that their ability to model user-item interactions improves as LLMs scale up. Our analyses on both SID-based GR and LLMs across model sizes from 44M to 14B parameters underscore the intrinsic scaling limits of SID-based GR and position LLM-as-RS as a promising path toward foundation models for GR.

URL PDF HTML ☆

赞 0 踩 0

2509.14380 2026-06-08 cs.RO 版本更新

CRAFT: Coaching Reinforcement Learning Autonomously using Foundation Models for Multi-Robot Coordination Tasks

CRAFT：利用基础模型自主教练强化学习以完成多机器人协调任务

Seoyeon Choi, Kanghyun Ryu, Jonghoon Ock, Negar Mehr

发表机构 * Department of Mechanical Engineering, University of California Berkeley（机械工程系，加州大学伯克利分校）

AI总结提出CRAFT框架，利用大语言模型分解任务、生成奖励函数，并通过视觉语言模型优化，实现多机器人协调学习，在四足导航和双臂操作任务中验证有效性。

详情

AI中文摘要

多智能体强化学习（MARL）为多智能体系统中的协调学习提供了强大的框架。然而，由于机器人具有高维连续联合动作空间、复杂的奖励设计以及并发学习智能体带来的非平稳性，将MARL应用于机器人领域仍然具有挑战性。另一方面，人类通常在教练的帮助下学习复杂的协调任务，教练通过精心设计的课程和详细反馈来指导学习。基于基础模型的推理能力，我们认为这些模型可以类似地教练机器人学习协调。受此启发，我们提出了CRAFT：利用基础模型自主教练强化学习以完成协调任务，这是一个利用基础模型作为多机器人协调“教练”的框架。CRAFT利用大语言模型（LLMs）的规划能力，自动将长时域协调任务分解为子任务序列。然后，CRAFT使用LLM生成的奖励函数训练每个子任务，并通过视觉语言模型（VLM）引导的奖励细化循环来改进它们。我们在多四足导航和双臂操作任务上评估了CRAFT，并展示了其学习复杂协调行为的能力。此外，在多四足导航设置中，我们展示了学到的策略可以迁移到现实世界。项目网站：https://iconlab.negarmehr.com/CRAFT/

英文摘要

Multi-Agent Reinforcement Learning (MARL) provides a powerful framework for learning coordination in multi-agent systems. However, applying MARL to robotics remains challenging due to their high-dimensional continuous joint action spaces, complex reward design, and non-stationarity from concurrently learning agents. On the other hand, humans often learn complex coordination with the help of coaches, who guide learning through carefully designed curricula and detailed feedback. Building on the reasoning capabilities of foundation models, we argue that these models can similarly coach robots to learn coordination. Motivated by this, we propose CRAFT: Coaching Reinforcement learning Autonomously using Foundation models for learning coordination Tasks, a framework that leverages foundation models to act as a "coach" for multi-robot coordination. CRAFT automatically decomposes long-horizon coordination tasks into sequences of subtasks using the planning capability of Large Language Models (LLMs). Then, CRAFT trains each subtask using LLM-generated reward functions, and refines them through a Vision Language Model (VLM)-guided reward-refinement loop. We evaluate CRAFT on multi-quadruped navigation and bimanual manipulation tasks, and demonstrate its capability to learn complex coordination behaviors. In addition, in a multi-quadruped navigation setting, we show that our learned policies transfer to the real world. Project website is https://iconlab.negarmehr.com/CRAFT/

URL PDF HTML ☆

赞 0 踩 0

2509.24935 2026-06-08 cs.CV cs.AI cs.LG 版本更新

Scalable GANs with Transformers

可扩展的Transformer生成对抗网络

Sangeek Hyun, MinKyu Lee, Jae-Pil Heo

发表机构 * KAIST（韩国科学技术院）

AI总结本文通过紧凑变分自编码器潜在空间和纯Transformer架构，研究了生成对抗网络的可扩展性，并提出了轻量级中间监督和宽度自适应学习率调整来解决缩放时的失败模式，在ImageNet-256上以40个epoch达到2.96的FID。

Comments ICML 2026

详情

AI中文摘要

可扩展性推动了生成建模的最新进展，但其原理在对抗学习中仍未充分探索。我们通过两个在其他类型生成模型中被证明有效的设计选择来研究生成对抗网络（GAN）的可扩展性：在紧凑的变分自编码器潜在空间中训练，以及采用纯Transformer的生成器和判别器。在潜在空间中训练能够在保持感知保真度的同时实现高效计算，而这种效率与普通Transformer自然匹配，后者的性能随计算预算扩展。基于这些选择，我们分析了朴素缩放GAN时出现的失败模式。具体来说，我们发现了随着网络规模扩大，生成器早期层利用不足和优化不稳定的问题。因此，我们提供了简单且对缩放友好的解决方案，如轻量级中间监督和宽度自适应学习率调整。我们的实验表明，GAT——一种纯Transformer的潜在空间GAN——能够在从S到XL的广泛容量范围内可靠地训练。此外，GAT-XL/2在ImageNet-256上仅用40个epoch（比强基线少6倍）就达到了最先进的单步类条件生成性能（FID为2.96）。项目页面：https://hse1032.github.io/GAT。

英文摘要

Scalability has driven recent advances in generative modeling, yet its principles remain underexplored for adversarial learning. We investigate the scalability of Generative Adversarial Networks (GANs) through two design choices that have proven to be effective in other types of generative models: training in a compact Variational Autoencoder latent space and adopting purely transformer-based generators and discriminators. Training in latent space enables efficient computation while preserving perceptual fidelity, and this efficiency pairs naturally with plain transformers, whose performance scales with computational budget. Building on these choices, we analyze failure modes that emerge when naively scaling GANs. Specifically, we find issues as underutilization of early layers in the generator and optimization instability as the network scales. Accordingly, we provide simple and scale-friendly solutions as lightweight intermediate supervision and width-aware learning-rate adjustment. Our experiments show that GAT, a purely transformer-based and latent-space GANs, can be easily trained reliably across a wide range of capacities (S through XL). Moreover, GAT-XL/2 achieves state-of-the-art single-step, class-conditional generation performance (FID of 2.18) on ImageNet-256 in just 60 epochs, 4x fewer epochs than strong baselines. Project page: https://hse1032.github.io/GAT.

URL PDF HTML ☆

赞 0 踩 0

2504.03635 2026-06-08 cs.AI cs.CL 版本更新

Finding the Minimal Parameter Budget for Implicit Reasoning: A Data Complexity Driven Scaling Law for Language Models

寻找隐式推理的最小参数预算：一种基于数据复杂度的语言模型缩放定律

Xinyi Wang, Shawn Tan, Shenbo Xu, Mingyu Jin, William Yang Wang, Rameswar Panda, Yikang Shen

发表机构 * University of California, Berkeley（加州大学伯克利分校）； University of Cambridge（剑桥大学）； University of Washington（华盛顿大学）； University of Toronto（多伦多大学）； University of Tokyo（东京大学）

AI总结本文通过控制合成环境中的预训练实验，发现语言模型隐式推理所需的最小参数预算与图搜索熵之间存在缩放定律，并确定了每参数最多可处理约0.008比特信息的容量上限。

Comments Accepted to ICML 2026

详情

AI中文摘要

推理是语言模型（LM）的核心能力，然而在预训练期间需要多少模型容量来支持推理仍不清楚。在这项工作中，我们研究了隐式推理所需的最小参数预算，隐式推理定义为无需显式思维链监督即可从所学知识中推断出新事实的能力。为了隔离这一现象，我们在一个受控的合成环境中从头开始预训练LM，该环境模拟了真实世界知识图谱的结构和分布，并通过多跳推理评估它们补全缺失边的能力。从理论和实证两个角度，我们确定了一个缩放定律，将该最优参数预算与图搜索熵度量联系起来。在广泛的模型大小、训练步数和图复杂度范围内，我们表明一个最优大小的语言模型最多可以可靠地处理每参数约0.008比特的信息。我们的结果刻画了预训练期间隐式推理所需的最小充分容量。我们的发现为匹配模型大小与数据复杂度提供了原则性指导，并为大型语言模型中推理的缩放行为提供了新见解。

英文摘要

Reasoning is a core capability of language models (LMs), yet it remains unclear how much model capacity is necessary to support reasoning during pretraining. In this work, we study the minimal parameter budget required for implicit reasoning, defined as the ability to infer new facts from learned knowledge without explicit chain-of-thought supervision. To isolate this phenomenon, we pretrain LMs from scratch in a controlled synthetic environment that mimics the structure and distribution of real-world knowledge graphs, and evaluate their ability to complete missing edges via multi-hop inference. From both a theoretical and an empirical perspective, we identify a scaling law linking this optimal parameter budget to a graph search entropy measure. Across a wide range of model sizes, training steps, and graph complexities, we show that an optimally sized language model can reliably reason over approximately 0.008 bits of information per parameter at most. Our results characterize the minimal sufficient capacity for implicit reasoning during pretraining. Our findings provide principled guidance for matching model size to data complexity and offer new insights into the scaling behavior of reasoning in large language models.

URL PDF HTML ☆

赞 0 踩 0

2503.03660 2026-06-08 cs.LG 版本更新

Chunking the Critic: A Transformer-based Soft Actor-Critic with N-Step Returns

分块评论家：基于Transformer的软演员-评论家算法与N步回报

Dong Tian, Onur Celik, Gerhard Neumann

发表机构 * Karlsruhe Institute of Technology (KIT)（卡尔斯鲁厄理工学院）

AI总结提出一种序列条件评论家，用轻量Transformer建模轨迹上下文并聚合N步目标训练，在不使用重要性采样下增强评论家对长时域和稀疏奖励问题的时序建模能力。

Comments 39 pages, 15 figures, ICLR2026 Poster

详情

AI中文摘要

我们为软演员-评论家（SAC）引入了一种序列条件评论家，它使用轻量级Transformer建模轨迹上下文，并在聚合的$N$步目标上训练。与先前的方法不同，先前方法要么（i）孤立地对状态-动作对评分，要么（ii）依赖演员侧的动作分块来处理长时域，我们的方法通过条件化短轨迹段并整合多步回报来增强评论家本身——无需重要性采样（IS）。由此产生的序列感知价值估计捕获了扩展时域和稀疏奖励问题的关键时间结构。在局部运动基准上，我们进一步表明，冻结评论家参数几步使得我们的更新与CrossQ的核心思想兼容，从而在无需目标网络的情况下实现稳定训练。尽管其简单性——一个2层Transformer，128-256个隐藏单元，以及最大更新-数据比（UTD）为$1$——该方法始终优于标准SAC和强离策略基线，在长轨迹控制上尤其获得巨大收益。这些结果突显了评论家侧序列建模和$N$步自举对长时域强化学习的价值。

英文摘要

We introduce a sequence-conditioned critic for Soft Actor-Critic (SAC) that models trajectory context with a lightweight Transformer and trains on aggregated $N$-step targets. Unlike prior approaches that (i) score state-action pairs in isolation or (ii) rely on actor-side action chunking to handle long horizons, our method strengthens the critic itself by conditioning on short trajectory segments and integrating multi-step returns -- without importance sampling (IS). The resulting sequence-aware value estimates capture the critical temporal structure for extended-horizon and sparse-reward problems. On local-motion benchmarks, we further show that freezing critic parameters for several steps makes our update compatible with CrossQ's core idea, enabling stable training \emph{without} a target network. Despite its simplicity -- a 2-layer Transformer with 128-256 hidden units and a maximum update-to-data ratio (UTD) of $1$ -- the approach consistently outperforms standard SAC and strong off-policy baselines, with particularly large gains on long-trajectory control. These results highlight the value of sequence modeling and $N$-step bootstrapping on the critic side for long-horizon reinforcement learning.

URL PDF HTML ☆

赞 0 踩 0

2509.21751 2026-06-08 cs.LG physics.comp-ph physics.flu-dyn 版本更新

On the Effect of Neural Field Reparameterization for 4DVAR

神经场重参数化对四维变分同化的影响

Jaemin Oh

发表机构 * Division of Applied Mathematics, Brown University（布朗大学应用数学系）

AI总结提出用神经场重参数化4DVAR，利用谱偏置隐式正则化，无需背景误差协方差，实现并行时间优化，在混沌基准测试中优于经典方法。

Comments 26 pages, 9 figures, 11 tables

详情

AI中文摘要

四维变分资料同化（4DVAR）是数值天气预报的基石，但由于目标函数的非凸性，它仍然计算密集且对初始化敏感。我们提出了一种基于神经场的4DVAR重构，其中时空状态被表示为由神经网络参数化的连续函数。我们证明，在参数空间中优化利用了神经场的谱偏置，作为隐式正则化器，稳定状态估计并抑制虚假的高频振荡，而无需显式的背景误差协方差信息。此外，通过参数化完整的时空轨迹，我们的框架实现了时间并行优化，并通过物理信息损失直接纳入物理约束。在混沌基准测试（包括二维Kolmogorov流和三维Taylor-Green涡旋）上的评估表明，神经重参数化比经典4DVAR产生更准确的初始条件。当与可分离神经架构（SPINNs）结合时，该方法实现了显著的加速。与许多机器学习方法不同，该框架不需要真实训练数据，为业务化资料同化提供了一种稳健且可扩展的替代方案。

英文摘要

Four-dimensional variational data assimilation (4DVAR) is a cornerstone of numerical weather prediction, yet it remains computationally intensive and sensitive to initialization due to the non-convexity of its objective function. We propose a neural field-based reformulation of 4DVAR in which the spatiotemporal state is represented as a continuous function parameterized by a neural network. We demonstrate that optimizing in parameter space leverages the spectral bias of neural fields, acting as an implicit regularizer that stabilizes state estimation and suppresses spurious high-frequency oscillations without requiring explicit background error covariance information. Furthermore, by parameterizing the full spatiotemporal trajectory, our framework enables parallel-in-time optimization and incorporates physical constraints directly through physics-informed losses. Evaluations on chaotic benchmarks, including 2D Kolmogorov flow and 3D Taylor-Green vortices, show that neural reparameterization produces more accurate initial conditions than classical 4DVAR. When combined with separable neural architectures (SPINNs), the method achieves substantial speedups. Unlike many machine learning approaches, this framework requires no ground-truth training data, offering a robust and scalable alternative for operational data assimilation.

URL PDF HTML ☆

赞 0 踩 0

2505.21285 2026-06-08 cs.LG stat.ML 版本更新

Learnable Kernel Density Estimation for Graphs and Its Application to Graph-Level Anomaly Detection

可学习图核密度估计及其在图级异常检测中的应用

Xudong Wang, Ziheng Sun, Chris Ding, Jicong Fan

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出LGKDE框架，通过图神经网络表示图分布并利用最大均值差异学习多尺度核密度估计，在理论保证下有效捕获结构模式和语义变化，在图异常检测任务中优于现有方法。

Comments Accepted in the Forty-Third International Conference on Machine Learning (ICML 2026), Main Track

详情

AI中文摘要

本文提出一个名为LGKDE的框架，用于学习图的核密度估计。图密度估计的关键挑战在于有效捕获结构模式和语义变化，同时保持理论保证。结合图核和核密度估计（KDE）是图密度估计的标准方法，但由于核的手工设计和固定特征，性能不佳。我们的方法LGKDE利用图神经网络将每个图表示为离散分布，并利用最大均值差异学习多尺度KDE的图度量，其中所有参数通过最大化图相对于其精心设计的扰动版本的密度来学习。扰动在节点特征和图谱上进行，有助于更好地刻画正常密度区域的边界。理论上，我们为LGKDE建立了一致性和收敛性保证，包括均方积分误差界、鲁棒性和泛化性。我们通过展示其在恢复合成图分布底层密度方面的有效性，并将其应用于多个基准数据集上的图异常检测来验证LGKDE。广泛的实证评估表明，在大多数基准数据集上，LGKDE相比最先进的基线方法表现出优越的性能。

英文摘要

This work proposes a framework LGKDE that learns kernel density estimation for graphs. The key challenge in graph density estimation lies in effectively capturing both structural patterns and semantic variations while maintaining theoretical guarantees. Combining graph kernels and kernel density estimation (KDE) is a standard approach to graph density estimation, but has unsatisfactory performance due to the handcrafted and fixed features of kernels. Our method LGKDE leverages graph neural networks to represent each graph as a discrete distribution and utilizes maximum mean discrepancy to learn the graph metric for multi-scale KDE, where all parameters are learned by maximizing the density of graphs relative to the density of their well-designed perturbed counterparts. The perturbations are conducted on both node features and graph spectra, which helps better characterize the boundary of normal density regions. Theoretically, we establish consistency and convergence guarantees for LGKDE, including bounds on the mean integrated squared error, robustness, and generalization. We validate LGKDE by demonstrating its effectiveness in recovering the underlying density of synthetic graph distributions and applying it to graph anomaly detection across diverse benchmark datasets. Extensive empirical evaluation shows that LGKDE demonstrates superior performance compared to state-of-the-art baselines on most benchmark datasets.

URL PDF HTML ☆

赞 0 踩 0

2509.11740 2026-06-08 cs.RO 版本更新

From Pixels to Shelf: An Integrated Robotic System for Autonomous Supermarket Stocking with a Mobile Manipulator

从像素到货架：基于移动操作臂的自主超市补货集成机器人系统

Davide Peron, Victor Nan Fernandez-Ayala, Lukas Segelmark

发表机构 * Department of Information Engineering, University of Padova（帕多瓦大学信息工程系）； Division of Decision and Control Systems, School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology（皇家理工学院电气工程与计算机科学学院决策与控制系统系）； Animum AB（Animum公司）

AI总结提出一种集成商用硬件与ROS2的模块化机器人系统，利用行为树规划、视觉检测和两步MPC控制，在700多次补货实验中实现98%以上的抓取成功率，但性能仍逊于人类工人。

Comments Preprint for CASE 2026

详情

AI中文摘要

零售环境（尤其是超市）中的自主补货由于动态的人机交互、受限空间和多样化的产品几何形状而面临挑战。本文介绍了一种高效的模块化机器人系统，用于自主货架补货，该系统将商用硬件与可扩展的算法架构相结合。本工作的一个主要贡献是将现成硬件与基于ROS2的感知、规划和控制集成到一个适用于零售环境的可部署平台中。我们的解决方案利用行为树（BT）进行任务规划，使用微调视觉模型进行目标检测，并采用两步模型预测控制（MPC）框架，通过ArUco标记实现精确的货架导航。在模拟真实超市条件的实验室实验中，该系统在总共700多次补货事件中实现了超过98%的抓取放置成功率。然而，我们的比较基准表明，当前自主系统的性能和成本效益仍然低于人类工人，我们以此突出关键改进领域，并量化在广泛商业部署真正实现之前仍需取得的进展。

英文摘要

Autonomous stocking in retail environments, particularly supermarkets, presents challenges due to dynamic human interactions, constrained spaces, and diverse product geometries. This paper introduces an efficient modular robotic system for autonomous shelf stocking, integrating commercially available hardware with a scalable algorithmic architecture. A major contribution of this work is the system integration of off-the-shelf hardware and ROS2-based perception, planning, and control into a single deployable platform for retail environments. Our solution leverages Behavior Trees (BTs) for task planning, fine-tuned vision models for object detection, and a two-step Model Predictive Control (MPC) framework for precise shelf navigation using ArUco markers. Laboratory experiments replicating realistic supermarket conditions demonstrate reliable performance, achieving over 98% success in pick-and-place operations across a total of more than 700 stocking events. However, our comparative benchmarks indicate that the performance and cost-effectiveness of current autonomous systems remain inferior to that of human workers, which we use to highlight key improvement areas and quantify the progress still required before widespread commercial deployment can realistically be achieved.

URL PDF HTML ☆

赞 0 踩 0

2509.05316 2026-06-08 cs.LG cs.AI 版本更新

Standard vs. Modular Sampling: Best Practices for Reliable LLM Unlearning

标准采样与模块化采样：可靠的大语言模型遗忘的最佳实践

Praveen Bushipaka, Lucia Passaro, Tommaso Cucinotta

发表机构 * Scuola Superiore Sant’Anna（圣安纳高等学院）； University of Pisa（比萨大学）

AI总结针对大语言模型遗忘中采样策略的不足，提出模块化实体级遗忘（MELU）策略，通过多样化邻居集和模块化采样平衡遗忘效果与模型效用。

2508.19239 2026-06-08 cs.AI 版本更新

Model Context Protocols in Adaptive Transport Systems: A Survey

自适应交通系统中的模型上下文协议：综述

Gaurab Chhetri, Shriyank Somvanshi, Md Monzurul Islam, Shamyo Brotee, Mahmuda Sultana Mimi, Dipti Koirala, Biplov Pandey, Subasish Das

发表机构 * Texas State University San Marcos（德克萨斯州立大学圣马科斯分校）

AI总结本文首次系统调查模型上下文协议（MCP）作为统一范式，提出五类分类法，揭示传统协议孤立适应的局限，并指出MCP的客户端-服务器和JSON-RPC结构支持语义互操作性，为下一代自适应智能交通基础设施奠定基础。

详情

AI中文摘要

互联设备、自主系统和AI应用的快速扩展导致自适应交通系统严重碎片化，各种协议和上下文源仍然孤立。本综述首次系统调查模型上下文协议（MCP）作为统一范式，强调其桥接协议级适应与上下文感知决策的能力。通过分析已有文献，我们发现现有工作已隐含地趋近于MCP类架构，标志着从碎片化解决方案向标准化集成框架的自然演进。我们提出一个五类分类法，涵盖自适应机制、上下文感知框架、统一模型、集成策略和MCP使能架构。我们的发现揭示了三个关键见解：传统传输协议已达到孤立适应的极限，MCP的客户端-服务器和JSON-RPC结构支持语义互操作性，以及AI驱动的传输需要特别适合MCP的集成范式。最后，我们提出一个研究路线图，将MCP定位为下一代自适应、上下文感知和智能交通基础设施的基础。

英文摘要

The rapid expansion of interconnected devices, autonomous systems, and AI applications has created severe fragmentation in adaptive transport systems, where diverse protocols and context sources remain isolated. This survey provides the first systematic investigation of the Model Context Protocol (MCP) as a unifying paradigm, highlighting its ability to bridge protocol-level adaptation with context-aware decision making. Analyzing established literature, we show that existing efforts have implicitly converged toward MCP-like architectures, signaling a natural evolution from fragmented solutions to standardized integration frameworks. We propose a five-category taxonomy covering adaptive mechanisms, context-aware frameworks, unification models, integration strategies, and MCP-enabled architectures. Our findings reveal three key insights: traditional transport protocols have reached the limits of isolated adaptation, MCP's client-server and JSON-RPC structure enables semantic interoperability, and AI-driven transport demands integration paradigms uniquely suited to MCP. Finally, we present a research roadmap positioning MCP as a foundation for next-generation adaptive, context-aware, and intelligent transport infrastructures.

URL PDF HTML ☆

赞 0 踩 0