arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1938
专题追踪
2605.20689 2026-05-21 cs.CL cs.AI cs.IR cs.LG

DIVE: Embedding Compression via Self-Limiting Gradient Updates

DIVE: 通过自限制梯度更新实现嵌入压缩

Dongfang Zhao

发表机构 * University of Washington Tacoma School of Engineering and Technology(华盛顿大学塔可姆分校工程与技术学院)

AI总结 本文提出DIVE方法,通过自限制的三元组损失和头级NT-Xent对比损失解决嵌入压缩中因标注数据稀缺导致的过拟合问题,提升了检索性能。

详情
AI中文摘要

大型语言模型的高维嵌入对向量搜索系统造成了显著的存储和计算成本。最近的嵌入压缩方法,包括Matryoshka-Adaptor(EMNLP 2024)、Search-Adaptor(ACL 2024)和SMEC(EMNLP 2025),通过轻量级残差适配器实现降维,但其训练目标在标注数据稀缺时导致严重过拟合,使检索性能低于冻结基线。我们提出DIVE(通过隐式视图集合进行降维),一种压缩适配器,通过两种机制解决这一失败。首先,一个自限制的基于hinge的三元组损失在三元组满足边距约束时产生零梯度,限制应用于预训练嵌入空间的总扰动。其次,头级NT-Xent对比损失将每个嵌入的多个学习投影视为隐式视图,提供密集的自监督梯度,补偿小数据集上三元组信号的稀疏性。在六个BEIR数据集上,DIVE在每个数据集和每个评估的压缩比上均优于所有三个基线适配器,具有14M参数的开源实现。

英文摘要

High-dimensional embeddings from large language models impose significant storage and computational costs on vector search systems. Recent embedding compression methods, including Matryoshka-Adaptor (EMNLP 2024), Search-Adaptor (ACL 2024), and SMEC (EMNLP 2025), enable dimensionality reduction through lightweight residual adapters, but their training objectives cause severe overfitting when labeled data is scarce, degrading retrieval performance below the frozen baseline. We propose \textsc{DIVE} (\textbf{D}imensionality reduction with \textbf{I}mplicit \textbf{V}iew \textbf{E}nsembles), a compression adapter that addresses this failure through two mechanisms. First, a self-limiting hinge-based triplet loss produces zero gradient once a triplet satisfies the margin constraint, bounding the total perturbation applied to the pretrained embedding space. Second, a head-wise NT-Xent contrastive loss treats multiple learned projections of each embedding as implicit views, providing dense self-supervised gradients that compensate for the sparsity of the triplet signal on small datasets. Across six BEIR datasets, \textsc{DIVE} outperforms all three baseline adapters on every dataset and at every evaluated compression ratio, with a 14M-parameter open-source implementation.

2605.20684 2026-05-21 cs.CL

Beyond Semantic Similarity: A Two-Phase Non-Parametric Retrieval Workflow for Corporate Credit Underwriting

超越语义相似性:一种用于企业信贷审批的双阶段非参数检索流程

Linus Ng Junjia, Ezekiel Tee Kongquan, Kelvin Heng, Kenneth Zhu Ke, Zhao Jing Yuan

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文提出了一种双阶段非参数检索架构,旨在解决信贷审批中检索结果与决策有用性之间的差距问题,通过结合词法和密集多语言检索构建候选池,并利用LLM作为判断机制对文档进行实用性评分,从而提高检索效率和实用性。

详情
AI中文摘要

企业信贷审批需要分析师从数百页、多语言的异构财务文档中提取可操作的证据。标准的检索增强生成(RAG)流水线优化语义相似性,这通常会检索出主题相关但缺乏决策有用性的段落,我们称之为相似性-有用性差距。我们提出了一种双阶段非参数检索架构,将高召回率的候选检索与高精度的实用性排名分开。第一阶段结合词法和密集多语言检索构建广泛候选池。第二阶段应用自适应检索控制器,利用查询意图和文档结构信号过滤候选者,随后通过LLM作为判断机制对段落进行实用性评分,而非基于语义接近性。一个上下文感知的提取模块在叙述文本和复杂财务表格之间保持结构忠实性。该系统完全在本地部署以满足企业数据治理要求。在具有分析师定制相关性标签的多语言专有财务文档语料库上评估,该系统显著优于简单检索基线。在超过800名信贷分析师的生产部署中,文档审查时间从数小时减少到约三分钟,证明了实用性感知RAG架构在文档密集型决策支持流程中的实际价值。

英文摘要

Corporate credit underwriting requires analysts to extract actionable evidence from long, heterogeneous financial documents spanning hundreds of pages and multiple languages. Standard Retrieval-Augmented Generation (RAG) pipelines optimize for semantic similarity, which frequently surfaces passages that are topically related but lack decision utility, a problem we term the similarity-utility gap. We propose a two-phase non-parametric retrieval architecture that separates high-recall candidate retrieval from high-precision utility ranking. The first phase combines lexical and dense multilingual retrieval to construct a broad candidate pool. The second phase applies an adaptive retrieval controller that filters candidates using query intent and document structure signals, followed by an LLM-as-a-Judge utility scoring mechanism that ranks passages by analytical usefulness rather than semantic proximity. A context-aware extraction module preserves structural fidelity across narrative text and complex financial tables. The system is deployed entirely on-premise to satisfy enterprise data governance requirements. Evaluated on a multilingual corpus of proprietary financial documents with analyst-curated relevance labels, the system significantly outperforms naive retrieval baselines. In production deployment across more than 800 credit analysts, document review time was reduced from several hours to approximately three minutes, demonstrating the practical value of utility-aware RAG architectures for document-intensive decision-support workflows.

2605.20682 2026-05-21 cs.CV

IndusAgent: Reinforcing Open-Vocabulary Industrial Anomaly Detection with Agentic Tools

IndusAgent: 通过智能工具增强开放词汇工业异常检测

Rongbin Tan, Fangfang Lin, Zhenlong Yuan, Min Qiu, Kejin Cui, Mengmeng Wang, Yi Wang, Zijian Song, Zhiyuan Wang, Jiyuan Wang, Yue Wang, Shuhan Song§, Huawei Cao

发表机构 * State Key Lab of Processors, Institute of Computing Technology, CAS(处理器国家重点实验室,计算技术研究所,中国科学院) Santa Clara University(圣克拉拉大学) LongCat Team(LongCat团队) Independent Researcher(独立研究者) New York University(纽约大学) Sun Yat-sen University(孙中山大学) Nanyang Technological University(南洋理工大学) Stanford University(斯坦福大学) University of Chinese Academy of Sciences, Beijing, China(中国科学院大学,北京,中国)

AI总结 本文提出IndusAgent框架,通过整合视觉观测、高分辨率局部片段和专家正常性先验,提升开放词汇工业异常检测的零样本性能,验证了方法的鲁棒性和泛化能力。

详情
AI中文摘要

多模态大语言模型(MLLMs)在连接视觉感知和文本推理方面表现出色,能够跨多样化的工业场景实现零样本理解。然而,其在开放词汇工业异常检测(IAD)中的性能常受限于领域不匹配的推理和幻觉的结构推断。为了解决这些挑战,我们提出了IndusAgent,一种工具增强的智能框架用于开放词汇IAD。具体而言,我们首先构建了Indus-CoT,一个整合了全局视觉观测、高分辨率局部片段和专家正常性先验的结构化数据集,为在严格工业检查轨迹上微调模型提供监督。在此基础上,IndusAgent动态协调一组外部工具,包括动态区域裁剪、高频特征增强和先验检索,从而使代理能够主动解决视觉歧义并分离细微异常。此外,我们引入了一个门控强化学习目标,联合优化异常分类、定位准确性、异常类型推理和高效的工具使用,确保工具调用仅在有益时发生。在五个工业异常基准测试上(包括MVTec-AD、VisA、MPDD、DTD和SDD)的广泛评估表明,IndusAgent在所有现有方法中实现了最先进的零样本性能,验证了我们的鲁棒性和泛化能力。

英文摘要

Multimodal large language models (MLLMs) have shown remarkable capability in bridging visual perception and textual reasoning, enabling zero-shot understanding across diverse industrial scenarios. However, their performance in open-vocabulary industrial anomaly detection (IAD) is often limited by domain-misaligned reasoning and hallucinated structural inferences. To address these challenges, we propose \textbf{IndusAgent}, a tool-augmented agentic framework for open-vocabulary IAD. Specifically, we first construct \textbf{Indus-CoT}, a structured dataset that integrates global visual observations, high-resolution local patches, and expert normalcy priors, providing supervision for fine-tuning the model on rigorous industrial inspection trajectories. Building on this, IndusAgent dynamically orchestrates a set of external tools, including dynamic region cropping, high-frequency feature enhancement, and prior retrieval, thus enabling the agent to actively resolve visual ambiguities and disentangle subtle anomalies. Furthermore, we introduce a gated reinforcement learning objective that jointly optimizes anomaly classification, localization accuracy, anomaly type reasoning, and efficient tool usage, ensuring that tool invocation occurs only when beneficial. Extensive evaluations on five industrial anomaly benchmarks, including MVTec-AD, VisA, MPDD, DTD, and SDD, demonstrate that IndusAgent achieves state-of-the-art zero-shot performance among all existing methods, validating our robustness and generalization capacity.

2605.20680 2026-05-21 cs.CV

DarkShake-DVS: Event-based Human Action Recognition under Low-light andShaking Camera Conditions

DarkShake-DVS: 低光和摇晃条件下基于事件的行人动作识别

Jiaqi Chen, Qinfu Xu, Liyuan Pan

发表机构 * Beijing Institute of Technology(北京理工大学)

AI总结 本文提出了一种结合事件相机和惯性测量单元的EIS-HAR方法,通过非线性变形模块减少运动模糊并提取时空特征,同时引入DarkShake-DVS基准数据集,用于低光和6自由度运动条件下的行人动作识别研究。

Comments 8pages,7 figures

详情
AI中文摘要

行人动作识别(HAR)是计算机视觉中的基本任务,具有广泛的应用。实际部署通常涉及低光环境和无约束的6-DoF相机运动,这些条件会降低视觉质量,破坏时间一致性,并影响现有方法的可靠性。事件相机具有高低光灵敏度和微秒级时间分辨率,结合惯性测量单元(IMU)提供了一种有前途的解决方案。然而,当前研究面临两个关键挑战:缺乏整合低光条件、6-DoF运动和同步IMU数据的基准;以及缺乏有效的运动补偿技术。为此,我们提出事件-IMU稳定HAR(EIS-HAR),包含两个模块。第一个是EIS模块,通过非线性变形函数减少运动模糊以重建运动补偿的输入。第二个是HAR模块,具有四阶段混合架构,以高效提取时空特征进行准确的动作识别。为缓解数据稀缺,我们引入DarkShake-DVS,第一个大规模基于事件的HAR基准,包含18,041个真实世界片段,在低光和强烈6-DoF运动条件下拍摄,并补充同步IMU数据。在三个数据集上的广泛实验表明,EIS-HAR在状态-of-the-art方法上表现出一致的优越性。

英文摘要

Human Action Recognition (HAR) is a fundamental computer vision task with diverse real-world applications. Practical deployments often involve low-light environments and unconstrained 6-DoF camera motion, conditions that degrade visual quality, disrupt temporal coherence, and compromise reliability of existing methods. Event cameras, with high low-light sensitivity and microsecond-level temporal resolution, paired with an inertial measurement unit (IMU), present a promising solution. However, current research faces two key challenges: absence of a benchmark integrating low-light conditions, 6-DoF motion, and synchronized IMU data; and lack of effective motion compensation techniques. To address these, we propose Event-IMU Stabilized HAR (EIS-HAR), with two modules. The first is an EIS module that reduces motion blur via a non-linear warping function to reconstruct a motion-compensated input. The second is a HAR module with a four-stage hybrid architecture to efficiently extract spatiotemporal features for accurate action recognition. To alleviate data scarcity, we introduce DarkShake-DVS, the first large-scale event-based HAR benchmark that includes 18,041 realworld clips captured in low light and intense 6-DoF motion, supplemented by synchronized IMU data. Extensive experiments on three datasets demonstrate consistent superiority of EIS-HAR over state-of-the-art methods.

2605.20678 2026-05-21 cs.LG cs.AI

Dynamic TMoE: A Drift-Aware Dynamic Mixture of Experts Framework for Non-Stationary Time Series Forecasting

动态TMoE:一种针对非平稳时间序列预测的漂移感知动态专家混合框架

Jiawen Zhu, Shuhan Liu, Di Weng, Yingcai Wu

发表机构 * School of Software Technology, Zhejiang University, Ningbo, China State Key Lab of CAD\&CG, Zhejiang University, Hangzhou, China

AI总结 本文提出Dynamic TMoE框架,通过动态构建异构专家和剪枝冗余专家来优化容量,并利用时间记忆路由器确保稳定且上下文感知的专家选择,从而在非平稳时间序列预测中实现更优性能。

Comments 27 pages, 7 figures. Accepted to ICML 2026

详情
AI中文摘要

非平稳时间序列预测面临由演变分布偏移带来的挑战,静态模型难以捕捉这些变化。虽然混合专家(MoE)架构提供了解耦复杂漂移模式的有前景范式,但现有方法受限于固定专家池和无记忆路由,阻碍了其适应突发制度转变的能力。为此,我们提出Dynamic TMoE框架,将架构进化与时间连续性统一在学习阶段。通过最大均值偏差(MMD)检测分布偏移,动态实例化异构专家并剪枝冗余专家以优化容量。此外,时间记忆路由器利用循环状态和异常库确保稳定、上下文感知的专家选择,无需测试时更新。在九个基准测试中的实验表明,该方法实现了最先进的性能,将MSE减少10.4%,MAE减少7.8%。代码可在https://github.com/andone-07/Dynamic-TMoE获取。

英文摘要

Non-stationary time series forecasting is challenged by evolving distribution shifts that static models struggle to capture. While Mixture-of-Experts (MoE) architectures offer a promising paradigm for decoupling complex drift patterns, existing approaches are limited by fixed expert pools and memoryless routing, hampering their ability to adapt to abrupt regime shifts. To address this, we propose Dynamic TMoE, a framework that unifies architectural evolution with temporal continuity during learning phase. By detecting distribution shifts via Maximum Mean Discrepancy (MMD), we dynamically instantiate heterogeneous experts and prune redundant ones to optimize capacity. Additionally, a temporal memory router leverages recurrent states and an anomaly repository to ensure stable, context-aware expert selection without requiring test-time updates. Experiments on nine benchmarks demonstrate state-of-the-art performance, reducing MSE by 10.4% and MAE by 7.8%. Code is available at https://github.com/andone-07/Dynamic-TMoE.

2605.20676 2026-05-21 cs.CV

VISTAQA: Benchmarking Joint Visual Question Answering and Pixel-Level Evidence

VISTAQA: 评估联合视觉问答与像素级证据

Mozhgan Nasr Azadani, Yimu Wang, Yongpeng Zhu, Lihong Chen, Milan Ganai, Sean Sedwards, Marco Pavone, Krzysztof Czarnecki

发表机构 * University of Waterloo(滑铁卢大学) Stanford University(斯坦福大学) NVIDIA(英伟达)

AI总结 本文提出VISTAQA基准,用于评估视觉问答中自由回答的正确性和像素级证据的定位,通过引入GROVE指标,强调回答正确性与视觉证据对齐的重要性,实验显示现有系统在该指标下表现有限,揭示了回答准确性和视觉证据对齐之间的显著差距。

详情
AI中文摘要

建立模型预测与支持它们的视觉证据之间的清晰联系对于多模态推理的透明性和可靠性至关重要,但当前的多模态大语言模型(MLLM)评估并未明确强制这种对齐。现有的基准评估要么单独评估文本答案的正确性,要么单独评估像素级定位,使推理与定位的耦合成为一个开放性挑战。我们介绍了VISTAQA,一个用于联合评估自由回答正确性和像素级证据定位的全面基准。VISTAQA包含1,157个专家整理的样本,涵盖六种任务类型和六个视觉领域,从直接感知到组合和关系推理。VISTAQA要求模型不仅要正确回答,还要提供精确的分割掩码以支持其答案。它还包含有幻觉意识的例子,其中不存在有效的视觉证据。为了支持这种增强的评估,我们引入了GROVE,一个统一的评估指标,通过每样本几何均值结合文本准确性与定位质量,确保两者都不能补偿对方的不足。在接地意识模型和混合管道与通用MLLM的全面实验中,即使最强的系统在GROVE下也表现有限,突显了回答准确性和视觉证据对齐之间的显著差距。

英文摘要

Establishing a clear link between model predictions and the visual evidence that supports them is critical for transparency and reliability in multimodal reasoning, yet current multimodal large language model (MLLM) evaluations do not explicitly enforce this alignment. Existing benchmarks assess either textual answer correctness or pixel-level localization in isolation, leaving the coupling of reasoning and grounding an open challenge. We introduce VISTAQA, a comprehensive benchmark for joint evaluation of free-form answer correctness and pixel-level evidence grounding in visual question answering. VISTAQA comprises 1,157 expert-curated samples spanning six task types and six visual domains, ranging from direct perception to compositional and relational reasoning. VISTAQA requires models to not only answer correctly, but to also provide precise segmentation masks that support their answers. It also includes hallucination-aware examples where no valid visual evidence exists. To support this enhanced evaluation, we introduce GROVE, a unified evaluation metric that enforces joint correctness by combining textual accuracy and grounding quality via a per-sample geometric mean, ensuring neither dimension can compensate for deficiencies in the other. Comprehensive experiments across grounding-aware models and hybrid pipelines with general-purpose MLLMs reveal that even the strongest systems achieve limited performance under GROVE, highlighting a substantial gap between answer accuracy and visual evidence alignment.

2605.20674 2026-05-21 cs.LG

Modular Multimodal Classification Without Fine-Tuning: A Simple Compositional Approach

无需微调的模块化多模态分类:一种简单的组合方法

Herman Bergström, Aditya Mehrotra, Rahul G. Krishnan

发表机构 * Chalmers University of Technology and University of Gothenburg(查尔姆斯理工大学和哥德堡大学) Vector Institute(向量研究所) University of Toronto(多伦多大学)

AI总结 本文提出CoMET,一种无需微调的多模态分类方法,通过冻结预训练的backbone对每个模态进行处理,使用PCA压缩嵌入并输入到表格基础模型中进行预测,展示了PCA作为适配器在不同模态上的强大鲁棒性能,并提出了PALPooling来提升表示质量,实现了无需训练的多模态学习最佳结果。

Comments 30 pages, 17 figures

详情
AI中文摘要

我们介绍CoMET,即通过表格基础模型(TFM)组合模态编码器的简单而具有竞争力的多模态分类方法:将每个模态通过冻结的预训练backbone处理,用PCA压缩得到的嵌入,并将其连接作为输入到TFM中进行预测。我们证明仅PCA就足以作为适配器,在不同模态上实现强大且稳健的性能。当基础模型的CLS标记与下游任务匹配不佳时,我们提出了PALPooling,一种轻量级的自适应标记池化器,能够一致地提高表示质量。通过将强大的冻结表示学习backbone与TFM组合,我们的方法在多样化的多模态基准上实现了最先进的结果,无需任何训练。在具有大规模细粒度类别空间的分层任务中,我们的方法实现了快速且可扩展的分类,能够处理超过500,000个样本和2,000个类别的数据集,无需任何微调。总体而言,我们的结果表明,基础模型的组合是一种简单但强大的即开即用解决方案,挑战了为新问题进行复杂端到端训练管道的必要性。

英文摘要

We introduce CoMET, \textit{\textbf{C}omposing \textbf{M}odality \textbf{E}ncoders with \textbf{T}abular foundation models}, a simple yet highly competitive method for multimodal classification: pass each modality through a frozen pre-trained backbone, compress the resulting embeddings with PCA, and concatenate as input into a Tabular Foundation Model (TFM) for prediction. We show that PCA alone suffices to act as an adaptor yielding strong, robust performance across modalities. When the \texttt{CLS} tokens of the foundation model align poorly with downstream tasks, we propose \textbf{PALPooling}, a lightweight adaptive token pooler that consistently improves representation quality. By composing strong frozen representation learning backbones with TFMs, our approach achieves state-of-the-art results across diverse multimodal benchmarks without any training. On hierarchical tasks with large fine-grained class spaces, our approach enables fast and scalable classification, handling datasets with over 500,000 samples and 2,000 classes without any fine-tuning. Overall, our results show that the composition of foundation models is a simple, yet powerful, out-of-the-box solution for multimodal learning, challenging the necessity of complex, end-to-end training pipelines for new problems.

2605.20669 2026-05-21 cs.CV

GSA-YOLO: A High-Efficiency Framework via Structured Sparsity and Adaptive Knowledge Distillation for Real-Time X-ray Security Inspection

GSA-YOLO: 一种通过结构稀疏性和自适应知识蒸馏实现高效率的实时X射线安全检查框架

Jiahao Kong

发表机构 * SDU-ANU Joint Science College(山东大学-澳大利亚国立大学联合科学学院) Shandong University(山东大学)

AI总结 本文提出GSA-YOLO框架,通过结构稀疏性和自适应知识蒸馏提升实时X射线安全检查的检测鲁棒性和推理效率,实现了高精度和高效率的平衡。

Comments 41 pages, 8 figures, submitted to Scientific Reports

详情
AI中文摘要

X射线安全检查需要准确实时检测违禁物品,但现有模型往往难以平衡严重遮挡、复杂杂乱和严格速度要求的挑战。为克服这些挑战,本文提出GSA-YOLO,一种基于YOLOv8n架构的新型轻量框架,专门设计以增强检测鲁棒性和推理效率。GSA-YOLO通过三个核心组件策略性整合结构稀疏性和自适应知识转移:Group Lasso(GL)应用于网络颈部以实现鲁棒的特征提取;Sparse Structure Selection(SSS)应用于检测头以实现显著的模型瘦身;以及自适应知识蒸馏(Ada-KD)机制以实现全面的准确率恢复。这种整合方法协同增强了特征表示,同时修剪冗余通道,最大化模型效率而不牺牲性能。在HiXray和PIDray数据集上的严格评估证实了GSA-YOLO的全面能力,实现了领先的推理速度189.62 FPS,伴随计算成本从8.7G降至8.0G。关键的是,GSA-YOLO在HiXray和PIDray上分别实现了mAP50:95结果0.531和0.679,分别比基线提高了2.4%和1.8%。与其他模型相比,GSA-YOLO在保持计算效率的同时表现出更高的准确性,使其成为实际X射线安全检查的有前景的解决方案。

英文摘要

X-ray security inspection requires accurate real-time detection of prohibited items, but existing models often struggle to balance the challenges of severe occlusion, complex clutter, and strict speed requirements. To overcome these challenges, this paper proposes GSA-YOLO, a novel lightweight framework built upon the YOLOv8n architecture, specifically engineered to enhance detection robustness and inference efficiency. GSA-YOLO strategically integrates structured sparsity and adaptive knowledge transfer through three core components: Group Lasso (GL) applied to the network neck for robust feature extraction; Sparse Structure Selection (SSS) applied to the detection head for significant model slimming; and an Adaptive Knowledge Distillation (Ada-KD) mechanism for comprehensive accuracy recovery. This integrated approach synergistically enhances feature representation while pruning redundant channels, maximizing model efficiency without sacrificing performance. Rigorous evaluations on the HiXray and PIDray datasets confirm GSA-YOLO's comprehensive capability, achieving a leading inference speed of 189.62 FPS, accompanied by a reduction in computational cost from 8.7G to 8.0G. Crucially, GSA-YOLO secures mAP50:95 results of 0.531 and 0.679 on HiXray and PIDray, demonstrating 2.4% and 1.8% improvements over the baseline, respectively. Compared to other models, GSA-YOLO exhibits enhanced accuracy while maintaining computational efficiency, making it a promising solution for practical X-ray security inspection.

2605.20668 2026-05-21 cs.CL cs.AI cs.LG

On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists

人工智能审稿人的局限与机遇:对Nature系列论文审稿的45位专家科学家的审查

Seungone Kim, Dongkeun Yoon, Kiril Gashteovski, Juyoung Suk, Jinheon Baek, Pranjal Aggarwal, Ian Wu, Viktor Zaverkin, Spase Petkoski, Daniel R. Schrider, Ilija Dukovski, Francesco Santini, Biljana Mitreska, Yong Jeong, Kyeongha Kwon, Young Min Sim, Dragana Manasova, Arthur Porto, Biljana Mojsoska, Makoto Takamoto, Marko Shuntov, Ruoqi Liu, Hyunjoo Jenny Lee, Niyazi Ulas Dinç, Yehhyun Jo, Sunkyu Han, Chungwoo Lee, Huishan Li, Esther H. R. Tsai, Ergun Simsek, Khushboo Shafi, Yeonseung Chung, Jihye Park, Aleksandar Shulevski, Henrik Christiansen, Yoosang Son, Elly Knight, Amanda Montoya, Jeongyoun Ahn, Christian Langkammer, Heera Moon, Changwon Yoon, Nikola Stikov, Mooseok Jang, Edward Choi, Junhan Kim, Yeon Sik Jung, Woo Youn Kim, Jae Kyoung Kim, Ishraq Md Anjum, Hyun Uk Kim, Drew Bridges, Carolin Lawrence, Xiang Yue, Alice Oh, Akari Asai, Sean Welleck, Graham Neubig

发表机构 * Nature(自然)

AI总结 本文通过大规模专家标注研究,探讨了AI审稿人在科学同行评审中的能力与局限,发现AI审稿在准确性、显著性和证据充分性方面表现优异,但存在领域知识有限、上下文管理不足等弱点,表明AI审稿是人类审稿的补充而非替代。

Comments Work in progress

详情
AI中文摘要

随着AI能力的提升,AI审稿人开始被应用于科学同行评审,但其能力和可信度仍存疑:许多科学家将其视为概率系统,缺乏评估研究的专业能力,而其他研究人员则对AI的准备程度更为乐观,但缺乏实证支持。理解AI审稿人擅长什么、哪里不足以及仍需解决的挑战至关重要。然而,现有的AI审稿评估主要关注其判断是否与人类一致(例如评分对齐、接受预测),这不足以表征其能力和局限。在本文中,我们通过大规模专家标注研究填补了这一空白,45位物理、生物和健康科学领域的专家花费469小时对2960个个体批评(每个批评针对论文的一个特定方面)进行评分,这些批评来自人类和AI生成的82篇Nature系列论文的审稿。在综合正确性、显著性和证据充分性三个维度上,由GPT-5.2驱动的审稿代理在每篇论文的最高评分人类审稿人评分上(60.0% vs. 48.2%,p = 0.009),而所有三个AI审稿(包括Gemini 3.0 Pro和Claude Opus 4.5)在每个维度上都超过了最低评分的人类审稿人。AI审稿的准确批评也更常被评分显著且证据充分,并揭示了人类未提及的26%的问题。然而,AI审稿在交叉审稿者对之间重叠远多于人类(21% vs. 3%),并且表现出16个人类不共享的弱点,如领域知识有限、缺乏多文件上下文管理能力以及对次要问题过于批判。总体而言,我们的结果表明当前AI审稿人是人类审稿人的补充,而非替代。

英文摘要

With the advancement of AI capabilities, AI reviewers are beginning to be deployed in scientific peer review, yet their capability and credibility remain in question: many scientists simply view them as probabilistic systems without the expertise to evaluate research, while other researchers are more optimistic about their readiness without concrete evidence. Understanding what AI reviewers do well, where they fall short, and what challenges remain is essential. However, existing evaluations of AI reviewers have focused on whether their verdicts match human verdicts (e.g., score alignment, acceptance prediction), which is insufficient to characterize their capabilities and limits. In this paper, we close this gap through a large-scale expert annotation study, in which 45 domain scientists in Physical, Biological, and Health Sciences spent 469 hours rating 2,960 individual criticisms (each targeting one specific aspect of a paper) from human-written and AI-generated reviews of 82 Nature-family papers on correctness, significance, and sufficiency of evidence. On a composite of all three dimensions, a reviewing agent powered by GPT-5.2 scores above each paper's top-rated human reviewer (60.0% vs. 48.2%, p = 0.009), while all three AI reviewers (including Gemini 3.0 Pro and Claude Opus 4.5) exceed the lowest-rated human across every dimension. AI reviewers' accurate criticisms are also more often rated significant and well-evidenced, and surface a distinct 26% of issues no human raises. However, AI reviewers overlap far more than humans do (21% vs. 3% for cross-reviewer pairs), and exhibit 16 recurring weaknesses humans do not share, such as limited subfield knowledge, lack of long context management over multiple files, and overly critical stance on minor issues. Overall, our results position current AI reviewers as complements to, not substitutes for, human reviewers.

2605.20667 2026-05-21 cs.CV

LER-YOLO: Reliability-Aware Expert Routing for Misaligned RGB-Infrared UAV Detection

LER-YOLO: 一种可靠性感知的专家路由方法用于对齐不准确的RGB-红外无人机检测

Liming Hou, Yueping Peng, Hexiang Hao, Ji Wang, Xuekai Zhang, Wei Tang, Zecong Ye, Xin Ying, Yubo He

发表机构 * Engineering University of PAP(中国人民解放军防务大学) Unit Command Department, Officers College of PAP(中国人民解放军军官学院作战指挥部)

AI总结 该研究提出LER-YOLO,一种可靠性感知的稀疏专家混合方法,用于解决RGB-红外遥感对中无人机检测的挑战,通过引入不确定性感知的目标对齐模块和可靠性引导的稀疏MoE融合模块,提升跨模态交互的可靠性。

Comments 17 pages, 6 figures, 8 tables

详情
AI中文摘要

检测RGB-红外遥感对中的小型无人驾驶航空器仍然具有挑战性,因为目标尺度小、背景杂乱以及异构传感器之间的空间不对齐。现有的双模检测器通常对齐或融合特征,但未评估局部跨传感器对应关系的可靠性,导致不匹配伪影传播到检测头。为此,我们提出了LER-YOLO,一种可靠性感知的稀疏混合专家框架,用于对齐不准确的RGB-红外无人机检测。LER-YOLO首先引入了一个不确定性感知的目标对齐模块,将可见特征重新采样到红外参考,并估计空间可靠性图。此可靠性先验随后被可靠性引导的稀疏MoE融合模块使用,以从RGB主导、红外主导和交互融合专家中自适应选择k个专家,从而在抑制不可靠融合的同时实现可信的跨模态交互。在公共MBU基准上,使用YOLOv5s家族协议进行实验,结果显示LER-YOLO在三个独立种子下达到89.7±0.2%的AP50,最佳结果为89.9%。广泛的消融实验、参数匹配比较、合成位移评估和复杂度分析表明,收益主要来自可靠性引导的专家路由,而非增加模型容量。

英文摘要

Detecting small unmanned aerial vehicles from RGB-infrared remote-sensing pairs remains challenging due to tiny target scale, cluttered backgrounds, and spatial misalignment between heterogeneous sensors. Existing bimodal detectors often align or fuse features without assessing the reliability of local cross-sensor correspondence, allowing mismatch artifacts to propagate into the detection head. To address this issue, we propose LER-YOLO, a reliability-aware sparse mixture-of-experts framework for misaligned RGB-infrared UAV detection. LER-YOLO first introduces an Uncertainty-Aware Target Alignment module that resamples visible features toward the infrared reference and estimates a spatial reliability map. This reliability prior is then used by a Reliability-Guided Sparse MoE Fusion module to adaptively select k experts from RGB-dominant, infrared-dominant, and interactive fusion experts, enabling trustworthy cross-modal interaction while suppressing unreliable fusion. Experiments on the public MBU benchmark under a YOLOv5s-family protocol show that LER-YOLO achieves 89.7+/-0.2% AP50 over three independent seeds, with a best result of 89.9%. Extensive ablations, parameter-matched comparisons, synthetic-shift evaluations, and complexity analysis demonstrate that the gains mainly come from reliability-guided expert routing rather than increased model capacity.

2605.20666 2026-05-21 cs.RO

A Semantic and Occlusion-Aware GM-PHD Filter

一种语义和遮挡感知的GM-PHD滤波器

Jovan Menezes, Mark Campbell

发表机构 * Sibley School of Mechanical and Aerospace Engineering, Cornell University(康奈尔大学机械与航空航天工程系)

AI总结 本文提出了一种包含从深度学习中提取的语义信息的新出生模型,以创建一种遮挡感知的高斯混合概率假说密度(GM-PHD)滤波器。与以往依赖简单或统一假设的方法不同,所提出的语义-遮挡感知(S-OA)出生模型通过显式考虑遮挡区域并利用环境的语义信息来定义初始化项。这使滤波器能够准确表示新物体更可能出现的位置,从而在复杂和高密度的驾驶场景中提高跟踪性能。该方法通过蒙特卡洛模拟和KITTI数据集的实验进行评估。性能通过测量首次检测与跟踪初始化之间的延迟、平均绝对数量误差以及最优子模式分配(OSPA)度量来评估。结果表明,S-OA出生模型在遮挡密集的环境中减少了初始化延迟,在约70%的情况下匹配或优于最强基线。还提供了出生模型权重的敏感性分析。总体而言,研究结果强调了在自动驾驶中将遮挡推理和语义先验整合到贝叶斯跟踪框架中的优势。

Comments Accepted at ICRA 2026

详情
AI中文摘要

本文提出了一种新的出生模型,该模型包含从深度学习中提取的语义信息,以创建一种遮挡感知的高斯混合概率假说密度(GM-PHD)滤波器。与以往依赖简单或统一假设的方法不同,所提出的语义-遮挡感知(S-OA)出生模型通过显式考虑遮挡区域并利用环境的语义信息来定义初始化项。这使滤波器能够准确表示新物体更可能出现的位置,从而在复杂和高密度的驾驶场景中提高跟踪性能。该方法通过蒙特卡洛模拟和KITTI数据集的实验进行评估。性能通过测量首次检测与跟踪初始化之间的延迟、平均绝对数量误差以及最优子模式分配(OSPA)度量来评估。结果表明,S-OA出生模型在遮挡密集的环境中减少了初始化延迟,在约70%的情况下匹配或优于最强基线。还提供了出生模型权重的敏感性分析。总体而言,研究结果强调了在自动驾驶中将遮挡推理和语义先验整合到贝叶斯跟踪框架中的优势。

英文摘要

This paper proposes a new birth model including semantic information derived from deep learning to create an occlusion-aware Gaussian Mixture Probability Hypothesis Density (GM-PHD) filter. Unlike prior approaches that rely on simplistic or uniform assumptions, the proposed Semantic-Occlusion Aware (S-OA) birth model defines initialization terms by explicitly considering regions of occlusion and by leveraging semantic information about the environment. This enables the filter to accurately represent where new objects are more likely to appear, thereby improving tracking performance in complex and high-density driving scenarios. The method is evaluated through Monte Carlo simulations and experiments on the KITTI dataset. Performance is assessed by measuring the latency between first detection and track initiation, along with the mean absolute cardinality error and the Optimal Subpattern Assignment (OSPA) metric. Results demonstrate that the S-OA birth model reduces initialization delay in occlusion-heavy settings, matching or outperforming the strongest baseline in approximately 70% of cases. A sensitivity analysis of birth model weights is also provided. Overall, the findings underscore the benefits of integrating occlusion reasoning and semantic priors into Bayesian tracking frameworks for autonomous driving.

2605.20659 2026-05-21 cs.CV cs.LG

RoPeSLR: 3D RoPE-driven Sparse-LowRank Attention for Efficient Diffusion Transformers

RoPeSLR: 3D RoPE驱动的稀疏低秩注意力用于高效的扩散变换器

Yuxi Liu, Zekun Zhang, Yixiang Cai, Renjia Deng, Yutong He, Kun Yuan

发表机构 * Peking University(北京大学) University of Electronic Science and Technology of China(电子科技大学) Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 本研究提出RoPeSLR,一种基于3D RoPE的稀疏低秩注意力框架,旨在解决扩散变换器中长序列生成的高复杂度问题,通过结合高频率语义尖峰集和极低秩背景连续体,实现子二次稀疏性和子线性秩增长,从而在超长视频推理中表现出色。

详情
AI中文摘要

扩散变换器(DiTs)已革新了高保真视频生成,但其$\mathcal{O}(L^2)$的注意力复杂度对长序列合成构成了重大瓶颈。尽管近期的稀疏线性注意力混合体旨在缓解这一问题,但其在极端稀疏性下性能严重下降,这是因为“RoPE困境”:标准线性注意力无法保持3D旋转位置嵌入(RoPE)的正交相对位置结构,从而消除了关键的距离意识。为了解决这个问题,我们提出了RoPeSLR,一种3D RoPE引导的稀疏低秩注意力框架。我们建立,根据经验证实的假设,DiT注意力流形可以解耦为一个高频率语义尖峰集(受限于$\mathcal{O}(L^{3/2})$稀疏性)和一个极低秩($\mathcal{O}(d_h \log L)$)背景连续体。受这一结构先验的指导,RoPeSLR摒弃标准线性注意力,采用具有可学习3D绝对位置嵌入(PE)注入的头级低秩参数化,无缝合成长距离相对距离衰减。通过保证子二次稀疏性和子线性秩增长,RoPeSLR特别适合扩展到超长视频推理。广泛的评估验证了这种可扩展优势:在90%稀疏性下,RoPeSLR在Wan2.1-1.3B上实现高达10倍的FLOPs减少,并在HunyuanVideo-13B的超长100K+ token序列上提供2.26倍的端到端推理加速,同时保持接近无损的生成保真度(平均VBench退化低于1.3%)

英文摘要

Diffusion Transformers (DiTs) have revolutionized high-fidelity video generation, yet their $\mathcal{O}(L^2)$ attention complexity poses a formidable bottleneck for long-sequence synthesis. While recent sparse-linear attention hybrids aim to mitigate this, their performance severely degrades at extreme sparsity due to the "RoPE Dilemma": standard linear attention fails to preserve the orthogonal relative-position structure of 3D Rotary Position Embeddings (RoPE), neutralizing vital distance awareness. To address this, we propose \textbf{RoPeSLR}, a 3D RoPE-guided Sparse-LowRank attention framework. We establish that under empirically validated assumptions, the DiT attention manifold admits a decoupling into a high-frequency semantic spike set (bounded by $\mathcal{O}(L^{3/2})$ sparsity) and an extreme low-rank ($\mathcal{O}(d_h \log L)$) background continuum. Guided by this structural prior, RoPeSLR eschews standard linear attention for a head-wise low-rank parameterization equipped with a learnable 3D Absolute Positional Embedding (PE) injection, seamlessly synthesizing long-range relative distance decay. By guaranteeing sub-quadratic sparsity and sub-linear rank growth, RoPeSLR is exceptionally suited for scaling to ultra-long video inference. Extensive evaluations validate this scalable superiority: at 90\% sparsity, RoPeSLR achieves up to $10\times$ fewer FLOPs on Wan2.1-1.3B and delivers a $2.26\times$ end-to-end inference speedup on the ultra-long 100K+ token sequences of HunyuanVideo-13B, all while maintaining near-lossless generation fidelity (less than 1.3\% average VBench degradation).

2605.20651 2026-05-21 cs.CV

Gaze into the Details: Locality-Sensitive Enhancement for OCTA Retinal Vessel Segmentation

凝视细节:用于OCTA视网膜血管分割的局部敏感增强

Tuopusen Huang, Ding Ma, Xiangqian Wu

发表机构 * Faculty of Computing(计算学院)

AI总结 本文提出LSENet,通过引入三个创新模块解决OCTA血管分割中局部对比度低导致的断续和细节丢失问题,实验表明其在多个公开数据集上达到最佳性能且参数更少。

详情
AI中文摘要

现有的OCTA血管分割深度学习框架大多基于U-Net架构,但大多数方法仅关注整体表示,难以处理OCTA特有的低局部对比度问题,导致血管断续和细节丢失。为此,我们提出LSENet,基于U-Net架构引入三个核心创新模块:为解决血管断续问题,引入补丁信息增强模块(PIE),用补丁级注意力替代标准跳接连接;为缓解细节丢失问题,提出多尺度特征融合模块(MFF),通过从原始输入和前一层提取可解释特征,为PIE模块提供丰富多尺度信息;最后设计连接性细化解码器(CRD),通过最终卷积层的大核减少碎片化。在三个公开数据集(OCTA-500、ROSE-1和ROSSA)上的实验表明,所提LSENet在性能上达到最佳,且参数更少。

英文摘要

Existing deep learning frameworks for Optical Coherence Tomography Angiography (OCTA) vessel segmentation are largely derived from the U-Net architecture, which serves as the foundation for most current designs. However, most of these methods focus only on holistic representation, struggling to address the problem of low local contrast unique to OCTA, which leads to vessel discontinuities and loss of detail. To address these problems, we propose LSENet, which builds upon the U-Net architecture by introducing three core innovative modules: To address vessel discontinuities, we introduce the Patch Information Enhance module (PIE), which replaces standard skip connections to execute patch-wise attention. To mitigate detail loss, the Multiscale Feature Fusion module (MFF) is proposed to feed the PIE module rich, multi-scale information by extracting visually interpretable features from both the original input and preceding layers. Finally, the Connectivity Refinement Decoder (CRD) is designed to refine features from all levels and utilize a large kernel in the final convolutional layer to reduce fragmentation. Experiments on three public datasets (OCTA-500, ROSE-1, and ROSSA) demonstrate that our proposed LSENet achieves state-of-the-art performance while requiring fewer parameters.

2605.20648 2026-05-21 cs.RO cs.AI

Jointly Learning Predicates and Actions Enables Zero-Shot Skill Composition

联合学习谓词和动作使零样本技能组合成为可能

Benedict Quartey, Sebastian Castro, Eric Rosen, Wil Thomason, George Konidaris, Stefanie Tellex

发表机构 * Brown University(布朗大学) Robotics & AI Institute(机器人与人工智能研究所)

AI总结 本文提出了一种联合学习谓词和动作的技能方法,通过闭合回路的视觉-运动策略,使机器人能够在不重新训练的情况下实现零样本技能组合。

详情
AI中文摘要

学习示范(LfD)使机器人能够从专家示例中学习复杂行为,但现有方法往往无法在不重新训练的情况下泛化到新组合的已知技能。现代生成性策略仅建模动作轨迹分布,因此无法推断出所需的符号结果。我们提出技能应联合建模动作轨迹和它们诱导的符号结果。为解决这一差距,我们引入了谓词动作技能(PACTS),一种闭合回路的视觉-运动策略,将技能建模为动作和谓词信念轨迹的联合生成过程,在单一模型中产生连贯的动作-结果滚动。联合生成动作和谓词使PACTS能够学习改进动作生成和谓词分类的内部表示。此外,我们通过利用PACTS的在线谓词预测作为符号接口来序列化和监控执行,展示了学习技能的零样本组合。项目网站:https://planpacts.github.io/

英文摘要

Learning from Demonstration (LfD) enables robots to learn complex behaviors from expert examples, yet existing approaches often fail to generalize to new compositions of known skills without retraining. Modern generative policies model distributions over action trajectories alone, thus are unable to reason about the symbolic outcomes required for robust composition. We propose that skills should jointly model action trajectories and the symbolic outcomes they induce. To address this gap, we introduce Predicate Action Skills (PACTS), a class of closed-loop visuomotor policies that model skills as a joint generative process over action and predicate belief trajectories, producing coherent action-outcome rollouts within a single model. Jointly generating actions and predicates enables PACTS to learn internal representations that improve both action generation and predicate classification. Furthermore, we demonstrate zero-shot composition of learned skills via planning by leveraging online predicate predictions from PACTS as a symbolic interface for sequencing and monitoring execution. Project website: https://planpacts.github.io/

2605.20645 2026-05-21 cs.CV

Seeing Through Fog: Towards Fog-Invariant Action Recognition

穿透雾气:迈向雾不变的动作识别

Enqi Liu, Liyuan Pan, Zhi Gao, Lingzhi Li, Qing Li

发表机构 * Beijing Institute of Technology, Beijing, China(北京理工大学,北京,中国) Beijing Institute for General Artificial Intelligence, Beijing, China(北京通用人工智能研究院,北京,中国) Yangtze Delta Region Academy of Beijing Institute of Technology, Jiaxing, China(北京理工大学扬子江地区研究院,嘉兴,中国)

AI总结 本文提出FogAct基准数据集和FogNet模型,旨在解决雾天环境下动作识别中的挑战,通过改进的两流CLIP模型提取雾不变的语义信息,提升在雾天条件下的动作识别性能。

详情
AI中文摘要

雾天条件在现实应用中很常见;然而,现有动作识别方法通常假设有利的天气和高质量的视频输入。在雾天,不可预测的可见性降级和对比度降低会阻碍语义线索的提取,给当前的动作识别方法带来重大挑战。在本文中,我们通过采用两种策略来缓解雾天条件下动作识别的问题。首先,我们提出了FogAct,这是第一个雾状动作识别基准数据集,由使用立体相机系统拍摄的配对干净和雾天视频组成。该数据集涵盖10个场景和55个动作类别,包含近10000个视频片段。其次,我们提出了FogNet,一种两流CLIP模型,该模型发现隐藏在降质视频背后的雾不变的语义信息。FogNet通过清洁视频的指导学习雾视频的稳健表示,有效捕捉清洁和雾天视频之间的共享结构和运动线索。在FogAct和三个其他流行数据集上的广泛实验表明,我们的方法在与最先进(SOTA)方法相比时具有竞争性性能。我们的FogAct和FogNet可在我们的项目页面上找到。

英文摘要

Foggy conditions are commonly encountered in real-world applications; however, existing action recognition approaches typically assume favorable weather and high-quality video inputs. On foggy days, unpredictable visibility degradation and reduced contrast obstruct the extraction of semantic cues, posing significant challenges for current action recognition methods. In this paper, we mitigate the issues faced in action recognition under foggy conditions by employing two strategies. First, we present FogAct, the first benchmark dataset for foggy action recognition, consisting of paired clean and foggy videos captured with a stereo camera system. The dataset spans 10 scenes and 55 action categories, comprising nearly 10,000 video clips. Second, we propose FogNet, a two-stream CLIP model that discovers fog-invariant semantic information hidden behind the degraded videos. FogNet learns robust representations of foggy videos with guidance from clean videos, effectively capturing shared structural and motion cues between clean and foggy videos. Extensive experiments on FogAct and three other popular datasets demonstrate that our method achieves competitive performance compared with state-of-the-art (SOTA) approaches. Our FogAct and FogNet are given in our project page.

2605.20644 2026-05-21 cs.LG cs.AI cs.RO

Design for Manufacturing: A Manufacturability Knowledge-Integrated Reinforcement Learning Framework for Free-Form Pipe Routing in Aeroengines

制造设计:一种集成制造知识的强化学习框架用于航空发动机自由形管道路由

Caicheng Wang, Zili Wang, Shuyou Zhang, Yongzhe Xiang, Zheyi Li, Liangyou Li, Jianrong Tan

发表机构 * State Key Laboratory of Fluid Power and Mechatronic Systems, Zhejiang University(浙江大学流体动力与机电系统国家重点实验室) Engineering Research Center for Design Engineering and Digital Twin of Zhejiang Province, Zhejiang University(浙江省设计工程与数字孪生工程研究中心) Zhejiang Changxing Heliang Intelligent Equipment Co., Ltd.(浙江长兴鹤浪智能装备有限公司)

AI总结 本文提出了一种集成制造知识的强化学习框架,用于航空发动机中自由形管道路由优化,通过将制造知识作为约束条件,提高了管道路径的可制造性和几何平滑度。

详情
AI中文摘要

制造设计在先进航空发动机开发中起着关键作用,其中复杂组件需要仔细考虑可制造性。然而,当前的管道路由实践仍然很大程度上与下游制造脱节,导致需要大量劳动和试错迭代以获得可制造的设计。为了解决这个问题,本研究提出了一种基于弗伦塞尔的管道路由优化(FPRO)框架,这是一种用于航空发动机自由形管道设计的集成制造知识的强化学习方法。FPRO将路由问题表述为弗伦塞尔框架中的边界值问题。在此框架中,管道路径由曲率和扭率剖面表示,这些剖面通过三次赫尔迈特插值生成。为了将设计与制造相结合,领域特定的制造知识被嵌入到曲率和扭率的允许范围的约束中。路径优化使用了具有随机探索和阶段引导奖励机制的近端策略优化算法。统一的映射公式然后将优化的路径转换为弯曲模具的运动轨迹,使六轴自由弯曲机能够直接制造。实验结果表明,FPRO能够持续生成无碰撞、可制造的路径,其几何剖面比基于笛卡尔的方法更平滑。它还实现了更快的收敛速度和在终端对齐、路径长度、障碍物避让和可制造性方面的优越性能,优于最先进的强化学习基线。现实验证确认了制造管道与数字设计之间几何的紧密对应关系,验证了FPRO的实践可行性。

英文摘要

Design for manufacturing plays a critical role in advanced aeroengine development, where complex components necessitate careful consideration of manufacturability. However, current practices in pipe routing remain largely decoupled from down-stream manufacturing, leading to labor-intensive, trial-and-error iterations to achieve manufacturable designs. To address this problem, this study proposes the Frenet-based pipe routing optimization (FPRO) framework, a manufacturability knowledge-integrated reinforcement learning approach for free-form pipe design in aeroengines. FPRO formulates the routing problem as a boundary value problem in the Frenet frame. In this framework, the pipe path is represented by curvature and torsion profiles, which are generated using cubic Hermite interpolation. To integrate design and manufacturing, domain-specific manufacturing knowledge is embedded as constraints on the permissible ranges of curvature and torsion. The path optimization is performed using the proximal policy optimization algorithm with stochastic exploration and a stage-guided reward mechanism. A unified mapping formulation then translates the optimized path into motion trajectories for the bending die, enabling direct fabrication on a six-axis free-bending machine. Experimental results demonstrate that FPRO consistently generates collision-free, manufacturable paths with smoother geometric profiles compared to Cartesian-based methods. It also achieves faster convergence and superior performance in terminal alignment, path length, obstacle avoidance, and manufacturability compared to state-of-the-art reinforcement learning baselines. Real-world validation confirms the close geometric correspondence between the manufactured pipe and its digital design, validating the practical feasibility of FPRO.

2605.20643 2026-05-21 cs.LG cs.AI cs.CL

AVSD: Adaptive-View Self-Distillation by Balancing Consensus and Teacher-Specific Privileged Signals

AVSD:通过平衡共识和教师特定的特权信号实现自适应视图自蒸馏

Duy Nguyen, Hanqi Xiao, Archiki Prasad, Zaid Khan, Anirban Das, Austin Zhang, Sambit Sahu, Hyunji Lee, Elias Stengel-Eskin, Mohit Bansal

发表机构 * UNC Chapel Hill(北卡罗来纳大学教堂山分校) Capital One(Capital One公司) The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 本文提出AVSD,一种通过平衡共识和教师特定的特权信号来实现自适应视图自蒸馏的方法,以解决自蒸馏中教师和学生信息不对称和特权信息选择的问题。

Comments Code: https://github.com/duykhuongnguyen/AVSD

详情
AI中文摘要

自蒸馏使语言模型能够通过使用同一模型作为学生和教师来从自身轨迹中学习,其中教师基于学生无法访问的特权信息进行条件。此类信息可以是不同种类或视图,如解决方案、演示、反馈或最终答案。这种设置可以在不依赖外部模型的情况下提供密集的token级反馈,但会产生根本性的不对称性:教师可能依赖于视图特定的信息,而学生在推理时无法访问。此外,最佳的特权信息类型通常是任务依赖的,使得选择单一教师视图变得困难。在本工作中,我们通过引入AVSD(自适应视图自蒸馏),一种具有多种特权信息视图的自蒸馏新方法,来同时解决这两个挑战。AVSD通过分离稳定的跨视图共识和视图特定的残差信号来重建token级监督。AVSD识别出跨视图共享的共识信号,提供可靠的更新方向,然后在两者一致且比例适当的情况下,选择性地添加视图特定的残差信号以调整更新幅度。在数学竞赛基准(AIME24、AIME25和HMMT25)上的实验表明,AVSD在Qwen3-8B和Qwen3-4B上分别比单视图自蒸馏基线和GRPO平均Avg@8提升了3.1%和2.2%。此外,在代码生成基准(Codeforces、LiveCodeBench v6)上使用Qwen3-8B时,AVSD在平均上比单视图自蒸馏基线高出2.4%。

英文摘要

Self-distillation enables language models to learn on-policy from their own trajectories by using the same model as both student and teacher, with the teacher being conditioned on privileged information unavailable to the student. Such information can come in different types or views, such as solutions, demonstrations, feedback, or final answers. This setup provides dense token-level feedback without relying on a separate external model, but creates a fundamental asymmetry: the teacher may rely on view-specific information that the student cannot access at inference time. Moreover, the best type of privileged information is often task-dependent, making it difficult to choose a single teacher view. In this work, we address both these challenges jointly by introducing AVSD (Adaptive-View Self-Distillation), a novel method of self-distillation with multiple privileged-information views, which reconstructs token-level supervision by separating stable cross-view consensus from view-specific residual signals. AVSD identifies the consensus signal shared across views, which provides a reliable update direction, and then selectively adds the view-specific residual signal to adjust the update magnitude when it both aligns with the consensus direction and remains proportionate to the consensus signal. Experiments on math competition benchmarks (AIME24, AIME25, and HMMT25) show that AVSD consistently outperforms both single-view self-distillation baselines and GRPO, achieving average Avg@8 gains of 3.1% and 2.2% over the strongest baselines on Qwen3-8B and Qwen3-4B, respectively. Moreover, on code-generation benchmarks (Codeforces, LiveCodeBench v6) using Qwen3-8B, AVSD outperforms the single-view self-distillation baseline by 2.4% on average.

2605.20642 2026-05-21 cs.LG

Same Target, Different Basins: Hard vs. Soft Labels for Annotator Distributions

相同目标,不同盆地:标注者分布中的硬标签与软标签

Mirerfan Gheibi, Gashin Ghazizadeh

发表机构 * Independent Researcher(独立研究者)

AI总结 本文研究了在标注者分布中硬标签与软标签的区别,发现当每个示例的标注数量较少时,硬标签方法在性能上优于软标签训练,尤其是在稀疏经验目标远离完整标注者分布时效果更佳。

Comments 14 pages, 12 figures. Accepted to the 2nd Workshop on Epistemic Intelligence in Machine Learning (EIML @ ICML 2026)

详情
AI中文摘要

当标注者存在分歧时,这种分歧可能反映的是知识不确定性而非简单的标签噪声。我们研究了硬标签交付作为一种替代方法,以替代通常的投票汇总为单一标签或直接在经验软标签分布上训练。我们重点关注两种主要的硬标签方法:多轮次(multipass),它在保持数据集大小不变的情况下循环处理观察到的投票;以及随机标签采样(SLS),它在每个epoch开始时对每个示例采样一个标签。在CIFAR-10H上,我们发现当每个示例仅有少量标注时,硬标签交付在软标签训练上表现更优,尤其是在稀疏经验目标远离完整标注者分布时改进更明显。当完整标注者分布可用时,两种硬标签方法与软标签训练相当。我们使用确定性控制作为多轮次的消融实验,并使用洗牌SLS作为打破示例到分布匹配的对照。我们还展示了SLS和软标签交叉熵优化相同的预期目标。硬标签交付还收敛到更平坦的盆地,这在SVHN和CIFAR-100上的OoD检测中提供了支持性的描述证据。总体而言,这些结果表明,当原始投票数可用时,多轮次是一个强大的实用默认选择,而SLS则提供了一个轻量级的替代方案,当每个示例仅有少量投票时仍具有竞争力,且在完整标注者分布可用时与软标签训练相当。

英文摘要

When annotators disagree, that disagreement can reflect epistemic uncertainty rather than simple label noise. We study hard-label delivery as an alternative to the usual choices of collapsing votes to a single label or training directly on the empirical soft-label distribution. We focus on two primary hard-label methods: multipass, which cycles through observed votes while keeping the dataset size fixed, and stochastic label sampling (SLS), which samples one label per example at the start of each epoch. On CIFAR-10H, we find that when only a small number of annotations per example is available, hard-label delivery improves over soft-label training, with larger improvements where the sparse empirical target is farther from the full annotator distribution. When full annotator distributions are available, both hard-label methods match soft-label training. We use deterministic control as an ablation of multipass and shuffled SLS as a control that breaks the example-to-distribution match. We also show that SLS and soft-label cross-entropy optimize the same expected objective. Hard-label delivery also converges to flatter basins, with supporting descriptive evidence from OOD detection on SVHN and CIFAR-100. Overall, these results suggest that multipass is a strong practical default when raw vote counts are available, while SLS offers a lightweight alternative that remains competitive when only a few votes per example are available and matches soft-label training when full annotator distributions are available.

2605.20630 2026-05-21 cs.AI

Evaluating Temporal Semantic Caching and Workflow Optimization in Agentic Plan-Execute Pipelines

评估代理计划-执行管道中的时间语义缓存和工作流优化

Alimurtaza Mustafa Merchant, Krish Veera, Sajal Kumar Goyla, Shambhawi Bhure, Dhaval Patel, Kaoutar El Maghraoui

发表机构 * Columbia University(哥伦比亚大学) IBM IBM Research(IBM研究院)

AI总结 本文研究了在代理计划-执行管道中时间语义缓存和工作流优化的问题,提出两种互补的优化层以提高效率,并展示了其在工业资产操作工作流中的应用效果。

Comments 13 pages, 8 figures, 3 appendices

详情
AI中文摘要

工业资产操作工作流对延迟敏感,因为单个用户查询可能需要协调传感器数据、工作订单、故障模式、预测工具和领域特定代理。我们在此问题上评估了AssetOpsBench (AOB),这是一个工业代理基准,其计划-执行管道暴露了工具发现、LLM规划、MCP工具执行和最终总结的重复开销。现有的LLM缓存技术如KV缓存重用和基于嵌入的语义缓存是为聊天机器人服务设计的,并在输出有效性依赖于时间、资产或传感器参数时失效。我们为AOB计划-执行管道提出了两个互补的优化层:一个时间语义缓存和一组结合磁盘支持的工具发现缓存和依赖感知并行步骤执行的MCP工作流优化。MCP工作流优化对应于1.67倍的速度提升,将中位端到端延迟减少了约40.0%,而时间缓存基准在缓存命中时实现了30.6倍的速度提升。除了速度提升外,我们的结果揭示了纯语义缓存在参数丰富的工业查询中的具体失败模式,提供了对MCP支持的代理基准中缓存选择如何与评估正确性相互作用的批判性分析。

英文摘要

Industrial asset operations workflows are latency-sensitive because a single user query may require coordination over sensor data, work orders, failure modes, forecasting tools, and domain-specific agents. We evaluate this problem on AssetOpsBench (AOB), an industrial agent benchmark whose plan-execute pipeline exposes repeated overhead from tool discovery, LLM planning, MCP tool execution, and final summarization. Existing LLM caching techniques such as KV-cache reuse and embedding-based semantic caching were designed for chatbot serving and break down when output validity depends on time, asset, or sensor parameters. We propose two complementary optimization layers for AOB plan-execute pipelines: a temporal semantic cache and a set of MCP workflow optimizations combining disk-backed tool-discovery caching and dependency-aware parallel step execution. MCP workflow optimizations corresponded to a 1.67x speedup and reduced median end-to-end latency by about 40.0% while the temporal-cache benchmark achieved a median of 30.6x speedup on cache hits. Beyond the speedup, our results expose a concrete failure mode of pure semantic caching for parameter-rich industrial queries, providing a critical analysis of how caching choices interact with evaluation correctness in MCP-backed agent benchmarks.

2605.20626 2026-05-21 cs.CL cs.AI cs.CV

Retrieval-Augmented Long-Context Translation for Cultural Image Captioning: Gators submission for AmericasNLP 2026 shared task

基于检索的长上下文翻译用于文化图像描述:佛罗里达大学Gators参加2026年美洲自然语言处理共享任务的提交

Aashish Dhawan, Christopher Driggers-Ellis, Dzmitry Kasinets, Daisy Zhe Wang, Christan Grant

发表机构 * University of Florida(佛罗里达大学)

AI总结 本文提出了一种基于检索的长上下文翻译方法,用于文化图像描述,通过两阶段流程生成西班牙语中间描述,再利用检索增强的多示例提示生成目标语言描述,显著提升了Bribri、Guaraní和Orizaba Nahuatl语言的描述生成性能,并在共享任务中获得冠军。

详情
AI中文摘要

我们提出了佛罗里达大学Gators团队对2026年美洲自然语言处理共享任务在原住民语言文化图像描述任务中的提交。我们的两阶段流程使用Qwen2.5-VL生成西班牙语中间描述,然后利用检索增强的多示例提示与Gemini 2.5 Flash生成目标语言描述。我们在开发集评估中分别实现了Bribri、Guaraní和Orizaba Nahuatl描述生成性能的164.1%、131.7%和122.6%的提升,并在测试集评估中保持Bribri和Orizaba Nahuatl语言的>150%提升。我们发现检索高度依赖语言,仅对大规模、领域内语料有效,并且合成数据增强对开发集Guaraní性能提升贡献了约28 chrF++。我们的提交在共享任务中获得冠军,位列五份最终提交中的第二名。

英文摘要

We present the University of Florida Gators submission to the AmericasNLP 2026 shared task on cultural image captioning for Indigenous languages. Our two-stage pipeline generates a Spanish intermediate caption with Qwen2.5-VL, then produces the target-language caption using retrieval-augmented many-shot prompting with Gemini 2.5 Flash. We achieve 164.1%, 131.7%, and 122.6% improvements over the shared task baseline for Bribri, Guaraní, and Orizaba Nahuatl captioning, respectively, in our dev set evaluation and maintain >150% improvements for the Bribri and Orizaba Nahuatl languages in the test set evaluation. We find retrieval is highly language-dependent, beneficial only for large, in-domain corpora, and that synthetic data augmentation accounts for around 28 chrF++ of the dev set Guaraní performance gain. Our submission is the overall winner of the shared task, placing second out of five finalist submissions in human evaluations of target-language captions.

2605.20624 2026-05-21 cs.CV cs.AI cs.LG

Accelerating Video Inverse Problem Solvers with Autoregressive Diffusion Models

用自回归扩散模型加速视频逆问题求解器

Taesung Kwon, Jonghyun Park, Hyungjin Chung, Jong Chul Ye

发表机构 * KAIST(韩国科学技术院) EverEx

AI总结 本文提出自回归视频逆问题求解器(AVIS),通过自回归扩散模型实现流式视频恢复,显著降低初始延迟并提高吞吐量,同时保持高质量的恢复效果,并进一步提出加速变体AVIS Flash,实现更高的吞吐量和更优的效率-性能权衡,为实时部署铺平道路。

Comments Project page is available here: https://avis-project.github.io/

详情
AI中文摘要

扩散模型为零样本视频逆问题提供了强大的先验知识,但其实时部署受到两个效率问题的阻碍:由整体视频恢复引起的高初始延迟,以及由于在像素空间中多次VAE传递以强制测量一致性导致的低吞吐量。为克服这些限制,我们提出了自回归视频逆问题求解器(AVIS)。AVIS框架利用自回归视频扩散模型以流式方式恢复视频,自然地消除了延迟瓶颈。具体而言,AVIS通过测量一致性的估计初始化反向扩散,减少了所需的采样步骤。与领先的非自回归求解器相比,AVIS将初始延迟从114秒减少到4秒,并将吞吐量从0.71提高到1.18 FPS,同时实现更优的恢复质量。我们进一步引入了一个高度加速的变体,称为AVIS Flash,该变体仅在第一个片段上强制测量一致性。AVIS Flash在单个RTX 4090 GPU上将吞吐量提高到5.91 FPS,同时保持竞争性的性能,并实现有利的效率-性能权衡,为实时部署铺平道路。

英文摘要

Diffusion models provide powerful priors for zero-shot video inverse problems, but their real-time deployment is hindered by two inefficiencies: high initial latency caused by holistic video restoration, and low throughput resulting from multiple VAE passes to enforce measurement consistency in pixel space. To overcome these limitations, we propose Autoregressive Video Inverse problem Solver (AVIS). The AVIS framework leverages autoregressive video diffusion models to restore videos in a streaming manner, naturally eliminating latency bottlenecks. Specifically, AVIS initializes reverse diffusion with a measurement-consistent estimate, reducing the required sampling steps. Compared to leading non-autoregressive solvers, AVIS drastically reduces initial latency from 114s to 4s and increases throughput from 0.71 to 1.18 FPS while achieving superior restoration quality. We further introduce a highly accelerated variant, dubbed AVIS Flash, that enforces measurement consistency solely on the first chunk. AVIS Flash substantially boosts throughput to 5.91 FPS on a single RTX 4090 GPU while maintaining competitive performance and achieving a favorable efficiency-performance trade-off, paving the way toward real-time deployment.

2605.20620 2026-05-21 cs.LG cs.DB cs.GT

Dynamic Shapley Computation

动态Shapley值计算

Xuan Yang, Hsi-Wen Chen, Ming-Syan Chen, Jian Pei

发表机构 * Duke University(杜克大学) National Taiwan University(国立台湾大学)

AI总结 本文提出D-Shap框架,通过将Shapley值表示为玩家-任务矩阵,解决动态环境下训练数据贡献评估的高效更新问题,利用任务和联盟的局部性特性实现快速更新和自评估。

详情
AI中文摘要

基于数据的Shapley估值提供了一种量化训练数据贡献的原则性方法,但其高计算成本使其在动态设置中难以应用,其中任务和训练玩家不断变化。现有方法将Shapley计算视为一次性过程,将贡献汇总为聚合分数,阻止了重用并要求在任何变化时重新计算。我们引入了一种新的视角,将Shapley值表示为玩家-任务矩阵,并将动态估值建模为结构化矩阵维护问题。我们利用每个任务依赖于少量训练玩家的事实以及相似任务产生相似估值,导致效用局部性和联盟局部性。基于这些见解,我们提出了D-Shap,一种动态估值框架,通过仅修改矩阵的小部分实现高效更新:新任务估值通过结构感知插值推断,而由新玩家引起的更新被限制在受影响的局部矩阵块中。为消除对预指定评估任务的需求,我们引入了自估值,通过可扩展的子集重用和覆盖感知的锚点选择,直接从训练数据构建初始矩阵。在多样模型上的实验表明,D-Shap在毫秒级内完成任务更新,并将玩家更新成本降低至全重新计算的三量级,同时实现与全重新计算相当的估值质量。

英文摘要

Shapley-based data valuation provides a principled way to quantify the contribution of training data, but its high computational cost makes it impractical in dynamic settings where tasks and training players evolve. Existing methods treat Shapley computation as a one-shot process and collapse contributions into aggregated scores, preventing reuse and requiring recomputation under any change. We introduce a new perspective that represents Shapley values as a player-by-task matrix and formulates dynamic valuation as a structured matrix maintenance problem. We exploit the fact that each task depends on a small subset of training players and that similar tasks yield similar valuations, leading to utility locality and coalition locality. Based on these insights, we propose D-Shap, a dynamic valuation framework that enables efficient updates by modifying only a small portion of the matrix: new task valuations are inferred via structure-aware interpolation, while updates induced by new players are confined to affected local matrix blocks. To eliminate the need for pre-specified evaluation tasks, we introduce self-valuation, which constructs the initial matrix directly from training data, supported by scalable subset reuse and coverage-aware anchor selection. Experiments across diverse models show that D-Shap performs task updates in milliseconds and reduces the cost of player updates by up to three orders of magnitude, while achieving valuation quality competitive with full recomputation.

2605.20619 2026-05-21 cs.LG math.OC stat.ML

SURF: Steering the Scalarization Weight to Uniformly Traverse the Pareto Front

SURF: 通过调整标量化权重以均匀遍历帕累托前沿

Liuyuan Jiang, Chentong Huang, Lisha Chen

发表机构 * Department of Electrical and Computer Engineering(电气与计算机工程系)

AI总结 本文提出SURF方法,通过调整标量化权重以实现帕累托前沿的均匀覆盖,解决了传统标量化方法在多目标优化中导致非均匀覆盖的问题。

详情
AI中文摘要

标量化在多目标优化中因其简单性和可扩展性而被广泛应用。然而,在许多应用中,目标是生成代表多样化用户偏好的解决方案,理想情况下应实现帕累托前沿(PF)的均匀覆盖。然而,通常均匀采样标量化权重通常会导致PF的非均匀覆盖。我们通过标量化路径的几何分析解释了这种不匹配。随着标量化权重的变化,对应的解决方案通常以非均匀的速度遍历PF。这种速度诱导了一个弧长累积分布函数(CDF);通过反向此CDF映射,可以得到一个原则性的规则,用于选择产生均匀PF覆盖的权重。基于这一见解,我们提出了SURF(沿帕累托前沿均匀采样)。对于结构化问题,包括双目标老虎机,我们推导了此CDF映射和由此产生的PF感知的权重采样规则。对于一般问题,SURF在CDF重建和权重采样之间交替进行。理论上,我们证明在可证明的条件下,SURF收敛到一个不可避免的有限采样地板。经验上,在老虎机、多目标gymnasium和多目标LLM对齐实验中,SURF在效率上实现了比基线更均匀的PF覆盖。

英文摘要

Scalarization is widely used in multi-objective optimization owing to its simplicity and scalability. In many applications, the goal is to generate solutions that represent diverse user preferences, ideally with uniform coverage of the Pareto front (PF). However, uniformly sampling scalarization weights usually induces non-uniform coverage of the PF. We explain this mismatch through a geometric analysis of the scalarization path. As the scalarization weight varies, the corresponding solutions trace the PF with a generally non-uniform traversal speed. This speed induces an arc-length cumulative distribution function (CDF); inverting this CDF map yields a principled rule for selecting weights that produce uniform PF coverage. Building on this insight, we propose SURF (Sampling Uniformly along the PaReto Front). For structured problems, including bi-objective bandits, we derive closed-form expressions for this CDF map and the resulting PF-aware weight sampling rule. For general problems, SURF alternates between CDF reconstruction and weight sampling. Theoretically, we show that under provable conditions, SURF converges linearly to an unavoidable finite-sampling floor. Empirically, experiments on bandits, multi-objective-gymnasium, and multi-objective LLM alignment demonstrate that SURF efficiently achieves more uniform PF coverage than baselines.

2605.20618 2026-05-21 cs.AI

COAgents: Multi-Agent Framework to Learn and Navigate Routing Problems Search Space

COAgents: 多智能体框架用于学习和导航路由问题搜索空间

Oleksandr Yakovenko, Mahdi Mostajabdaveh, Cheikh Ahmed, Abdullah Ali Sivas, Xiaorui Li, Zirui Zhou, Mao Kun

发表机构 * Huawei Technologies Canada(华为技术加拿大公司) Huawei Technologies(华为技术)

AI总结 本文提出COAgents多智能体框架,通过将搜索过程建模为图来解决车辆路径问题的计算复杂性问题,通过训练不同智能体来指导强化和探索,从而在CVRP和VRPTW基准测试中取得优异成绩。

Comments Accepted at LION 2026, The Learning and Intelligent Optimization Conference

详情
AI中文摘要

尽管车辆路径问题(VRP)对许多现实系统至关重要,但其计算复杂性使其在大规模情况下难以处理。传统启发式方法依赖于手工制定的规则进行局部改进和偶尔的跳跃以逃避局部极小值,但往往难以在多样化的实例上泛化。我们引入COAgents,一种协作多智能体框架,将搜索过程建模为图:节点代表解决方案,边对应于局部细化或大型扰动以进行多样化(即跳跃)。在搜索过程中动态构建部分搜索图(PSG),使COAgents能够训练节点选择代理和移动选择代理以指导强化,并触发跳跃代理以探索新区域。与端到端学习方法不同,COAgents将问题无关的搜索控制与紧凑的领域特定编码分离,从而在跨任务中提高适应性。在CVRP和VRPTW基准测试中进行了广泛的实验,结果表明COAgents在CVRP上与多个学习搜索基线竞争,并在更具有挑战性的VRPTW实例上设定了新的学习方法状态。在N=100时,COAgents将与最强神经求解器(POMO)的最佳解差距缩小了14%,在N=50时缩小了44%。

英文摘要

Although Vehicle Routing Problems (VRP) are essential to many real-world systems, they remain computationally intractable at scale due to their combinatorial complexity. Traditional heuristics rely on handcrafted rules for local improvements and occasional \textit{jumps} to escape local minima, but often struggle to generalize across diverse instances. We introduce \textbf{COAgents}, a cooperative multi-agent framework that models the search process as a graph: nodes represent solutions, and edges correspond to either local refinements or large perturbations for diversification (i.e., jumps). A \textit{Partial Search Graph} (PSG) is dynamically constructed during search, enabling COAgents to train a Node Selection Agent and a Move Selection Agent to guide intensification, and a Jump Agent to trigger well-timed explorations of new regions. Unlike end-to-end learning approaches, COAgents cleanly separates problem-agnostic search control from compact domain-specific encoding, facilitating adaptability across tasks. Extensive experiments on the CVRP and VRPTW benchmarks show that COAgents remains competitive with several learn-to-search baselines on CVRP and sets a new state of the art among learning-based methods on the more challenging VRPTW instances, reducing the gap to the best-known solutions by 14\% at $N\!=\!100$ and 44\% at $N\!=\!50$ relative to the strongest neural solver (POMO), and by 21\% and 40\% respectively relative to ALNS. Code is available at https://github.com/mahdims/COAgents.

2605.20616 2026-05-21 cs.CL

Auto-Dreamer: Learning Offline Memory Consolidation for Language Agents

Auto-Dreamer:学习离线记忆巩固用于语言智能体

Chongrui Ye, Yuxiang Liu, Yu Wang, Haofei Yu, Yining Zhao, Ge Liu, Julian McAuley, Jiaxuan You

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of California San Diego(加州大学圣地亚哥分校)

AI总结 本文提出Auto-Dreamer,一种学习离线记忆巩固方法,用于语言智能体,通过分离快速会话记忆获取与慢速跨会话巩固过程,提升智能体在多个任务流中的记忆整合与知识复用能力。

Comments Preprint

详情
AI中文摘要

语言智能体越来越多地在相关任务流上运行,但现有记忆系统难以将积累的经验转化为可重用的知识。检索增强和结构化记忆方法能有效记录每会话的观察,但通常将获取和巩固过程合并为一个在线过程,使智能体无法获得跨会话的全局视图以发现重复模式、抽象共享流程或剪枝冗余条目。受互补学习系统理论启发,我们提出Auto-Dreamer,一种学习的离线巩固器用于语言智能体记忆。Auto-Dreamer将快速的每会话记忆获取与慢速的跨会话巩固过程分离。给定一个选定的类型记忆库的工作区域,巩固器将该区域视为只读证据,执行受限的工具使用来检查条目和与来源轨迹相关联的来源轨迹,并合成一个新鲜的紧凑替换集,该集在跨会话中抽象并取代原始区域。我们通过GRPO训练Auto-Dreamer,使用端到端智能体性能作为奖励信号来学习如何通过快速在线经验巩固记忆。仅在ScienceWorld轨迹上训练,Auto-Dreamer在ScienceWorld上优于固定、强化学习训练和提示记忆基线,得分高出7分,同时使用比最强基线小12倍的活跃记忆库,并在不重新训练的情况下继续在held-out的ALFWorld和WebArena上领先,使用比最强基线小6倍的内存。

英文摘要

Language agents increasingly operate over streams of related tasks, yet existing memory systems struggle to convert accumulated experience into reusable knowledge. Retrieval-augmented and structured memory methods record per-session observations effectively, but often couple acquisition and consolidation into a single online process, leaving the agent without a global view across sessions to discover recurring patterns, abstract shared procedures, or prune redundant entries. Inspired by complementary learning systems theory, we propose Auto-Dreamer, a learned offline consolidator for language-agent memory. Auto-Dreamer decouples fast per-session memory acquisition from slow cross-session consolidation. Given a selected working region of a typed memory bank, the consolidator treats the region as read-only evidence, performs bounded tool-use to inspect entries and provenance-linked source trajectories, and synthesizes a fresh compact replacement set that abstracts across sessions and supersedes the original region. We train Auto-Dreamer via GRPO, using end-to-end agent performance as the reward signal to learn how to consolidate memories acquired through fast online experience. Trained on ScienceWorld trajectories alone, Auto-Dreamer outperforms fixed, RL-trained, and prompted memory baselines on ScienceWorld by 7 points while using an active memory bank 12$\times$ smaller than the strongest baseline, and continues to lead on held-out ALFWorld and WebArena without retraining -- using 6$\times$ less memory than the strongest baseline on ALFWorld.

2605.20613 2026-05-21 cs.CL

HRM-Text: Efficient Pretraining Beyond Scaling

HRM-Text: 超越规模的高效预训练

Guan Wang, Changling Liu, Chenyu Wang, Cai Zhou, Yuhao Sun, Yifei Wu, Shuai Zhen, Luca Scimeca, Yasin Abbasi Yadkori

发表机构 * Sapient Intelligence MIT(麻省理工学院)

AI总结 本文提出HRM-Text模型,通过引入分层递归模型和新的训练方法,在减少计算资源消耗的同时实现了与大规模模型相当的性能,展示了高效预训练的可能性。

详情
AI中文摘要

当前大型语言模型的预训练范式依赖于巨大的计算资源和互联网级原始文本,这在基础研究中形成了显著的障碍。相比之下,生物系统通过多时间尺度处理实现高样本效率的学习,例如前额叶环路的功能组织。受此启发,我们引入了HRM-Text,它用分层递归模型(HRM)取代标准Transformer,将计算分解为慢速演变的战略层和快速演变的执行层。为了稳定这种深度递归进行语言建模,我们引入了MagicNorm和深度信用分配的预热。此外,我们不再使用标准的原始文本预训练,而是仅在指令-响应对上进行训练,使用任务完成目标和PrefixLM遮蔽。作为高效预训练的实证存在证明,一个仅用400亿个唯一词和1,500美元预算从头训练的10亿参数HRM-Text模型在MMLU上达到60.7%,在ARC-C上达到81.9%,在DROP上达到82.2%,在GSM8K上达到84.5%,在MATH上达到56.2%。尽管使用了比标准基线少100-900倍的训练词和96-432倍的估计计算,HRM-Text的性能与2-7B参数的开源模型相媲美。这些结果表明,协同设计架构和目标可以大幅降低计算到性能的比率,使从头开始的预训练对更广泛的研究社区具有可及性。

英文摘要

The current pretraining paradigm for large language models relies on massive compute and internet-scale raw text, creating a significant barrier to foundational research. In contrast, biological systems demonstrate highly sample-efficient learning through multi-timescale processing, such as the functional organization of the frontoparietal loop. Taking this as inspiration, we introduce HRM-Text, which replaces standard Transformers with a Hierarchical Recurrent Model (HRM) that decouples computation into slow-evolving strategic and fast-evolving execution layers. To stabilize this deep recurrence for language modeling, we introduce MagicNorm and warmup deep credit assignment. Furthermore, instead of standard raw-text pretraining, we train exclusively on instruction-response pairs using a task-completion objective and PrefixLM masking. Serving as an empirical existence proof of efficient pretraining, a 1B-parameter HRM-Text model trained from scratch on only 40 billion unique tokens and $1,500 budget achieves 60.7% on MMLU, 81.9% on ARC-C, 82.2% on DROP, 84.5% on GSM8K, and 56.2% on MATH. Despite utilizing roughly 100-900x fewer training tokens and 96-432x less estimated compute than standard baselines, HRM-Text performs competitively with 2-7B parameter open models. These results demonstrate that co-designing architectures and objectives can radically reduce the compute-to-performance ratio, making pretraining from scratch accessible to the broader research community.

2605.20610 2026-05-21 cs.CV cs.AI

Beyond Routing: Characterising Expert Tuning and Representation in Vision Mixture-of-Experts

超越路由:表征专家调节与表示在视觉混合专家中的刻画

Gene Tangtartharakul, Katherine R. Storrs

发表机构 * School of Psychology University of Auckland(心理学系奥克兰大学)

AI总结 本文研究了视觉混合专家模型中专家调节与表示的特性,通过对比学习训练稀疏门控卷积MoE模型,并利用视觉神经科学工具分析专家的专业化,发现动植物区分主导专家划分,并揭示了专家在更广泛的连续视觉和语义维度上的调节。

Comments 21 Pages, 6 Main Figures, 1 Table

详情
AI中文摘要

混合专家(MoE)模型通常通过分析哪些类别被路由到哪些专家来解释。然而,仅靠路由并不能揭示每个专家实际编码的内容。我们训练了稀疏门控卷积MoE模型,并在自然图像上使用对比目标进行训练,利用视觉神经科学工具来表征专家的专业化。从门控级别扩展到专家级别分析,我们测量了每个专家的类别分离度,并利用最吸引人的输入来分析每个专家的调节。从类别级别扩展到特征级别解释,我们通过从人类行为判断数据集(THINGS)中衍生出的语义维度来解释调节。最后,我们使用调节和表征相似性分析来评估在独立初始化下专家分配的稳定性。我们发现,动植物区分主导专家划分,从门控到专家读取都明显,并在独立训练模型中保持稳定。尽管路由统计数据表明相对稀疏的、类别的偏好,但专家分析揭示了更广泛的对连续视觉和语义维度的调节,超出了类别边界。尽管特征调节不同,专家之间表现出相似的类别分离度,这表明超越类别级别分析的解释优势。这些结果表明,视觉MoE中的专家专业化远超类别路由,并通过探测细粒度专家级别调节和表征结构来更好地理解。

英文摘要

Mixture-of-Experts (MoE) models are often interpreted by analysing which categories are routed to which experts. However, routing alone does not reveal what each expert actually encodes. We train sparsely-gated convolutional MoE models with a contrastive objective on natural images and characterise expert specialisation using tools from visual neuroscience. Extending from gating-level to expert-level analyses, we measure per-expert category separability, and per-expert tuning using the most exciting inputs. Extending from category-level to feature-level explanations, we interpret tuning via semantic dimensions derived from a dataset of human behavioural judgements (THINGS). Finally, we use tuning and representational similarity analysis to assess the stability of expertise-allocation across independent initialisations. We find that an animate-inanimate distinction dominates expert partitioning, apparent from gating through to expert readout, and is stable across independently trained models. Although routing statistics suggest relatively sparse, categorical preferences, expert analyses reveal broader tuning to continuous visual and semantic dimensions that extend beyond category boundaries. Experts exhibit similar category-separability to one another, despite distinct feature tuning, demonstrating the explanatory benefits of moving beyond category-level analyses. Together, these results show that expert specialisation in vision MoEs extends well beyond category routing and is better understood by probing fine-grained expert-level tuning and representational structure.

2605.20609 2026-05-21 cs.LG

Compositional Transduction with Latent Analogies for Offline Goal-Conditioned Reinforcement Learning

基于潜在类比的组合转导用于离线目标条件强化学习

Junseok Kim, Dohyeong Kim, Mineui Hong, Songhwai Oh

发表机构 * Department of Electrical and Computer Engineering and ASRI, Seoul National University(电气与计算机工程系和首尔国立大学ASRI) Independent researcher(独立研究者) Robotics Institute, Carnegie Mellon University(卡内基梅隆大学机器人研究所)

AI总结 本文提出了一种基于潜在类比的组合转导方法,用于解决离线目标条件强化学习中面对新情境时的目标泛化问题,通过引入新的类比表示方法,提升了在不同情境下的目标达到能力。

Comments ICML 2026

详情
AI中文摘要

组合泛化对于在新颖的上下文变化中达到未见过的目标在离线目标条件强化学习(GCRL)中至关重要,其中必须从有限的数据中学习一个通用的目标达到智能体。大多数先前的方法通过在时间连续的片段上进行轨迹缝合来实现这一点,这限制了在不同上下文中组合行为的能力。为了克服这一限制,我们正式将类比转导定义为通过组合任务内固有的类比与给定的上下文来合成新的计划,并提出了一个针对此目的的新型类比表示。基于我们的理论,这种类比表示捕捉了在最优任务执行下发生变化的内容,对上下文变化保持不变,并且足以实现最优的目标达到。我们进一步认为,对未见过的类比-上下文对的泛化是类比转导中的实际障碍,并引入了一种新的离线GCRL方法,使类比转导能够超越已见过的对到未见的组合。我们通过在OGBench操纵环境中实验证明了我们方法的有效性,显著优于不进行类比转导的先前方法。项目页面:https://rllab-snu.github.io/projects/CTA/

英文摘要

Compositional generalization is essential for reaching unseen goals under novel contextual variations in offline goal-conditioned reinforcement learning (GCRL), where a generalist goal-reaching agent must be learned from limited data. Most prior approaches pursue this via trajectory stitching over temporally contiguous segments, which limits composing behaviors across varying contexts. To overcome this limitation, we formalize analogy transduction as synthesizing new plans by composing task-endogenous analogies with given contexts and propose a novel analogy representation tailored for it. Grounded in our theory, this analogy representation captures what changes under optimal task execution, remains invariant to contextual variations, and is sufficient for optimal goal reaching. We further contend that generalization to unseen analogy-context pairs is a practical obstacle in analogy transduction, and introduce a new approach for offline GCRL that enables analogy transduction beyond seen pairs to unseen combinations. We empirically demonstrate the effectiveness of our approach on OGBench manipulation environments, substantially outperforming prior methods that do not perform analogy transduction. Project page: https://rllab-snu.github.io/projects/CTA/

2605.20608 2026-05-21 cs.AI cs.NI

From Automated to Autonomous: Hierarchical Agent-native Network Architecture (HANA)

从自动化到自主:分层代理原生网络架构(HANA)

Binghan Wu, Shoufeng Wang, Yunxin Liu, Ya-Qin Zhang, Joseph Sifakis, Ye Ouyang

发表机构 * AsiaInfo Technologies Limited(亚洲信息科技有限公司) Institute for AI Industry Research (AIR)(人工智能产业研究院) Tsinghua University(清华大学) Verimag

AI总结 本文提出了一种分层多代理参考架构,旨在实现Level 4/5自主网络,通过引入代理自意识,统一战略规划与操作韧性,验证了其在5G核心环境中的有效性。

Comments This manuscript has been accepted by IEEE Networking Letters

Journal ref B. Wu, S. Wang, Y. Liu, Y. -Q. Zhang, J. Sifakis and Y. Ouyang, "From Automated to Autonomous: Hierarchical Agent-native Network Architecture (HANA)," in IEEE Networking Letters, 2026

详情
AI中文摘要

实现Level 4/5自主网络(AN)需要从静态自动化转向代理原生智能。当前的操作依赖于刚性的脚本,缺乏处理非正常条件的认知能力。为此,本文提出了一种分层多代理参考架构,该架构包含一个双驱动协调器,协调专门的执行代理,并通过共享的公共内存实现统一的领域知识。关键创新是将代理自意识整合进来,使系统能够协调 deliberative战略治理与 reflexive 故障恢复。我们将在5G核心环境中实例化并验证该架构。案例研究表明,该系统在拥堵条件下仍能维持关键吞吐量,并将平均修复时间(MTTR)减少了86%,证实了其在统一战略规划与操作韧性方面的有效性。

英文摘要

Realizing Level 4/5 Autonomous Networks (AN) demands a shift from static automation to agent-native intelligence. Current operations, reliant on rigid scripts, lack the cognitive agency to handle off-nominal conditions. To address this, this letter proposes a hierarchical multi-agent reference architecture enabling high-level autonomy. The framework features a Dual-Driven Orchestrator that coordinates specialized Executive Agents, supported by a shared Public Memory for unified domain knowledge. A key innovation is the integration of agent self-awareness, which empowers the system to harmonize deliberative strategic governance with reflexive fault recovery. We instantiate and validate this architecture within a 5G Core environment. Case studies demonstrate that the system sustains critical throughput under congestion and reduces Mean Time to Repair (MTTR) by 86%, confirming its efficacy in unifying strategic planning with operational resilience.

2605.20607 2026-05-21 cs.LG cs.CV cs.RO

Mechanistic Interpretability for Learning Assurance of a Vision-Based Landing System

基于视觉着陆系统的学习保证机制解释

Romeo Valentin, Olivia Beyer Bruvik, Marc R. Schlichting, Mykel J. Kochenderfer

发表机构 * Stanford Intelligent Systems Laboratory, Stanford University, Stanford, CA, USA(斯坦福智能系统实验室,斯坦福大学,斯坦福,CA,美国)

AI总结 本文提出了一种基于视觉着陆系统的学习保证机制,通过分离内容与风格来构建可解释的模型,从而提供可靠的证据支持,同时引入了新的运行时保证方法来监控模型的情境表示。

Comments 10 pages, 4 figures

详情
AI中文摘要

EASA的学习保证指导要求数据驱动的航空系统构建并监控自身的情境表示,但对神经网络而言,提供此类证据的技术手段仍是一个开放问题。我们针对基于视觉的飞机着陆系统填补了这一空白:我们提出,一个可保证的模型至少必须展示其情境表示中能够分离内容与风格。展示模型的预测主要依赖于内容表示组件,从而得到一个具体的保证路径。为了在具体模型上展示这个保证路径,我们训练了一个用于跑道关键点回归的视觉Transformer模型,在LARDv2数据集上进行训练。该模型作为我们保证演示的主体,产生每块嵌入,我们通过K-SVD稀疏字典学习将其分解为可解释的原子。定性可视化确认了内容原子跟踪任务相关的跑道结构,风格原子跟踪领域特定的外观,且回归头几乎将所有线性权重放在内容原子上。我们进一步基于内容/风格分离并定义了模型外范围(OOMS)检测,一种新的运行时保证方法,直接监控模型的情境表示。OOMS监控与操作设计领域和输出空间的分布外监控互补,并满足最近EASA指导的明确要求。通过在测试时间和运行时直接分析模型的情境表示,本工作提供了EASA学习保证指导所要求的第一个具体的表示层面证据,并指出了机制解释作为未来航空安全案例的实用构建块。

英文摘要

EASA's learning-assurance guidance requires data-driven aviation systems to build and monitor their own situation representation, yet for neural networks the technical means to provide such evidence remain an open problem. We address this gap for a vision-based aircraft landing system: we propose that a minimally assurable model must at least be shown to separate content from style in its own situation representation. Showing that the model's predictions then rely largely on the contentful representation components leads to a concrete assurance path. To demonstrate this assurance path on a concrete model we train a vision transformer model for runway keypoint regression on the LARDv2 dataset. The model, which acts as the subject for our assurance demonstration, produces per-patch embeddings that we decompose into interpretable atoms via K-SVD sparse dictionary learning. A qualitative visualization confirms that contentful atoms track task-relevant runway structure and stylistic atoms track domain-specific appearance, and the regression head is shown to place almost all of its linear weight on contentful atoms. We further build on the content/style separation and define out-of-model-scope (OOMS) detection, a novel runtime assurance approach directly monitoring the model's situation representation. OOMS monitoring is complementary to operational design domain and output-space out-of-distribution monitoring and addresses concrete requirements of the recent EASA guidance. By directly analyzing a model's situation representation both at test time and runtime, this work delivers the first concrete piece of the representation-level evidence that EASA learning-assurance guidance demands, and points to mechanistic interpretability as a practical building block of future aviation safety cases.