arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.15562 2026-05-18 cs.CL

GiLT: Augmenting Transformer Language Models with Dependency Graphs

GiLT：通过依赖图增强Transformer语言模型

Tianyu Huang, Yida Zhao, Chuyan Zhou, Kewei Tu

发表机构 * School of Information Science and Technology, ShanghaiTech University（信息科学与技术学院，上海科技大学）； Shanghai Engineering Research Center of Intelligent Vision and Imaging（智能视觉与成像上海工程研究中心）

AI总结 GiLT通过依赖图增强Transformer语言模型，提升语法泛化能力，同时保持竞争力的困惑度，且能通过微调提升下游任务表现。

详情

AI中文摘要

通过语言结构增强Transformer能够有效提升语言模型的语法泛化性能。先前的工作主要关注语言的句法树结构，特别是短语结构树。我们提出图融合层Transformer语言模型（GiLT），利用依赖图来增强Transformer语言模型。与大多数先前工作不同，GiLT不在语言建模中插入额外的结构标记；相反，它通过在Transformer中调节注意力权重，将从逐步构建的依赖图中提取的特征注入到语言建模中。在我们的实验中，GiLT使用语义依赖图在保持与Transformer语言模型基线相当的困惑度的同时，实现了更好的语法泛化。此外，GiLT可以从预训练语言模型进行微调，以获得改进的下游任务性能。我们的代码已发布在https://github.com/cookie-pie-oops/GiLT-LM。

英文摘要

Augmenting Transformers with linguistic structures effectively enhances the syntactic generalization performance of language models. Previous work in this direction focuses on syntactic tree structures of languages, in particular constituency tree structures. We propose Graph-Infused Layers Transformer Language Model (GiLT) which leverages dependency graphs for augmenting Transformer language models. Unlike most previous work, GiLT does not insert extra structural tokens in language modeling; instead, it injects structural information into language modeling by modulating attention weights in the Transformer with features extracted from the dependency graph that is incrementally constructed along with token prediction. In our experiments, GiLT with semantic dependency graphs achieves better syntactic generalization while maintaining competitive perplexity in comparison with Transformer language model baselines. In addition, GiLT can be finetuned from a pretrained language model to achieve improved downstream task performance. Our code is released at https://github.com/cookie-pie-oops/GiLT-LM.

URL PDF HTML ☆

赞 0 踩 0

2605.15561 2026-05-18 cs.CV

RoiMAM: Region-of-Interest Medical Attention Model for Efficient Vision-Language Understanding

RoiMAM：面向高效视觉-语言理解的感兴趣区域医学注意模型

Jiayan Yang, Zhuoyu Wu, Wenqi Fang

发表机构 * Shenzhen Institutes of Advanced Technology, Chinese Academy of Science（深圳先进技术研究院，中国科学院）； CyPhi( ) AI Research Lab, School of IT, Monash University, Malaysia Campus（CyPhi人工智能研究实验室，信息学院，墨尔本大学马来西亚校区）

AI总结本文提出RoiMAM，通过整合无训练ROI生成模块和语义选择性抑制，专注于病变相关区域，提升医疗视觉问答的效率与准确性。

Comments under revision

2605.15559 2026-05-18 cs.RO

NavRL++: A System-Level Framework for Improving Sim-to-Real Transfer in Reinforcement Learning-Based Robot Navigation

NavRL++: 一种提升基于强化学习的机器人导航仿真到现实迁移的系统级框架

Zhefan Xu, Hanyu Jin, Kenji Shimada

发表机构 * Department of Mechanical Engineering, Carnegie Mellon University（机械工程系，卡内基梅隆大学）

AI总结本文提出NavRL++框架，通过系统性研究仿真到现实迁移的关键因素，引入感知鲁棒性增强策略和基于Transformer的时序推理策略，提升机器人导航性能。

Comments 18 pages, 18 figures, 6 tables

详情

AI中文摘要

近年来，强化学习在自主导航中取得了显著进展。然而，现有方法主要关注强化学习框架设计，如输入表示、动作空间和奖励函数，而对仿真到现实迁移的分析有限。为弥合这一差距，本文不仅引入了有效的RL框架，还提出了完整的训练和部署流程，并进行系统性经验研究，以解耦影响强化学习导航仿真到现实迁移的关键因素，包括传感器噪声、感知失败、系统延迟和控制响应。基于此分析，我们引入了感知-aware微调策略，通过显式考虑经验识别的领域差异来提高迁移鲁棒性。为进一步缓解感知退化并提升现实部署中的控制平滑性，我们提出了一种基于Transformer的时序推理策略，利用短时间观测进行导航控制。我们定量评估了个体仿真到现实扰动和训练设计选择对导航性能的影响。实验结果表明，所提出的训练策略和策略架构在静态和动态环境中均优于基于学习的基线，在静态设置中性能可与基于优化的规划器相媲美。我们通过在多个机器人平台上进行现实部署验证了我们的方法，包括空中和腿部机器人，在探索和检查等导航任务中实现了零样本仿真到现实迁移。

英文摘要

Recent years have witnessed significant progress in autonomous navigation using reinforcement learning. However, existing approaches largely emphasize reinforcement learning framework design, such as input representations, action spaces, and reward functions, while providing limited analysis of sim-to-real transfer and insufficient insight into how training strategies affect real-world deployment performance. To bridge this gap, we not only introduce an effective RL framework but also present a complete training and deployment pipeline, along with a systematic empirical study that disentangles the key factors affecting sim-to-real transfer in reinforcement learning-based navigation, including sensor noise, perception failures, system latency, and control response. Building on insights from this analysis, we introduce perturbation-aware fine-tuning, a post-training adaptation strategy that improves transfer robustness by explicitly accounting for empirically identified domain discrepancies. To further mitigate perception degradation and enhance control smoothness in real-world deployment, we propose a Transformer-based temporal reasoning policy that leverages short-horizon observation for navigation control. We quantitatively evaluate how individual sim-to-real perturbations and training design choices impact navigation performance across environments. Experimental results demonstrate that the proposed training strategy and policy architecture outperform learning-based baselines in both static and dynamic environments, while achieving performance comparable to optimization-based planners in static settings. We validate our approach through real-world deployment on multiple robotic platforms, including aerial and legged robots, across navigation-centric tasks such as exploration and inspection, demonstrating zero-shot sim-to-real transfer.

URL PDF HTML ☆

赞 0 踩 0

2605.15557 2026-05-18 cs.CL cs.LG

When Latent Geometry Is Not Enough: Draft-Conditioned Latent Refinement for Non-Autoregressive Text Generation

当潜在几何不足以：非自回归文本生成的草稿条件潜在细化

De Shuai Zhang

发表机构 * Technical Report v1, May 2026（技术报告v1，2026年5月）

AI总结本文提出通过草稿条件潜在细化模型提升非自回归文本生成效果，发现潜在几何本身不足以保证生成质量，需结合解码器可读性与结构保持。

Comments 17 pages, 1 figure, 6 tables. Technical Report v1. Stage 1 complete; Stage 2 ongoing Code: https://github.com/saslifat-gif/structured-latent-text-refinement

详情

AI中文摘要

连续扩散和流模型因能并行更新所有位置而适用于非自回归文本生成，但连续潜在状态与离散令牌之间的接口是主要难题。本文研究了一种基于冻结BERT编码器、并行解码器、去噪DraftPrior、局部FlowNet和学习的对角MetricNet构建的草稿条件潜在细化模型。早期高斯起始实验表明，良好的潜在空间度量如尺度匹配或余弦相似度并不能保证解码质量。生成的潜在向量可能接近真实编码器潜在向量但仍会产生高熵、偏倚或重复的令牌分布。因此，本文将任务框架为受控的局部细化而非从噪声中完全生成。在ROCStories数据集上，使用前两句话作为提示，后三句作为目标，768维BERT潜在向量比压缩至256维的潜在向量恢复令牌效果更好。在768维潜在向量下，DraftPrior目标令牌概率为清洁草稿时0.938，3%令牌丢失时0.613，5%丢失时0.483，10%丢失时0.272。局部流细化和融合解码器感知读取提供小幅增益，而度量学习和OT风格对齐改进几何但无法缩小解码器差距。主要结果是一个诊断性结果：潜在几何本身不足以保证生成质量。连续潜在文本生成应通过解码器恢复性、起始分布质量以及细化是否保持解码器可读结构来评估。

英文摘要

Continuous diffusion and flow models are attractive for non-autoregressive text generation because they can update all positions in parallel. A major difficulty is the interface between continuous latent states and discrete tokens. This report studies a draft-conditioned latent refinement model built from a frozen BERT encoder, a parallel decoder, a denoising DraftPrior, a local FlowNet, and a learned diagonal MetricNet. Early Gaussian-start experiments showed that good latent-space metrics, such as scale matching or cosine similarity, do not guarantee good decoding. Generated latents can be close to real encoder latents but still produce high-entropy, biased, or repetitive token distributions. We therefore frame the task as controlled local refinement rather than full generation from noise. On ROCStories, using the first two sentences as prompt and the last three as target, full 768-dimensional BERT latents recover tokens much better than compressed 256-dimensional latents. With 768-dimensional latents, DraftPrior target-token probability is 0.938 for clean drafts, 0.613 for 3% token dropout, 0.483 for 5% dropout, and 0.272 for 10% dropout. Local flow refinement and fused decoder-aware readout give modest additional gains, while metric learning and OT-style alignment improve geometry but do not close the decoder gap. The main result is a diagnostic one: latent geometry alone is not enough. Continuous latent text generation should be evaluated by decoder recoverability, the quality of the start distribution, and whether refinement preserves decoder-readable structure.

URL PDF HTML ☆

赞 0 踩 0

2605.15551 2026-05-18 cs.LG

Characterizing Learning in Deep Neural Networks using Tractable Algorithmic Complexity Analysis

利用可计算的算法复杂度分析表征深度神经网络中的学习

Pedram Bakhtiarifard, Sophia N. Wilson, Mahmoud Afifi, Jonathan Wenshøj, Raghavendra Selvan

发表机构 * Department of Computer Science, University of Copenhagen（哥本哈根大学计算机科学系）； The American University in Cairo（开罗美国大学）

AI总结本文提出QuBD方法，用于估计深度神经网络权重的算法复杂度，揭示训练过程中复杂度随学习阶段的变化规律，为模型压缩提供理论依据。

详情

AI中文摘要

训练大规模深度神经网络（DNNs）是资源密集型任务，使模型压缩成为实际需求。广泛接受的'学习即压缩'假说认为训练会在网络权重中引入结构，从而实现压缩。通过Kolmogorov-Chaitin-Solomonoff（KCS）复杂度测量这种结构具有吸引力，但现有的基于编码定理方法（CTM）和块分解方法（BDM）的估计器仅适用于小二进制对象，无法扩展到现代DNNs。我们引入了量化块分解方法（QuBD），将其扩展到任何k-ary对象的算法复杂度估计。QuBD首先将网络权重量化到有限的字母表中，然后通过聚合每个位平面CTM估计来估计KCS复杂度。我们理论证明QuBD相对于真实KCS复杂度的估计差距比基于二值化的方法更严格。使用QuBD，我们研究了神经网络权重的算法复杂度在训练过程中的演变，显示其随着模型学习而减少，随数据预算增加，过拟合期间增加，遵循在grokking期间观察到的延迟泛化，并与泛化性能相关联。我们进一步表明，算法信息主要存在于最显著的位平面中，这可以作为实际诊断以确定适当的训练后量化级别。这项工作通过为大型非二进制对象（如DNN权重）提供可扩展且可计算的KCS复杂度估计，提供了关于DNN学习机制的新见解。

英文摘要

Training large-scale deep neural networks (DNNs) is resource-intensive, making model compression a practical necessity. The widely accepted ''learning as compression'' hypothesis posits that training induces structure in network weights, which enables compression. Measuring this structure through Kolmogorov-Chaitin-Solomonoff (KCS) complexity is appealing, but existing estimators based on the Coding Theorem Method (CTM) and the Block Decomposition Method (BDM) are limited to small binary objects and do not scale to modern DNNs. We introduce the Quantized Block Decomposition method (QuBD), which extends algorithmic complexity estimation to any $k$-ary object. QuBD first quantizes the network weights to a finite alphabet, then estimates the KCS complexity by aggregating per bit-plane CTM estimates. We show theoretically that QuBD yields a strictly tighter estimation gap with respect to true KCS complexity than binarization-based methods. Using QuBD, we study how the algorithmic complexity of neural network weights evolves during training, showing that it decreases as models learn, scales with data budget, increases during overfitting, follows the delayed generalization observed during grokking, and correlates with generalization performance. We further show that algorithmic information resides predominantly in the most significant bit-planes, which can serve as a practical diagnostic for determining appropriate post-training quantization levels. This work offers novel insights into learning mechanisms in DNNs by providing the first scalable, tractable estimates of KCS complexity for large, non-binary objects such as DNN weights.

URL PDF HTML ☆

赞 0 踩 0

2605.15549 2026-05-18 cs.LG cs.AI cs.CE

CTF4Nuclear: Common Task Framework for Nuclear Fission and Fusion Models

CTF4Nuclear: 用于核裂变和核聚变模型的通用任务框架

Stefano Riva, Carolina Introini, Antonio Cammi, Dean Price, Alexey Yermakov, Yue Zhao, Philippe M. Wyder, Judah Goldfeder, Jan Williams, Amy Sara Rude, Matteo Tomasetto, Joe Germany, Joseph Bakarji, Georg Maierhofer, Miles Cranmer, J. Nathan Kutz

发表机构 * Autodesk Research（Autodesk研究院）； Department of Energy, Nuclear Engineering Division, Politecnico di Milano（能源部，核工程系，米兰理工学院）； Nuclear Science and Engineering, Massachusetts Institute of Technology（核科学与工程，麻省理工学院）； Department of Applied Mathematics, University of Washington（应用数学系，华盛顿大学）； Department of Electrical and Computer Engineering, University of Washington（电气与计算机工程系，华盛顿大学）； High Performance Machine Learning, SURF（高性能机器学习，SURF）； Distyl AI ； Department of Computer Science, Columbia University（计算机科学系，哥伦比亚大学）； Department of Mechanical Engineering, University of Washington（机械工程系，华盛顿大学）； Department of Mechanical Engineering, Politecnico di Milano（机械工程系，米兰理工学院）； Department of Mathematics, American University in Beirut（数学系，贝鲁特美国大学）； Department of Mechanical Engineering, American University in Beirut（机械工程系，贝鲁特美国大学）； Department of Applied Mathematics and Theoretical Physics, University of Cambridge（应用数学与理论物理系，剑桥大学）

AI总结本文提出CTF4Nuclear框架，用于核工程中机器学习方法的标准化评估，通过12个指标和稀疏测量系统监控，提升核工业科学ML的严谨性和可重复性。

详情

AI中文摘要

清洁能源需求持续增长，新型核技术为可再生能源提供补充方案。然而，设计和运行这些系统极具挑战性，因为物理现象的复杂性导致系统动态难以预测。尽管高保真模拟有助于理解反应堆中的非线性多物理场相互作用，但计算成本高，难以实现实时应用。此外，基于模型的方法对简化假设敏感，导致与实际测量存在固有差异。相比之下，机器学习（ML）方法有潜力生成可靠的替代模型，快速预测系统行为。然而，可用于此任务的数据驱动方法种类繁多且多样。在安全关键领域如核工程中，公平比较不同ML方法及其优缺点至关重要。为此，我们引入了一个通用任务框架（CTF）用于核工程中的ML，基于动态系统和地震学的先前努力。该CTF考虑了来自不同核和核相邻系统的精选数据集。CTF评估方法在12个已建立的指标上表现，以及一个专注于仅稀疏测量的系统监控新范式。我们通过基准测试标准ML基线方法，揭示了当前方法的限制。我们的愿景是用标准化评估替代随意比较，提高核工业科学ML的严谨性和可重复性。

英文摘要

The demand for clean energy is ever increasing, with new nuclear technologies presenting a complementary solution to renewable energies. However, designing and operating these systems is exceptionally difficult, given the complexity of the physical phenomena that interact to form the system dynamics. While high-fidelity simulations help to understand the non-linear, multi-physics interactions within a reactor, they are computationally expensive and rarely suitable for real-time applications. Furthermore, model-based approaches are inherently sensitive to simplifying assumptions required to derive their governing equations and parameters, leading to inevitable discrepancies with real-world measurements. In contrast, Machine Learning (ML) methods have the potential to generate reliable surrogate models which may be able to quickly predict the system's behaviour. However, the number of data-driven methods that can potentially be used for this task is large and diverse. In a safety-critical setting such as nuclear engineering, a fair comparison of different ML methods, and a clear understanding of their advantages and limitations, is of paramount importance. To address this, we introduce a Common Task Framework (CTF) for ML in nuclear engineering, building upon previous efforts in dynamical systems and seismology. This CTF considers a curated set of datasets from different nuclear and nuclear-adjacent systems. The CTF evaluates the performance of a method on 12 established metrics, alongside a new paradigm focused on system monitoring from sparse measurements only. We illustrate the framework by benchmarking standard ML baselines against these datasets, revealing current method limitations. Our vision is to replace ad hoc comparisons with standardized evaluations on hidden test sets, raising the bar for rigour and reproducibility in scientific ML for the nuclear industry.

URL PDF HTML ☆

赞 0 踩 0

2605.15548 2026-05-18 cs.RO

KaRMA: A Kinematic Metric for Fine Manipulation Ability in Robotic Hands

KaRMA：一种用于机器人手精细操作能力的运动学指标

Martin Peticco, Pulkit Agrawal

发表机构 * Improbable AI Lab, Massachusetts Institute of Technology（Improbable AI实验室，麻省理工学院）

AI总结 KaRMA是一种基于运动学的指标，用于衡量机器人手在保持接触的情况下连续改变物体姿态的能力，通过球形测试物体的可达平移和重新定向来评估。

详情

AI中文摘要

传统机器人手指标侧重于静态属性，如工作空间、操作性和抓取稳定性。然而，这些指标无法直接测量标准定义下的灵活性：在保持初始抓握接触的情况下，连续改变物体姿态的能力。我们引入了运动学滚动操作能力（KaRMA），这是一种仅基于运动学的指标，通过可行的滚动运动量化两指精密捏合中球形测试物体的可达平移和重新定向。KaRMA强制执行关节限制、碰撞约束、滚动接触和反向力可行性，然后通过平移和旋转原语的广度优先搜索来研究可达的在手物体姿态。KaRMA报告三个评分：平移覆盖（KaRMA-T）、旋转覆盖（KaRMA-R）和对初始抓握的敏感性（KaRMA-S）。我们在16种广泛使用的机器人手上评估KaRMA，并与静态基线进行比较，显示KaRMA能够区分在静态代理中排名相同的手，揭示现有基线无法看到的平移-旋转权衡，并在选定的发表任务基准中与Jacobian基指标一致。

英文摘要

Traditional robotic hand metrics focus on static properties such as workspace, manipulability, and grasp stability. However, these metrics do not directly measure dexterity under the standard definition in robotic manipulation: the ability to continuously change an object's pose within the hand while maintaining contact from an initial grasp. We introduce Kinematic Rolling Manipulation Ability (KaRMA), a kinematic-only metric for fine manipulation that quantifies reachable in-hand translation and reorientation of a spherical test object within a two-finger precision pinch through feasible rolling motions. KaRMA enforces joint limits, collision constraints, rolling contact, and antipodal force feasibility, then investigates reachable in-hand object poses via breadth-first search over translation and rotation primitives. KaRMA reports three scores: translational coverage (KaRMA-T), rotational coverage (KaRMA-R), and sensitivity to the initial grasp (KaRMA-S). We evaluate KaRMA on 16 widely used robotic hands and compare against static baselines, showing that KaRMA separates hands that rank identically under static proxies, reveals translation-rotation tradeoffs invisible to existing baselines, and is qualitatively consistent with selected published task benchmarks where Jacobian-based metrics can be misleading.

URL PDF HTML ☆

赞 0 踩 0

2605.15546 2026-05-18 cs.CV

3DTMDet: A Dual-Path Synergy Network of Transformer and SSM for 3D Object Detection in Point Clouds

3DTMDet：一种结合Transformer和SSM的双路径协同网络用于点云中的3D目标检测

Bingwen Qiu, Yuan Liu, Junqi Bai, Tong Jiang, Ben Liang, Fangzhou Chen, Xiubao Sui, Qian Chen

发表机构 * School of Electronic and Optical Engineering（电子与光学工程学院）； The 28th Research Institute of China Electronics Technology Group Corporation（中国电子科技集团第二十八研究所）； College of Astronautics（航天学院）； School of Information and Communication Engineering（信息与通信工程学院）； State key Laboratory of Extreme Environment Optoelectronic Dynamic Measurement Technology and Instrument（极端环境光电动态测量技术与仪器国家重点实验室）

AI总结本文提出3DTMDet网络，结合SSM和Transformer，解决点云检测中稀疏点与远距离上下文理解的矛盾，通过3D混合Mamba Transformer模块和体素生成模块提升检测性能。

详情

AI中文摘要

点云目标检测面临远距离点极稀疏与需要远程上下文理解的矛盾。现有方法通过1D序列扩展感受野，不可避免地丢弃已稀缺的局部几何细节并降低远距离和小物体的检测。为了解决这个问题，我们提出了3DTMDet，一种新颖的检测网络，协同结合状态空间模型（Mamba）与Transformer。核心思想是利用SSM的线性复杂度和长序列建模优势，有效捕捉稀疏和远距离点之间的全局交互，同时使用Transformer模块进行局部注意力编码，以编码局部点集中的细粒度几何结构，保留准确的形状信息。我们提出了3D混合Mamba Transformer（3DHMT）块，使用SSM-Attention-SSM流水线来平衡全局上下文理解和局部细节保存，有效缓解了远距离检测中感受野扩大与几何保存之间的张力。此外，我们引入了受LiDAR物理启发的体素生成块，该模块沿传感器观测方向扩散特征，以重建遮挡和远距离区域的完整物体结构。在KITTI和ONCE数据集上进行的大量实验表明，3DTMDet优于最先进的检测器。代码可在https://github.com/QiuBingwen/3DTMDet获取。

英文摘要

A fundamental challenge in point cloud object detection lies in the conflict between the extreme sparsity of distant points and the need for remote context understanding. The existing methods typically use 1D serialization to expand the receptive field, which inevitably discards already scarce local geometric details and reduces detection of distant and small objects. To address this issue, we propose 3DTMDet, a novel detection network that synergistically combines state space models (Mamba) with Transformers. The core idea is to utilize SSM's linear complexity and advantages in long sequence modeling to effectively capture global interactions between sparse and distant points, while using Transformer modules with local attention to encode fine-grained geometric structures in local point sets, preserving accurate shape information. We propose the 3D Hybrid Mamba Transformer (3DHMT) block, which uses an SSM-Attention-SSM pipeline to balance global context understanding and local detail preservation, effectively alleviating the tension between receptive field enlargement and geometric preservation in remote detection. In addition, we introduced a voxel generation block inspired by LiDAR physics, which diffuses features along the sensor observation direction to reconstruct the complete object structure of occlusion and distant areas. Extensive experiments conducted on the KITTI and ONCE datasets have shown that 3DTMDet outperforms state-of-the-art detectors. The code is available at https://github.com/QiuBingwen/3DTMDet.

URL PDF HTML ☆

赞 0 踩 0

2605.15542 2026-05-18 cs.AI

DRS-GUI: Dynamic Region Search for Training-Free GUI Grounding

DRS-GUI: 动态区域搜索用于无训练的GUI定位

Yichao Liu, Huawen Shen, Liu Yu, Shiyu Liu, Zeyu Chen, Yu Zhou

发表机构 * Nankai University（南开大学）； Institute of Information Engineering, Chinese Academy of Sciences（中国科学院信息工程研究所）

AI总结 DRS-GUI通过动态区域搜索框架提升GUI定位性能，利用轻量级UI感知器和MCTS动作规划器，实现高效区域探索与筛选，提升多模态大语言模型的定位能力。

Comments 11 pages, 8 figures

详情

AI中文摘要

基于多模态大语言模型（MLLM）的GUI代理在理解和执行用户指令方面表现出色，但准确地从高分辨率截图中定位相关元素仍具挑战性。受人类动态调整感知范围的启发，本文提出DRS-GUI，一种无训练的动态区域搜索框架，可无缝集成到现有MLLM中。DRS-GUI引入轻量级UI感知器，执行聚焦、位移和分散三种人类似感知动作，逐步探索界面并生成区域提案。通过基于蒙特卡洛树搜索（MCTS）的动作规划器动态调度这些动作，并利用区域质量奖励评估和选择高度相关的区域，有效剪枝冗余UI元素。实验表明，DRS-GUI在ScreenSpot-Pro上对通用和GUI特定的MLLM（Qwen2.5-VL-7B和UGround-V1-7B）实现了14%的提升，显著增强了定位性能和泛化能力。

英文摘要

GUI agents powered by Multimodal Large Language Models (MLLMs) have demonstrated impressive capability in understanding and executing user instructions. However, accurately grounding instruction-relevant elements from high-resolution screenshots cluttered with irrelevant UI components remains challenging for existing approaches. Inspired by how humans dynamically adjust their perceptual scope to locate task-related regions on complex screens, we propose DRS-GUI, a training-free dynamic region search framework for GUI grounding that can be seamlessly integrated into existing MLLMs. DRS-GUI introduces a lightweight UI Perceptor that performs three human-like perceptual actions (Focus, Shift, and Scatter) to progressively explore the interface and generate region proposals. To dynamically schedule these actions, we further design an Action Planner based on Monte Carlo Tree Search (MCTS). A region quality reward is employed to evaluate and select the highly instruction-relevant region, efficiently pruning redundant UI elements. Experiments demonstrate that DRS-GUI yields a 14\% improvement on ScreenSpot-Pro for general and GUI-specific MLLMs (Qwen2.5-VL-7B and UGround-V1-7B), significantly enhancing grounding performance and generalization.

URL PDF HTML ☆

赞 0 踩 0

2605.15537 2026-05-18 cs.AI

RTL-BenchMT: Dynamic Maintenance of RTL Generation Benchmark Through Agent-Assisted Analysis and Revision

RTL-BenchMT：通过代理辅助分析和修订动态维护RTL生成基准

Jing Wang, Shang Liu, Hangan Zhou, Zhiyao Xie

发表机构 * Hong Kong University of Science and Technology（香港科技大学）

AI总结本文提出RTL-BenchMT框架，通过自动识别和修正错误案例及检测更新过拟合案例，解决RTL基准中的缺陷和过拟合问题，降低人工维护成本。

Comments This paper has been accepted by DAC 2026

2605.15536 2026-05-18 cs.RO cs.AI cs.CV

SkiP: When to Skip and When to Refine for Efficient Robot Manipulation

SkiP: 在何时跳过和何时细化以实现高效的机器人操作

Mingtong Dai, Guanqi Peng, Yongjie Bai, Feng Yan, Chunjie Chen, Lingbo Liu, Liang Lin, Xinyu Wu

发表机构 * Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences（深圳先进技术研究院，中国科学院）； Peng Cheng Laboratory（鹏城实验室）； Southern University of Science and Technology（南方科技大学）； Sun Yat-sen University（中山大学）； UNT ； University of Chinese Academy of Sciences（中国科学院大学）

AI总结 SkiP通过动态跳过冗余步骤和精细化关键步骤，提升机器人操作效率，无需额外结构或规划器。

详情

AI中文摘要

先前的模仿学习策略在每个控制步骤都预测未来动作，无论是在平滑运动阶段还是精确的接触丰富操作阶段。这种统一处理是浪费的：大多数操作轨迹步骤在自由空间中移动，携带很少的任务相关信息，而一小部分关键步骤围绕接触、抓取和对齐需求密集的高分辨率预测。我们提出了一种新的动作重标机制：在跳过段的每个时间步，我们用下一个关键段入口的动作替换行为克隆目标，使策略能够在一个决策中跳过冗余步骤。由此产生的Skip Policy (SkiP)在单一统一网络中动态跳过跳过段并密集细化关键段，无需学习跳过规划器或分层结构。为了自动将演示分成关键和跳过段而无需手动标注，我们引入了Motion Spectrum Keying (MSK)，一种快速且任务无关的程序，从动作信号中检测局部运动复杂性。在72个模拟操作任务和三个真实机器人任务上的广泛实验表明，SkiP将执行步骤减少15-40%，同时在各种策略骨干上匹配或提高成功率。项目页面：https://pgq18.github.io/SkiP-page/.

英文摘要

Previous imitation learning policies predict future actions at every control step, whether in smooth motion phases or precise, contact-rich operation phases. This uniform treatment is wasteful: most steps in a manipulation trajectory traverse free space and carry little task-relevant information, while a small fraction of \emph{key} steps around contacts, grasps, and alignment demand dense, high-resolution prediction. We propose a novel \emph{action relabeling} mechanism: at each timestep in a skip segment, we replace the behavior cloning target with the action at the entrance of the next key segment, enabling the policy to leap over redundant steps in a single decision. The resulting \textbf{Skip Policy (SkiP)} dynamically leaps over skip segments and intensively refines actions in key segments, within a single unified network requiring no learned skip planner or hierarchical structure. To automatically partition demonstrations into key and skip segments without manual annotation, we introduce \emph{Motion Spectrum Keying} (MSK), a fast, task-agnostic procedure that detects local motion complexity from action signals. Extensive experiments across 72 simulated manipulation tasks and three real-robot tasks show that SkiP reduces executed steps by $15$--$40\%$ while matching or improving success rates across various policy backbones. Project page: \texttt{https://pgq18.github.io/SkiP-page/}.

URL PDF HTML ☆

赞 0 踩 0

2605.15535 2026-05-18 cs.CV

Learning Dynamic Structural Specialization for Underwater Salient Object Detection

学习动态结构专业化用于水下显著目标检测

Lin Hong, Chenhui Wang, Linan Deng, Yuning Cui, Yu Zhang, Xin Wang, Bojian Zhang, Wenqi Ren, Xingchen Yang, Fumin Zhang

发表机构 * Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology（电子与计算机工程系，香港科学与技术大学）； School of Robotics and Advanced Manufacture, Harbin Institute of Technology（机器人与先进制造学院，哈尔滨工业大学）； School of Computation, Information and Technology, Technical University of Munich（计算、信息与技术学院，慕尼黑技术大学）； College of Computer Science & Visual Computing and Intelligent Perception Lab, Nankai University（计算机科学与视觉计算学院及智能感知实验室，南开大学）； School of Cyber Science and Technology, Sun Yat-sen University（网络科学与技术学院，中山大学）； School of Automation, Southeast University（自动化学院，东南大学）

AI总结本文提出DSS-USOD方法，通过动态结构专业化解决水下图像退化导致的定位不准确、区域碎片化和边界预测粗的问题，提升边界精度与区域一致性。

Comments 15 pages

详情

AI中文摘要

水下显著目标检测（USOD）因在水下视觉场景理解和视觉引导机器人应用中受到越来越多关注。然而，现有USOD方法仍难以应对水下图像退化，这通常导致目标定位不准确、显著区域碎片化和边界预测粗劣。为解决这些挑战，本文提出DSS-USOD，一种基于RGB的USOD方法，建立在动态结构专业化之上。DSS-USOD从单张水下图像中提取共享基础表示，将其分解为对边界敏感和区域一致的结构特征，并根据局部结构上下文动态协调其贡献。具体而言，提取的共享基础表示被分解为一个用于建模细粒度边界细节的边界敏感分支和一个用于捕捉区域级结构一致性的区域一致分支。随后引入一个空间协调模块，根据局部结构上下文自适应调节两个分支的相对贡献。此外，引入协作结构监督以促进分支专业化并稳定空间协调，使DSS-USOD在退化的水下条件下更好地平衡边界精度和区域一致性。大量实验表明，DSS-USOD在基准数据集上实现了优越性能。最后，实际部署在水下机器人上验证了DSS-USOD在水下目标检测中的实际有效性。

英文摘要

Underwater salient object detection (USOD) has attracted increasing attention for underwater visual scene understanding and vision-guided robotic applications. However, existing USOD methods still struggle with underwater image degradations, which often lead to inaccurate object localization, fragmented salient regions, and coarse boundary prediction. To address these challenges, this paper proposes DSS-USOD, a novel RGB-based USOD method built upon dynamic structural specialization. DSS-USOD extracts a shared base representation from a single underwater image, decomposes it into boundary-sensitive and region-coherent structural features, and dynamically coordinates their contributions according to local structural context. Specifically, the extracted shared base representation is decomposed into a boundary-sensitive branch for modeling fine-grained boundary details and a region-coherent branch for capturing region-level structural consistency. A spatial coordination module is then introduced to adaptively regulate the relative contributions of the two branches according to local structural context. Moreover, cooperative structural supervision is introduced to promote branch specialization and stabilize spatial coordination, enabling DSS-USOD to better balance boundary precision and region coherence under degraded underwater conditions. Extensive experiments show that DSS-USOD achieves superior performance on benchmark datasets. Finally, real-world deployment on an underwater robot validates the practical effectiveness of DSS-USOD for underwater object inspection.

URL PDF HTML ☆

赞 0 踩 0

2605.15533 2026-05-18 cs.CV cs.AI

Tuning-free Instruction-based Video Editing Via Structural Noise Initialization and Guidance

无需调优的指令式视频编辑：通过结构噪声初始化和引导

Song Wu, Xinyu Chen, Qian Wang, Liang Li, Zili Yi, Junlan Feng

发表机构 * JIUTIAN Research, China Mobile（中国移动极天研究院）； School of Intelligence Science and Technology, Nanjing University（南京大学智能科学与技术学院）； State Key Laboratory of Novel Software Technology, Nanjing University（南京大学新型软件技术国家重点实验室）

AI总结本文提出无需调优的指令式视频编辑框架，通过结构噪声初始化策略和噪声引导机制，提升视频编辑的视觉质量和性能。

Comments Accepted by ICIP 2026

详情

AI中文摘要

视频编辑面临重大挑战。尽管一系列无需调优的方法避免了大量数据收集和模型训练的需求，但它们往往未能充分利用嵌入在噪声潜在空间中的丰富信息，导致结果不满意。为此，我们提出一种无需调优、基于指令的视频编辑框架。我们从噪声潜在空间的角度出发：设计了结构噪声初始化策略（SNIS），通过为编辑区域分配更高的噪声水平（以促进内容变化）和为未编辑区域分配更低的噪声水平（以保持内容一致性），从而获得更优的编辑起点。我们引入了噪声引导机制（NGM），利用生成模型中的视频先验知识，有效整合噪声潜在空间中的丰富信息以引导去噪过程，从而保持未编辑内容和整体视觉一致性。实验表明，我们提出的方法在视觉质量和性能上均优于现有方法。

英文摘要

Video editing poses a significant challenge. While a series of tuning-free methods circumvent the need for extensive data collection and model training, they often underutilize the rich information embedded within noisy latent, leading to unsatisfactory results. To address this, we propose a \textit{tuning-free, instruction-based} video editing framework. We approach video editing from the perspective of noisy latent: we design a Structural Noise Initialization Strategy (SNIS) to secure a superior editing starting point by assigning higher noise levels to edited regions (to facilitate content change) and lower noise levels to unedited regions (to maintain content consistency). We introduce a Noise Guidance Mechanism (NGM), which leverages the video prior in the generative model and effectively integrates rich information within the noisy latent to guide the denoising process, thereby preserving unedited content and overall visual coherence. Experiments show that our proposed method achieves better visual quality and state-of-the-art performance.

URL PDF HTML ☆

赞 0 踩 0

2605.15529 2026-05-18 cs.CL cs.AI cs.LG

Process Rewards with Learned Reliability

基于学习可靠性的过程奖励

Jinyuan Li, Langlin Huang, Chengsong Huang, Shaoyang Xu, Donghong Cai, Yuyi Yang, Wenxuan Zhang, Jiaxin Huang

发表机构 * Washington University in St. Louis（华盛顿大学圣路易斯分校）； Singapore University of Technology and Design（新加坡科技设计大学）

AI总结本文提出BetaPRM，通过预测步骤成功概率和预测可靠性，改进过程奖励模型，使下游任务能区分可靠与不确定的奖励。ACA应用在最佳N推理中，提升准确率-token权衡。

详情

AI中文摘要

Process Reward Models (PRMs) 提供步骤级反馈用于推理，但当前PRMs通常为每个步骤输出单一奖励分数。下游方法必须将不完美的步骤级奖励预测视为可靠的决策信号，但无指示何时应信任这些预测。我们提出BetaPRM，一种分布型PRM，预测步骤成功概率及该预测的可靠性。给定步骤成功监督来自蒙特卡洛延续，BetaPRM学习Beta信念，通过Beta-Binomial似然解释观察到的成功延续数量，而非回归到有限样本成功比率作为点目标。该学习的可靠性信号指示何时应信任步骤奖励，使下游应用能区分可靠奖励与不确定奖励。作为一项应用，我们引入自适应计算分配（ACA）用于PRM引导的最佳N推理。ACA利用学习的可靠性信号在高奖励解决方案可靠时停止，并在不确定候选前缀上投入更多计算。在四个backbone和四个推理基准上的实验表明，BetaPRM改进了PRM引导的最佳N选择，同时保持标准步骤级错误检测。基于此信号，ACA在固定预算最佳16上提升了准确率-token权衡，减少token使用达33.57%，同时提高最终答案准确率。

英文摘要

Process Reward Models (PRMs) provide step-level feedback for reasoning, but current PRMs usually output only a single reward score for each step. Downstream methods must therefore treat imperfect step-level reward predictions as reliable decision signals, with no indication of when these predictions should be trusted. We propose BetaPRM, a distributional PRM that predicts both a step-level success probability and the reliability of that prediction. Given step-success supervision from Monte Carlo continuations, BetaPRM learns a Beta belief that explains the observed number of successful continuations through a Beta-Binomial likelihood, rather than regressing to the finite-sample success ratio as a point target. This learned reliability signal indicates when a step reward should be trusted, enabling downstream applications to distinguish reliable rewards from uncertain ones. As one application, we introduce Adaptive Computation Allocation (ACA) for PRM-guided Best-of-N reasoning. ACA uses the learned reliability signal to stop when a high-reward solution is reliable and to spend additional computation on uncertain candidate prefixes. Experiments across four backbones and four reasoning benchmarks show that BetaPRM improves PRM-guided Best-of-N selection while preserving standard step-level error detection. Built on this signal, ACA improves the accuracy--token tradeoff over fixed-budget Best-of-16, reducing token usage by up to 33.57% while improving final-answer accuracy.

URL PDF HTML ☆

赞 0 踩 0

2605.15528 2026-05-18 cs.RO cs.MA

Task-Semantic Graph-Driven Distributed Agent Networking for Underwater Target Tracking

基于任务语义的分布式智能体网络用于水下目标跟踪

Shengchao Zhu, Guangjie Han, Chuan Lin, Yu He

发表机构 * College of Computer Science and Software Engineering, Hohai University（河海大学计算机科学与软件工程学院）； College of Information Science and Engineering, Hohai University（河海大学信息科学与工程学院）； Software College, Northeastern University（东北大学软件学院）

AI总结本文提出STG-MAPPO算法，通过整合DI-engine与六自由度水下AUV仿真器，构建开放平台评估不同MARL算法，解决多智能体强化学习在水下目标跟踪中的挑战。

详情

AI中文摘要

自主水下航行器（AUV）群正在成为智能水下网络，其中每个节点必须在严峻的声学约束下感知、通信、处理本地数据并做出决策。持久性的水下目标跟踪是典型的任务，具有移动目标、变化的通信拓扑、间歇性声学链路和每个AUV有限的观测。多智能体强化学习（MARL）是分布式跟踪的自然候选者，但现有研究仍缺乏一个统一的开源平台来评估不同MARL算法在六自由度AUV动态下的性能。此外，使用原始几何状态和低层力动作训练的策略往往难以表示任务阶段、观测可靠性、链路质量以及局部合作角色。本文通过开发一个整合DI-engine与六自由度水下AUV目标跟踪仿真的开源MARL-AUV平台来解决这些问题。据我们所知，这是第一个将公共MARL训练框架与物理建模的AUV群任务连接起来的开源平台，并提供统一的实验协议，用于公平训练、测试和比较代表性RL和MARL算法。基于此平台，我们提出了STG-MAPPO，一种增强的多智能体近端策略优化变种。STG-MAPPO从跟踪诊断、任务阶段、观测置信度、链路可用性、邻居跟踪质量以及局部角色优势构建语义策略输入。一个紧凑的语义任务图将通信受限的网络状态连接到去中心化的动作决策，而速度级动作抽象将高层协作决策映射到可执行的六自由度AUV控制输入。代码可在https://github.com/dasjsaj/MARL-AUV获取。

英文摘要

Autonomous underwater vehicle (AUV) swarms are emerging as intelligent underwater networks, where each node must sense, communicate, process local data, and make decisions under severe acoustic constraints. Persistent underwater target tracking is a typical task with moving targets, changing communication topology, intermittent acoustic links, and limited observation for each AUV. Multi-agent reinforcement learning (MARL) is a natural candidate for distributed tracking, yet existing studies still lack a unified open-source platform for evaluating different MARL algorithms under six-degree-of-freedom AUV dynamics. In addition, policies trained with raw geometric states and low-level force actions often struggle to represent task phases, observation reliability, link quality, and local cooperation roles. This paper addresses these issues by developing an open-source MARL-AUV platform that integrates DI-engine with a six-degree-of-freedom underwater AUV target-tracking simulator. To the best of our knowledge, it is the first open platform that connects a public MARL training framework with physically modeled AUV swarm-based tasks, and provides a unified experimental protocol for fair training, testing, and comparison of representative RL and MARL algorithms. Based on this platform, we propose STG-MAPPO, a Semantic Task Graph-enhanced variant of Multi-Agent Proximal Policy Optimization. STG-MAPPO builds semantic policy inputs from tracking diagnostics, task phases, observation confidence, link availability, neighbor tracking quality, and local role advantage. A compact semantic task graph links communication-constrained network states to decentralized actor decisions, and a velocity-level action abstraction maps high-level cooperative decisions to executable six-degree-offreedom AUV control inputs.The code is available at https://github.com/dasjsaj/MARL-AUV.

URL PDF HTML ☆

赞 0 踩 0

2605.15524 2026-05-18 cs.LG cs.AI math.DG math.ST stat.TH

Neural Point-Forms

神经点形

Bruno Trentini, Jacob Hume, Vincenzo Antonio Isoldi, Philipp Misof, Ekaterina S. Ivshina, Kelly Maggs

发表机构 * NVIDIA ； University of Oxford（牛津大学）； Max Planck Institute for Mathematics in the Sciences（马克斯·普朗克数学研究所）； Department of Mathematical Sciences（数学科学系）； Chalmers University of Technology and University of Gothenburg（查尔姆斯理工大学和哥德堡大学）； School of Engineering and Applied Sciences（工程与应用科学学院）； Max Planck Institute of Molecular Cell Biology and Genetics（马克斯·普朗克分子细胞生物学与遗传学研究所）

AI总结本文提出神经点形（NPFs），通过扩散几何中的拉普拉斯技术，构建点云的可学习几何特征，用于比较微分形式，并在合成和生物相关实验中展示其在处理采样密度、流形结构和群体几何时的优势。

详情

AI中文摘要

点云学习通常基于观察样本是嵌入高维特征空间的底层几何对象的噪声轨迹的假设。然而，许多几何特性无法仅通过坐标、成对距离或学习的图邻域直接捕捉。在光滑情况下，微分形式用于编码高阶切线信息。本文引入了一种新的可学习几何特征家族，称为神经点形（NPFs）。在没有自然切线结构的情况下，我们使用来自扩散几何的拉普拉斯技术，通过内积构建点云的离散模型，以比较微分形式。在连续情况下，共享环境特征空间的子流形表示为比较矩阵，其条目描述了特征形式对偶切线信息的相互作用。我们通过证明在标准采样、带宽、密度和流形假设下比较矩阵的长期一致性，使这一直觉精确化。这产生了一个紧凑、高效且可交换的神经层，其输出是一个学习的形比较矩阵。在合成和生物相关实验中，我们展示了NPFs提供了一个竞争性且可解释的表示，当标签依赖于采样密度、流形结构或响应相关群体几何时，其优势最为明显。

英文摘要

Point cloud learning often rests on the premise that observed samples are noisy traces of an underlying geometric object, such as a manifold embedded in a high-dimensional feature space. Yet much of this geometry is not captured directly by coordinates, pairwise distances, or learned graph neighborhoods alone. In the smooth setting, differential forms are devices to encode higher order tangency information. In this work, we introduce a new family of principled learnable geometric features for point clouds called neural point-forms (NPFs). In the absence of a natural tangency structure, we instead use Laplacian-based techniques from Diffusion Geometry to build a discrete model for comparing differential forms on point clouds via inner products. In the continuum, submanifolds of a shared ambient feature space are represented as comparison matrices, whose entries describe how pairs of feature forms interact with extrinsic tangency information. We make this intuition precise by proving the long-run consistency of comparison matrices under standard sampling, bandwidth, density, and manifold-hypothesis assumptions. This yields a compact, efficient and permutation-invariant neural layer whose output is a learned form-comparison matrix. Across synthetic and biologically relevant experiments, we show that NPFs provide a competitive, and interpretable representation, with the strongest benefits appearing when labels depend on sampling density, manifold-like structure, or response-relevant population geometry.

URL PDF HTML ☆

赞 0 踩 0

2605.15520 2026-05-18 cs.LG cs.AI cs.DC

On the Fragility of Data Attribution When Learning Is Distributed

在分布式学习中数据归因的脆弱性

Xian Gao, Bo Hui, Min-Te Sun, Wei-Shinn Ku

发表机构 * Department of Computer Science and Software Engineering, Auburn University, Auburn, Alabama, USA（计算机科学与软件工程系，阿伯茨温泉大学，阿伯茨温泉，阿拉巴马州，美国）； Department of Computer Science, University of Tulsa, Tulsa, Oklahoma, USA（计算机科学系，塔尔萨大学，塔尔萨，俄克拉荷马州，美国）； Department of Computer Science and Information Engineering, National Central University, Taoyuan, Taiwan（计算机科学与信息工程系，国立中央大学，桃园，台湾）

AI总结研究揭示了分布式学习中数据归因的脆弱性，通过归因优先攻击展示归因值可能被人为放大，同时提出归因鲁棒和激励相容的评分机制。

2605.15519 2026-05-18 cs.CV cs.AI

DiffVAS: Diffusion-Guided Visual Active Search in Partially Observable Environments

DiffVAS: 在部分可观测环境中基于扩散的视觉主动搜索

Anindya Sarkar, Srikumar Sastry, Aleksis Pirinen, Nathan Jacobs, Yevgeniy Vorobeychik

发表机构 * Washington University in St. Louis（华盛顿大学圣路易斯分校）； RISE Research Institutes of Sweden（瑞典RISE研究机构）； Climate AI Nordics（北欧气候AI）

AI总结 DiffVAS提出了一种目标条件化的策略，能够在部分可观测环境中同时搜索多种目标，提升了视觉主动搜索在现实应用中的部署能力。

Comments 26 Pages, 12 figures, Accepted to AAMAS 2026

详情

AI中文摘要

视觉主动搜索（VAS）已被引入作为一种建模框架，利用视觉线索指导空中（如基于无人机的）探索，并在广阔的地理区域中定位感兴趣区域。潜在应用包括检测稀有野生动物盗猎的热点、协助搜救任务以及揭露非法武器交易等。先前的VAS方法假设整个搜索空间在前期已知，这在受限视野和高采集成本的约束下往往不现实，且通常学习针对特定目标对象的策略，限制了同时搜索多种目标类别的能力。在本工作中，我们提出DiffVAS，一种目标条件化的策略，根据任务需求在部分可观测环境中同时搜索多种对象，从而推进视觉主动搜索策略在现实应用中的部署。DiffVAS利用扩散模型从顺序观测的局部视图中重建整个地理区域，使基于目标条件的强化学习规划模块能够有效推理并引导后续的搜索步骤。大量实验表明，DiffVAS在部分可观测环境中搜索多种对象方面表现优异，在多个数据集上显著超越了最先进的方法。

英文摘要

Visual active search (VAS) has been introduced as a modeling framework that leverages visual cues to direct aerial (e.g., UAV-based) exploration and pinpoint areas of interest within extensive geospatial regions. Potential applications of VAS include detecting hotspots for rare wildlife poaching, aiding search-and-rescue missions, and uncovering illegal trafficking of weapons, among other uses. Previous VAS approaches assume that the entire search space is known upfront, which is often unrealistic due to constraints such as a restricted field of view and high acquisition costs, and they typically learn policies tailored to specific target objects, which limits their ability to search for multiple target categories simultaneously. In this work, we propose DiffVAS, a target-conditioned policy that searches for diverse objects simultaneously according to task requirements in partially observable environments, which advances the deployment of visual active search policies in real-world applications. DiffVAS leverages a diffusion model to reconstruct the entire geospatial area from sequentially observed partial glimpses, which enables a target-conditioned reinforcement learning-based planning module to effectively reason and guide subsequent search steps. Extensive experiments demonstrate that DiffVAS excels in searching diverse objects in partially observable environments, significantly surpassing state-of-the-art methods on several datasets.

URL PDF HTML ☆

赞 0 踩 0

2605.15517 2026-05-18 cs.RO cs.SY eess.SY

Terrain Consistent Reference-Guided RL for Humanoid Navigation Autonomy

地形一致的参考引导强化学习用于人形导航自主性

William D. Compton, Zachary Olkin, Aaron D. Ames

发表机构 * Department of Computing and Mathematical Sciences, California Institute of Technology（计算与数学科学部，加州理工学院）

AI总结本文提出一种训练参考引导感知强化学习策略的方法，通过在训练中调节参考轨迹使其与地形几何一致，提升人形机器人导航自主性。

Comments 8 pages, 4 figures, intended to submit to Humanoids 2026

详情

AI中文摘要

我们提出了一种方法，用于训练参考引导的感知强化学习运动策略，用于人形机器人，其中参考轨迹在训练中被调节以与地形几何一致。为了部署我们的方法与标准导航自主基础设施，我们合成SE(2)-可控的参考轨迹，将期望的步态投影到有效的脚踏点，并调整摆动脚和质心轨迹以匹配地形。所得到的策略暴露了一个干净的SE(2)速度接口，与标准导航规划器兼容。在仿真中，环境条件化的参考显著提高了参考跟踪性能，与环境无关的参考相比。在硬件上，我们将该策略与MPC + 控制屏障函数规划器集成，并在包含粗糙地形和连续楼梯的户外环境中展示了超过70米的闭环自主导航，所有传感和计算均在设备上完成。

英文摘要

We present a method for training reference-guided, perceptive reinforcement learning locomotion policies for humanoid robots in which reference trajectories are modulated in training to be consistent with terrain geometry. Aiming to deploy our method with standard navigation autonomy infrastructure, we synthesize SE(2)-controllable reference trajectories inside the RL training loop, projecting desired footsteps onto valid footholds and adjusting swing-foot and center-of-mass trajectories to match the terrain. The resulting policy exposes a clean SE(2) velocity interface compatible with standard navigation planners. In simulation, environmentally-conditioned references significantly improve reference tracking performance compared to environment agnostic references. On hardware, we integrate the policy with an MPC + control barrier function planner and demonstrate long-horizon (>70m) closed-loop autonomous navigation on the Unitree G1 through outdoor environments containing rough terrain and consecutive flights of stairs, with all sensing and computation onboard.

URL PDF HTML ☆

赞 0 踩 0

2605.15514 2026-05-18 cs.CL cs.AI cs.LG

RoPE Distinguishes Neither Positions Nor Tokens in Long Contexts, Provably

RoPE在长上下文中无法区分位置或令牌，证明性分析

Yufeng Du, Phillip Harris, Minyang Tian, Eliu A Huerta, Srikanth Ronanki, Subendhu Rongali, Aram Galstyan, Hao Peng

发表机构 * University of Illinois at Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； University of Bonn（波恩大学）； Argonne National Laboratory（阿贡国家实验室）； Amazon AGI（亚马逊人工智能研究院）

AI总结本文证明RoPE在长上下文中因失去局部偏倚和令牌相关性一致性而失效，无法区分位置或令牌，且增加RoPE基值只能牺牲位置区分能力。

Comments 35 pages, 11 figures, submitted to NeurIPS 2026

详情

AI中文摘要

我们识别了旋转位置嵌入（RoPE）在基于Transformer的长上下文语言模型中的内在限制。我们的理论分析脱离了上下文的具体内容，仅依赖其长度。我们证明，随着上下文长度增加，基于RoPE的注意力变得不可预测，并失去两个对有效性至关重要的属性。首先，它失去局部偏倚：RoPE不再更倾向于 favor 近的位置而非远的位置。其次，它失去令牌相关性的一致性：一个关键向量在某一位置获得更高的注意力分数，可能在另一位置获得更低的分数。在两种情况下，失败的概率接近0.5，不优于随机猜测。我们进一步证明，当关键令牌被移动到不同位置或被不同令牌替换时，注意力分数可以保持不变，表明无法区分位置或令牌。调整RoPE基值在区分位置和令牌之间进行权衡，但无法同时保持两者。增加RoPE基值超参数，这是当前长上下文模型中的常见做法，有助于区分不同令牌，但不可避免地牺牲区分位置的能力。我们的实证分析显示，多头、多层架构不足以克服这些限制。我们的发现表明，未来基于Transformer的长上下文语言模型可能需要从根本上新的机制来编码位置和令牌顺序。

英文摘要

We identify intrinsic limitations of Rotary Positional Embeddings (RoPE) in Transformer-based long-context language models. Our theoretical analysis abstracts away from the specific content of the context and depends only on its length. We prove that as context length increases, RoPE-based attention becomes unpredictable and loses two properties that are central to its effectiveness. First, it loses its locality bias: RoPE is no more likely to favor nearer positions than substantially farther ones. Second, it loses consistency in token relevance: a key vector that receives a higher attention score than an alternative at one position may receive a lower score at another. In both cases, the probability of failure approaches 0.5, no better than random guessing. We further prove that the attention score can remain unchanged when a key token is moved to a different position, or even replaced by a different token, indicating a failure to distinguish positions or tokens. Adjusting the RoPE base trades off distinguishing positions against distinguishing tokens but cannot preserve both at the same time. Increasing the RoPE base hyperparameter, a common practice in today's long-context models, helps distinguish different tokens, but inevitably sacrifices the ability to distinguish positions. Our empirical analysis shows that multi-head, multi-layer architectures are insufficient to overcome these limitations. Our findings suggest that fundamentally new mechanisms for encoding position and token order may be needed in future Transformer long-context language models.

URL PDF HTML ☆

赞 0 踩 0

2605.15513 2026-05-18 cs.AI

CAPS: Cascaded Adaptive Pairwise Selection for Efficient Parallel Reasoning

CAPS：级联自适应成对选择用于高效的并行推理

Fangzhou Lin, Shuo Xing, Peiran Li, Siyuan Yang, Qianwen Ge, Kazunori Yamada, Ziming Zhang, Haichong Zhang, Zhengzhong Tu

发表机构 * Texas A&M University（德克萨斯大学）； Worcester Polytechnic Institute（沃斯特理工大学）； Tohoku University（东北大学）； Georgia Institute of Technology（佐治亚理工学院）

AI总结 CAPS通过级联自适应成对选择方法，在保持高效并行推理的同时，减少验证器的计算成本，优于现有成对验证方法。

Comments 31 pages, 2 figures, 18 tables

详情

AI中文摘要

并行推理，即生成器生成多个候选解，聚合器选择最佳解，是大型语言模型中最具效果的测试时扩展形式，而成对自我验证已成为其最强的聚合原始构件。然而，成对验证成本高昂：每次判断需读取两个完整的解，现有方法无论比较是否信息丰富，每问题都进行数十次判断。我们引入CAPS（级联自适应成对选择），一种仅在推理阶段使用的框架，沿两个正交轴非均匀分配验证器计算：证据轴适应每个候选解中验证者看到的部分，分布轴适应比较在池中的分布。CAPS将其实例化为四阶段级联，可选救援子程序，并允许闭式验证器令牌成本，其中每个候选的边际成本大致减半，相对于均匀全证据计划。在四个自我验证模型（Qwen3-14B，GPT-OSS-20B，Qwen3-4B-Instruct/Thinking）和五个涵盖代码（LiveCodeBench-v5/v6，CodeContests）和数学（AIME 2025，HMMT 2025）的推理基准上，CAPS在14个20个套件中优于领先的成对验证器，使用25.4%的验证器令牌预算在代码上，并在所有20个套件中优于点状自我验证。权衡套件允许以验证器在部分与完整证据上的准确性为术语的可解释诊断，提供具体的预部署检查以确定级联适用性。

英文摘要

Parallel reasoning, where a generator samples many candidate solutions and an aggregator selects the best, is one of the most effective forms of test-time scaling in large language models, and pairwise self-verification has become its strongest aggregation primitive. Yet pairwise verification carries a heavy cost: each judgment reads two complete solutions in full, and existing methods perform tens of such judgments per problem regardless of whether the comparison is informative. We introduce CAPS (Cascaded Adaptive Pairwise Selection), an inference-only framework that allocates verifier compute non-uniformly along two orthogonal axes: an evidence axis that adapts how much of each candidate the judge sees, and a distribution axis that adapts how comparisons are spread across the pool. CAPS instantiates these into a four-stage cascade with an optional rescue subroutine, and admits a closed-form verifier-token cost in which the per-candidate marginal cost is roughly halved relative to uniform full-evidence schedules. On four self-verifying models (Qwen3-14B, GPT-OSS-20B, Qwen3-4B-Instruct/Thinking) and five reasoning benchmarks spanning code (LiveCodeBench-v5/v6, CodeContests) and math (AIME 2025, HMMT 2025), CAPS outperforms the leading pairwise verifier on 14 of 20 suites while using 25.4% of its verifier-token budget on code, and outperforms pointwise self-verification on all 20. The trade-off suites admit an interpretable diagnostic in terms of the verifier's accuracy at partial versus full evidence, providing a concrete pre-deployment check for cascade suitability.

URL PDF HTML ☆

赞 0 踩 0

2605.15510 2026-05-18 cs.RO

A QUBO Formulation Framework for Kinematic Structure-Based Robot Design Optimization: A Robotic Hand Case Study

基于运动学结构的机器人设计优化的QUBO公式框架：以机械手为例

HyoJae Kang, Yeong Jae Park, Jeongdo Ahn, Dongil Park

发表机构 * Advanced Robotics Research Center, Korea Institute of Machinery & Materials (KIMM)（韩国机械材料研究院先进机器人研究中心）； Mechanical Engineering, University of Science & Technology (UST)（科学技术大学机械工程系）

AI总结本文提出基于二次无约束二元优化的框架，用于机器人设计优化，通过运动学结构级评估指标进行经典计算和量子退火优化，以机械手为例验证了该方法的有效性。

Comments This manuscript has been submitted for possible publication. 14 pages, 5 figures

详情

AI中文摘要

本文提出了一种基于二次无约束二元优化的公式框架，用于机器人设计优化，利用运动学结构级评估指标进行经典计算，将结果转换为与基于量子退火的优化兼容的组合选择问题。以机械手为例，其性能由每个手指的运动学特性及交互项决定。所提公式将个体设计奖励、重叠工作空间交互、一位热约束和结构依赖惩罚整合到统一的二次模型中。构建了一个27变量的机械手设计问题，并用模拟退火作为经典基线验证该公式的可行性。进一步用量子退火检验该公式在退火硬件执行中的适用性。结果表明，可以得到满足一位热选择和成对约束的可行设计组合，随着读取次数增加，目标值范围变窄。此外，讨论了该公式的应用扩展至其他机器人系统。所提框架提供了一种将基于运动学结构的机器人设计问题转换为组合优化问题的一般方法。

英文摘要

This paper presents a quadratic unconstrained binary optimization-based formulation framework for robot design optimization using kinematic structure-level evaluation metrics. In the proposed framework, classical computation is used to evaluate design-dependent metrics while the resulting combinatorial selection problem is formulated in a structure compatible with quantum annealing-based optimization. A robotic hand is adopted as a representative case study, as its performance is determined by both the individual kinematic characteristics of each finger and interaction terms. The proposed formulation incorporates individual design rewards, overlap workspace interactions, one-hot constraint, and structural dependency penalties into a unified quadratic model. A 27-variable robotic hand design problem is constructed, and simulated annealing is used as a classical baseline to verify the feasibility of the formulation. Quantum annealing is further performed to examine the applicability of the proposed formulation to annealing-based hardware execution. The results show that feasible design combinations satisfying both one-hot selection and pairwise constraints can be obtained, with the observed objective-value range becoming narrower as the number of reads increases. In addition, the formulation process is discussed for other robotic systems. The proposed framework provides a generalized approach for transforming kinematic structure-based robot design problems into combinatorial optimization problems.

URL PDF HTML ☆

赞 0 踩 0

2605.15509 2026-05-18 cs.LG cs.RO

parallelcbf: A composable safety-filter and auditability framework for tensor-parallel reinforcement learning

parallelcbf：一种用于张量并行强化学习的可组合安全性过滤和可追溯性框架

Yijun Lu, Zilei Yang, Yuyin Ma

发表机构 * Xinjiang Key Laboratory of Intelligent Computing and Smart Applications（新疆智能计算与智能应用重点实验室）； School of Software, Xinjiang University, Urumqi, China（新疆大学软件学院，中国乌鲁木齐）； School of Computing Science, Waseda University, Tokyo, Japan（早稻田大学计算机科学系，日本东京）

AI总结 ParallelCBF首次整合了张量并行无人机环境、硬门CBF安全过滤器、分片BC到RL流水线和第一类操作可追溯性，提供可组合的API以实现端到端安全约束训练。

详情

AI中文摘要

ParallelCBF首次整合了张量并行无人机环境、硬门CBF安全过滤器、分片BC到RL流水线和第一类操作可追溯性，提供可组合的API以实现端到端安全约束训练。

英文摘要

While Isaac Lab provides massive parallel UAV simulation, OmniSafe and safe-control-gym provide constrained-RL benchmarks, and CBFKit provides control-barrier-function synthesis tooling, no existing framework unifies these capabilities for end-to-end safety-constrained training. ParallelCBF is the first framework to unify (i)~tensor-parallel UAV environments, (ii)~hard-gate CBF safety filters, (iii)~sharded BC-to-RL pipelines, and (iv)~first-class operational auditability -- pre-registration, watchdog registries, failure forensics, and dataset audits as composable APIs rather than user-implemented scripts. We release ParallelCBF v0.1.0 under Apache~2.0 with a four-layer composable API, a CPU PyTorch reference implementation of a dual-barrier (squared / linear-predictive) CBF, property-based safety invariance tests across vectorized batch sizes that complete in 1.67~s for the full 39-test suite, and a 31{,}415-episode behavior-cloning collection campaign whose curriculum mix, per-bucket yields, and dataset SHA-256 are auditable through the framework's own \texttt{ops} primitives. We report a representative end-to-end pipeline execution in which the framework's auditability layer halted a downstream training stage that did not meet pre-registered convergence criteria, preventing silent propagation of a degraded checkpoint -- an architectural property we argue is necessary, not merely useful, for reproducible empirical robotics research. The framework is installable via \texttt{pip install parallelcbf}; source and release artifacts are available at https://github.com/xiaoyang-123-cell/ParallelCBF.

URL PDF HTML ☆

赞 0 踩 0

2605.15504 2026-05-18 cs.LG cs.AI

Learning with Conflicts of Interest

利益冲突中的学习

Nischal Aryal, Arash Termehchy, Ali Vakilian, Marianne Winslett

发表机构 * Oregon State University（俄勒冈州立大学）； Virginia Tech（弗吉尼亚理工大学）； University of Illinois（伊利诺伊大学）

AI总结本文提出一种博弈论框架，用于解决ML系统与用户之间的利益冲突，通过可扩展的算法在保护用户的同时最大化有益信息。

详情

AI中文摘要

金融、社会和政治因素经常导致ML系统所有者和服务使用者的利益无法完全一致。ML系统往往产生有偏见的信息，可能影响用户做出不利于自身利益的决定。当前解决方案要求ML系统实施协议以缓解偏见，但所有者通常没有实施这些协议的激励，并常认为这限制了他们的表达自由或商业。我们认为，解决此问题的成功方案必须认识到ML系统与其用户之间的利益冲突，并利用此信息保护用户免受不利影响，同时允许用户安全地受益于这些系统。为此，我们提出了一种博弈论框架，用于建模存在利益冲突的ML系统与用户之间的互动。我们提出了具有理论保证的可扩展算法，以最大化与所需信息和行动相关的内容，并最小化与偏见和操纵行为相关的交互内容。

英文摘要

Financial, social, and political factors often prevent the interests of the owners of ML systems and services and their users from being perfectly aligned. ML systems often produce biased information that can influence users to make decisions that are not in their best interest. Current solution approaches require ML systems to implement protocols to mitigate their biases. However, ML system owners usually do not have any incentive to implement these protocols and often argue that it limits their freedom of expression or business. We believe that a successful solution to this problem must recognize the conflict of interest between the ML systems and their users, and use this information to protect users against information that adversely influences their decisions while allowing users to safely benefit from these systems. To this end, we propose a game-theoretic framework that models the interaction between ML systems and users with conflicts of interest. We present scalable algorithms with theoretical guarantees that maximize the amount of desired information and actions and minimize the amount of biased and manipulative actions in interaction with ML systems.

URL PDF HTML ☆

赞 0 踩 0

2605.15496 2026-05-18 cs.RO cs.CV

LAPS: Improving Incremental LiDAR Mapping using Active Pooling and Sampling for Neural Distance Fields

LAPS：利用主动池化和采样改进增量激光雷达映射

Dongjae Lee, Wooseong Yang, Yifu Tao, Maurice Fallon, Ayoung Kim

发表机构 * Department of Mechanical Engineering, Seoul National University（首尔国立大学机械工程系）； Oxford Robotics Institute at the University of Oxford（牛津大学机器人研究所）

AI总结 LAPS通过主动池化和采样提升增量神经映射的回放管理，提高回放保留和分配，增强重建完整性与几何精度。

Comments accepted at RA-L 2026

详情

AI中文摘要

神经距离场提供紧凑连续的3D几何表示，适合增量激光雷达映射。然而，其在线优化易受灾难性遗忘影响，新观测可能退化已重建几何。基于回放的训练常用于解决此问题，但现有方法依赖被动回放缓冲区和均匀采样，导致内存浪费和欠约束区域训练不足。我们提出LAPS，一种增量神经映射的回放管理框架，改进在线更新中的回放保留和分配。LAPS结合基于可靠性的主动池化保留有限内存下的可靠历史样本，以及基于不确定性的主动采样聚焦欠约束区域。实验表明，LAPS在合成和真实世界基准上一致提升重建完整性，同时保持竞争性的几何精度。在牛津尖塔数据集中，其在Blenheim Palace 05序列上比PIN-SLAM的召回率提高4.66个百分点，F1分数提高3.79个百分点。我们开源实现见：https://github.com/dongjae0107/LAPS。

英文摘要

Neural distance fields offer a compact and continuous representation of 3D geometry, making them attractive for incremental LiDAR mapping. However, their online optimization is vulnerable to catastrophic forgetting, where new observations can degrade previously reconstructed geometry. Replay-based training is commonly used to address this issue, but existing methods typically rely on passive replay buffers and uniform sampling, which can waste memory on redundant observations and under-train poorly constrained regions. We propose LAPS, a replay management framework for incremental neural mapping that improves both replay retention and replay allocation during online updates. LAPS combines reliability-based active pooling to retain reliable historical samples under limited memory with uncertainty-guided active sampling to focus optimization on under-constrained regions. Experiments on synthetic and real-world benchmarks show that LAPS consistently improves reconstruction completeness while maintaining competitive geometric accuracy. On Oxford Spires, it improves recall by 4.66 pp and F1-score by 3.79 pp over PIN-SLAM on the Blenheim Palace 05 sequence. We release our open source implementation at: https://github.com/dongjae0107/LAPS.

URL PDF HTML ☆

赞 0 踩 0

2605.15492 2026-05-18 cs.RO cs.CV

FLASH: Efficient Visuomotor Policy via Sparse Sampling

FLASH：通过稀疏采样实现高效的视觉-运动策略

Jiaqi Bai, Jindou Jia, Yuxuan Hu, Gen Li, Xiangyu Chen, Tuo An, Kuangji Zuo, Jianfei Yang

发表机构 * MARS Lab, Nanyang Technological University, Singapore（马尔萨实验室，南洋理工大学，新加坡）

AI总结 FLASH通过稀疏采样和Legendre多项式轨迹表示，提升视觉-运动策略学习效率，实现更长的动作时间跨度和更快的推理速度，实验表明其在多个任务中达到最先进的性能。

Comments 19 pages, 10 figures

详情

AI中文摘要

生成模型如扩散模型和流匹配在视觉-运动策略学习中占据主导地位，但其依赖迭代去噪导致高推理延迟，无法满足实时机器人控制需求。本文提出Fast Legendre-polynomial Action policy via Sparse History-anchored flow（FLASH Policy），通过连续Legendre多项式轨迹表示替代离散动作块生成。具体而言，通过稀疏时间采样拟合专家示范，使单次推理覆盖显著延长的动作时间跨度。为进一步加速生成，FLASH从历史多项式系数启动流匹配过程而非无信息的高斯噪声，缩短传输距离并实现准确单步推理。此外，解析多项式微分直接提供所需的速度前馈信号给扭矩控制器，无需数值近似。在五个模拟和两个真实世界操作任务上的大量实验表明，FLASH在所有任务中达到92%以上的成功率，每episode推理时间仅为31.40ms（比扩散策略快175倍，比先前流匹配策略快18倍），训练收敛速度比ACT快4倍，控制器跟踪误差比离散动作基线减少5至7倍。

英文摘要

Generative models such as diffusion and flow matching have become dominant paradigms for visuomotor policy learning, yet their reliance on iterative denoising incurs high inference latency incompatible with real-time robotic control. We present Fast Legendre-polynomial Action policy via Sparse History-anchored flow (FLASH Policy), which replaces discrete action-chunk generation with continuous Legendre polynomial trajectory representation. Specifically, by fitting expert demonstrations under sparse temporal sampling, FLASH enables a single inference to cover a significantly extended action horizon. To further accelerate generation, FLASH initiates the flow matching process from history polynomial coefficients rather than uninformative Gaussian noise, shortening the transport distance and enabling accurate single-step inference. Moreover, analytic polynomial differentiation directly provides desired velocity feed-forward signals to the torque controller without numerical approximation. Extensive experiments on five simulated and two real-world manipulation tasks demonstrate that FLASH achieves state-of-the-art success rates ($\ge 92\%$ across all tasks), a per-episode inference time of $31.40\,ms$ (up to $175\times$ faster than diffusion policies and $18\times$ faster than prior flow matching policies), up to $4\times$ faster training convergence than ACT, and $5\times$ to $7\times$ reduction in controller tracking error compared to discrete-action baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.15488 2026-05-18 cs.LG stat.ML

SurvivalPFN: Amortizing Survival Prediction via In-Context Bayesian Inference

SurvivalPFN: 通过上下文贝叶斯推断实现生存预测的 amortization

Shi-ang Qi, Vahid Balazadeh, Michael Cooper, Russell Greiner, Rahul G. Krishnan

发表机构 * Vector Institute（向量研究所）； University of Toronto（多伦多大学）； University of Alberta（阿尔伯塔大学）； Alberta Machine Intelligence Institute（阿尔伯塔机器智能研究所）

AI总结 SurvivalPFN 通过上下文学习实现生存预测的 amortization，利用预训练的网络在单次前向传递中处理右删失数据，避免了参数假设，产生校准的生存分布，在61个数据集上表现优异。

详情

AI中文摘要

生存分析提供了一个强大的统计框架，用于在删失存在的情况下建模时间到事件的结果。然而，从众多专门的生存方法中选择合适的估计器通常需要大量方法论和领域专业知识。我们引入了SurvivalPFN，这是一种先验-数据拟合网络，通过上下文学习实现对删失观测的贝叶斯推断的amortization。SurvivalPFN 在多样化的合成、可识别和右删失数据生成过程中进行预训练，使其能够在推理过程中单次前向传递中实现生存分析的amortization。结果，模型适应每个数据集的有效复杂性，而无需任务特定的训练或超参数调整，避免了限制性的参数假设，并产生校准的生存分布。在涵盖61个数据集、21种方法和5种评估指标的大型基准测试中，SurvivalPFN实现了强大的预测性能，并经常优于已建立的生存模型。这些结果表明，SurvivalPFN为生存分析提供了一个原理上和实用的基础模型，潜在应用领域包括医疗、金融和工程（https://github.com/rgklab/SurvivalPFN）

英文摘要

Survival analysis provides a powerful statistical framework for modeling time-to-event outcomes in the presence of censoring. However, selecting an appropriate estimator from the many specialized survival approaches often requires substantial methodological and domain expertise. We introduce SurvivalPFN, a prior-data fitted network that amortizes Bayesian inference for censored observations through in-context learning. SurvivalPFN is pretrained on a diverse family of synthetic, identifiable, and right-censored data-generating processes, enabling it to amortize survival analysis in a single forward pass during inference. As a result, the model adapts to the effective complexity of each dataset without task-specific training or hyperparameter tuning, avoids restrictive parametric assumptions, and produces calibrated survival distributions. In a large-scale benchmark spanning 61 datasets, 21 methods, and 5 evaluation metrics, SurvivalPFN achieves strong predictive performance and often improves upon established survival models. These results suggest that SurvivalPFN offers a principled and practical foundation model for survival analysis, with potential applications in high-impact domains such as healthcare, finance, and engineering (https://github.com/rgklab/SurvivalPFN).

URL PDF HTML ☆

赞 0 踩 0

2605.15486 2026-05-18 cs.RO cs.AI

Hybrid LLM-based Intelligent Framework for Robot Task Scheduling

基于混合大语言模型的智能机器人任务调度框架

Swayamjit Saha, Subhabrata Das, Haonan Duan, Xiao-Yang Liu

发表机构 * Department of Computer Science and Engineering, Mississippi State University（密苏里州立大学计算机科学与工程系）； Graduate School of Arts and Sciences, Columbia University（哥伦比亚大学研究生院）； Consumer and Community Banking, JPMorgan Chase（摩根大通消费与社区银行业）； Department of Data Science, Columbia University（哥伦比亚大学数据科学系）； Department of Electrical Engineering, Columbia University（哥伦比亚大学电气工程系）

AI总结本文提出利用大语言模型提升建筑机器人任务调度效率，通过平衡时间效率与资源利用，结合自然语言处理接口实现与专业人员的实时沟通，并采用两个LLM代理生成更精确的任务计划。

Comments 9 pages, 5 figures

2605.15484 2026-05-18 cs.CV cs.LG

When Does Sparse MoE Help in Vision? The Role of Backbone Compute Leverage in Sparse Routing

何时稀疏MoE在视觉中起作用？背骨计算利用在稀疏路由中的作用

Libo Sun, Po-wei Harn, Peixiong He, Xiao Qin

发表机构 * Department of Computer Science and Software Engineering（计算机科学与软件工程系）； Auburn University（阿伯拉罕大学）； Department of Information Management（信息管理系）； National Central University（国立中央大学）

AI总结研究稀疏top-k路由在视觉分类中的有效性，发现计算利用模式，指出背骨架构和多专家路由对性能的影响，通过实验验证关键因素。

Comments 24 pages (main + appendix), 8 figures, 18 tables. Under review at TMLR. Code and aggregate results: https://github.com/libophd/sparse-moe-vision-rho

详情

AI中文摘要

混合专家（MoE）网络提供良好的准确率-计算量折衷，但实际视觉部署受专家崩溃和端到端效率提升有限的阻碍。本文研究稀疏top-k路由在视觉分类中的帮助条件，评估多种子协议下的四个基准（CIFAR-10/100、Tiny-ImageNet、ImageNet-1K）。观察到计算利用模式：正准确率差距需要总FLOPs的显著分数ρ进行路由；在ImageNet规模上，这虽必要但不够，还需多专家路由（k≥2）。通过两个受控实验隔离这些因素。在CIFAR-10上对隐藏大小的扫描显示标准和深度wise背骨的预测符号反转，排除背骨家族作为活跃变量。ImageNet-1K的消融实验仅改变top-k，保持架构、初始化和ρ固定，使差距从正变负。一种针对样本的Soft MoE变体，对专家进行softmax而非批次，使CIFAR-100超越密集基线，识别批次轴调度为样本CNN设置的主要失败模式。代码和汇总结果：https://github.com/libophd/sparse-moe-vision-rho。

英文摘要

Mixture-of-Experts (MoE) networks promise favorable accuracy-compute trade-offs, yet practical vision deployments are hindered by expert collapse and limited end-to-end efficiency gains. We study when sparse top-$k$ routing with hard capacity constraints helps in vision classification, evaluated under multi-seed protocols on four benchmarks (CIFAR-10/100, Tiny-ImageNet, ImageNet-1K). We observe a \emph{compute-leverage pattern}: positive accuracy gaps require a substantial fraction $ρ$ of total FLOPs to be routed; at ImageNet scale this is necessary but not sufficient, as multi-expert routing ($k \geq 2$) is additionally required. Two controlled experiments isolate these factors. A hidden-size sweep on CIFAR-10 yields both predicted sign reversals across standard and depthwise backbones, ruling out backbone family as the active variable. An ImageNet-1K ablation that varies only top-$k$ -- holding architecture, initialization, and $ρ$ fixed -- reverses the gap from positive to negative across all five seeds. A per-sample variant of Soft MoE that softmaxes over experts rather than the batch rescues CIFAR-100 above the dense baseline, identifying batch-axis dispatch as the dominant failure mode in per-sample CNN settings. Code and aggregate results: https://github.com/libophd/sparse-moe-vision-rho.

URL PDF HTML ☆

赞 0 踩 0

2605.15480 2026-05-18 cs.RO cs.AI

Residual Reinforcement Learning for Robot Teleoperation under Stochastic Delays

残差强化学习用于具有随机延迟的机器人遥控

Kaize Deng, Zewen Yang

发表机构 * Technical University of Munich（慕尼黑技术大学）

AI总结针对随机延迟导致的信号不连续问题，本文提出一种混合控制框架，通过LSTM状态估计器与残差强化学习策略相结合，提升遥控稳定性与性能。

Comments Accepted at 23rd IFAC World Congress 2026

AI 大模型

视觉与机器人

科学与医疗

GiLT: Augmenting Transformer Language Models with Dependency Graphs

RoiMAM: Region-of-Interest Medical Attention Model for Efficient Vision-Language Understanding

NavRL++: A System-Level Framework for Improving Sim-to-Real Transfer in Reinforcement Learning-Based Robot Navigation

When Latent Geometry Is Not Enough: Draft-Conditioned Latent Refinement for Non-Autoregressive Text Generation

Characterizing Learning in Deep Neural Networks using Tractable Algorithmic Complexity Analysis

CTF4Nuclear: Common Task Framework for Nuclear Fission and Fusion Models

KaRMA: A Kinematic Metric for Fine Manipulation Ability in Robotic Hands

3DTMDet: A Dual-Path Synergy Network of Transformer and SSM for 3D Object Detection in Point Clouds

DRS-GUI: Dynamic Region Search for Training-Free GUI Grounding

RTL-BenchMT: Dynamic Maintenance of RTL Generation Benchmark Through Agent-Assisted Analysis and Revision

SkiP: When to Skip and When to Refine for Efficient Robot Manipulation

Learning Dynamic Structural Specialization for Underwater Salient Object Detection

Tuning-free Instruction-based Video Editing Via Structural Noise Initialization and Guidance

Process Rewards with Learned Reliability

Task-Semantic Graph-Driven Distributed Agent Networking for Underwater Target Tracking

Neural Point-Forms

On the Fragility of Data Attribution When Learning Is Distributed

DiffVAS: Diffusion-Guided Visual Active Search in Partially Observable Environments

Terrain Consistent Reference-Guided RL for Humanoid Navigation Autonomy

RoPE Distinguishes Neither Positions Nor Tokens in Long Contexts, Provably

CAPS: Cascaded Adaptive Pairwise Selection for Efficient Parallel Reasoning

A QUBO Formulation Framework for Kinematic Structure-Based Robot Design Optimization: A Robotic Hand Case Study

parallelcbf: A composable safety-filter and auditability framework for tensor-parallel reinforcement learning

Learning with Conflicts of Interest

LAPS: Improving Incremental LiDAR Mapping using Active Pooling and Sampling for Neural Distance Fields

FLASH: Efficient Visuomotor Policy via Sparse Sampling

SurvivalPFN: Amortizing Survival Prediction via In-Context Bayesian Inference

Hybrid LLM-based Intelligent Framework for Robot Task Scheduling

When Does Sparse MoE Help in Vision? The Role of Backbone Compute Leverage in Sparse Routing

Residual Reinforcement Learning for Robot Teleoperation under Stochastic Delays