arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2154
专题追踪
2605.19703 2026-05-20 cs.RO

KIO-planner: Attention-Guided Single-Stage Motion Planning with Dual Mapping for UAV Navigation

KIO-planner: 基于双映射的注意力引导单阶段运动规划用于无人机导航

Dexing Yao, Haochen Li, Junhao Wei, Yifu Zhao, Yanxiao Li, Jiahui Xu, Jinxuan Hu, Lele Tian, Baili Lu, Zikun Li, Xu Yang, Sio-Kei Im, Dingcheng Yang, Yapeng Wang

发表机构 * Faculty of Applied Sciences(应用科学学院) Macao Polytechnic University(澳门理工学院) College of Animal Science and Technology(动物科学与技术学院) Zhongkai University of Agriculture and Engineering(仲恺农业工程学院) School of Economics and Management(经济管理学院) South China Normal University(华南师范大学) Information Engineering School(信息工程学院)

AI总结 本文提出KIO-planner,一种基于注意力引导的单阶段轨迹规划框架,通过整合CBAM模块和双映射机制,实现了在密集障碍环境中低延迟、可靠的运动规划,提高了导航的敏捷性和安全性。

Comments Accepted by an IEEE Vehicular Technology Conference. 6 pages, 4 figures, 1 table

详情
AI中文摘要

在受限、墙壁密集的环境中实现自主无人机飞行需要在严格安全约束下具有低延迟和可靠性的运动规划。传统基于优化的规划器在导航密集结构障碍时面临映射延迟和容易陷入局部极小值的问题。同时,现有的端到端学习方法难以从原始深度图像中提取细粒度的几何特征,并缺乏硬的运动动力学约束,导致靠近墙壁时出现不可预测的碰撞。为了解决这些问题,我们提出了KIO-planner,一种注意力引导的单阶段轨迹规划框架。首先,我们将卷积块注意力模块(CBAM)整合到感知骨干中,以自适应地聚焦于关键结构边缘和可通行空间。其次,我们引入了一种新的双映射机制——包括物理界限激活和确定性的几何安全护盾——以在深度像素空间中强制运动动力学可行性并实现无碰撞飞行,而无需全局地图融合。广泛的高保真模拟实验表明,KIO-planner能够在高达3.0 m/s的速度下实现高度敏捷的导航。与最先进的基线相比,KIO-planner实现了更低的推理延迟(约24 ms)并生成了显著更平滑的轨迹,减少了28.4%的控制成本。最值得注意的是,我们的双映射显著增加了最坏情况的安全裕度,通过最小距离到障碍物的测量,从0.48米增加到0.76米,确保了在高度受限环境中快速、平滑和安全的导航。

英文摘要

Autonomous UAV flight in confined, wall-dense environments requires low-latency and reliable motion planning under strict safety constraints. Traditional optimization-based planners suffer from mapping latency and easily fall into local minima when navigating through dense structural obstacles. Meanwhile, existing end-to-end learning methods struggle to extract fine-grained geometric features from raw depth images and lack hard kinodynamic constraints, leading to unpredictable collisions near walls. To address these issues, we propose KIO-planner, an attention-guided single-stage trajectory planning framework. First, we integrate a Convolutional Block Attention Module (CBAM) into the perception backbone to adaptively focus on critical structural edges and traversable space. Second, we introduce a novel Dual Mapping mechanism--comprising physical bounds activation and a deterministic Geometric Safety Shield in the depth-pixel space--to enforce kinodynamic feasibility and collision-free flight without global map fusion. Extensive high-fidelity simulated experiments demonstrate that KIO-planner enables highly agile navigation at speeds up to 3.0 m/s. Compared to the state-of-the-art baseline, KIO-planner achieves lower inference latency (approximately 24 ms) and generates significantly smoother trajectories, reducing control cost by 28.4%. Most notably, our Dual Mapping substantially increases the worst-case safety margin, measured by minimum distance to obstacles, from 0.48 m to 0.76 m, ensuring fast, smooth, and safer navigation in highly constrained environments.

2605.19701 2026-05-20 cs.RO

Multi-Session Ground Texture SLAM in Low-Dynamic Environments

多会话低动态环境下的地面纹理SLAM

Kyle M. Hart, Brendan Englot

发表机构 * Naval Air Warfare Center, Aircraft Division(海军航空武器中心,飞机分部) Department of Mechanical Engineering, Stevens Institute of Technology(机械工程系,史蒂文斯理工学院)

AI总结 本文研究了在低动态环境中多会话地面纹理SLAM中的轨迹估计精度影响,探讨了三种技术的影响,发现Kullback-Leibler散度在相似度评分和闭环置信度偏置方面效果最佳,并介绍了一个包含多会话图像和高精度姿态信息的数据集。

Comments 8 pages, 9 figures. To appear at the 23rd International Conference on Ubiquitous Robots, Osaka, Japan. Distribution Statement A: Approved for public release; distribution is unlimited, as submitted under NAVAIR Public Release Authorization 2025-0098

详情
AI中文摘要

同时定位与建图社区已经引入了大量适用于多会话操作的系统,这些系统适应于具有低动态变化特征的环境,如地面磨损、天气现象或季节变化,这些变化会影响建图。这些系统允许机器人在这些环境中进行终身操作。同时,对于那些唯一可用的地面纹理作为建图特征的环境,也存在越来越多的兴趣。然而,这些地面纹理系统尚未针对多会话低动态变化环境进行优化。本文探讨了三种不同技术对这些多会话低动态地面纹理环境轨迹估计精度的影响。其中,使用Kullback-Leibler散度作为相似度评分和偏置影响闭环置信度的方法效果最佳。我们分析了所有三种方法,并深入探讨了Kullback-Leibler散度的影响。我们还介绍了一个供机器人社区使用的数据集,其中包含多会话图像,地面在不同会话中发生变化,并包含高精度姿态信息用于评估。

英文摘要

The simultaneous localization and mapping community has introduced a growing number of systems adapted for multi-session operations where the operational environment features low-dynamic changes that impact mapping, such as surface wear, weather phenomena, or seasonal change. These systems allow for lifelong operations by a robot within these environments. There is also growing interest in operations in environments where the unique ground texture is the only mapping feature available for use. These ground texture systems are not yet targeted for multi-session low-dynamic-change environments though. This work explores the impact of three different techniques on trajectory estimation accuracy in these multi-session low-dynamic ground texture environments. Of the three, the use of Kullback-Leibler Divergence, as a similarity score and a bias influencing loop closure confidence, is found to have the most success. We show an analysis of all three methods and a deeper exploration of the impact of Kullback-Leibler Divergence. We also introduce a dataset for use by the robotics community that contains multi-session images where the ground changes between sessions and also high-accuracy pose information for use in evaluation.

2605.19692 2026-05-20 cs.CV

WBCAtt+: Fine-Grained Pixel-Level Morphological Annotations for White Blood Cell Images

WBCAtt+: 细粒度像素级形态学标注用于白血球图像

Satoshi Tsutsui, Winnie Pang, Shuting He, Bihan Wen

发表机构 * Rapid-Rich Object Search (ROSE) Lab, School of Electrical and Electronic Engineering, Nanyang Technological University(快速丰富目标搜索(ROSE)实验室,电气与电子工程学院,南洋理工大学) Shanghai University of Finance and Economics(上海财经大学)

AI总结 本文提出WBCAtt+数据集,通过11个形态学属性和5个像素级细胞组件的密集标注,为白血球图像提供了全面的标注,用于改进属性识别和语义分割的基准模型,并展示了可解释AI模型等应用。

Comments Accepted to Medical Image Analysis. arXiv admin note: substantial text overlap with arXiv:2306.13531

详情
AI中文摘要

白血球(WBC)的显微检查在病理学中起着基础性作用,对于诊断如白血病和贫血等血液疾病至关重要。为了支持进一步的WBC图像研究,已提出多个数据集。然而,这些数据集主要标注细胞类别,缺乏病理学家用于解释细胞解释的详细形态学特征。为解决这一差距,我们引入WBCAtt+,一个包含11个形态学属性和5个像素级细胞组件的新型WBC图像数据集。WBCAtt+拥有113,000个图像级标签和10,000个分割图,是首个为WBC图像提供全面标注的数据集。利用此数据集,我们提供了属性识别和语义分割的基准模型。我们还设计了一个属性识别模型,以整合细胞的组成结构,进一步提高识别性能。最后,我们展示了由我们的数据集启用的各种应用,如可解释AI模型,包括反事实示例生成。

英文摘要

The microscopic examination of white blood cells (WBCs) plays a fundamental role in pathology and is essential for diagnosing blood disorders such as leukemia and anemia. To support further research on WBC images, multiple datasets have been proposed. However, they mainly annotate cell categories, and lack detailed morphological characteristics that pathologists use to explain their interpretations of cells. To address this gap, we introduce WBCAtt+, a novel dataset of WBC images densely annotated with 11 morphological attributes and five pixel-level cell components. With 113k image-level labels and 10k segmentation maps, WBCAtt+ is the first to provide comprehensive annotations for WBC images. Leveraging this dataset, we provide baseline models for attribute recognition and semantic segmentation. We also design an attribute recognition model to incorporate compositional structure of cells, further improving the recognition performance. Lastly, we showcase various applications enabled by our dataset, such as explainable AI models, including counterfactual example generation. \revision{The dataset and code are publicly available\footnote{https://doi.org/10.57967/hf/8143}}.

2605.19690 2026-05-20 cs.RO

D-CLING: Prior-Preserving Depth-Conditioned Fine-Tuning for Navigation Foundation Models

D-CLING: 保留先验知识的深度条件细调方法用于导航基础模型

Shintaro Nakaoka, Takayuki Kanai, Kazuhito Tanaka

发表机构 * Frontier Research Center, Toyota Motor Corporation(丰田电机公司前沿研究中心)

AI总结 本文提出了一种新的细调方法,通过利用大规模预训练同时高效学习新环境或相机配置等新设置,从而在保留预训练知识的同时提升导航模型的鲁棒性和准确性。

Comments This paper has been accepted to the 2026 IEEE International Conference on Robotics and Automation (ICRA 2026), which will be held in Vienna, Austria, from June 1 to 5, 2026

详情
AI中文摘要

导航基础模型(NFMs)在大规模跨身体数据集上训练后,已在各种场景中展示了强大的泛化能力。采用领域内细调来校准NFMs的视觉-运动策略,有望在新场景中进一步提升性能。然而,细调后的模型仍然存在避障能力差或无法正确到达目标的问题。此外,使用小数据集进行模型更新通常会削弱预训练的先验知识,影响预训练的泛化能力。因此,细调会降低模型在稳健和准确导航方面的能力。在本文中,我们提出了一种新的细调方法,该方法利用大规模预训练同时高效学习新设置,如环境或相机配置。特别是,受ControlNet启发,我们通过将可训练的预训练骨干网络的可学习副本附加到NFMs上,利用零初始化残差路径进行细调,从而学习几何线索。这种设计使模型能够高效地获取领域内的几何信息,同时在各种行为中保留预训练的知识。尽管其简单性,我们对现实导航的全面评估表明,我们的方法能够有效实现稳健的长周期导航,同时最小化碰撞和人工干预。此外,我们的离线分析显示,所提出的方法在细调数据集之外仍能维持或进一步提升动作预测能力,为通用导航的持续学习提供了关键见解。项目页面:https://toyotafrc.github.io/DCLING-Proj/

英文摘要

Navigation Foundation Models (NFMs) trained on large cross-embodied datasets have demonstrated powerful generalizability in various scenarios. Adopting in-domain fine-tuning for an NFM efficiently calibrates the visuomotor policy, promising further improvement even in a novel scenario. However, the fine-tuned models still suffer from poor obstacle avoidance or fail to properly reach the provided goals. Furthermore, model updates using a small subset of data typically erode the pre-trained prior, compromising the pre-training generalization. Consequently, fine-tuning deteriorates the capability of the model for robust and accurate navigation. In this work, we present a novel fine-tuning method that leverages large-scale pre-training while efficiently learning in novel setups, such as environments or camera configurations. In particular, inspired by ControlNet, we fine-tune an NFM by attaching a trainable copy of the pre-trained backbone using zero-initialized residual pathways, thereby learning geometric cues. This design enables the model to efficiently acquire in-domain geometry while preserving pre-trained knowledge across various behaviors. Despite its simplicity, our comprehensive evaluation of real-world navigation suggests that our proposal effectively enables robust long-horizon navigation with minimal collisions and human intervention. Additionally, our offline analysis shows that the proposed method maintains or further improves action prediction capabilities beyond the fine-tuned dataset, providing a key insight into continual learning for general navigation. The project page: https://toyotafrc.github.io/DCLING-Proj/

2605.19688 2026-05-20 cs.CV

DocQT: Improving Document Forgery Localization Robustness via Diverse JPEG Quantization Tables

DocQT: 通过多样化的JPEG量化表提高文档伪造定位的鲁棒性

Kylian Ronfleux-Corail, Guillaume Bernard, Mickaël Coustaty, Nicolas Sidère

发表机构 * MAIF, Niort, France(法国尼奥特MAIF机构) L3i Laboratory, La Rochelle University, La Rochelle, France(法国拉罗谢尔大学拉罗谢尔L3i实验室)

AI总结 本文提出DocQT数据集,通过对比不同架构在不同量化表训练下的表现,证明标准质量因子增强无法代表实际压缩多样性,并展示了显式考虑量化表的架构在实际部署中的鲁棒性优势。

详情
AI中文摘要

文档操纵定位模型在公开基准上表现强劲,但在实际文档工作流程中泛化能力不足。我们发现这一差距的关键原因在于训练过程中使用的JPEG量化表分布狭窄(仅限于标准libjpeg质量因子)与实际保险文档管道中遇到的异质压缩配置之间的不匹配。为了隔离这一因素,我们进行了一项受控的因子研究,比较了两种具有不同量化表意识水平的架构(FFDN [2] 和 Mesorch [20]),每种架构在标准质量因子增强(Standard-QT)或从DocQT量化表库(Real-QT)采样的操作校准量化表下进行训练,并在三种再压缩条件下进行评估。在DocTamper [15] 上训练时使用Real-QT带来了显著的定位增益,并显著降低了真实操作文档中的像素级误报率,但仅适用于显式将量化表作为输入的架构。发布的DocQT量化表数据集和压缩再生产材料可在https://github.com/Kyliroco/Improving-Document-Forgery-Localization-Robustness-via-Diverse-JPEG-Quantization-Tables直接获取。这些结果表明,标准质量因子增强无法充分代表实际压缩多样性,并且显式条件化于量化表的架构选择为实际部署提供了有意义的鲁棒性优势。

英文摘要

Document manipulation localization models achieve strong performance on public benchmarks yet fail to generalize to operational document workflows. We identify a critical and overlooked source of this gap: the mismatch between the narrow distribution of JPEG quantization tables used during training -restricted to standard libjpeg quality factors -and the heterogeneous compression profiles encountered in real-world insurance document pipelines. To isolate this factor, we conduct a controlled factorial study comparing two architectures with contrasting levels of quantization table awareness -FFDN [2] and Mesorch [20] -each trained under either standard quality factor augmentation (Standard-QT ) or operationally calibrated quantization tables sampled from DocQT, a quantization-table bank derived from a MAIF operational image corpus (Real-QT ), and evaluated under three recompression conditions. Training under Real-QT yields substantial localization gains on DocTamper [15] and significantly reduces the pixel-level false positive rate on authentic operational documents, but only for architectures that explicitly ingest the quantization table as input. The released DocQT quantization-table dataset and compression-reproduction material are directly available at https://github.com/Kyliroco/Improving-Document-Forgery-Localization-Robustness-via-Diverse-JPEG-Quantization-Tables. These results demonstrate that standard quality factor augmentation does not adequately proxy operational compression diversity, and that architectural choices explicitly conditioning on the quantization table provide a meaningful robustness advantage for real-world deployment.

2605.19678 2026-05-20 cs.RO

RoVLA: Multi-Consistency Constraints for Robust Vision-Language-Action Models

RoVLA: 多一致性约束用于鲁棒的视觉-语言-动作模型

Jingzhou Luo, Yifan Wen, Yongjie Bai, Xinshuai Song, Yang Liu, Liang Lin

发表机构 * Sun Yat-sen University(中山大学) Peng Cheng Laboratory(鹏城实验室) Guangdong Key Laboratory of Big Data Analysis and Processing(广东大数据分析与处理重点实验室) X-Era AI Lab(X-Era AI实验室)

AI总结 本文提出RoVLA框架,通过多一致性约束提升视觉-语言-动作模型的鲁棒性,通过指令语义、轨迹演变和观察扰动三种互补变换增强模型的稳定性和泛化能力。

详情
AI中文摘要

视觉-语言-动作(VLA)模型在具身操控中表现出色,但在视觉观察变化、语言指令改写和复合扰动下仍显脆弱。这种限制表明现有方法仍依赖于训练分布中的浅层相关性,而非学习任务语义、环境状态和动作生成之间的稳定耦合。尽管近期研究通过大规模训练、训练后适应或增强预测建模提高了鲁棒性,但很少在端到端策略本身中强制执行不变性一致性。为了解决这个问题,我们提出了RoVLA,一个具有多一致性约束的鲁棒视觉-语言-动作框架。RoVLA在三个互补的变换下强制一致性:指令语义、轨迹演变和观察扰动。具体而言,指令一致性(IC)通过语义等价指令改写促进稳定的语义关联,演变一致性(EC)在整个生成过程中保持一致的动作意图,观察一致性(OC)通过强制在受扰动前后的一致预测来提高对视觉和体感扰动的鲁棒性。通过在训练过程中显式建模这些不变性,RoVLA减少了对表面相关性的依赖,提高了鲁棒性和泛化能力。在LIBERO-Plus、RoboTwin 2.0和现实世界操控任务上的实验表明,RoVLA在强基线方法上表现一致,并在多样化的任务和观察转移下表现出更优越的鲁棒性。这些结果证明了多一致性学习在鲁棒具身控制中的有效性。代码将在https://github.com/HCPLab-SYSU/RoVLA上提供。

英文摘要

Vision-Language-Action (VLA) models have shown strong performance on embodied manipulation, yet they remain brittle under visual observation changes, paraphrased language instructions, and compounded perturbations. This limitation suggests that existing methods still rely heavily on shallow correlations in the training distribution, rather than learning stable couplings among task semantics, environment states, and action generation. Although recent efforts improve robustness through larger-scale training, post-training adaptation, or enhanced predictive modeling, they rarely enforce invariance-oriented consistency within the end-to-end policy itself. To address this issue, we propose RoVLA, a robust vision-language-action framework with multi-consistency constraints. RoVLA enforces consistency under three complementary transformations: instruction semantics, trajectory evolution, and observation perturbation. Specifically, Instructional Consistency (IC) promotes stable grounding under semantically equivalent instruction rewrites, Evolutionary Consistency (EC) preserves coherent action intent throughout the generation process, and Observational Consistency (OC) improves robustness to visual and proprioceptive perturbations by enforcing consistent predictions before and after targeted disturbances. By explicitly modeling these invariances during training, RoVLA reduces reliance on superficial correlations and improves robustness and generalization. Experiments on LIBERO-Plus, RoboTwin 2.0, and real-world manipulation tasks show that RoVLA consistently outperforms strong baseline methods and exhibits superior robustness under diverse task and observation shifts. These results demonstrate the effectiveness of multi-consistency learning for robust embodied control. Codes will be available at https://github.com/HCPLab-SYSU/RoVLA.

2605.19677 2026-05-20 cs.LG q-bio.QM

Agentic Discovery of Cryomicroneedle Formulations

代理发现冷冻微针制剂配方

Hao Li, Lifu Du, Nurul Hameed, Shemonti Saha Authai, Zlata Stefanovic, Chenjie Xu

发表机构 * Department of Biomedical Engineering, City University of Hong Kong(香港城市大学生物医学工程系)

AI总结 本研究提出了一种结合文献整理、高斯过程代理建模、贝叶斯优化和顺序湿实验验证的闭环工作流程,用于发现冷冻微针的冷冻保护剂配方,通过迭代湿实验验证提高了配方的准确性和有效性。

详情
AI中文摘要

冷冻微针提供了一种微创的皮下递送活细胞的途径,但其低温保存配方必须在保护细胞和限制毒性和设备制造约束之间取得平衡。本文报告了一种由AI辅助的闭环工作流程,用于冷冻微针冷冻保护剂的发现,结合了文献整理、高斯过程代理建模、贝叶斯优化和顺序湿实验验证。一个包含198种骨髓干细胞冷冻保存配方的curated数据集(来自42项研究)被转换为21种成分特征,并用于训练一个不确定性的文献先验模型。该模型捕捉了文献数据中的中等结构,但前瞻性地失败了,促使进行迭代的湿实验修正。在十次验证迭代和106次湿实验观察中,模型逐步适应了冷冻微针特定的结果:批次RMSE从41.21个百分点降低到6.86个百分点,后期阶段的排名相关性变得一致为正,累积的湿实验预测与测量总结达到了R²=0.942。最佳验证配方实现了95.15%的复苏存活率,同时具有低DMSO、ectoin、乙二醇和胎牛血清含量。然而,高存活率本身并不保证冷冻微针的完整形成,突显了未来多目标优化的必要性。这些结果表明,代理辅助的计算基础设施可以使数据高效的配方发现对拥有少量内部数据专业知识的实验室更加可及。项目代码可在https://github.com/baitmeister/ML-for-CryoMN上获得。

英文摘要

Cryomicroneedles offer a route to minimally invasive intradermal delivery of living cells, but their cryogenic formulations must reconcile cell protection with constraints on toxicity and device fabrication. Here we report an AI-assisted, closed-loop workflow for cryomicroneedle cryoprotectant discovery that combines literature curation, Gaussian-process surrogate modelling, Bayesian optimization, and sequential wet-lab validation. A curated dataset of 198 mesenchymal stem-cell cryopreservation formulations from 42 studies was converted into 21 ingredient features and used to train an uncertainty-aware literature prior. This model captured moderate structure in the literature data but failed prospectively, motivating iterative wet-lab correction. Across ten validation iterations and 106 wet-lab observations, the model progressively adapted to cryomicroneedle-specific outcomes: batch RMSE decreased from 41.21 to 6.86 percentage points, later-stage rank correlations became consistently positive, and the cumulative wet-lab predicted-versus-measured summary reached $R^2 = 0.942$. The best validated formulation achieved 95.15\% post-thaw viability with low DMSO, ectoin, ethylene glycol, and fetal bovine serum. However, high viability alone did not ensure intact cryomicroneedle formation, highlighting the need for future multi-objective optimization. These results demonstrate that agent-assisted computational infrastructure can make data-efficient formulation discovery more accessible to labs with minimal data expertise in-house. Project code is available at https://github.com/baitmeister/ML-for-CryoMN.

2605.19671 2026-05-20 cs.AI

Transforming Constraint Programs to Input for Local Search

将约束程序转换为局部搜索的输入

Jo Devriendt, Patrick De Causmaecker, Marc Denecker

发表机构 * University of Leuven(卢森堡大学)

AI总结 本文通过建立约束优化问题的对称性属性与局部搜索邻域之间的联系,自动从约束规范中生成邻域,用于IDP系统中的元启发式算法,并在六个经典优化问题上评估了生成的邻域。

Comments Unpublished paper accepted and presented at the Fourteenth International Workshop on Constraint Modelling and Reformulation (ModRef) in 2015

详情
AI中文摘要

将局部搜索算法应用于组合优化问题并不容易。通常需要人工干预才能将约束转换为某些元启发式算法的输入数据。在本文中,我们建立了约束优化问题的对称性属性与局部搜索邻域之间的联系,并利用这一联系在IDP系统中自动从约束规范生成邻域。我们对六个经典优化问题评估了所获得的邻域。所得结果支持了该技术的可行性。

英文摘要

Applying local search algorithms to combinatorial optimization problems is not an easy feat. Typically, human intervention is required to compile the constraints to input data for some metaheuristic algorithm. In this paper, we establish a link between symmetry properties of constraint optimization problems and local search neighborhoods, and we use this link to automatically generate neighborhoods from a constraint specification in the context of the IDP system. We evaluate the obtained neighborhoods for six classical optimization problems. The resulting observations support the viability of this technique.

2605.19663 2026-05-20 cs.AI

Pseudocode-Guided Structured Reasoning for Automating Reliable Inference in Vision-Language Models

基于伪代码的结构化推理用于自动化可靠推理在视觉-语言模型中

Weicong Ni, Tianbao Jiang, Linlin Wang

发表机构 * East China Normal University(东华师范大学)

AI总结 本文提出了一种基于伪代码的结构化推理框架(PStar),旨在通过自适应选择结构化伪代码推理路径,提高视觉-语言模型在复杂任务中的可靠性和鲁棒性,从而减少幻觉现象并提升推理性能。

详情
AI中文摘要

视觉-语言模型(VLMs)正成为机器人自动化高级推理的基石,使机器人能够解析自然语言指令并感知其环境。然而,其易受幻觉影响,导致决策失败,对实际部署的安全性和可靠性构成重大风险。为解决这一问题,我们提出了基于伪代码的结构化推理框架(PStar),该框架能够自适应选择结构化伪代码推理路径,帮助VLMs进行灵活的逐步推理。我们首先设计了一组抽象推理函数,并制定了一套结构化伪代码库来表示模块化推理策略。关键的是,我们设计了一个难度特征向量(DFV),使模型能够评估问题复杂性并自适应选择适当的推理策略,从而增强鲁棒性和可解释性。大量实验表明,PStar显著降低了幻觉率,在POPE上达到87.1%的分数,在MMStar上达到68.0%的分数,优于GPT-4V。通过提供一种经过验证的机制来减少视觉-语言错误,PStar为部署更可信和确定性的VLMs用于实际自动化系统提供了关键一步,其中此类错误可能导致灾难性后果。

英文摘要

Vision-Language Models (VLMs) are becoming the cornerstone of high-level reasoning for robotic automation, enabling robots to parse natural language commands and perceive their environments. However, their susceptibility to hallucinations introduces critical failures in decision-making, posing significant safety and reliability risks in physical deployments. This challenge is exacerbated by the open-ended nature of real-world tasks, where questions vary vastly in difficulty and modality, demanding robust and adaptable reasoning strategies. To tackle this, we propose the Pseudocode-guided Structured Reasoning framework (PStar), which adaptively selects structured pseudocode reasoning paths to help VLMs perform flexible and step-by-step reasoning. We first design a set of abstract reasoning functions and formulate a structured pseudocode library to represent modular reasoning strategies. Crucially, we design a Difficulty Feature Vector (DFV) that allows the model to assess question complexity and adaptively choose appropriate reasoning strategies-enhancing robustness and interpretability. Extensive experiments demonstrate that PStar significantly reduces hallucination rates, achieving state-of-the-art scores of 87.1% on POPE and 68.0% on MMStar, outperforming even GPT-4V. By providing a validated mechanism to reduce visual-language errors, PStar offers a critical step toward deploying more trustworthy and deterministic VLMs for real-world automated systems, where such errors can lead to catastrophic outcomes.

2605.19660 2026-05-20 cs.LG cs.CL

OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond

OScaR:LLMs及更广泛场景中的极压缩KV缓存量化之奥卡姆之刀

Zunhai Su, Rui Yang, Chao Zhang, Yaxiu Liu, Yifan Zhang, Wei Wu, Jing Xiong, Dayou Du, Xialie Zhuang, Yulei Qian, Yuchen Xie, Yik-Chung Wu, Hongxia Yang, Ngai Wong

发表机构 * Tsinghua University(清华大学) Meituan LongCat Team(美团LongCat团队) The University of Hong Kong(香港大学) The University of Edinburgh(爱丁堡大学) UCAS(中国科学技术大学) The Hong Kong Polytechnic University(香港理工大学)

AI总结 本文针对LLMs中KV缓存极压缩时的量化保真问题,提出OScaR框架,通过Canalized Rotation和Omni-Token Scaling有效缓解Token Norm Imbalance,实现近无损的INT2量化性能,同时提升解码速度和吞吐量。

Comments Under review

详情
AI中文摘要

快速发展的长上下文推理和多模态智能使Key-Value(KV)缓存的内存占用成为高效部署的主要内存瓶颈。虽然已建立的每通道量化方法能有效处理Key张量中的固有通道级异常值,但在极端压缩下其效果下降。本文从经验和理论角度重新审视每通道量化范式的固有限制。我们的分析指出Token Norm Imbalance(TNI)是量化保真度的主要瓶颈。我们证明当共享量化参数需要覆盖具有显著范数差异的token组时,TNI会系统性地放大误差。而不是依赖复杂的量化流水线(如TurboQuant),我们提出了OScaR(Omni-Scaled Canalized Rotation),一种适用于X-LLMs(即纯文本、多模态和全模态LLMs)的准确且轻量的KV缓存压缩框架。在推进每通道范式的基础上,OScaR通过Canalized Rotation后接Omni-Token Scaling,有效且高效地缓解TNI引起的序列维度方差,进一步通过优化的系统设计和CUDA内核支持。在X-LLMs上的广泛评估显示,OScaR在INT2量化下实现了近无损性能,优于现有方法,确立了其作为稳健、低复杂度和通用的框架,定义了新的帕累托前沿。与BF16 FlashDecoding-v2基线相比,我们的OScaR实现解码速度提升达3.0倍,内存占用减少5.3倍,吞吐量增加4.1倍。OScaR的代码在https://github.com/ZunhaiSu/OScaR-KV-Quant公开。

英文摘要

The rapid advancement toward long-context reasoning and multi-modal intelligence has made the memory footprint of the Key-Value (KV) cache a dominant memory bottleneck for efficient deployment. While the established per-channel quantization effectively accommodates intrinsic channel-wise outliers in Key tensors, its efficacy diminishes under extreme compression. In this work, we revisit the inherent limitations of the per-channel quantization paradigm from both empirical and theoretical perspectives. Our analysis identifies Token Norm Imbalance (TNI) as the primary bottleneck to quantization fidelity. We demonstrate that TNI systematically amplifies errors when shared quantization parameters are required to span token groups exhibiting substantial norm disparities. Instead of relying on intricate quantization pipelines (e.g., TurboQuant), we propose OScaR (Omni-Scaled Canalized Rotation), an accurate and lightweight KV cache compression framework for X-LLMs (i.e., text-only, multi-modal, and omni-modal LLMs). Advancing the per-channel paradigm, OScaR employs Canalized Rotation followed by Omni-Token Scaling to mitigate TNI-induced sequence-dimensional variance both effectively and efficiently, further supported by our optimized system design and CUDA kernels. Extensive evaluations across X-LLMs show that OScaR consistently outperforms existing methods and achieves near-lossless performance under INT2 quantization, establishing it as a robust, low-complexity, and universal framework that defines a new Pareto front. Compared with the BF16 FlashDecoding-v2 baseline, our OScaR implementation achieves a notable up to 3.0x speedup in decoding, reduces memory footprint by 5.3x, and increases throughput by 4.1x. The code for OScaR is publicly available at https://github.com/ZunhaiSu/OScaR-KV-Quant.

2605.19656 2026-05-20 cs.CV

Cross-View Splatter: Feed-Forward View Synthesis with Georeferenced Images

跨视图泼溅:基于地理参考图像的馈送视图合成

Matias Turkulainen, Akshay Krishnan, Filippo Aleotti, Mohamed Sayed, Guillermo Garcia-Hernando, Juho Kannala, Arno Solin, Gabriel Brostow, Daniyar Turmukhambetov

发表机构 * Aalto University(阿alto大学) Georgia Tech(佐治亚理工学院) Niantic Spatial(Niantic空间) University of Oulu(奥卢大学) ELLIS Institute Finland(芬兰ELLIS研究所) UCL(伦敦大学学院)

AI总结 本文提出了一种基于地理参考图像的馈送视图合成方法,通过融合正交校正的卫星图像与GPS标记的地面照片,预测统一3D坐标框架中的高斯泼溅,从而提升场景覆盖和新视角合成效果。

Comments Submitted to CVPR 2026. 8 figures, 3 tables. Project page: https://nianticspatial.github.io/cross-view-splatter/

详情
AI中文摘要

我们提出了Cross-View Splatter,一种预测像素对齐高斯泼溅的馈送方法,用于地面级和卫星拍摄的户外场景。忠实重建需要良好的相机覆盖,但地面影像在大规模户外场景中拍摄耗时且困难。幸运的是,卫星影像可以提供全球几何先验,可通过公共API轻松获取。Cross-View Splatter融合正交校正的卫星视图与GPS标记的地面照片,以统一的3D坐标框架预测高斯泼溅。通过对齐地面和鸟瞰特征表示,我们的模型相比仅使用地面影像提升了场景覆盖和新视角合成。我们在经过筛选的地理参考数据集和配对的卫星地形数据上进行训练,这些数据来自开源测绘服务。我们在新的新视角合成基准上评估了我们的方法,该基准允许与先前最先进的方法进行比较。我们的代码和数据准备将在https://nianticspatial.github.io/cross-view-splatter/上提供。

英文摘要

We present Cross-View Splatter, a feed-forward method that predicts pixel-aligned Gaussian splats for outdoor scenes captured at ground level AND by satellite. Faithful reconstructions require good camera coverage, but ground imagery is time-consuming and hard to capture at scale for large outdoor scenes. Fortunately, satellite imagery can provide a global geometric prior that is easy to access via public APIs. Cross-View Splatter fuses orthorectified satellite views with GPS-tagged ground photos to predict Gaussian splats in a unified 3D coordinate frame. By aligning ground and bird's-eye feature representations, our model improves scene coverage and novel-view synthesis, compared to ground imagery alone. We train on curated georeferenced datasets and paired satellite-terrain data, mined from open mapping services. We evaluate our method on a new benchmark for novel-view synthesis with georeferenced imagery allowing comparison to prior state-of-the-art methods. Our code and data preparation will be available at https://nianticspatial.github.io/cross-view-splatter/.

2605.19645 2026-05-20 cs.CL

K-Quantization and its Impact on Output Performance

K-量化及其对输出性能的影响

Robin Baki Davidsson, Pierre Nugues

发表机构 * Lund University(隆德大学)

AI总结 本文研究了不同量化级别(2-6位)对大型语言模型(LLM)在MMLU-Pro、CRUXEval和MuSR等任务上的性能和准确性的影响,发现高精度量化(如8位Q8_0)能提升性能,但降维量化(如2位Q2_K)会带来性能损失,且不同模型和任务的响应差异显著。

Comments 13 pages, 4 figures

详情
AI中文摘要

近年来,大型语言模型(LLMs)在许多自然语言处理(NLP)任务中展现出显著能力。然而,其庞大的规模常常给部署带来挑战。这需要高效的模型压缩技术,量化作为一种重要的解决方案。尽管量化具有诸多优势,但其对LLMs性能和准确性的确切影响仍然是一个活跃的研究领域。本文研究了八个LLMs在不同量化级别下的性能,重点考察了MMLU-Pro(知识处理和推理)、CRUXEval(代码理解)和MuSR(阅读理解)等任务。我们的结果表明,更高的精度(例如8位Q8_0)能带来更好的性能,但边际效益逐渐降低。激进的量化(例如2位Q2_K)通常能保持可接受的准确性,尽管某些模型会显著损失性能。我们的发现表明,虽然较低的位精度通常会降低性能,但不同模型和任务的响应差异显著。较大的模型对激进量化表现出更大的韧性,但仍然会在较低精度下经历显著下降。7-9十亿参数范围的中等大小模型在效率和资源使用之间取得了最佳平衡。这些结果为模型大小、量化和性能之间的权衡提供了见解。

英文摘要

Recent advancements in large language models (LLMs) have shown their remarkable capacities in many NLP tasks. However, their substantial size often presents challenges for deployment. This necessitates efficient techniques for model compression, with quantization emerging as a prominent solution. Despite its benefits, the exact impact of quantization (from 2- to 6-bit) on the performance and accuracy of LLMs remains an active area of research. This paper investigates the performance of eight LLMs at various quantization levels, focusing on tasks such as MMLU-Pro for knowledge processing and reasoning, CRUXEval for code comprehension, and MuSR for reading comprehension. Our results show a consistent trend where higher precision (e.g., 8-bit Q8\_0) yields improved performance, albeit with diminishing returns. Aggressive quantization (e.g., 2-bit Q2\_K) usually retains acceptable accuracy, though some models show a substantial loss in performance. Our findings indicate that while lower bit precision generally reduces performance, the impact varies across models and tasks. Larger models show greater resilience to aggressive quantization, but can still undergo significant drops at lower precision levels. Mid-sized models in the 7-9 billion parameter range strike an optimal balance between efficiency and resource usage. Such results provide insights into the trade-offs between model size, quantization, and performance.

2605.19634 2026-05-20 cs.CV cs.AI

P2DNav: Panorama-to-Downview Reasoning for Zero-shot Vision-and-Language Navigation

P2DNav: 全景到俯视视角的零样本视觉-语言导航

Kai Sheng, Liuyi Wang, Haojie Dai, Jinlong Li, Yongrui Qin, Zongtao He, Chengju Liu, Qijun Chen

发表机构 * Department of Control Science and Engineering, Tongji University(控制科学与工程系,同济大学)

AI总结 本文提出P2DNav框架,通过全景到俯视视角的分解、滑动窗口对话记忆和反思重新定位机制,解决零样本视觉-语言导航中的方向推理与局部定位问题,实验表明其在R2R-CE基准上性能优异。

详情
AI中文摘要

视觉-语言导航(VLN)要求一个具身代理将自然语言指令转化为可执行的导航动作,以应对未见环境。现有零样本方法通常依赖额外的航点预测模块,这些模块往往将高层方向推理与细粒度局部定位纠缠在一起,导致决策错误且不稳定。在本文中,我们提出P2DNav,一种用于零样本视觉-语言导航的分层框架。P2DNav包含三个核心组件:全景到俯视(P2D)、滑动窗口对话记忆(SDM)和反思重新定位机制(RRM)。P2D明确将导航决策分解为两个阶段:全景方向选择和俯视局部定位。它首先从360°全景中选择与指令相关的方向,然后从该方向的俯视RGB观察中预测像素级目标点。此外,SDM将导航历史组织为多轮对话上下文,并在滑动窗口内维护最近的视觉观察以支持长距离导航。RRM进一步通过评估局部定位的可靠性基于俯视观察,并在必要时返回全景方向选择。在R2R-CE基准上的实验表明,P2DNav在零样本方法中表现强劲。特别是,与最先进的(SOTA)零样本航点基于和航点自由方法相比,P2DNav在SR方面分别获得了146.6%和58.9%的提升,证明了P2D、SDM和RRM在零样本VLN中的有效性。代码将向公众发布。

英文摘要

Vision-and-language navigation (VLN) requires an embodied agent to ground natural-language instructions into executable navigation actions in unseen environments. Existing zero-shot methods typically rely on additional waypoint prediction modules, which often entangle high-level directional reasoning with fine-grained local grounding, leading to error-prone and unstable decisions. In this paper, we propose P2DNav, a hierarchical framework for zero-shot vision-and-language navigation. P2DNav consists of three core components: Panorama-to-Downview (P2D), Sliding-Window Dialogue Memory (SDM), and Reflective Reorientation Mechanism (RRM). P2D explicitly decomposes navigation decision-making into two stages: panoramic direction selection and downview local grounding. It first selects the instruction-relevant direction from a 360° panorama, and then predicts a pixel-level target point from the downview RGB observation in that direction. In addition, SDM organizes navigation history as a multi-turn dialogue context and maintains recent visual observations within a sliding window to support long-horizon navigation. RRM further enables reflective reorientation by assessing the reliability of local grounding based on the downview observation and returning to panoramic direction selection when necessary. Experiments on the R2R-CE benchmark show that P2DNav achieves strong performance among zero-shot methods. In particular, compared with the state-of-the-art (SOTA) zero-shot waypoint-based and waypoint-free methods, P2DNav achieves SR gains of 146.6% and 58.9%, respectively, demonstrating the effectiveness of P2D, SDM, and RRM for zero-shot VLN. Code will be released for public use.

2605.19633 2026-05-20 cs.CL cs.AI cs.LG cs.NE cs.SE

optimize_anything: A Universal API for Optimizing any Text Parameter

optimize_anything: 一个用于优化任何文本参数的通用API

Lakshya A Agrawal, Donghyun Lee, Shangyin Tan, Wenjie Ma, Karim Elmaaroufi, Rohit Sandadi, Sanjit A. Seshia, Koushik Sen, Dan Klein, Ion Stoica, Joseph E. Gonzalez, Omar Khattab, Alexandros G. Dimakis, Matei Zaharia

发表机构 * MIT(麻省理工学院)

AI总结 本文提出了一种基于LLM的通用优化系统,能够跨不同领域实现文本参数的优化,展示了其在六个多样化任务中的state-of-the-art性能,通过多任务搜索和跨问题迁移实现了高效的优化。

Comments 16 pages, 11 figures; Blog: https://gepa-ai.github.io/gepa/blog/2026/02/18/introducing-optimize-anything/

Journal ref Proceedings of the ACM Conference on AI and Agentic Systems (CAIS 26), May 26-29, 2026, San Jose, CA, USA

详情
AI中文摘要

能否一个基于LLM的优化系统在根本不同的领域中匹配专门工具?我们证明当优化问题被表述为改进一个通过评分函数评估的文本工件时,一个基于AI的优化系统—支持单任务搜索、多任务搜索和跨问题迁移以及对未见过的输入进行泛化—在六个不同的任务中实现了state-of-the-art的结果。我们的系统发现了将Gemini Flash的ARC-AGI准确性几乎提高三倍的代理架构(32.5%到89.5%),发现了将云成本降低40%的调度算法,生成了87%匹配或超过PyTorch的CUDA内核,并优于AlphaEvolve报告的圆圈打包解决方案(n=26)。在三个领域的消融研究揭示了可操作的侧信息比仅评分反馈更快收敛且最终得分更高,且多任务搜索在同等问题预算下通过跨任务迁移优于独立优化。共同,我们首次展示了基于LLM搜索的文本优化是一种通用问题解决范式,将传统需要领域特定算法的任务统一到一个框架下。我们开源了optimize_anything,并支持多个后端作为GEPA项目的一部分,在https://github.com/gepa-ai/gepa上。

英文摘要

Can a single LLM-based optimization system match specialized tools across fundamentally different domains? We show that when optimization problems are formulated as improving a text artifact evaluated by a scoring function, a single AI-based optimization system-supporting single-task search, multi-task search with cross-problem transfer, and generalization to unseen inputs-achieves state-of-the-art results across six diverse tasks. Our system discovers agent architectures that nearly triple Gemini Flash's ARC-AGI accuracy (32.5% to 89.5%), finds scheduling algorithms that cut cloud costs by 40%, generates CUDA kernels where 87% match or beat PyTorch, and outperforms AlphaEvolve's reported circle packing solution (n=26). Ablations across three domains reveal that actionable side information yields faster convergence and substantially higher final scores than score-only feedback, and that multi-task search outperforms independent optimization given equivalent per-problem budget through cross-task transfer, with benefits scaling with the number of related tasks. Together, we show for the first time that text optimization with LLM-based search is a general-purpose problem-solving paradigm, unifying tasks traditionally requiring domain-specific algorithms under a single framework. We open-source optimize\_anything with support for multiple backends as part of the GEPA project at https://github.com/gepa-ai/gepa .

2605.19631 2026-05-20 cs.RO cs.CV

HEAT: Heterogeneous End-to-End Autonomous Driving via Trajectory-Guided World Models

HEAT: 基于轨迹引导的世界模型实现异构端到端自动驾驶

Hoonhee Cho, Giwon Lee, Jae-Young Kang, Hyemin Yang, Heejun Park, Kuk-Jin Yoon

发表机构 * KAIST(韩国科学技术院)

AI总结 本文提出一种基于轨迹引导的学习方法,通过规划轨迹组织训练,使模型能够捕捉驾驶意图的领域不变表示,并结合预测未来潜在特征的世界模型,提高特征一致性并缓解领域偏见,从而在多个异构数据集上实现强性能。

详情
AI中文摘要

端到端自动驾驶作为一种直接将原始传感器数据映射到驾驶动作的替代方案,已逐渐取代传统模块化管道。尽管近期方法在单域数据集上表现强劲,但当在多个异构领域联合训练时,性能显著下降。然而,实际自动驾驶系统必须在具有异构分布的不同环境中运行,包括不同城市、传感器配置和交通模式,而无需领域特定重新训练。这一差距突显了多领域学习中的关键挑战:异构领域中的领域特定变化引入了冲突的学习信号,使模型倾向于妥协解决方案,这些方案在各个领域中都是次优的。为此,我们提出了一种轨迹驱动的学习范式,围绕规划轨迹组织训练,使模型能够捕捉驾驶意图的领域不变表示。此外,我们还引入了一个世界模型,该模型根据自主动作预测未来的潜在特征,从而提高特征一致性和缓解领域引起的偏见。我们在三个基准上评估了我们的方法,即nuScenes、NAVSIM和Waymo端到端数据集,并在所有领域上展示了显著优于现有方法的改进。我们的结果表明,一个统一的模型可以在异构数据集上进行训练,同时在每个领域中保持强大的性能,这表明了向可扩展的现实世界部署迈出的一步。我们将公开我们的代码。

英文摘要

End-to-end autonomous driving has emerged as a compelling alternative to traditional modular pipelines by directly mapping raw sensor data to driving actions. While recent approaches achieve strong performance on single-domain datasets, their performance degrades significantly when trained jointly across multiple heterogeneous domains. In practice, however, autonomous systems must operate across diverse environments with heterogeneous distributions, including different cities, sensor configurations, and traffic patterns, without domain-specific retraining. This gap highlights a key challenge in multi-domain learning: domain-specific variations across heterogeneous domains introduce conflicting learning signals, driving models toward compromised solutions that are suboptimal across domains. To address this, we propose a trajectory-driven learning paradigm that organizes training around planning trajectories, enabling the model to capture domain-invariant representations of driving intent. Furthermore, we incorporate a world model that predicts future latent features conditioned on ego actions, improving feature consistency and mitigating domain-induced biases. We evaluate our approach on three benchmarks, nuScenes, NAVSIM, and the Waymo end-to-end dataset, and show substantial improvements over existing methods across all domains. Our results demonstrate that a single unified model can be trained on heterogeneous datasets while maintaining strong performance within each domain, highlighting a step toward scalable real-world deployment. We will make our code publicly available.

2605.19630 2026-05-20 cs.AI

EMO-BOOST: Emotion-Augmented Audio-Visual Features for Improved Generalization in Deepfake Detection

EMO-BOOST:情感增强的音频视觉特征用于深度伪造检测中的泛化改进

Aritra Marik, Marcel Klemt, Anna Rohrbach

发表机构 * Technical University of Darmstadt(达姆施塔特技术大学) ELIZA

AI总结 本文提出EMO-BOOST框架,通过融合传统RGB和声学聚焦检测器与基于情感的EmoForensics检测器,利用高阶语义线索提升深度伪造检测的泛化能力,实验显示在FakeAVCeleb数据集上平均跨操纵泛化AUC提升了2.1%。

Comments Accepted at SAFE@CVPRW 2026

详情
AI中文摘要

随着生成式AI模型的不断发展,取证学正面临越来越大的压力。新的生成技术不断出现,使得无法为每种操纵收集数据来训练深度伪造检测模型。因此,将模型泛化到训练期间未见过的深度伪造类型是当前深度伪造检测研究中的主要挑战之一。为解决这一挑战,我们采用了高层语义线索,并认为这些线索可以支持低层聚焦方法在泛化到未见操纵类型时发挥作用。在本研究中,我们研究了情感作为高层语义线索。我们提出了EMO-BOOST,一种多模态深度伪造检测框架,该框架融合了传统RGB和声学聚焦深度伪造检测器与我们基于情感的深度伪造检测器EmoForensics。EmoForensics利用视觉和音频情感识别模块,并在音频视频流中建模内在和跨模态的时间一致性。我们发现EmoForensics和低层聚焦方法捕获了互补的信号。因此,在EMO-BOOST中结合这两种信号,使在FakeAVCeleb数据集上的平均跨操纵泛化AUC提高了2.1%。

英文摘要

With every advancement in generative AI models, forensics is under increasing pressure. The constant emergence of new generation techniques makes it impossible to collect data for each manipulation to train a deepfake detection model. Thus, generalizing to deepfakes unseen during training is one of the major challenges in current deepfake detection research. To tackle this challenge, we employ high-level semantic cues and argue that these cues can support low-level focused approaches in generalizing to unseen types of manipulations. In this work, we study emotions as a high-level semantic cue. We propose Emo-Boost, a multimodal deepfake detection framework that fuses an off-the-shelf RGB- and acoustic-focused deepfake detector with our emotion-based deepfake detector EmoForensics. EmoForensics utilises vision and audio emotion recognition modules and models intra- and inter-modal temporal consistency in emotion representations from an audio-visual stream. We found that EmoForensics and the low-level focused method capture complementary signals. Consequently, combining both signals in EmoBoost enhances the average cross-manipulation generalization AUC by 2.1% on FakeAVCeleb.

2605.19625 2026-05-20 cs.LG

Optimal Reconstruction from Linear Queries

从线性查询中最优重建

Yuval Filmus, Shay Moran, Elizaveta Nesterova

发表机构 * Technion – Israel Institute of Technology(技术学院 – 以色列理工学院) Google Research(谷歌研究)

AI总结 研究如何从近似线性查询中重建未知点,分析查询数量、维度和噪声参数对重建误差的影响,并提出一种改进的重建问题变体。

Comments Accepted to COLT 2026. 46 pages, 4 figures

详情
AI中文摘要

我们研究从近似线性查询中重建$\mathbb{R}^d$中未知点的问题。该设定出现在从低维遥感和信号恢复到高维数据分析和隐私敏感推断的应用中。我们的主要目标是将最优重建误差作为查询数量$T$、环境维度$d$和噪声参数$\delta$的函数进行表征。我们首先分析$T o \infty$的极限,证明最优重建误差收敛到显式值$\sqrt{2d/(d+1)} \delta$,其作用类似于监督学习中的贝叶斯最优误差。当维度固定时,我们显示在该极限之上,误差以双指数速度衰减,比通常在学习曲线中遇到的速率快得多。当维度增长时,我们证明需要数量级为$\exp(d)$的查询才能实现消失的误差。最后,我们介绍并分析了重建问题的一个不恰当变体。从技术角度看,我们的主要贡献是Jung定理(1901)的推广。经典定理界定了直径为1的集合的最大可能半径,并刻画了极值体。我们的推广提供了一个鲁棒变体,刻画了近极值体,并通过利用对称性和李群作用的几何和动力学论证证明。

英文摘要

We study the problem of reconstructing an unknown point in $\mathbb{R}^d$ from approximate linear queries. This setting arises naturally in applications ranging from low-dimensional remote sensing and signal recovery to high-dimensional data analysis and privacy-sensitive inference. Our main goal is to characterize the optimal reconstruction error as a function of the number of queries $T$, the ambient dimension $d$, and the noise parameter $δ$. We first analyze the limit $T \to \infty$ and show that the optimal reconstruction error converges to the explicit value $\sqrt{2d/(d+1)} δ$, which plays a role analogous to the Bayes optimal error in supervised learning. When the dimension is fixed, we show that the excess error above this limit decays doubly exponentially fast as $T \to \infty$, a rate that is significantly faster than those typically encountered in learning curves. When the dimension grows, we show that a number of queries on the order of $\exp(d)$ is necessary and sufficient to achieve vanishing excess error. Finally, we introduce and analyze an improper variant of the reconstruction problem. From a technical perspective, our main contribution is a generalization of Jung's theorem (1901). The classical theorem bounds the maximum possible radius of a set of diameter 1 and characterizes extremal bodies. Our generalization provides a robust variant that characterizes near-extremal bodies and is proved via geometric and dynamical arguments exploiting symmetry and Lie group actions.

2605.19623 2026-05-20 cs.CV

PrAda: Few-Shot Visual Adaptation for Text-Prompted Segmentation

PrAda:基于文本提示的分割的少样本视觉适应

Gabriele Rosi, Fabio Cermelli, Carlo Masone, Barbara Caputo

发表机构 * Politecnico di Torino(托里诺理工学院) Focoos AI

AI总结 该研究针对文本提示分割在特定领域中的性能下降问题,提出了一种新的少样本视觉适应方法PrAda,通过结合细粒度像素特征和高层Transformer表示学习类特定原型,从而在不改变模型零样本潜力的情况下实现对新领域的强适应。

Comments CVPR 2026 Findings. Code: https://github.com/FocoosAI/PrAda

详情
AI中文摘要

图像分割对于视觉理解至关重要,但需要大量的像素级标注。基础模型已经使预测新类别的新范式成为可能,这些范式通过文本提示引导,而无需目标领域的标注。然而,在专门化的目标领域中,远离原始预训练,其性能会下降。我们研究了现有方法在这样的领域偏移下的误差,发现误分类而不是掩码生成是主要的罪魁祸首。为了解决这个问题,我们引入了新的问题:基于文本提示的分割的少样本视觉适应。这种适应在图像分类中已被广泛研究,但在分割中仍属未探索的领域。我们通过原型适应(PrAda)解决了这一任务,这是一种新颖且参数高效的适应方法,用于适应冻结的文本提示分割模型。我们的方法通过结合细粒度像素特征和高层Transformer表示来学习类特定原型,然后通过学习的重要性因子将这些原型与原始基于文本的预测融合。这在保持模型零样本潜力的同时,使模型能够适应新领域。在五个基准上的语义、实例和全景分割实验表明,PrAda在与现有最先进方法和所提基线相比时,取得了显著的改进。

英文摘要

Segmenting images is critical for visual understanding but demands extensive pixel-level annotations. Foundational models have enabled new paradigms for predicting new classes guided by textual prompts, without annotations from the target domain. Yet, on specialized target domains, far from the original pre-training, their performance degrades. We study the errors of existing methods under such domain-shift, finding that misclassification rather than mask generation is the main culprit. To address this, we introduce the novel problem of Few-Shot Visual Adaptation for text-prompted Segmentation. This kind of adaptation has been largely studied for image classification, but it remains unexplored for segmentation. We tackle this task with Prototype Adaptation (PrAda), a novel, parameter-efficient method that adapts a frozen text-prompted segmentation model. Our approach learns class-specific prototypes by combining fine-grained pixel features and high-level transformer representations, which are then fused with the original text-based predictions through a learned importance factor. This preserves the model's zero-shot potential while enabling strong adaptation to new domains. Experiments across semantic, instance, and panoptic segmentation on five benchmarks demonstrate that PrAda yields significant improvements over state-of-the-art and proposed baselines.

2605.19622 2026-05-20 cs.CV

UniRefiner: Teaching Pre-trained ViTs to Self-Dispose Dross via Contrastive Register

UniRefiner: 通过对比注册教会预训练ViTs自我处理杂质

Congpei Qiu, Zhaoyu Hu, Wei Ke, Zhuotao Tian, Yanhao Wu, Tong Zhang

发表机构 * Xi’an Jiaotong University, School of Software Engineering(西安交通大学软件工程学院) University of Chinese Academy of Sciences(中国科学院大学) Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)) Shenzhen Loop Area Institute(深圳环城院)

AI总结 本文提出UniRefiner,一种通用 refinement 框架,通过对比注册方法教会预训练 ViT 自动处理空间敏感任务中的杂质 token,提升模型在密集预测任务中的表现。

Comments CVPR 2026

详情
AI中文摘要

基于 Vision Transformers (ViTs) 的表示学习已取得显著进展,然而大规模模型在空间敏感任务中的实用性受到虚假 token 的阻碍。先前的缓解措施有限,通常将这些伪影狭义地定义为简单的高范数异常值。我们认为这种范围不足。对于密集预测任务,我们提出任何未能编码位置对齐语义的 token 应被视为伪影。这种更广义的定义揭示了一个更复杂的问题,促使我们系统地分类并表征三种基本类型的伪影 token,这些 token 污染了空间表示。基于这种全面的诊断,我们提出了 UniRefiner,一种通用的 refinement 框架,教会预训练 ViTs 自我处理这些伪影。UniRefiner 使用对比注册来显式隔离并重新分配伪影 token,通过双重目标:(i) 它将图像 token 与过滤后的正常 token 对齐以保持语义,(ii) 它将注册 token 与检测到的伪影 token 对齐以捕捉伪影信号。我们的方法仅需在 ~5k 图像上进行少量微调即可优化多种 ViTs,包括 EVA-CLIP-8B 和 InternViT-6B 等大规模模型。实验显示了一致且显著的改进:特别是优化后的 EVA-CLIP-8B 在 ADE20K 上达到 51.9% mIoU(+9.4%),超过 DINOv2(49.1%)等专用视觉模型,零样本分割精度提升高达 22%。UniRefiner 解锁了现有大规模基础模型的潜在空间能力,为它们的广泛应用铺平了道路。

英文摘要

Representation learning with Vision Transformers (ViTs) has advanced rapidly, yet the utility of large-scale models in spatially sensitive tasks is hindered by spurious tokens. Prior efforts to mitigate this have been limited, often defining these artifacts narrowly, for example, as simple high-norm outliers. We argue that this scope is insufficient. For dense prediction tasks, we posit that any token failing to encode location-aligned semantics should be treated as a spurious artifact. This broader definition reveals a more complex problem, leading us to systematically categorize and characterize three fundamental types of spurious tokens that corrupt spatial representations. Based on this comprehensive diagnosis, we propose UniRefiner, a universal refinement framework that teaches pre-trained ViTs to self-dispose of these artifacts. UniRefiner uses contrastive registers to explicitly isolate and redistribute spurious tokens via a dual objective: (i) it aligns image tokens with filtered regular tokens to preserve semantics, and (ii) it aligns register tokens with detected spurious tokens to capture the spurious signals. Our method requires only a few epochs of fine-tuning on ~5k images to refine diverse ViTs, including massive models like EVA-CLIP-8B and InternViT-6B. Experiments demonstrate consistent and significant improvements: notably, the refined EVA-CLIP-8B achieves 51.9\% mIoU on ADE20K (+9.4\%), surpassing specialized vision models like DINOv2 (49.1\%), while zero-shot segmentation accuracy improves by up to 22\%. UniRefiner unlocks the latent spatial potential of existing large-scale foundation models, paving the way for their broader application.

2605.19620 2026-05-20 cs.CV

Bézier Degradation Modeling for LiDAR-based Human Motion Capture

基于LiDAR的人体动作捕捉的贝塞尔退化建模

Xiaoqi An, Lin Zhao, Jun Li, Chen Gong, Jian Yang

发表机构 * PCA Lab, School of Computer Science and Engineering, Nanjing University of Science and Technology(计算机科学与工程学院精密仪器实验室,南京理工大学) PCA Lab, School of Intelligence Science and Technology, Nanjing University(智能科学与技术学院精密仪器实验室,南京大学)

AI总结 本文提出BMLiCap框架,通过时间可压缩的贝塞尔曲线建模人体动作,采用轨迹保留策略减少控制点,设计渐进式动作重建模块,利用时间尺度运动变换器和多级动作聚合器有效融合多尺度曲线,以提高复杂场景下的动作重建精度和时间连续性。

Comments Accepted by CVPR 2026

详情
AI中文摘要

基于LiDAR的3D人体动作捕捉在自动驾驶和机器人领域有广泛应用,准确的动作重建至关重要。然而,现有方法在不稳定输入和严重遮挡情况下常常导致预测抖动甚至失败。为了解决这些挑战,我们提出BMLiCap,一种从粗到细的框架,通过时间可压缩的贝塞尔曲线建模运动。通过采用轨迹保留策略减少控制点,我们获得了一种连贯且易于学习的动作表示。为了从LiDAR点云线索中重建人体动作,我们设计了一个渐进式动作重建模块。具体来说,引入了时间尺度运动变换器(TMT)来在多个时间尺度上预测运动曲线,并利用多级动作聚合器(MMA)来适应性融合多尺度曲线,以恢复详细的、时间连贯的姿态,有效弥补由遮挡和噪声引起的观测缺口。在四个主流基准LiDARHuman26M、FreeMotion、NoiseMotion和SLOPER4D上,BMLiCap在复杂场景中实现了最先进的准确性和时间连续性,证明了其在严重遮挡下的补偿能力和减少预测抖动的能力。

英文摘要

LiDAR-based 3D human motion capture has broad applications in fields such as autonomous driving and robotics, where accurate motion reconstruction is crucial. However, existing methods often struggle with unstable inputs and severe occlusions, leading to jittery or even failed pose predictions. To address these challenges, we propose BMLiCap, a coarse-to-fine framework that models motion using temporally compressible Bézier curves. By reducing control points through a trajectory-preserving strategy, we obtain a coherent and learning-friendly motion representation. To reconstruct human actions from LiDAR point-cloud cues, we design a progressive motion-reconstruction module. Specifically, a Time-scale Motion Transformer (TMT) is introduced to predict motion curves at multiple temporal scales, and a Multi-level Motion Aggregator (MMA) is utilized to adaptively fuse the multi-scale curves to recover detailed, temporally coherent poses, effectively bridging observation gaps caused by occlusions and noise. Across four mainstream benchmarks LiDARHuman26M, FreeMotion, NoiseMotion, and SLOPER4D, BMLiCap achieves state-of-the-art accuracy and temporal continuity in complex scenes, demonstrating its ability to compensate for severe occlusions and reduce prediction jitter.

2605.19619 2026-05-20 cs.LG cs.AI math.OC stat.ML

MiMuon: Mixed Muon Optimizer with Improved Generalization for Large Models

MiMuon: 一种具有改进泛化能力的混合穆恩优化器用于大模型

Feihu Huang, Yuning Luo, Songcan Chen

发表机构 * College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics(南京航空航天大学计算机科学与技术学院) MIIT Key Laboratory of Pattern Analysis and Machine Intelligence(信息科技部模式分析与机器智能重点实验室) College of Design and Engineering, National University of Singapore(新加坡国立大学设计与工程学院)

AI总结 本文研究了穆恩优化器的泛化误差,提出了一种改进的混合穆恩优化器MiMuon,证明其泛化误差更低,同时保持了与穆恩优化器相同的收敛速度。

Comments 25 pages

详情
AI中文摘要

矩阵结构的参数在许多人工智能模型中频繁出现,例如大语言模型。最近,为大规模模型的矩阵参数设计了一种高效的穆恩优化器,其收敛速度明显快于向量级算法。尽管一些工作已经开始研究穆恩优化器的收敛性质(即优化误差),但其泛化性质(即泛化误差)尚未建立。因此,在本文中,我们基于算法稳定性与数学归纳法研究穆恩优化器的泛化误差,并证明穆恩优化器的泛化误差为O(1/(Nκ^T)),其中N为训练样本数量,T表示迭代次数,κ>0表示梯度估计奇异值之间的最小差。为了增强穆恩优化器的泛化能力,我们通过谨慎使用梯度的正交化,提出了一种有效的混合穆恩(MiMuon)优化器,该优化器是穆恩优化器与基于动量的SGD优化器的混合。然后我们证明我们的MiMuon优化器的泛化误差比穆恩优化器的O(1/(Nκ^T))更低,因为κ通常非常小。同时,我们还研究了我们MiMuon算法的收敛性质,并证明我们的MiMuon算法具有与穆恩算法相同的收敛速度O(1/T^{1/4})。在训练大模型(包括Qwen3-0.6B和YOLO26m)的一些数值实验结果中展示了MiMuon优化器的效率。

英文摘要

Matrix-structured parameters frequently appear in many artificial intelligence models such as large language models. More recently, an efficient Muon optimizer is designed for matrix parameters of large-scale models, and shows markedly faster convergence than the vector-wise algorithms. Although some works have begun to study convergence properties (i.e., optimization error) of the Muon optimizer, its generalization properties (i.e., generalization error) is still not established. Thus, in this paper, we study generalization error of the Muon optimizer based on algorithmic stability and mathematical induction, and prove that the Muon has a generalization error of $O\big(\frac{1}{Nκ^{T}}\big)$, where $N$ is training sample size, and $T$ denotes iteration number, and $κ>0$ denotes minimum difference between singular values of gradient estimate. To enhance generalization of the Muon, we propose an effective mixed Muon (MiMuon) optimizer by cautiously using orthogonalization of gradient, which is a hybrid of Muon and momentum-based SGD optimizers. Then we prove that our MiMuon optimizer has a lower generalization error of $O\big(\frac{1}{N}\big)$ than $O\big(\frac{1}{Nκ^{T}}\big)$ of Muon optimizer, since $κ$ generally is very small. Meanwhile, we also studied the convergence properties of our MiMuon algorithm, and prove that our MiMuon algorithm has the same convergence rate of $O(\frac{1}{T^{1/4}})$ as the Muon algorithm. Some numerical experimental results on training large models including Qwen3-0.6B and YOLO26m demonstrate efficiency of the MiMuon optimizer.

2605.19618 2026-05-20 cs.LG stat.ME

A Family of Divergence Measures for Evaluating the Reconstruction Quality of Explainable Ensemble Trees

可解释性集成树的重建质量评估的一类发散度度量

Massimo Aria, Agostino Gnasso, Carmela Iorio

发表机构 * Department of Economics and Statistics, University of Naples Federico II(那不勒斯费德里科二世大学经济与统计系)

AI总结 本文提出了一种基于发散度的度量框架,用于评估可解释性集成树的重建质量,通过区分一致性和关联性,提供了一种新的诊断方法来识别重建失败的具体原因。

详情
AI中文摘要

验证集成学习者可解释的替代模型需要测量集成内部表示与其替代近似之间的同意程度,而不是仅仅关联性。基于相关性的方法是尺度不变的,无法检测共现结构中的系统性差异。我们提出了一种基于一致性和关联性区别的统计框架,以归一化的可解释性损失(nLoI)为中心。该框架基于Cressie-Read幂发散家族,lambda等于2,nLoI可以分解为节点内和节点间的组成部分,提供了独特的诊断能力,以精确识别重建失败的位置和原因。该框架包含四个互补的度量,捕捉替代质量的不同结构方面。统一的排列检验程序在单次重采样过程中为所有度量提供有效的推断。每个度量的理论性质,包括有界性和对称性,均已建立。蒙特卡洛模拟和实证评估证实了精确的I型错误控制,并展示了这些度量能够检测出相关性方法无法检测到的重建保真度梯度。该框架在可解释性集成树(E2Tree)的背景下开发和说明,并在三个基准数据集上的实证评估展示了该框架的实际应用价值。

英文摘要

Validating interpretable surrogate models for ensemble learners requires measuring agreement between the ensemble's internal representation and its surrogate approximation, rather than mere association. Correlation-based approaches are scale-invariant and fail to detect systematic discrepancies in co-occurrence structure. We propose a statistical framework grounded in the agreement-association distinction, centered on the normalized Loss of Interpretability (nLoI). Rooted in the Cressie-Read power divergence family with lambda equal to 2, the nLoI admits a closed-form decomposition into within-node and between-node components, providing a unique diagnostic capability to identify precisely where and why reconstruction fails. The framework incorporates four complementary measures capturing distinct structural facets of approximation quality. A unified permutation testing procedure delivers valid inference for all measures within a single resampling pass. Theoretical properties, including boundedness and symmetry, are established for each metric. Monte Carlo simulations and empirical evaluations confirm exact Type I error control and demonstrate that these measures detect reconstruction fidelity gradients invisible to correlation-based alternatives. The framework is developed and illustrated in the context of Explainable Ensemble Trees (E2Tree), and empirical evaluation on three benchmark datasets illustrates the practical utility of the framework.

2605.19613 2026-05-20 cs.CV

White-Balance First, Adjust Later: Cross-Camera Color Constancy via Vision-Language Evaluation

先白平衡,后调整:通过视觉-语言评估实现跨相机颜色恒常性

Shuwei Li, Lei Tan, Robby T. Tan

发表机构 * National University of Singapore(国立新加坡大学) ASUS Intelligent Cloud Services(ASUS智能云服务)

AI总结 本文提出VLM-CC框架,通过视觉-语言模型评估实现跨相机颜色恒常性的迭代反馈优化,利用感知反馈替代直接RGB回归,提升鲁棒性。

Comments In CVPR 2026

详情
AI中文摘要

颜色恒常性旨在保持物体颜色在不同光照下的一致性。跨相机颜色恒常性仍具挑战性,因为基于学习的模型常过拟合训练相机的颜色响应特性,导致在其他相机拍摄的图像上性能下降。我们提出VLM-CC,一种反馈引导的框架,将颜色恒常性建模为迭代细化过程。而不是直接从原始输入估计光源,VLM-CC通过视觉-语言模型(VLM)基于的评估进行迭代修正。在每次迭代中,图像使用当前估计进行白平衡并转换为伪sRGB。一个轻量级的LoRA微调VLM然后评估校正后的图像,识别主导的残差色偏并提供定性反馈。此反馈被映射到残差照明方向(红、绿或蓝)并用于更新光源估计,直到收敛。我们的关键思想是将颜色恒常性重新建模为迭代感知反馈问题,利用VLM评估而不是直接RGB回归。通过将直接RGB估计替换为VLM引导的感知反馈,VLM-CC在多个数据集上实现了跨相机颜色恒常性的最先进鲁棒性。代码将在https://github.com/NothingIknow/VLM-CC上提供。

英文摘要

Color constancy aims to keep object colors consistent under varying illumination. Cross-camera generalization in color constancy remains challenging because learning-based models often overfit to the color response characteristics of the training camera, resulting in degraded performance on images captured by other cameras. We propose VLM-CC, a feedback-guided framework that formulates color constancy as an iterative refinement process. Instead of directly estimating the illuminant from raw input, VLM-CC performs iterative correction driven by vision-language model (VLM)-based evaluation. At each iteration, the image is white-balanced using the current estimate and converted to pseudo-sRGB. A lightweight LoRA-tuned VLM then assesses the corrected image, identifying the dominant residual color cast and providing qualitative feedback. This feedback is mapped to a residual illumination direction (red, green, or blue) and used to update the illuminant estimate until convergence. Our key idea is to reframe color constancy as an iterative perceptual feedback problem, leveraging VLM evaluation instead of direct RGB regression. By replacing direct RGB estimation with VLM-guided perceptual feedback, VLM-CC achieves state-of-the-art robustness in cross-camera color constancy across multiple datasets. Code will be available at https://github.com/NothingIknow/VLM-CC.

2605.19607 2026-05-20 cs.CV cs.AI cs.LG

Spectral Integrated Gradients for Coarse-to-Fine Feature Attribution

基于谱积分梯度的粗到细特征归因

Soyeon Kim, Seongwoo Lim, Kyowoon Lee, Jaesik Choi

发表机构 * Korea Advanced Institute of Science and Technology(韩国科学技术院) INEEJI Corp.(INEEJI公司)

AI总结 本文提出Spectral Integrated Gradients(SIG)方法,通过奇异值分解构建积分路径,以减少噪声并提高特征归因的准确性,优于传统路径基方法。

Comments 21 pages, 13 figures, 9 tables. Accepted to ACM KDD 2026; includes appendix

详情
AI中文摘要

积分梯度(IG)是一种广泛采用的特征归因方法,满足理想的公理性质。然而,积分路径的选择显著影响归因质量,标准直线路径同时引入所有输入特征,通常在途中积累噪声梯度。为解决这一限制,我们提出了Spectral Integrated Gradients,通过基线到输入差异的奇异值分解(SVD)构建积分路径。通过逐步激活奇异成分,从最大到最小,SIG在引入全局结构之前引入细粒度细节,自然遵循粗到细的进程。通过在多种图像分类数据集上的广泛评估,我们证明SIG生成的归因图更干净,噪声更少,并在定量性能上优于现有基于路径的归因方法。我们的代码可在https://github.com/leekwoon/sig/上获得。

英文摘要

Integrated Gradients (IG) is a widely adopted feature attribution method that satisfies desirable axiomatic properties. However, the choice of integration path significantly affects the quality of attributions, and the standard straight-line path introduces all input features simultaneously, often accumulating noisy gradients along the way. To address this limitation, we propose Spectral Integrated Gradients, which constructs integration paths based on singular value decomposition (SVD) of the baseline-to-input difference. By progressively activating singular components from largest to smallest, SIG introduces global structure before fine-grained details, naturally following a coarse-to-fine progression. Through extensive evaluation across diverse image classification datasets, we demonstrate that SIG produces cleaner attribution maps with reduced noise and achieves improved quantitative performance compared to existing path-based attribution methods. Our code is available at https://github.com/leekwoon/sig/.

2605.19605 2026-05-20 cs.CV

deadtrees.earth-aerial: A Multi-Resolution Aerial Image Dataset for Tree Cover and Mortality Detection

deadtrees.earth-aerial: 一个多分辨率航拍图像数据集用于树冠和死亡检测

Ayushi Sharma, Clemens Mosig, Lukas Drees, Salim Soltani, Janusch Vajna-Jehle, Aaron Sheppard, Belqis Ahmadi, Jonathan Schmid, Paul Neumeier, Nathan Jacobs, Jan Dirk Wegner, Teja Kattenborn

发表机构 * Chair of Sensor-based Geoinformatics, University of Freiburg(传感器基于地理信息学系,弗赖堡大学) EcoVision Lab, DM3L, University of Zurich(生态视觉实验室,苏黎世大学) Institute for Earth System Science and Remote Sensing, Leipzig University(地球系统科学与遥感研究所,莱比锡大学) Washington University, St. Louis(斯蒂芬斯敦大学)

AI总结 本文提出两个全新的开放数据集,用于从厘米级航拍图像中进行树冠和死亡的联合分割,解决了全球范围内缺乏统一数据集的问题,并在多个生物群落中实现了显著的性能提升。

Comments Preprint. Under review. All rights reserved

详情
AI中文摘要

全球范围内的森林正日益受到气候变化和火灾、害虫和病原体等破坏的威胁,这催生了对大规模树冠和树死亡监测的迫切需求。无人机和飞机的航拍图像是一种关键的数据源,用于详细且大规模地绘制树冠和死亡情况。然而,相关进展受限于缺乏全球代表性、统一的数据集,用于树冠和死亡的联合分割。我们介绍了两个新的、开放的、适合机器学习的数据集,首次在全球范围内实现了从厘米级航拍图像中进行树冠和死亡的联合分割。通过DTE-aerial-train,我们提供了一个包含385,000个1024x1024像素图像块的训练数据集,分辨率范围从2.5到20厘米。它包括多类专家标注和审核的伪标签,用于树冠和死亡。通过DTE-aerial-bench,我们提供了一个地理上平衡的基准测试集,包含25个全球分布的正射图像,总计525个高质量的专家标注图像块,用于树冠和死亡。训练和基准数据集涵盖了热带、温带、寒带和干旱生物群落,并覆盖了广泛的森林结构和死亡模式。使用基准测试集进行评估,我们建立了强参考基线,这些基线在所有生物群落和尺度上提高了死亡分割的性能,在挑战性区域如寒带森林中,F1分数从0.40提高到0.58,提升了约45%的相对性能。所有数据、模型和代码将在宽松的开源许可证下公开发布。基准数据集的交互式可视化可在deadtrees.earth/releases/dte-aerial-bench查看。

英文摘要

Forests worldwide are increasingly threatened by climate change and disturbances such as fire, pests, and pathogens, creating an urgent need for scalable monitoring of tree cover and tree mortality. Aerial imagery from drones and aircraft is a key data source for detailed and large-scale mapping of tree crowns and mortality. However, related progress is limited by the lack of globally representative, harmonized datasets for joint segmentation of tree cover and mortality. We introduce two novel, open, machine-learning-ready datasets to enable joint segmentation of tree cover and tree mortality from centimeter-scale aerial imagery for the first time at global scales. With DTE-aerial-train, we provide a training dataset comprising 385K image patches of size 1024x1024 pixels, with resolutions ranging from 2.5 to 20 cm. It includes multi-class expert-annotated and -audited pseudo-labels for tree cover and mortality. With DTE-aerial-bench, we provide a geographically balanced benchmark test set of 25 globally distributed orthoimages totaling 525 patches with high-quality expert annotations for both tree cover and mortality. Both the training and benchmark datasets span tropical, temperate, boreal, and dryland biomes and cover a wide range of forest structures and mortality patterns. Using the benchmark test set for evaluation, we establish strong reference baselines that improve mortality segmentation across all biomes and scales with significant gains in challenging regions, such as boreal forests, where the F1 score increases from 0.40 to 0.58 with around 45% relative improvement. All data, models, and code will be publicly released under permissive open-source licenses. An interactive visualization of the benchmark dataset is available at deadtrees.earth/releases/dte-aerial-bench.

2605.19604 2026-05-20 cs.AI

Formal Skill: Programmable Runtime Skills for Efficient and Accurate LLM Agents

形式技能:用于高效且准确LLM代理的可编程运行时技能

Xi Zhang, Meijun Gao, Yuntian Zhao, Xinyu Tan, Yilun Yao, Feiyu Wang, Yanshu Wang, Dingsiyi, Tong Yang

发表机构 * FairyClaw

AI总结 本文提出形式技能,一种用于LLM代理的可编程运行时技能抽象,通过JSON元数据和动作模式、可靠的Python执行器、受钩子控制的控制逻辑、形式技能路由和本地运行时状态,提高代理的效率和准确性。

详情
AI中文摘要

大型语言模型(LLM)代理越来越多地在真实工作空间中发挥作用,其中工具和技能决定了模型推理是否能够可靠地转化为行动。现有的技能仍然主要非正式:Markdown技能和指令包将过程编码为长自然语言文档,而函数调用、模型上下文协议(MCP)服务器和框架工具则结构化单个动作,但通常将工作流状态、政策执行和完成纪律排除在技能本身之外。我们引入了形式技能,一种运行时原生的抽象,它通过JSON元数据和动作模式、可靠的Python执行器、受钩子控制的控制逻辑、形式技能路由和本地运行时状态来表示可重用的能力。通过将可重用的过程从重复的提示文本中转移到可执行的状态机和钩子策略中,形式技能为代理提供了一个令牌高效且可执行的控制面。我们在FairyClaw中实现了该抽象,这是一个开源的事件驱动运行时,用于可执行、可观察和可组合的形式技能。在Harness-Bench上,FairyClaw获得了高度竞争的平均分数,同时使用显著更少的令牌,尤其在暴露形式技能作用的任务上表现尤为突出。

英文摘要

Large Language Model (LLM) agents increasingly act inside real workspaces, where tools and skills determine whether model reasoning becomes reliable action. Existing skills remain largely informal: Markdown skills and instruction packs encode procedures as long natural-language documents, while function calling, Model Context Protocol (MCP) servers, and framework tools structure individual actions but usually leave workflow state, policy enforcement, and completion discipline outside the skill itself. We introduce Formal Skill, a runtime-native abstraction that represents reusable capability with JSON metadata and action schemas, reliable Python executors, hook-governed control logic, Formal Skill routing, and skill-local runtime state. By moving reusable procedure from repeated prompt text into executable state machines and hook policies, Formal Skill gives agents a token-efficient and enforceable control surface. We implement the abstraction in FairyClaw, an open-source event-driven runtime for executable, observable, and composable Formal Skills. On Harness-Bench, FairyClaw obtains highly competitive average scores while using substantially fewer tokens, with especially strong results on tasks that expose the role of Formal Skill.

2605.19600 2026-05-20 cs.RO

FlyMirage: A Fully Automated Generation Pipeline for Diverse and Scalable UAV Flight Data via Generative World Model

FlyMirage: 一种用于生成多样化和可扩展的无人机飞行数据的完全自动化生成流程

Jinhan Li, Xijie Huang, Zhaoqi Wang, Yijin Wang, Weiqi Ge, Qiyi He, Mo Zhu, Fei Gao, Yuze Wu, Xin Zhou

发表机构 * State Key Laboratory of Industrial Control Technology, Zhejiang University, Hangzhou 310027, China(浙江大学工业控制技术状态重点实验室,杭州310027,中国) Differential Robotics, Hangzhou 311121, China(差分机器人,杭州311121,中国)

AI总结 本文提出FlyMirage,一种完全自动化的生成流程,通过生成世界模型生成大规模、多样化且逼真的无人机视觉-语言导航数据,支持下一代具身导航模型的发展。

详情
AI中文摘要

在视觉-语言导航(VLN)领域,空中数据集在结合规模、多样性和现实感方面仍然有限,通常依赖于昂贵的真实世界场景或视觉受限的模拟。为了解决这些挑战,我们引入了FlyMirage,一种高度可扩展且完全自动化的空中VLN数据生成流程。我们的方法利用大型语言模型(LLM)作为环境设计师来促进场景多样性,配以生成世界模型,将这些设计转化为高保真的3D高斯点云(3DGS)场景。为了显著减少人工劳动并确保飞行数据的可行性,FlyMirage自动化了场景探索和语义信息获取,并进一步集成了动态可行的规划器用于无人机(UAV)轨迹生成。利用这一工具链,我们生成了一个大规模、多样化且逼真的空中VLN数据集,具有动态可行的飞行轨迹,旨在支持下一代具身导航模型的发展。

英文摘要

In the field of Vision-Language Navigation (VLN), aerial datasets remain limited in their ability to combine scale, diversity, and realism, often relying on either costly real-world scenes or visually limited simulations. To address these challenges, we introduce FlyMirage, a highly scalable and fully automated data generation pipeline for aerial VLN. Our approach leverages large language models (LLM) as an environment designer to promote scene diversity, paired with a generative world model that instantiates these designs into high-fidelity 3D Gaussian Splatting (3DGS) scenes. To substantially reduce human labor and ensure the feasibility of flight data, FlyMirage automates scene exploration and semantic information acquisition, and further integrates a dynamically feasible planner for uncrewed aerial vehicle (UAV) trajectory generation. Utilizing this toolchain, we generate a large-scale, diverse, and photorealistic aerial VLN dataset, with dynamically feasible flying trajectories, designed to support the development of next-generation embodied navigation models.

2605.19597 2026-05-20 cs.CL

LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening

LLMEval-Logic: 一个验证求解器的中文逻辑推理基准,具有对抗性强化

Ming Zhang, Qiyuan Peng, Yinxi Wei, Yujiong Shen, Kexin Tan, Yuhui Wang, Zhenghao Xiang, Junjie Ye, Zhangyue Yin, Zhiheng Xi, Shihan Dou, Tao Gui, Maxm Pan, Ruizhi Yang, Qi Zhang, Xuanjing Huang

发表机构 * Institute of Trustworthy Embodied Artificial Intelligence(可信具身人工智能研究院) Fudan University(复旦大学) Hunyuan Team Tencent(腾讯 Hunyuan 团队) School of Philosophy Fudan University(复旦大学哲学学院)

AI总结 本文提出LLMEval-Logic,一个基于真实情境场景的中文逻辑推理基准,通过作者和专家共同审核自然语言项目及其形式化参考,利用Z3验证注释答案,构建自然到形式的评分标准,并通过闭环对抗流程强化选定项目。基准包含246个基础项目和190个难度项目,评估14个前沿LLM显示当前模型存在显著差距。

详情
AI中文摘要

评估大型语言模型(LLMs)在自然语言逻辑推理上的能力至关重要,因为规则主导的任务要求结论必须严格基于陈述的前提。许多现有的逻辑推理基准是通过从采样的公式中模板化自然语言项目生成的,仅提供粗糙或未经审核的形式注释,现在很快被前沿推理模型饱和。我们提出了LLMEval-Logic,一个基于真实情境场景的中文逻辑推理基准。其流程包括作者和专家共同审核自然语言项目及其参考形式化,利用Z3验证注释答案,构建自然到形式的评分标准,并通过闭环对抗流程强化选定项目。该基准发布在两个配对子集中:一个包含246个项目的基础子集,附带1,400个专家开发的评分原子,以及一个包含190个项目的难度子集,包含938个多步骤子问题,覆盖封闭模型空间。在LLMEval-Logic上评估14个前沿LLM揭示了当前模型的显著差距:最佳模型仅达到37.5%的难度项目准确率,即使使用参考符号,评估模型中最高的联合Z3+评分形式化得分也仅为60.16%。我们的基准在https://github.com/llmeval/LLMEval-Logic上公开可用。

英文摘要

Evaluating large language models (LLMs) on natural-language logical reasoning is essential because rule-governed tasks require conclusions to follow strictly from stated premises. Many existing logical-reasoning benchmarks are generated by templating natural-language items from sampled formulas, provide only coarse or unaudited formal annotations, and are now quickly saturated by frontier reasoning models. We present LLMEval-Logic, a Chinese logical reasoning benchmark built from realistic situational scenarios. Its pipeline forward-authors and expert-audits natural-language items together with their reference formalizations, verifies annotated answers with Z3, constructs expert rubrics for natural-to-formal grading, and hardens selected items through a closed-loop adversarial workflow. The benchmark is released in two paired subsets: a 246-item Base subset shipped with 1,400 expert-developed rubric atoms, and a 190-item Hard subset with 938 multi-step sub-questions over closed model spaces. Evaluating 14 frontier LLMs on LLMEval-Logic reveals substantial gaps in current models: the best model reaches only 37.5% Hard Item Accuracy, and even with reference symbols the highest joint Z3+Rubric formalization score among evaluated models reaches only 60.16%. Our benchmark is publicly available at https://github.com/llmeval/LLMEval-Logic.

2605.19595 2026-05-20 cs.CV cs.AI

A novel YOLO26-MoE optimized by an LLM agent for insulator fault detection considering UAV images

一种由LLM代理优化的YOLO26-MoE新型模型用于考虑无人机图像的绝缘子故障检测

João Pedro Matos-Carvalho, Laio Oriel Seman, Stefano Frizzo Stefenon, Mohammad Khalaf Mohammad Khreasat, Gabriel Villarrubia González

发表机构 * Department of Automation and Systems Engineering, Federal University of Santa Catarina, Florianópolis, Brazil(自动化与系统工程系,圣卡塔琳娜联邦大学,巴西弗洛里安波利斯) Applications Lab, Faculty of Science, University of Salamanca, Plaza de los Caídos s/n, 37008 Salamanca, Spain(应用实验室,科学学院,萨拉曼卡大学,西班牙萨拉曼卡)

AI总结 本文提出一种优化的YOLO26-MoE模型,通过在YOLO26检测器的高分辨率分支中集成稀疏的混合专家(MoE)模块,以适应细微和多样的故障模式,同时保持单阶段检测框架的效率,利用LLM代理进行超参数优化,最终在无人机图像上实现了99.00 mAP@0.5和95.15 mAP@0.5:0.95的性能,优于最新版本的YOLO。

详情
AI中文摘要

电力线路绝缘子的检查对于确保电网可靠性和防止因损坏或退化的绝缘组件引起的故障至关重要。近年来,结合深度学习视觉系统的无人机(UAV)已成为自动化此过程的有效解决方案。然而,由于缺陷区域小、故障模式异质性、复杂背景和变化的成像条件,绝缘子故障检测仍具挑战性。为解决这些挑战,本文提出了一种优化的YOLO26-MoE模型,一种新的目标检测架构,其在YOLO26检测器的高分辨率分支中集成了稀疏的混合专家(MoE)模块。所提出的修改使模型能够适应细微和多样的故障模式,同时保持单阶段检测框架的效率。超参数优化、最终训练和评估通过工具增强的大型语言模型(LLM)代理协调。所提出的模型实现了0.9900 mAP@0.5和0.9515 mAP@0.5:0.95的性能,优于最新版本的YOLO。这些结果表明,所提出的模型为基于无人机的绝缘子故障检测提供了一种有效且可靠的解决方案。

英文摘要

The inspection of electrical power line insulators is essential for ensuring grid reliability and preventing failures caused by damaged or degraded insulation components. In recent years, Unmanned Aerial Vehicles (UAVs) combined with deep learning-based vision systems have emerged as an effective solution for automating this process. However, insulator fault detection remains challenging due to small defect regions, heterogeneous fault patterns, complex backgrounds, and varying imaging conditions. To address these challenges, this paper proposes an optimized YOLO26-MoE, a novel object detection architecture that integrates a sparse Mixture-of-Experts (MoE) module into the high-resolution branch of the YOLO26 detector. The proposed modification enables adaptive feature refinement for subtle and diverse fault patterns while preserving the efficiency of a one-stage detection framework. Hyperparameter optimization, final training, and evaluation were coordinated through a tool-augmented Large Language Model (LLM) agent. The proposed model achieved 0.9900 mAP@0.5 and 0.9515 mAP@0.5:0.95, outperforming the latest YOLO versions. These results demonstrate that the proposed model provides an effective and reliable solution for UAV-based insulator fault detection.

2605.19594 2026-05-20 cs.RO

MCNav: Memory-Aware Dynamic Cognitive Map for Zero-shot Goal-oriented Navigation

MCNav: 用于零样本目标导向导航的记忆感知动态认知图

Jingyu Li, Zhe Liu, Wenxiao Wu, Li Zhang

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院) University of Hong Kong(香港大学) Huazhong University of Science and Technology(华中科技大学)

AI总结 本文提出MCNav,一种记忆感知的动态认知图导航框架,通过高效查询已探索区域的相关物体信息,解决零样本目标导向导航中目标丢失或误识别的问题,通过目标再验证和遗漏目标再探索策略,结合黑名单和双检机制,实现最先进的性能。

详情
AI中文摘要

在复杂环境中导航到实例级目标是一个具有挑战性的问题。许多现有的零样本方法通过建模整个环境并利用大语言模型进行场景理解来实现强性能。然而,这些策略主要集中在探索新区域,而缺乏对先前探索区域信息的深入利用。因此,当目标在先前访问的区域中丢失或误识别时,导航失败频繁发生。为了解决这些限制,我们提出了MCNav,一种具有动态认知图的记忆感知导航框架。该图存储有关已探索区域相关物体的高效查询信息。基于此记忆结构,MCNav引入了两种记忆感知探索策略:目标再验证,用于重新评估已见过的对象以纠正匹配失败;以及遗漏目标再探索,用于根据上下文线索估计目标在已探索区域中的存在概率。这些策略进一步通过黑名单机制防止重复错误,并通过双检机制进行高置信度确认。我们在HM3Dv1和HM3Dv2数据集上对MCNav进行了三种不同任务的评估,其中在实例级目标导航任务上实现了最先进的性能。

英文摘要

Navigating to instance-level targets in complex environments is a challenging problem. Many existing zero-shot methods achieve strong performance by modeling the entire environment and leveraging large language models for scene understanding. However, such strategies primarily focus on exploring new regions while lacking a deeper exploitation of information from previously explored areas. Consequently, when targets are missed or misidentified within previously visited regions, navigation failures occur frequently. To address these limitations, we propose MCNav, a memory-aware navigation framework with a dynamic cognitive map. This map stores efficiently queryable information about relevant objects in explored areas. Building on this memory structure, MCNav introduces two memory-aware exploration strategies: goal re-validation, which re-assesses previously seen objects to correct matching failures, and missed goal re-exploration, which estimates the likelihood that a target is present in an explored region from contextual cues. These strategies are further stabilized by a blacklist mechanism to prevent repeated errors and a double-check mechanism for high-confidence confirmation. We evaluate MCNav on the HM3Dv1 and HM3Dv2 datasets across three different tasks, where it achieves state-of-the-art performance, particularly on the instance-level goal navigation task.