2605.19678 2026-05-20 cs.RO

RoVLA: Multi-Consistency Constraints for Robust Vision-Language-Action Models

RoVLA: 多一致性约束用于鲁棒的视觉-语言-动作模型

Jingzhou Luo, Yifan Wen, Yongjie Bai, Xinshuai Song, Yang Liu, Liang Lin

发表机构 * Sun Yat-sen University（中山大学）； Peng Cheng Laboratory（鹏城实验室）； Guangdong Key Laboratory of Big Data Analysis and Processing（广东大数据分析与处理重点实验室）； X-Era AI Lab（X-Era AI实验室）

AI总结本文提出RoVLA框架，通过多一致性约束提升视觉-语言-动作模型的鲁棒性，通过指令语义、轨迹演变和观察扰动三种互补变换增强模型的稳定性和泛化能力。

详情

AI中文摘要

视觉-语言-动作（VLA）模型在具身操控中表现出色，但在视觉观察变化、语言指令改写和复合扰动下仍显脆弱。这种限制表明现有方法仍依赖于训练分布中的浅层相关性，而非学习任务语义、环境状态和动作生成之间的稳定耦合。尽管近期研究通过大规模训练、训练后适应或增强预测建模提高了鲁棒性，但很少在端到端策略本身中强制执行不变性一致性。为了解决这个问题，我们提出了RoVLA，一个具有多一致性约束的鲁棒视觉-语言-动作框架。RoVLA在三个互补的变换下强制一致性：指令语义、轨迹演变和观察扰动。具体而言，指令一致性（IC）通过语义等价指令改写促进稳定的语义关联，演变一致性（EC）在整个生成过程中保持一致的动作意图，观察一致性（OC）通过强制在受扰动前后的一致预测来提高对视觉和体感扰动的鲁棒性。通过在训练过程中显式建模这些不变性，RoVLA减少了对表面相关性的依赖，提高了鲁棒性和泛化能力。在LIBERO-Plus、RoboTwin 2.0和现实世界操控任务上的实验表明，RoVLA在强基线方法上表现一致，并在多样化的任务和观察转移下表现出更优越的鲁棒性。这些结果证明了多一致性学习在鲁棒具身控制中的有效性。代码将在https://github.com/HCPLab-SYSU/RoVLA上提供。

英文摘要

Vision-Language-Action (VLA) models have shown strong performance on embodied manipulation, yet they remain brittle under visual observation changes, paraphrased language instructions, and compounded perturbations. This limitation suggests that existing methods still rely heavily on shallow correlations in the training distribution, rather than learning stable couplings among task semantics, environment states, and action generation. Although recent efforts improve robustness through larger-scale training, post-training adaptation, or enhanced predictive modeling, they rarely enforce invariance-oriented consistency within the end-to-end policy itself. To address this issue, we propose RoVLA, a robust vision-language-action framework with multi-consistency constraints. RoVLA enforces consistency under three complementary transformations: instruction semantics, trajectory evolution, and observation perturbation. Specifically, Instructional Consistency (IC) promotes stable grounding under semantically equivalent instruction rewrites, Evolutionary Consistency (EC) preserves coherent action intent throughout the generation process, and Observational Consistency (OC) improves robustness to visual and proprioceptive perturbations by enforcing consistent predictions before and after targeted disturbances. By explicitly modeling these invariances during training, RoVLA reduces reliance on superficial correlations and improves robustness and generalization. Experiments on LIBERO-Plus, RoboTwin 2.0, and real-world manipulation tasks show that RoVLA consistently outperforms strong baseline methods and exhibits superior robustness under diverse task and observation shifts. These results demonstrate the effectiveness of multi-consistency learning for robust embodied control. Codes will be available at https://github.com/HCPLab-SYSU/RoVLA.

URL PDF HTML ☆

赞 0 踩 0

2605.19677 2026-05-20 cs.LG q-bio.QM

Agentic Discovery of Cryomicroneedle Formulations

代理发现冷冻微针制剂配方

Hao Li, Lifu Du, Nurul Hameed, Shemonti Saha Authai, Zlata Stefanovic, Chenjie Xu

发表机构 * Department of Biomedical Engineering, City University of Hong Kong（香港城市大学生物医学工程系）

AI总结本研究提出了一种结合文献整理、高斯过程代理建模、贝叶斯优化和顺序湿实验验证的闭环工作流程，用于发现冷冻微针的冷冻保护剂配方，通过迭代湿实验验证提高了配方的准确性和有效性。

详情

AI中文摘要

冷冻微针提供了一种微创的皮下递送活细胞的途径，但其低温保存配方必须在保护细胞和限制毒性和设备制造约束之间取得平衡。本文报告了一种由AI辅助的闭环工作流程，用于冷冻微针冷冻保护剂的发现，结合了文献整理、高斯过程代理建模、贝叶斯优化和顺序湿实验验证。一个包含198种骨髓干细胞冷冻保存配方的curated数据集（来自42项研究）被转换为21种成分特征，并用于训练一个不确定性的文献先验模型。该模型捕捉了文献数据中的中等结构，但前瞻性地失败了，促使进行迭代的湿实验修正。在十次验证迭代和106次湿实验观察中，模型逐步适应了冷冻微针特定的结果：批次RMSE从41.21个百分点降低到6.86个百分点，后期阶段的排名相关性变得一致为正，累积的湿实验预测与测量总结达到了R²=0.942。最佳验证配方实现了95.15%的复苏存活率，同时具有低DMSO、ectoin、乙二醇和胎牛血清含量。然而，高存活率本身并不保证冷冻微针的完整形成，突显了未来多目标优化的必要性。这些结果表明，代理辅助的计算基础设施可以使数据高效的配方发现对拥有少量内部数据专业知识的实验室更加可及。项目代码可在https://github.com/baitmeister/ML-for-CryoMN上获得。

英文摘要

Cryomicroneedles offer a route to minimally invasive intradermal delivery of living cells, but their cryogenic formulations must reconcile cell protection with constraints on toxicity and device fabrication. Here we report an AI-assisted, closed-loop workflow for cryomicroneedle cryoprotectant discovery that combines literature curation, Gaussian-process surrogate modelling, Bayesian optimization, and sequential wet-lab validation. A curated dataset of 198 mesenchymal stem-cell cryopreservation formulations from 42 studies was converted into 21 ingredient features and used to train an uncertainty-aware literature prior. This model captured moderate structure in the literature data but failed prospectively, motivating iterative wet-lab correction. Across ten validation iterations and 106 wet-lab observations, the model progressively adapted to cryomicroneedle-specific outcomes: batch RMSE decreased from 41.21 to 6.86 percentage points, later-stage rank correlations became consistently positive, and the cumulative wet-lab predicted-versus-measured summary reached $R^2 = 0.942$. The best validated formulation achieved 95.15\% post-thaw viability with low DMSO, ectoin, ethylene glycol, and fetal bovine serum. However, high viability alone did not ensure intact cryomicroneedle formation, highlighting the need for future multi-objective optimization. These results demonstrate that agent-assisted computational infrastructure can make data-efficient formulation discovery more accessible to labs with minimal data expertise in-house. Project code is available at https://github.com/baitmeister/ML-for-CryoMN.

URL PDF HTML ☆

赞 0 踩 0

2605.19671 2026-05-20 cs.AI

Transforming Constraint Programs to Input for Local Search

将约束程序转换为局部搜索的输入

Jo Devriendt, Patrick De Causmaecker, Marc Denecker

发表机构 * University of Leuven（卢森堡大学）

AI总结本文通过建立约束优化问题的对称性属性与局部搜索邻域之间的联系，自动从约束规范中生成邻域，用于IDP系统中的元启发式算法，并在六个经典优化问题上评估了生成的邻域。

Comments Unpublished paper accepted and presented at the Fourteenth International Workshop on Constraint Modelling and Reformulation (ModRef) in 2015

2605.19663 2026-05-20 cs.AI

P2DNav: 全景到俯视视角的零样本视觉-语言导航

Kai Sheng, Liuyi Wang, Haojie Dai, Jinlong Li, Yongrui Qin, Zongtao He, Chengju Liu, Qijun Chen

发表机构 * Department of Control Science and Engineering, Tongji University（控制科学与工程系，同济大学）

AI总结本文提出P2DNav框架，通过全景到俯视视角的分解、滑动窗口对话记忆和反思重新定位机制，解决零样本视觉-语言导航中的方向推理与局部定位问题，实验表明其在R2R-CE基准上性能优异。

详情

AI中文摘要

视觉-语言导航（VLN）要求一个具身代理将自然语言指令转化为可执行的导航动作，以应对未见环境。现有零样本方法通常依赖额外的航点预测模块，这些模块往往将高层方向推理与细粒度局部定位纠缠在一起，导致决策错误且不稳定。在本文中，我们提出P2DNav，一种用于零样本视觉-语言导航的分层框架。P2DNav包含三个核心组件：全景到俯视（P2D）、滑动窗口对话记忆（SDM）和反思重新定位机制（RRM）。P2D明确将导航决策分解为两个阶段：全景方向选择和俯视局部定位。它首先从360°全景中选择与指令相关的方向，然后从该方向的俯视RGB观察中预测像素级目标点。此外，SDM将导航历史组织为多轮对话上下文，并在滑动窗口内维护最近的视觉观察以支持长距离导航。RRM进一步通过评估局部定位的可靠性基于俯视观察，并在必要时返回全景方向选择。在R2R-CE基准上的实验表明，P2DNav在零样本方法中表现强劲。特别是，与最先进的（SOTA）零样本航点基于和航点自由方法相比，P2DNav在SR方面分别获得了146.6%和58.9%的提升，证明了P2D、SDM和RRM在零样本VLN中的有效性。代码将向公众发布。

英文摘要

Vision-and-language navigation (VLN) requires an embodied agent to ground natural-language instructions into executable navigation actions in unseen environments. Existing zero-shot methods typically rely on additional waypoint prediction modules, which often entangle high-level directional reasoning with fine-grained local grounding, leading to error-prone and unstable decisions. In this paper, we propose P2DNav, a hierarchical framework for zero-shot vision-and-language navigation. P2DNav consists of three core components: Panorama-to-Downview (P2D), Sliding-Window Dialogue Memory (SDM), and Reflective Reorientation Mechanism (RRM). P2D explicitly decomposes navigation decision-making into two stages: panoramic direction selection and downview local grounding. It first selects the instruction-relevant direction from a 360° panorama, and then predicts a pixel-level target point from the downview RGB observation in that direction. In addition, SDM organizes navigation history as a multi-turn dialogue context and maintains recent visual observations within a sliding window to support long-horizon navigation. RRM further enables reflective reorientation by assessing the reliability of local grounding based on the downview observation and returning to panoramic direction selection when necessary. Experiments on the R2R-CE benchmark show that P2DNav achieves strong performance among zero-shot methods. In particular, compared with the state-of-the-art (SOTA) zero-shot waypoint-based and waypoint-free methods, P2DNav achieves SR gains of 146.6% and 58.9%, respectively, demonstrating the effectiveness of P2D, SDM, and RRM for zero-shot VLN. Code will be released for public use.

URL PDF HTML ☆

赞 0 踩 0

2605.19633 2026-05-20 cs.CL cs.AI cs.LG cs.NE cs.SE

optimize_anything: A Universal API for Optimizing any Text Parameter

optimize_anything: 一个用于优化任何文本参数的通用API

Lakshya A Agrawal, Donghyun Lee, Shangyin Tan, Wenjie Ma, Karim Elmaaroufi, Rohit Sandadi, Sanjit A. Seshia, Koushik Sen, Dan Klein, Ion Stoica, Joseph E. Gonzalez, Omar Khattab, Alexandros G. Dimakis, Matei Zaharia

发表机构 * MIT（麻省理工学院）

AI总结本文提出了一种基于LLM的通用优化系统，能够跨不同领域实现文本参数的优化，展示了其在六个多样化任务中的state-of-the-art性能，通过多任务搜索和跨问题迁移实现了高效的优化。

Comments 16 pages, 11 figures; Blog: https://gepa-ai.github.io/gepa/blog/2026/02/18/introducing-optimize-anything/

Journal ref Proceedings of the ACM Conference on AI and Agentic Systems (CAIS 26), May 26-29, 2026, San Jose, CA, USA

详情

DOI: 10.1145/3786335.3813167

AI中文摘要

能否一个基于LLM的优化系统在根本不同的领域中匹配专门工具？我们证明当优化问题被表述为改进一个通过评分函数评估的文本工件时，一个基于AI的优化系统—支持单任务搜索、多任务搜索和跨问题迁移以及对未见过的输入进行泛化—在六个不同的任务中实现了state-of-the-art的结果。我们的系统发现了将Gemini Flash的ARC-AGI准确性几乎提高三倍的代理架构（32.5%到89.5%），发现了将云成本降低40%的调度算法，生成了87%匹配或超过PyTorch的CUDA内核，并优于AlphaEvolve报告的圆圈打包解决方案（n=26）。在三个领域的消融研究揭示了可操作的侧信息比仅评分反馈更快收敛且最终得分更高，且多任务搜索在同等问题预算下通过跨任务迁移优于独立优化。共同，我们首次展示了基于LLM搜索的文本优化是一种通用问题解决范式，将传统需要领域特定算法的任务统一到一个框架下。我们开源了optimize_anything，并支持多个后端作为GEPA项目的一部分，在https://github.com/gepa-ai/gepa上。

英文摘要

Can a single LLM-based optimization system match specialized tools across fundamentally different domains? We show that when optimization problems are formulated as improving a text artifact evaluated by a scoring function, a single AI-based optimization system-supporting single-task search, multi-task search with cross-problem transfer, and generalization to unseen inputs-achieves state-of-the-art results across six diverse tasks. Our system discovers agent architectures that nearly triple Gemini Flash's ARC-AGI accuracy (32.5% to 89.5%), finds scheduling algorithms that cut cloud costs by 40%, generates CUDA kernels where 87% match or beat PyTorch, and outperforms AlphaEvolve's reported circle packing solution (n=26). Ablations across three domains reveal that actionable side information yields faster convergence and substantially higher final scores than score-only feedback, and that multi-task search outperforms independent optimization given equivalent per-problem budget through cross-task transfer, with benefits scaling with the number of related tasks. Together, we show for the first time that text optimization with LLM-based search is a general-purpose problem-solving paradigm, unifying tasks traditionally requiring domain-specific algorithms under a single framework. We open-source optimize\_anything with support for multiple backends as part of the GEPA project at https://github.com/gepa-ai/gepa .

URL PDF HTML ☆

赞 0 踩 0

2605.19631 2026-05-20 cs.RO cs.CV

可解释性集成树的重建质量评估的一类发散度度量

Massimo Aria, Agostino Gnasso, Carmela Iorio

发表机构 * Department of Economics and Statistics, University of Naples Federico II（那不勒斯费德里科二世大学经济与统计系）

AI总结本文提出了一种基于发散度的度量框架，用于评估可解释性集成树的重建质量，通过区分一致性和关联性，提供了一种新的诊断方法来识别重建失败的具体原因。

详情

AI中文摘要

验证集成学习者可解释的替代模型需要测量集成内部表示与其替代近似之间的同意程度，而不是仅仅关联性。基于相关性的方法是尺度不变的，无法检测共现结构中的系统性差异。我们提出了一种基于一致性和关联性区别的统计框架，以归一化的可解释性损失（nLoI）为中心。该框架基于Cressie-Read幂发散家族，lambda等于2，nLoI可以分解为节点内和节点间的组成部分，提供了独特的诊断能力，以精确识别重建失败的位置和原因。该框架包含四个互补的度量，捕捉替代质量的不同结构方面。统一的排列检验程序在单次重采样过程中为所有度量提供有效的推断。每个度量的理论性质，包括有界性和对称性，均已建立。蒙特卡洛模拟和实证评估证实了精确的I型错误控制，并展示了这些度量能够检测出相关性方法无法检测到的重建保真度梯度。该框架在可解释性集成树（E2Tree）的背景下开发和说明，并在三个基准数据集上的实证评估展示了该框架的实际应用价值。

英文摘要

Validating interpretable surrogate models for ensemble learners requires measuring agreement between the ensemble's internal representation and its surrogate approximation, rather than mere association. Correlation-based approaches are scale-invariant and fail to detect systematic discrepancies in co-occurrence structure. We propose a statistical framework grounded in the agreement-association distinction, centered on the normalized Loss of Interpretability (nLoI). Rooted in the Cressie-Read power divergence family with lambda equal to 2, the nLoI admits a closed-form decomposition into within-node and between-node components, providing a unique diagnostic capability to identify precisely where and why reconstruction fails. The framework incorporates four complementary measures capturing distinct structural facets of approximation quality. A unified permutation testing procedure delivers valid inference for all measures within a single resampling pass. Theoretical properties, including boundedness and symmetry, are established for each metric. Monte Carlo simulations and empirical evaluations confirm exact Type I error control and demonstrate that these measures detect reconstruction fidelity gradients invisible to correlation-based alternatives. The framework is developed and illustrated in the context of Explainable Ensemble Trees (E2Tree), and empirical evaluation on three benchmark datasets illustrates the practical utility of the framework.

URL PDF HTML ☆

赞 0 踩 0

2605.19613 2026-05-20 cs.CV

White-Balance First, Adjust Later: Cross-Camera Color Constancy via Vision-Language Evaluation

先白平衡，后调整：通过视觉-语言评估实现跨相机颜色恒常性

Shuwei Li, Lei Tan, Robby T. Tan

发表机构 * National University of Singapore（国立新加坡大学）； ASUS Intelligent Cloud Services（ASUS智能云服务）

AI总结本文提出VLM-CC框架，通过视觉-语言模型评估实现跨相机颜色恒常性的迭代反馈优化，利用感知反馈替代直接RGB回归，提升鲁棒性。

Comments In CVPR 2026

详情

AI中文摘要

颜色恒常性旨在保持物体颜色在不同光照下的一致性。跨相机颜色恒常性仍具挑战性，因为基于学习的模型常过拟合训练相机的颜色响应特性，导致在其他相机拍摄的图像上性能下降。我们提出VLM-CC，一种反馈引导的框架，将颜色恒常性建模为迭代细化过程。而不是直接从原始输入估计光源，VLM-CC通过视觉-语言模型（VLM）基于的评估进行迭代修正。在每次迭代中，图像使用当前估计进行白平衡并转换为伪sRGB。一个轻量级的LoRA微调VLM然后评估校正后的图像，识别主导的残差色偏并提供定性反馈。此反馈被映射到残差照明方向（红、绿或蓝）并用于更新光源估计，直到收敛。我们的关键思想是将颜色恒常性重新建模为迭代感知反馈问题，利用VLM评估而不是直接RGB回归。通过将直接RGB估计替换为VLM引导的感知反馈，VLM-CC在多个数据集上实现了跨相机颜色恒常性的最先进鲁棒性。代码将在https://github.com/NothingIknow/VLM-CC上提供。

英文摘要

Color constancy aims to keep object colors consistent under varying illumination. Cross-camera generalization in color constancy remains challenging because learning-based models often overfit to the color response characteristics of the training camera, resulting in degraded performance on images captured by other cameras. We propose VLM-CC, a feedback-guided framework that formulates color constancy as an iterative refinement process. Instead of directly estimating the illuminant from raw input, VLM-CC performs iterative correction driven by vision-language model (VLM)-based evaluation. At each iteration, the image is white-balanced using the current estimate and converted to pseudo-sRGB. A lightweight LoRA-tuned VLM then assesses the corrected image, identifying the dominant residual color cast and providing qualitative feedback. This feedback is mapped to a residual illumination direction (red, green, or blue) and used to update the illuminant estimate until convergence. Our key idea is to reframe color constancy as an iterative perceptual feedback problem, leveraging VLM evaluation instead of direct RGB regression. By replacing direct RGB estimation with VLM-guided perceptual feedback, VLM-CC achieves state-of-the-art robustness in cross-camera color constancy across multiple datasets. Code will be available at https://github.com/NothingIknow/VLM-CC.

URL PDF HTML ☆

赞 0 踩 0

2605.19607 2026-05-20 cs.CV cs.AI cs.LG

Spectral Integrated Gradients for Coarse-to-Fine Feature Attribution

基于谱积分梯度的粗到细特征归因

Soyeon Kim, Seongwoo Lim, Kyowoon Lee, Jaesik Choi

发表机构 * Korea Advanced Institute of Science and Technology（韩国科学技术院）； INEEJI Corp.（INEEJI公司）

AI总结本文提出Spectral Integrated Gradients（SIG）方法，通过奇异值分解构建积分路径，以减少噪声并提高特征归因的准确性，优于传统路径基方法。

Comments 21 pages, 13 figures, 9 tables. Accepted to ACM KDD 2026; includes appendix

详情

AI中文摘要

积分梯度（IG）是一种广泛采用的特征归因方法，满足理想的公理性质。然而，积分路径的选择显著影响归因质量，标准直线路径同时引入所有输入特征，通常在途中积累噪声梯度。为解决这一限制，我们提出了Spectral Integrated Gradients，通过基线到输入差异的奇异值分解（SVD）构建积分路径。通过逐步激活奇异成分，从最大到最小，SIG在引入全局结构之前引入细粒度细节，自然遵循粗到细的进程。通过在多种图像分类数据集上的广泛评估，我们证明SIG生成的归因图更干净，噪声更少，并在定量性能上优于现有基于路径的归因方法。我们的代码可在https://github.com/leekwoon/sig/上获得。

英文摘要

Integrated Gradients (IG) is a widely adopted feature attribution method that satisfies desirable axiomatic properties. However, the choice of integration path significantly affects the quality of attributions, and the standard straight-line path introduces all input features simultaneously, often accumulating noisy gradients along the way. To address this limitation, we propose Spectral Integrated Gradients, which constructs integration paths based on singular value decomposition (SVD) of the baseline-to-input difference. By progressively activating singular components from largest to smallest, SIG introduces global structure before fine-grained details, naturally following a coarse-to-fine progression. Through extensive evaluation across diverse image classification datasets, we demonstrate that SIG produces cleaner attribution maps with reduced noise and achieves improved quantitative performance compared to existing path-based attribution methods. Our code is available at https://github.com/leekwoon/sig/.

URL PDF HTML ☆

赞 0 踩 0

2605.19605 2026-05-20 cs.CV

deadtrees.earth-aerial: A Multi-Resolution Aerial Image Dataset for Tree Cover and Mortality Detection

deadtrees.earth-aerial: 一个多分辨率航拍图像数据集用于树冠和死亡检测

Ayushi Sharma, Clemens Mosig, Lukas Drees, Salim Soltani, Janusch Vajna-Jehle, Aaron Sheppard, Belqis Ahmadi, Jonathan Schmid, Paul Neumeier, Nathan Jacobs, Jan Dirk Wegner, Teja Kattenborn

发表机构 * Chair of Sensor-based Geoinformatics, University of Freiburg（传感器基于地理信息学系，弗赖堡大学）； EcoVision Lab, DM3L, University of Zurich（生态视觉实验室，苏黎世大学）； Institute for Earth System Science and Remote Sensing, Leipzig University（地球系统科学与遥感研究所，莱比锡大学）； Washington University, St. Louis（斯蒂芬斯敦大学）

AI总结本文提出两个全新的开放数据集，用于从厘米级航拍图像中进行树冠和死亡的联合分割，解决了全球范围内缺乏统一数据集的问题，并在多个生物群落中实现了显著的性能提升。

详情

AI中文摘要

全球范围内的森林正日益受到气候变化和火灾、害虫和病原体等破坏的威胁，这催生了对大规模树冠和树死亡监测的迫切需求。无人机和飞机的航拍图像是一种关键的数据源，用于详细且大规模地绘制树冠和死亡情况。然而，相关进展受限于缺乏全球代表性、统一的数据集，用于树冠和死亡的联合分割。我们介绍了两个新的、开放的、适合机器学习的数据集，首次在全球范围内实现了从厘米级航拍图像中进行树冠和死亡的联合分割。通过DTE-aerial-train，我们提供了一个包含385,000个1024x1024像素图像块的训练数据集，分辨率范围从2.5到20厘米。它包括多类专家标注和审核的伪标签，用于树冠和死亡。通过DTE-aerial-bench，我们提供了一个地理上平衡的基准测试集，包含25个全球分布的正射图像，总计525个高质量的专家标注图像块，用于树冠和死亡。训练和基准数据集涵盖了热带、温带、寒带和干旱生物群落，并覆盖了广泛的森林结构和死亡模式。使用基准测试集进行评估，我们建立了强参考基线，这些基线在所有生物群落和尺度上提高了死亡分割的性能，在挑战性区域如寒带森林中，F1分数从0.40提高到0.58，提升了约45%的相对性能。所有数据、模型和代码将在宽松的开源许可证下公开发布。基准数据集的交互式可视化可在deadtrees.earth/releases/dte-aerial-bench查看。

英文摘要

Forests worldwide are increasingly threatened by climate change and disturbances such as fire, pests, and pathogens, creating an urgent need for scalable monitoring of tree cover and tree mortality. Aerial imagery from drones and aircraft is a key data source for detailed and large-scale mapping of tree crowns and mortality. However, related progress is limited by the lack of globally representative, harmonized datasets for joint segmentation of tree cover and mortality. We introduce two novel, open, machine-learning-ready datasets to enable joint segmentation of tree cover and tree mortality from centimeter-scale aerial imagery for the first time at global scales. With DTE-aerial-train, we provide a training dataset comprising 385K image patches of size 1024x1024 pixels, with resolutions ranging from 2.5 to 20 cm. It includes multi-class expert-annotated and -audited pseudo-labels for tree cover and mortality. With DTE-aerial-bench, we provide a geographically balanced benchmark test set of 25 globally distributed orthoimages totaling 525 patches with high-quality expert annotations for both tree cover and mortality. Both the training and benchmark datasets span tropical, temperate, boreal, and dryland biomes and cover a wide range of forest structures and mortality patterns. Using the benchmark test set for evaluation, we establish strong reference baselines that improve mortality segmentation across all biomes and scales with significant gains in challenging regions, such as boreal forests, where the F1 score increases from 0.40 to 0.58 with around 45% relative improvement. All data, models, and code will be publicly released under permissive open-source licenses. An interactive visualization of the benchmark dataset is available at deadtrees.earth/releases/dte-aerial-bench.

URL PDF HTML ☆

赞 0 踩 0

2605.19604 2026-05-20 cs.AI

Formal Skill: Programmable Runtime Skills for Efficient and Accurate LLM Agents

形式技能：用于高效且准确LLM代理的可编程运行时技能

Xi Zhang, Meijun Gao, Yuntian Zhao, Xinyu Tan, Yilun Yao, Feiyu Wang, Yanshu Wang, Dingsiyi, Tong Yang

发表机构 * FairyClaw

AI总结本文提出形式技能，一种用于LLM代理的可编程运行时技能抽象，通过JSON元数据和动作模式、可靠的Python执行器、受钩子控制的控制逻辑、形式技能路由和本地运行时状态，提高代理的效率和准确性。

详情

AI中文摘要

大型语言模型（LLM）代理越来越多地在真实工作空间中发挥作用，其中工具和技能决定了模型推理是否能够可靠地转化为行动。现有的技能仍然主要非正式：Markdown技能和指令包将过程编码为长自然语言文档，而函数调用、模型上下文协议（MCP）服务器和框架工具则结构化单个动作，但通常将工作流状态、政策执行和完成纪律排除在技能本身之外。我们引入了形式技能，一种运行时原生的抽象，它通过JSON元数据和动作模式、可靠的Python执行器、受钩子控制的控制逻辑、形式技能路由和本地运行时状态来表示可重用的能力。通过将可重用的过程从重复的提示文本中转移到可执行的状态机和钩子策略中，形式技能为代理提供了一个令牌高效且可执行的控制面。我们在FairyClaw中实现了该抽象，这是一个开源的事件驱动运行时，用于可执行、可观察和可组合的形式技能。在Harness-Bench上，FairyClaw获得了高度竞争的平均分数，同时使用显著更少的令牌，尤其在暴露形式技能作用的任务上表现尤为突出。

英文摘要

Large Language Model (LLM) agents increasingly act inside real workspaces, where tools and skills determine whether model reasoning becomes reliable action. Existing skills remain largely informal: Markdown skills and instruction packs encode procedures as long natural-language documents, while function calling, Model Context Protocol (MCP) servers, and framework tools structure individual actions but usually leave workflow state, policy enforcement, and completion discipline outside the skill itself. We introduce Formal Skill, a runtime-native abstraction that represents reusable capability with JSON metadata and action schemas, reliable Python executors, hook-governed control logic, Formal Skill routing, and skill-local runtime state. By moving reusable procedure from repeated prompt text into executable state machines and hook policies, Formal Skill gives agents a token-efficient and enforceable control surface. We implement the abstraction in FairyClaw, an open-source event-driven runtime for executable, observable, and composable Formal Skills. On Harness-Bench, FairyClaw obtains highly competitive average scores while using substantially fewer tokens, with especially strong results on tasks that expose the role of Formal Skill.

URL PDF HTML ☆

赞 0 踩 0

2605.19600 2026-05-20 cs.RO

FlyMirage: A Fully Automated Generation Pipeline for Diverse and Scalable UAV Flight Data via Generative World Model

FlyMirage: 一种用于生成多样化和可扩展的无人机飞行数据的完全自动化生成流程

Jinhan Li, Xijie Huang, Zhaoqi Wang, Yijin Wang, Weiqi Ge, Qiyi He, Mo Zhu, Fei Gao, Yuze Wu, Xin Zhou

发表机构 * State Key Laboratory of Industrial Control Technology, Zhejiang University, Hangzhou 310027, China（浙江大学工业控制技术状态重点实验室，杭州310027，中国）； Differential Robotics, Hangzhou 311121, China（差分机器人，杭州311121，中国）

AI总结本文提出FlyMirage，一种完全自动化的生成流程，通过生成世界模型生成大规模、多样化且逼真的无人机视觉-语言导航数据，支持下一代具身导航模型的发展。

详情

AI中文摘要

在视觉-语言导航（VLN）领域，空中数据集在结合规模、多样性和现实感方面仍然有限，通常依赖于昂贵的真实世界场景或视觉受限的模拟。为了解决这些挑战，我们引入了FlyMirage，一种高度可扩展且完全自动化的空中VLN数据生成流程。我们的方法利用大型语言模型（LLM）作为环境设计师来促进场景多样性，配以生成世界模型，将这些设计转化为高保真的3D高斯点云（3DGS）场景。为了显著减少人工劳动并确保飞行数据的可行性，FlyMirage自动化了场景探索和语义信息获取，并进一步集成了动态可行的规划器用于无人机（UAV）轨迹生成。利用这一工具链，我们生成了一个大规模、多样化且逼真的空中VLN数据集，具有动态可行的飞行轨迹，旨在支持下一代具身导航模型的发展。

英文摘要

In the field of Vision-Language Navigation (VLN), aerial datasets remain limited in their ability to combine scale, diversity, and realism, often relying on either costly real-world scenes or visually limited simulations. To address these challenges, we introduce FlyMirage, a highly scalable and fully automated data generation pipeline for aerial VLN. Our approach leverages large language models (LLM) as an environment designer to promote scene diversity, paired with a generative world model that instantiates these designs into high-fidelity 3D Gaussian Splatting (3DGS) scenes. To substantially reduce human labor and ensure the feasibility of flight data, FlyMirage automates scene exploration and semantic information acquisition, and further integrates a dynamically feasible planner for uncrewed aerial vehicle (UAV) trajectory generation. Utilizing this toolchain, we generate a large-scale, diverse, and photorealistic aerial VLN dataset, with dynamically feasible flying trajectories, designed to support the development of next-generation embodied navigation models.

URL PDF HTML ☆

赞 0 踩 0

2605.19597 2026-05-20 cs.CL

LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening

LLMEval-Logic: 一个验证求解器的中文逻辑推理基准，具有对抗性强化

Ming Zhang, Qiyuan Peng, Yinxi Wei, Yujiong Shen, Kexin Tan, Yuhui Wang, Zhenghao Xiang, Junjie Ye, Zhangyue Yin, Zhiheng Xi, Shihan Dou, Tao Gui, Maxm Pan, Ruizhi Yang, Qi Zhang, Xuanjing Huang

发表机构 * Institute of Trustworthy Embodied Artificial Intelligence（可信具身人工智能研究院）； Fudan University（复旦大学）； Hunyuan Team Tencent（腾讯 Hunyuan 团队）； School of Philosophy Fudan University（复旦大学哲学学院）

AI总结本文提出LLMEval-Logic，一个基于真实情境场景的中文逻辑推理基准，通过作者和专家共同审核自然语言项目及其形式化参考，利用Z3验证注释答案，构建自然到形式的评分标准，并通过闭环对抗流程强化选定项目。基准包含246个基础项目和190个难度项目，评估14个前沿LLM显示当前模型存在显著差距。

详情

AI中文摘要

评估大型语言模型（LLMs）在自然语言逻辑推理上的能力至关重要，因为规则主导的任务要求结论必须严格基于陈述的前提。许多现有的逻辑推理基准是通过从采样的公式中模板化自然语言项目生成的，仅提供粗糙或未经审核的形式注释，现在很快被前沿推理模型饱和。我们提出了LLMEval-Logic，一个基于真实情境场景的中文逻辑推理基准。其流程包括作者和专家共同审核自然语言项目及其参考形式化，利用Z3验证注释答案，构建自然到形式的评分标准，并通过闭环对抗流程强化选定项目。该基准发布在两个配对子集中：一个包含246个项目的基础子集，附带1,400个专家开发的评分原子，以及一个包含190个项目的难度子集，包含938个多步骤子问题，覆盖封闭模型空间。在LLMEval-Logic上评估14个前沿LLM揭示了当前模型的显著差距：最佳模型仅达到37.5%的难度项目准确率，即使使用参考符号，评估模型中最高的联合Z3+评分形式化得分也仅为60.16%。我们的基准在https://github.com/llmeval/LLMEval-Logic上公开可用。

英文摘要

Evaluating large language models (LLMs) on natural-language logical reasoning is essential because rule-governed tasks require conclusions to follow strictly from stated premises. Many existing logical-reasoning benchmarks are generated by templating natural-language items from sampled formulas, provide only coarse or unaudited formal annotations, and are now quickly saturated by frontier reasoning models. We present LLMEval-Logic, a Chinese logical reasoning benchmark built from realistic situational scenarios. Its pipeline forward-authors and expert-audits natural-language items together with their reference formalizations, verifies annotated answers with Z3, constructs expert rubrics for natural-to-formal grading, and hardens selected items through a closed-loop adversarial workflow. The benchmark is released in two paired subsets: a 246-item Base subset shipped with 1,400 expert-developed rubric atoms, and a 190-item Hard subset with 938 multi-step sub-questions over closed model spaces. Evaluating 14 frontier LLMs on LLMEval-Logic reveals substantial gaps in current models: the best model reaches only 37.5% Hard Item Accuracy, and even with reference symbols the highest joint Z3+Rubric formalization score among evaluated models reaches only 60.16%. Our benchmark is publicly available at https://github.com/llmeval/LLMEval-Logic.

URL PDF HTML ☆

赞 0 踩 0

2605.19595 2026-05-20 cs.CV cs.AI

A novel YOLO26-MoE optimized by an LLM agent for insulator fault detection considering UAV images

一种由LLM代理优化的YOLO26-MoE新型模型用于考虑无人机图像的绝缘子故障检测

João Pedro Matos-Carvalho, Laio Oriel Seman, Stefano Frizzo Stefenon, Mohammad Khalaf Mohammad Khreasat, Gabriel Villarrubia González

发表机构 * Department of Automation and Systems Engineering, Federal University of Santa Catarina, Florianópolis, Brazil（自动化与系统工程系，圣卡塔琳娜联邦大学，巴西弗洛里安波利斯）； Applications Lab, Faculty of Science, University of Salamanca, Plaza de los Caídos s/n, 37008 Salamanca, Spain（应用实验室，科学学院，萨拉曼卡大学，西班牙萨拉曼卡）

AI总结本文提出一种优化的YOLO26-MoE模型，通过在YOLO26检测器的高分辨率分支中集成稀疏的混合专家（MoE）模块，以适应细微和多样的故障模式，同时保持单阶段检测框架的效率，利用LLM代理进行超参数优化，最终在无人机图像上实现了99.00 mAP@0.5和95.15 mAP@0.5:0.95的性能，优于最新版本的YOLO。

详情

AI中文摘要

电力线路绝缘子的检查对于确保电网可靠性和防止因损坏或退化的绝缘组件引起的故障至关重要。近年来，结合深度学习视觉系统的无人机（UAV）已成为自动化此过程的有效解决方案。然而，由于缺陷区域小、故障模式异质性、复杂背景和变化的成像条件，绝缘子故障检测仍具挑战性。为解决这些挑战，本文提出了一种优化的YOLO26-MoE模型，一种新的目标检测架构，其在YOLO26检测器的高分辨率分支中集成了稀疏的混合专家（MoE）模块。所提出的修改使模型能够适应细微和多样的故障模式，同时保持单阶段检测框架的效率。超参数优化、最终训练和评估通过工具增强的大型语言模型（LLM）代理协调。所提出的模型实现了0.9900 mAP@0.5和0.9515 mAP@0.5:0.95的性能，优于最新版本的YOLO。这些结果表明，所提出的模型为基于无人机的绝缘子故障检测提供了一种有效且可靠的解决方案。

英文摘要

The inspection of electrical power line insulators is essential for ensuring grid reliability and preventing failures caused by damaged or degraded insulation components. In recent years, Unmanned Aerial Vehicles (UAVs) combined with deep learning-based vision systems have emerged as an effective solution for automating this process. However, insulator fault detection remains challenging due to small defect regions, heterogeneous fault patterns, complex backgrounds, and varying imaging conditions. To address these challenges, this paper proposes an optimized YOLO26-MoE, a novel object detection architecture that integrates a sparse Mixture-of-Experts (MoE) module into the high-resolution branch of the YOLO26 detector. The proposed modification enables adaptive feature refinement for subtle and diverse fault patterns while preserving the efficiency of a one-stage detection framework. Hyperparameter optimization, final training, and evaluation were coordinated through a tool-augmented Large Language Model (LLM) agent. The proposed model achieved 0.9900 mAP@0.5 and 0.9515 mAP@0.5:0.95, outperforming the latest YOLO versions. These results demonstrate that the proposed model provides an effective and reliable solution for UAV-based insulator fault detection.

URL PDF HTML ☆

赞 0 踩 0

2605.19594 2026-05-20 cs.RO

MCNav: Memory-Aware Dynamic Cognitive Map for Zero-shot Goal-oriented Navigation

MCNav: 用于零样本目标导向导航的记忆感知动态认知图

Jingyu Li, Zhe Liu, Wenxiao Wu, Li Zhang

发表机构 * Fudan University（复旦大学）； Shanghai Innovation Institute（上海创新研究院）； University of Hong Kong（香港大学）； Huazhong University of Science and Technology（华中科技大学）

AI总结本文提出MCNav，一种记忆感知的动态认知图导航框架，通过高效查询已探索区域的相关物体信息，解决零样本目标导向导航中目标丢失或误识别的问题，通过目标再验证和遗漏目标再探索策略，结合黑名单和双检机制，实现最先进的性能。

详情

AI中文摘要

在复杂环境中导航到实例级目标是一个具有挑战性的问题。许多现有的零样本方法通过建模整个环境并利用大语言模型进行场景理解来实现强性能。然而，这些策略主要集中在探索新区域，而缺乏对先前探索区域信息的深入利用。因此，当目标在先前访问的区域中丢失或误识别时，导航失败频繁发生。为了解决这些限制，我们提出了MCNav，一种具有动态认知图的记忆感知导航框架。该图存储有关已探索区域相关物体的高效查询信息。基于此记忆结构，MCNav引入了两种记忆感知探索策略：目标再验证，用于重新评估已见过的对象以纠正匹配失败；以及遗漏目标再探索，用于根据上下文线索估计目标在已探索区域中的存在概率。这些策略进一步通过黑名单机制防止重复错误，并通过双检机制进行高置信度确认。我们在HM3Dv1和HM3Dv2数据集上对MCNav进行了三种不同任务的评估，其中在实例级目标导航任务上实现了最先进的性能。

英文摘要

Navigating to instance-level targets in complex environments is a challenging problem. Many existing zero-shot methods achieve strong performance by modeling the entire environment and leveraging large language models for scene understanding. However, such strategies primarily focus on exploring new regions while lacking a deeper exploitation of information from previously explored areas. Consequently, when targets are missed or misidentified within previously visited regions, navigation failures occur frequently. To address these limitations, we propose MCNav, a memory-aware navigation framework with a dynamic cognitive map. This map stores efficiently queryable information about relevant objects in explored areas. Building on this memory structure, MCNav introduces two memory-aware exploration strategies: goal re-validation, which re-assesses previously seen objects to correct matching failures, and missed goal re-exploration, which estimates the likelihood that a target is present in an explored region from contextual cues. These strategies are further stabilized by a blacklist mechanism to prevent repeated errors and a double-check mechanism for high-confidence confirmation. We evaluate MCNav on the HM3Dv1 and HM3Dv2 datasets across three different tasks, where it achieves state-of-the-art performance, particularly on the instance-level goal navigation task.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

KIO-planner: Attention-Guided Single-Stage Motion Planning with Dual Mapping for UAV Navigation

Multi-Session Ground Texture SLAM in Low-Dynamic Environments

WBCAtt+: Fine-Grained Pixel-Level Morphological Annotations for White Blood Cell Images

D-CLING: Prior-Preserving Depth-Conditioned Fine-Tuning for Navigation Foundation Models

DocQT: Improving Document Forgery Localization Robustness via Diverse JPEG Quantization Tables

RoVLA: Multi-Consistency Constraints for Robust Vision-Language-Action Models

Agentic Discovery of Cryomicroneedle Formulations

Transforming Constraint Programs to Input for Local Search

Pseudocode-Guided Structured Reasoning for Automating Reliable Inference in Vision-Language Models

OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond

Cross-View Splatter: Feed-Forward View Synthesis with Georeferenced Images

K-Quantization and its Impact on Output Performance

P2DNav: Panorama-to-Downview Reasoning for Zero-shot Vision-and-Language Navigation

optimize_anything: A Universal API for Optimizing any Text Parameter

HEAT: Heterogeneous End-to-End Autonomous Driving via Trajectory-Guided World Models

EMO-BOOST: Emotion-Augmented Audio-Visual Features for Improved Generalization in Deepfake Detection

Optimal Reconstruction from Linear Queries

PrAda: Few-Shot Visual Adaptation for Text-Prompted Segmentation

UniRefiner: Teaching Pre-trained ViTs to Self-Dispose Dross via Contrastive Register

Bézier Degradation Modeling for LiDAR-based Human Motion Capture

MiMuon: Mixed Muon Optimizer with Improved Generalization for Large Models

A Family of Divergence Measures for Evaluating the Reconstruction Quality of Explainable Ensemble Trees

White-Balance First, Adjust Later: Cross-Camera Color Constancy via Vision-Language Evaluation

Spectral Integrated Gradients for Coarse-to-Fine Feature Attribution

deadtrees.earth-aerial: A Multi-Resolution Aerial Image Dataset for Tree Cover and Mortality Detection

Formal Skill: Programmable Runtime Skills for Efficient and Accurate LLM Agents

FlyMirage: A Fully Automated Generation Pipeline for Diverse and Scalable UAV Flight Data via Generative World Model

LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening

A novel YOLO26-MoE optimized by an LLM agent for insulator fault detection considering UAV images

MCNav: Memory-Aware Dynamic Cognitive Map for Zero-shot Goal-oriented Navigation