arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3868
2606.09367 2026-06-09 cs.CV 新提交

RT-SDGOD: Real-Time Single-Domain Generalized Object Detection

RT-SDGOD: 实时单域泛化目标检测

Yupeng Zhang, Fangzhuo Gao, Ruize Han, Wei Feng, Liang Wan

发表机构 * College of Intelligence and Computing, Tianjin University(天津大学智能与计算学部) Key Research Center for Surface Monitoring and Analysis of Relics, State Administration of Cultural Heritage(国家文物局文物表面监测与分析重点研究中心) Faculty of Computer Science and Artificial Intelligence, Shenzhen University of Advanced Technology(深圳理工大学计算机科学与人工智能学院)

AI总结 针对实时检测器在域偏移下漏检严重的问题,提出多证据协同建模框架RT-SDGDet,通过一对多监督、证据多样性学习和双视图一致性学习提升泛化能力,且无额外推理开销。

详情
AI中文摘要

在严格实时约束下的实际部署中,天气和成像变化会导致显著的分布偏移,严重降低检测器性能。单域泛化目标检测旨在缓解这一问题,但现有方法很少在问题表述层面研究实时检测器在受限推理预算下的泛化能力。为此,我们引入实时单域泛化目标检测(RT-SDGOD),专注于实时检测器如何仅通过训练时表示学习,在零额外推理开销下实现跨域泛化。我们观察到,在域偏移下,基于DETR的实时检测器主要通过漏检增加而退化,根源在于目标级判别证据有限且不稳定。基于此,我们提出RT-SDGDet,一种用于RT-SDGOD的多证据协同建模框架。核心思想是使同一目标的多个查询协同覆盖更充分的判别证据,同时保持跨视图的证据建模稳定性。具体而言,我们使用一对多(O2M)监督构建稳定的目标特定查询组,并进一步设计判别证据多样性学习(DEDL)和双视图证据一致性学习(DvECL),分别扩展目标级证据覆盖范围和改善外观扰动下的证据稳定性。由于所有组件仅在训练时引入,我们的方法不产生额外推理开销。大量实验表明,所提方法在多个未见目标域上取得了比现有方法更好的泛化性能。

英文摘要

In real-world deployment under strict real-time constraints, weather and imaging variations induce significant distribution shifts, severely degrading detectors. Single-Domain Generalized Object Detection aims to mitigate this issue, yet existing methods rarely investigate-at the level of problem formulation-the generalization capability of real-time detectors under such constrained inference budgets. To this end, we introduce Real-Time Single-Domain Generalized Object Detection (RT-SDGOD), which focuses on how real-time detectors can achieve cross-domain generalization under zero extra inference overhead by relying solely on training-time representation learning. We observe that, under domain shift, DETR-based real-time detectors mainly degrade through increased missed detections, rooted in limited and unstable object-level discriminative evidence. Based on this, we propose RT-SDGDet, a multi-evidence collaborative modeling framework for RT-SDGOD. The core idea is to enable multiple queries of the same object to collaboratively cover more sufficient discriminative evidence while maintaining the stability of such evidence modeling across views. Specifically, we use one-to-many (O2M) supervision to construct stable object-specific query groups, and further design Discriminative Evidence Diversity Learning (DEDL) and Dual-view Evidence Consistency Learning (DvECL) to expand object-level evidence coverage and improve evidence stability under appearance perturbations, respectively. Since all components are introduced only during training, our method incurs no extra inference overhead. Extensive experiments show that the proposed method achieves better generalization performance than existing approaches across multiple unseen target domains.

2606.09366 2026-06-09 cs.CL eess.AS 新提交

Is Text All You Need? Text as a Universal Information Bottleneck for Speech LLMs

文本就是一切?文本作为语音大语言模型的通用信息瓶颈

Ming-Hao Hsu, Yuxuan Hu, Shujie Liu, Jinyu Li, Yan Lu, Zhizheng Wu

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Microsoft Corporation(微软公司) Microsoft Research Asia(微软亚洲研究院)

AI总结 提出Convex Gate(C-Gate)桥接语音与LLM,通过凸包约束将语音表示限制在LLM输入嵌入流形内,在ASR和情感识别上取得联合最优性能,并揭示几何结构而非离散性是关键设计因素。

详情
AI中文摘要

大型语言模型(LLM)为语音理解提供了强大的推理骨干,但将连续声学信号集成到冻结的LLM中仍然具有挑战性。现有的语音到LLM接口通常处于两个极端:要么强制近乎离散的令牌对齐,这有利于转录但丢失副语言信息;要么学习无约束的连续表示,这可能会偏离LLM的输入空间并降低自回归解码性能。在这项工作中,我们提出了Convex Gate(C-Gate),一种语音到LLM的桥接方法,通过架构凸包约束将所有语音表示限制在LLM的输入嵌入流形内。具体而言,每一帧被表示为令牌嵌入的凸组合,确保与预训练LLM的兼容性,同时保持连续表达能力。在自动语音识别(ASR)和情感识别任务中,C-Gate实现了强大的联合性能,在LibriSpeech上相对词错误率(WER)降低高达48.7%,同时匹配或超过单任务情感识别准确率。除了性能之外,我们的分析揭示了一个关键见解:信息不是由离散令牌身份携带,而是由嵌入空间中时间分辨的轨迹携带。因果干预证实,轨迹结构和与预训练嵌入流形的对齐对性能都至关重要。这些结果表明,几何结构而非令牌离散性是语音到LLM接口的基本设计因素,并为研究冻结LLM中的多模态集成提供了一个受控机制。我们发布了检查点、每个样本的输出、机制转储和干预套件以供复现。

英文摘要

Large language models (LLMs) provide a powerful reasoning backbone for speech understanding, but integrating continuous acoustic signals into a frozen LLM remains challenging. Existing speech-to-LLM interfaces typically operate at two extremes: either enforcing near-discrete token alignment, which benefits transcription but loses paralinguistic information, or learning unconstrained continuous representations, which can drift away from the LLM's input space and degrade autoregressive decoding. In this work, we propose Convex Gate (C-Gate), a speech-to-LLM bridge that constrains all speech representations to lie within the LLM's input embedding manifold with an architectural convex-hull constraint. Concretely, each frame is represented as a convex combination of token embeddings, ensuring compatibility with the pretrained LLM while preserving continuous expressivity. Across automatic speech recognition (ASR) and emotion recognition, C-Gate achieves strong joint performance, improving LibriSpeech WER by up to 48.7% relative while matching or exceeding single-task emotion accuracy. Beyond performance, our analysis reveals a key insight: information is not carried by discrete token identities, but by time-resolved trajectories in the embedding space. Causal interventions confirm that both the trajectory structure and alignment to the pretrained embedding manifold are critical for performance. These results suggest that geometry, rather than token discreteness, is the fundamental design factor in speech-to-LLM interfaces, and provide a controlled regime for studying multimodal integration in frozen LLMs. We release the checkpoint, per-sample outputs, mechanism dumps, and intervention suite for replication.

2606.09362 2026-06-09 cs.CV cs.LG 新提交

Zero-Shot Semantic Re-Identification for Autonomous Driving: A VLM Baseline Study

零样本语义重识别用于自动驾驶:一项VLM基线研究

Eduardo Borges, Manuel Abreu, Luís Garrote, Urbano J. Nunes

发表机构 * Autonomous Mobile Robot(自主移动机器人) University of Minho(明德大学)

AI总结 提出使用视觉-语言模型生成语义描述进行零样本重识别,在自动驾驶场景中实现与监督CNN基线相当的检索性能,并增强可解释性。

Comments 7 pages

详情
AI中文摘要

自动驾驶中的重识别通常被表述为一个视觉匹配问题,其中车辆、行人和骑自行车者的观测通过学习的外观嵌入在时间、帧或相机视图之间进行关联,通常辅以运动、几何或多模态线索。然而,纯视觉表示可能对视角、遮挡、光照和传感器域变化敏感,限制了其在复杂驾驶场景中的可解释性和鲁棒性。我们提出了一项零样本管道的基线研究,使用视觉-语言模型生成检测到的交通参与者的文本描述,并评估这些描述是否能够支持跨观测的身份匹配。该公式不仅依赖低层次视觉相似性,而是通过结构化语义属性表示每个对象,包括类别、颜色、形状、姿态、可见部分、空间上下文和独特的视觉线索。本研究为自动驾驶场景中基于语言的重识别提供了初始基准,讨论并评估了当前VLM在此任务中的优势和局限性。结果表明,零样本语义描述可以支持有效的对象重识别,实现与监督CNN基线相当的检索性能,同时通过显式身份线索提供更大的可解释性。然而,实验也揭示了重要挑战,包括跨视角的属性不一致以及视觉相似实例之间的细粒度区分有限。

英文摘要

Re-Identification (ReID) in autonomous driving is typically formulated as a visual matching problem, where observations of vehicles, pedestrians, and cyclists are associated across time, frames, or camera views using learned appearance embeddings, often complemented by motion, geometric, or multimodal cues. However, purely visual representations may be sensitive to viewpoint, occlusion, illumination, and sensor-domain variations, limiting their interpretability and robustness in complex driving scenes. We propose a baseline study of a zero-shot pipeline using Vision-Language Models (VLMs) to generate textual descriptions of detected traffic participants and evaluate whether these descriptions can support identity matching across observations. Instead of relying only on low-level visual similarity, the proposed formulation represents each object through structured semantic attributes, including category, color, shape, pose, visible parts, spatial context, and distinctive visual cues. This study provides an initial benchmark for language-based re-identification in autonomous-driving scenarios, discussing and evaluating the strengths and limitations of current VLMs for this task. Results demonstrate that zero-shot semantic descriptions can support effective object re-identification, achieving retrieval performance comparable to a supervised CNN baseline while offering greater interpretability through explicit identity cues. However, the experiments also reveal important challenges, including attribute inconsistency across viewpoints and limited fine-grained discrimination between visually similar instances.

2606.09360 2026-06-09 cs.CV 新提交

ExDet: Open-Domain Open-Vocabulary Detection with Cross-modal Extrapolation and Rectification

ExDet: 基于跨模态外推与校正的开放域开放词汇检测

Yupeng Zhang, Yuzhong Feng, Ruize Han, Zhiwei Chen, Wei Feng, Liang Wan

发表机构 * College of Intelligence and Computing, Tianjin University(天津大学智能与计算学部) Faculty of Computer Science and Artificial Intelligence, Shenzhen University of Advanced Technology(深圳理工大学计算机科学与人工智能学院) School of Artificial Intelligence, Nanchang University(南昌大学人工智能学院)

AI总结 提出ExDet框架,通过文本引导外推(TGE)和检测器兼容校正(DCR)模块,无需额外训练即可增强开放域开放词汇检测的跨类别和跨域泛化能力,在多个基准上取得最优性能。

详情
AI中文摘要

开放域开放词汇检测(ODOVD)要求检测器泛化到新类别和未见过的域,比开放词汇检测更具挑战性。现有方法通常从头训练开放词汇检测器与域泛化模块,导致训练成本高。我们提出ExDet,一种轻量级类别-域协同泛化框架,用于增强现有检测器的跨类别和跨域泛化能力。ExDet由文本引导外推(TGE)、轻量级检测器兼容校正(DCR)模块和ExRPN组成。具体地,TGE利用视觉-语言模型(VLM)的DeltaSpace属性,从文本推断类别和域感知的代理视觉原型。DCR以无需检测器训练和无需真实数据的方式从TGE生成的原型中学习,并在推理时插入分类头之后,将表示校正为与检测器兼容的源域视觉分布,从而增强对新类别和未见域目标的分类。ExRPN通过结合语义相似度与RPN置信度重新校准提议分数,提高对新颖和域偏移目标的召回率,同时为后续分类和DCR提供更好支持。ExDet在OD-LVIS、OV-LVIS、Objects365和MSOSB上达到最优性能。

英文摘要

Open-domain open-vocabulary detection (ODOVD) requires detectors to generalize to both novel categories and unseen domains, making it more challenging than open-vocabulary detection. Existing methods typically train open-vocabulary detectors together with domain generalization modules from scratch, leading to high training cost. we propose ExDet, a lightweight category-domain collaborative generalization framework for ODOVD that enhances the cross-category and cross-domain generalization of existing detectors. ExDet consists of Text-Guided Extrapolation (TGE), a lightweight Detector-Compatible Rectification (DCR) module, and ExRPN. Specifically, TGE exploits the DeltaSpace property of vision-language models (VLMs) to infer category- and domain-aware proxy visual prototypes from text. DCR is learned from the TGE-generated prototypes in a detector training-free and real-data-free manner, and is inserted after the classification head at inference to rectify representations toward a detector-compatible source-domain visual distribution, thereby enhancing classification for targets from novel categories and unseen domains. ExRPN recalibrates proposal scores by combining semantic similarity with RPN confidence, improving recall for novel and domain-shifted objects while providing better support for subsequent classification and DCR. ExDet achieves SOTA performance on OD-LVIS, OV-LVIS, Objects365, and MSOSB.

2606.09355 2026-06-09 cs.RO 新提交

MosaicIMU: Composing Carrier Experts for Generalizable Neural Inertial Odometry

MosaicIMU:面向可泛化神经惯性里程计的载体专家组合

Junye Zou, Huiyi Yan, Xinning Xu, Xiaolei Li, Pengkun Zhou, Jinhui Zhang, Ziyang Meng

发表机构 * Tsinghua University(清华大学) Xi'an Jiaotong University(西安交通大学) Beijing University of Chemical Technology(北京化工大学) Beijing Information Science and Technology University(北京信息科技大学) Beijing Institute of Technology(北京理工大学)

AI总结 提出MosaicIMU框架,通过原型路由组合载体特定专家特征,结合历史感知EKF,实现跨载体泛化;冻结预训练模型并学习轻量专家残差分支适应新领域,边缘部署时利用路由器选择在线样本高效增量更新,平均ATE和RTE-10s分别降低40%和34%。

详情
AI中文摘要

当外部传感不可靠时,鲁棒的惯性里程计对各种载体至关重要。基于学习的方法通过捕获局部运动先验来减少积分漂移,但这些方法通常局限于特定载体,限制了跨异构平台的泛化。我们提出MosaicIMU,一种载体条件的混合专家(MoE)预训练与自适应框架,用于可泛化的神经惯性里程计。MosaicIMU使用基于原型的路由器组合载体特定的专家特征,解码局部速度和不确定性约束,并将其与历史感知EKF集成。对于未见领域自适应,它冻结预训练基础模型并学习新的轻量专家残差分支。对于边缘部署,它进一步重用路由器来选择信息丰富的在线样本以进行高效的增量更新。实验表明,MosaicIMU持续优于基于学习的基线,平均ATE和RTE-10s分别降低40%和34%。这些结果凸显了MosaicIMU为可泛化和自适应的神经惯性里程计提供了一种可扩展的预训练到部署范式。

英文摘要

Robust inertial odometry is essential for various carriers when external sensing is unreliable. Learning-based methods reduce integration drift by capturing local motion priors, but these methods often remain tied to a particular carrier, limiting generalization across heterogeneous platforms. We present MosaicIMU, a carrier-conditioned Mixture-of-Experts (MoE) pretraining-and-adaptation framework for generalizable neural inertial odometry. MosaicIMU uses a prototype-based router to compose carrier-specific expert features, decodes local velocity and uncertainty constraints, and integrates them with a history-aware EKF. For unseen domain adaptation, it freezes the pretrained base model and learns a new lightweight expert residual branch. For edge-deployment, it further reuses the router to select informative online samples for efficient incremental updates. Experiments show that MosaicIMU consistently outperforms learning-based baselines, reducing average ATE and RTE-10s by 40% and 34%, respectively. These results highlight that MosaicIMU provides a scalable pretraining-to-deployment paradigm for generalizable and adaptive neural inertial odometry.

2606.09353 2026-06-09 cs.CV cs.AI 新提交

Beyond Humans: Multispecies Animal Face Recognition Using Transfer Learning

超越人类:使用迁移学习的多物种动物面部识别

Maria De Marsico, Anil K. Jain, Annalaura Miglino

发表机构 * Sapienza University of Rome(罗马大学) Michigan State University(密歇根州立大学) University of Salerno(萨莱诺大学)

AI总结 研究利用迁移学习(FaceNet和Vision Transformer)实现多物种动物面部识别,在狗、灵长类和牛数据集上验证,狗识别准确率最高(96.85%),部分场景超越现有方法。

Comments This paper extends the work published in the proceedings of CAIP 2025 conference: 'Adapting to the Wild: From Human Face to Animal Face Recognition' by De Marsico, M., Jain, A. K., Miranda, M., & Orlando, A

详情
AI中文摘要

个体动物识别可用于寻找丢失或被盗的宠物、追踪濒危物种个体以及识别拥挤农场中的动物。目前的识别技术主要使用物理设备(如微芯片),通常不切实际且难以应用。这些可以通过动物面部进行远程识别来替代;如果足够准确,它具有多个优势:非侵入性、可远距离工作、难以伪造,例如在食品工业中用病畜替换健康畜的情况。现有的少数数据集具有足够的每个主体图像并标注了单个动物身份,但不足以训练当前的深度学习架构。我们转而研究迁移学习的可能性,利用预训练网络模型作为骨干。我们的实验比较了专门在大型人脸数据库上训练的FaceNet和在ImageNet(即对象类别)上预训练的Vision Transformer(ViT)。我们使用了三种非常不同的动物的面部数据集:狗、灵长类(狐猴、金丝猴和黑猩猩)和牛。我们报告了结果,并对每个数据集与当前最优(SOTA)专门训练的深度网络进行了比较。三个数据集的捕获条件不同。图像质量(分辨率、运动模糊、不同姿态等)从狗到牛到灵长类依次下降。最佳性能在狗上实现,ViT达到了96.85%的平均验证准确率和84.34%的Rank-1识别率。濒危灵长类的结果仍然令人鼓舞,但性能因动物类别和任务(验证或识别)而异,并不总是优于SOTA。对于牛,ViT结果优于SOTA,而FaceNet仍然具有竞争力。

英文摘要

Individual animal recognition can be useful in the search for lost or stolen pets, the tracking of individuals of endangered species, and the recognition of animals in crowded farms. Present recognition techniques mostly use physical devices, e.g., microchips, often impractical and difficult to apply. These could be replaced by remote recognition via the animal's face; if accurate enough, it provides several advantages: it is non-invasive, can work at a distance, and is difficult to counterfeit, as, for instance, in the case of substituting sick animals for healthy ones in the food industry. The few existing datasets with sufficient per-subject images annotated with a single animal identity are not large enough to train current deep learning architectures. We rather investigate the possibility of transfer learning, exploiting pre-trained network models as backbones. Our experiments compared FaceNet, which is specifically trained on large databases of human faces, with the Vision Transformer (ViT) pre-trained on ImageNet, i.e., on object categories. We used three face datasets of very different animals: dogs, primates (lemurs, golden monkeys, and chimpanzees), and cattle. We report the results and, for each dataset, compare them with the state of the art (SOTA) ad hoc-trained deep networks. The capture conditions differ among the three datasets. Image quality (resolution, motion blur, diverse poses, etc.) decreases from dogs to cattle to primates. The best performance was achieved with dogs, where ViT reached a mean verification accuracy of 96.85% and a Rank-1 Identification Rate of 84.34%. The results for endangered primates are still encouraging, but performance varies across animal classes and tasks (verification or identification), and does not always outperform SOTA. For cattle, the ViT results outperform SOTA, while FaceNet is still competitive.

2606.09351 2026-06-09 cs.CL stat.ME 新提交

In-Context Learning for the Imputation of Public Opinion Data with Large Language Models

基于上下文学习的民意数据插补方法

Tobias Holtdirk, Georg Ahnert, Joseph W Sakshaug, Anna-Carolina Haensch

发表机构 * LMU Munich(慕尼黑大学) Munich Center for Machine Learning(慕尼黑机器学习中心) University of Mannheim(曼海姆大学) Institute for Employment Research (IAB)(就业研究所(IAB)) University of Maryland, College Park(马里兰大学帕克分校)

AI总结 提出通过上下文学习(ICL)插补调查缺失数据,在150个意见变量上评估,相比MICE PMM方法,在所有缺失机制下绝对误差更低,尤其非随机缺失时优势显著。

详情
AI中文摘要

大型语言模型已被广泛评估为个体调查响应的模拟器。然而,在实践中,完全未观测到的响应很少见;主要问题是部分无响应。插补旨在通过填充这些缺失值来恢复调查数据集的整体结构。它有自己的明确定义的评估标准,并且与预测有根本区别。我们提出通过上下文学习(ICL)来插补缺失的调查数据。我们在美国趋势面板的15波调查中,针对150个意见变量,系统评估了不同缺失机制(MCAR、MAR、MNAR)下的ICL设计选择。与成熟的数据插补统计方法(如MICE PMM)相比,我们的ICL方法在所有缺失机制下均持续降低了绝对误差,在非随机缺失(MNAR)下收益最大。值得注意的是,性能最佳的配置(gpt-oss-120b,100个上下文示例)实现了接近名义水平的总体覆盖率(接近95%),置信区间比MICE PMM窄2到5倍。我们发布了一个具有类似sklearn API的Python包,以便使用本地和专有LLM轻松部署我们的方法。

英文摘要

Large language models have been widely evaluated as simulators of individual survey responses. In practice, however, fully unobserved responses are rare; the dominant problem is partial non-response. Imputation aims to restore the overall structure of a survey dataset by filling in these missing values. It has its own well-defined evaluation criteria and differs fundamentally from prediction. We propose to impute missing survey data through in-context learning (ICL). We systematically evaluate ICL design choices across different missingness mechanisms (MCAR, MAR, MNAR) on 150 opinion variables spanning 15 waves of the American Trends Panel. Compared to well-established statistical methods for data imputation like MICE PMM, our ICL approach consistently reduces absolute error across all missingness mechanisms, with the largest gains under non-random missingness (MNAR). Notably, the best-performing specification (gpt-oss-120b with 100 in-context examples) achieves near-nominal aggregate coverage (approaching the 95% level) with confidence intervals two to five times narrower than MICE PMM. We publish a Python package with an sklearn-like API to enable easy deployment of our method using local and proprietary LLMs.

2606.09350 2026-06-09 cs.RO cs.CV 新提交

Taming Perception Jitter: Uncertainty-Aware LiDAR Object Detection for Reliable Motion Classification

驯服感知抖动:面向可靠运动分类的不确定性感知激光雷达目标检测

Cornelius Schröder, Žygimantas Marcinkus, Markus Lienkamp

发表机构 * Technical University of Munich(慕尼黑工业大学) Institute for Automotive Engineering, Munich Institute of Robotics and Machine Intelligence, School of Engineering and Design(汽车工程研究所,慕尼黑机器人与机器智能研究所,工程与设计学院)

AI总结 提出一种部署友好的策略,通过不确定性估计和统计检验减少静态物体的虚假动态预测,在真实驾驶中显著降低误报和不必要停车。

详情
AI中文摘要

可靠的运动分类对于自动驾驶至关重要,因为对静态物体的错误动态预测可能会级联导致不必要的规划器干预。不稳定的边界框预测会导致跟踪中产生虚假的速度估计和错误预测的轨迹。我们提出了一种部署友好的缓解策略,该策略通过偶然不确定性估计增强3D目标检测器,并在短观测窗口上应用双样本z检验来区分真实运动和抖动。该方法集成到Autoware中,仅需最小改动,并重用现有数据关联以最小化计算开销。实验结果表明,在nuScenes上与速度阈值法性能相当,但在真实道路测试中,虚假动态预测和不必要停车显著减少,这是因为记录数据中存在中间抖动带,而仅基于速度的规则会误分类。这表明,不确定性感知检测和轻量级统计测试可以在噪声更大的真实环境中为自动驾驶带来实际性能提升。

英文摘要

Reliable motion classification is critical for autonomous driving, as false dynamic predictions of static objects can cascade into unnecessary planner interventions. Unstable bounding box predictions can lead to spurious velocity estimates in tracking and falsely predicted trajectories. We present a deployment-friendly mitigation strategy that augments a 3D object detector with aleatoric uncertainty estimates and applies a two-sample z-test over short observation windows to separate true motion from jitter. Integrated into Autoware with minimal changes, the approach reuses existing data association for minimal compute overhead. Empirical results show parity with velocity thresholding on nuScenes, but substantially fewer false dynamic predictions and unnecessary stops in real-world test drives, explained by the presence of an intermediate jitter band in the recorded data that speed-only rules misclassify. This demonstrates that uncertainty-aware detection and lightweight statistical testing can deliver practical performance gains for autonomous driving in noisier real-world settings.

2606.09348 2026-06-09 cs.LG cs.CL 新提交

PBSD: Privileged Bayesian Self-Distillation for Long-Horizon Credit Assignment

PBSD: 特权贝叶斯自蒸馏用于长程信用分配

Yang Tian, Rui Wang, Xumeng Wen, Junjie Li, Shizhao Sun, Lei Song, Jiang Bian, Bo Zhao

发表机构 * School of AI, Shanghai Jiao Tong University(上海交通大学人工智能学院) XYZ AI Lab(XYZ AI实验室)

AI总结 提出PBSD方法,通过贝叶斯校准的自蒸馏将稀疏最终奖励转化为细粒度步骤级信用信号,解决长程智能体任务中的信用分配问题,实验表明其提升领域内外性能并促进泛化。

详情
AI中文摘要

长程智能体任务对基于结果的强化学习提出了根本性的信用分配挑战:轨迹级奖励验证最终正确性,但很少指导哪些中间推理步骤或工具交互对结果有贡献。在多轮搜索智能体中,这一困难尤为突出,因为成功轨迹可能包含误导性动作,而失败轨迹可能包含有价值的证据收集步骤。我们提出PBSD(特权贝叶斯自蒸馏),一种在稀疏最终奖励下进行细粒度信用分配的贝叶斯校准自蒸馏方法。PBSD通过验证答案的后验与先验概率比来衡量轨迹质量,并应用贝叶斯规则将这个难以估计的答案侧比率转化为标准学生模型与特权答案条件教师模型之间的易处理似然比。对该贝叶斯证据分数的自回归分解产生轮级信号,识别每个中间轮次是支持还是破坏已验证结果。因此,PBSD提供了一种原则性且优雅的重新加权方案,将稀疏结果监督转化为贝叶斯校准的轮级信用信号,同时完全兼容标准策略优化。实验表明,PBSD在领域内和领域外设置中均持续提升性能,并有效将知识从短上下文训练迁移到长上下文推理,表明其细粒度信用分配机制促进了更有效的策略学习并带来更好的泛化。

英文摘要

Long-horizon agentic tasks pose a fundamental credit assignment challenge for outcome-base reinforcement learning: trajectory-level rewards verify final correctness but provide limited guidance on which intermediate reasoning steps or tool interactions contribute to the outcome. The difficulty is especially pronounced in multi-turn search agents, where successful trajectories may contain misleading actions and failed trajectories may contain valuable evidence-gathering steps. We propose PBSD (Privileged Bayesian Self-Distillation), a Bayes-calibrated self-distillation method for fine-grained credit assignment under sparse final rewards. PBSD measures trajectory quality through the posterior-to-prior probability ratio of the verified answer and applies Bayes' rule to convert this hard-to-estimate answer-side ratio into a tractable likelihood ratio between a standard student model and a privileged answer-conditioned teacher model. Autoregressive decomposition of this Bayesian evidence score yields turn-level signals that identify whether each intermediate turn supports or undermines the verified outcome. Consequently, PBSD provides a principled and elegant reweighting scheme that transforms sparse outcome supervision into Bayes-calibrated turn-level credit signals, while remaining fully compatible with standard policy optimization. Experiments demonstrate that PBSD consistently enhances performance across both in-domain and out-of-domain settings, and effectively transfers knowledge from short-context training to long-context inference, suggesting that its fine-grained credit assignment mechanism facilitates more effective policy learning and yields improved generalization.

2606.09343 2026-06-09 cs.AI 新提交

Leveraging Structural Constraints for Diffusion-based Neural TSP Solvers

利用结构约束的基于扩散的神经TSP求解器

Mickaël Basson, Philippe Preux

发表机构 * Université de Lille, France(法国里尔大学) CNRS, France(法国国家科学研究中心) Inria, France(法国国家信息与自动化技术研究院) UMR 9189-CRIStAL, Lille, France(法国里尔大学UMR 9189-CRIStAL研究中心)

AI总结 提出投影一致性推理(PCI),用结构感知投影替代梯度细化,在TSP500/1000上分别达到0.17%/0.31%最优性差距,推理时间减少30-40%。

详情
Journal ref
The 20th Learning and Intelligent OptimizatioN Conference (LION), Jun 2026, Milan (Italie), Italy
AI中文摘要

神经组合优化最近在欧几里得旅行商问题(TSP)上使用生成模型(如扩散和一致性模型)取得了强劲结果。最先进的方法如FT2T将基于一致性的快速预测与基于梯度的推理时细化相结合。然而,梯度搜索通常会产生显著的计算开销,并且可能与可行解的离散结构不一致。我们引入了投影一致性推理(PCI),这是一种即插即用、无需重新训练的替代方案,用结构感知投影替换梯度细化:PCI从一致性模型输出解码有效的哈密顿环,并应用轻量级局部搜索(例如2-opt)。PCI在500个城市的TSP上实现了0.17%的平均最优性差距(OG),在1000个城市的TSP上实现了0.31%,优于FT2T的最佳设置(OG分别为0.22%和0.36%),同时将推理时间减少了30%至40%。PCI还表现出更低的方差和内存使用,并且在快速生成解决方案方面可以超越经典启发式算法(如LKH3)。我们的结果表明,结构感知的推理时操作为神经TSP求解器提供了一条实用且原则性的路径,补充了训练时目标。

英文摘要

Neural combinatorial optimization has recently achieved strong results on the Euclidean Traveling Salesman Problem (TSP) using generative models such as diffusion and consistency models. State-ofthe-art approaches like FT2T combine fast consistency-based prediction with gradient-based inference time refinement. However, gradient search often incurs significant computational overhead and may not align with the discrete structure of feasible solutions. We introduce Projected Consistency Inference (PCI), a plug-and-play, retraining-free alternative that replaces gradient refinement with structure-aware projections: PCI decodes valid Hamiltonian tours from the consistency model output and applies a lightweight local search (e.g., 2-opt). PCI achieves an average optimality gap (OG) of 0.17% on TSP with 500 cities, and 0.31% on TSP with 1000 cities, outperforming FT2T best settings (OG 0.22% and 0.36%, respectively) while reducing the inference time up to 30 to 40%. PCI also exhibits lower variance and memory usage, and can surpass classical heuristics such as LKH3 in rapid solution generation. Our results demonstrate that structure-aware inference time operations provide a practical and principled path for neural TSP solvers, complementing training time objectives.

2606.09340 2026-06-09 cs.LG 新提交

Thresholded Local Hyper-Flow Diffusion

阈值化局部超流扩散

Meher Chaitanya, Sebastian Dalleiger, Luana Ruiz

发表机构 * KTH Royal Institute of Technology(瑞典皇家理工学院) Johns Hopkins University(约翰霍普金斯大学)

AI总结 提出TL-HFD算法,通过局部活动区域和阈值化边界激活实现超图种子聚类的局部扩散,保证与全局更新等价并给出有限时间对偶次优性界。

详情
AI中文摘要

局部超流扩散(HFD)为一般子模超图中的种子聚类提供了与边大小无关的Cheeger型保证,但现有的HFD求解器在每次迭代中不保持中间计算的局部性。我们引入了阈值化局部HFD(TL-HFD),这是一种一阶方法,它维护种子周围的活动区域,对该区域及其直接边界执行投影次梯度更新,并通过阈值化(top-k)边界激活进行扩展。我们证明了局部更新是精确的:限制在活动区域及其边界上的度预条件投影次梯度步骤与无限制的全局更新一致。我们为精确和阈值化更新建立了有限时间对偶次优性,将后者视为具有显式跳过边界误差的不精确投影次梯度步骤。我们进一步推导了一个加性激活体积界,由实现的局部次梯度范数和新激活顶点中的最小边界推动控制,并将具有局部支持的近似对偶最优性转化为早期停止迭代的鲁棒扫描切割保证。对于一般子模切割成本,每次迭代在扫描区域中是局部的,并且在超边原语中是对 oracle 敏感的。实验上,TL-HFD通常匹配或优于HFD,同时激活更少的体积,在扩散倾向于吸收非目标顶点的噪声实例上获得最大收益。

英文摘要

Local Hyper-Flow Diffusion (HFD) gives an edge-size-independent Cheeger-type guarantee for seeded clustering in general submodular hypergraphs, but existing HFD solvers do not keep intermediate computation local at every iteration. We introduce Thresholded Local HFD (TL-HFD), a first-order method that maintains an active region around the seeds, performs projected subgradient updates on that region and its immediate boundary, and expands via thresholded (top-k) boundary activation. We prove that the local update is exact: the degree-preconditioned projected subgradient step restricted to the active region and its boundary coincides with the unrestricted global update. We establish finite-time dual suboptimality for both exact and thresholded updates, treating the latter as inexact projected subgradient steps with explicit skipped-boundary error. We further derive an additive activated-volume bound controlled by realized local subgradient norms and the minimum boundary-push among newly activated vertices, and translate approximate dual optimality with localized support into a robust sweep-cut guarantee for early-stopped iterates. For general submodular cut-costs, each iteration is local in the scanned region and oracle-sensitive in the hyperedge primitive. Empirically, TL-HFD often matches or improves over HFD while activating less volume, with the largest gains on noisy instances where diffusion tends to absorb non-target vertices.

2606.09338 2026-06-09 cs.CL 新提交

Multi-Hop Knowledge Composition is Bound by Pretraining Exposure

多跳知识组合受限于预训练暴露

Yannis Karmim, Luis Marti, Djamé Seddah, Valentin Barrière

发表机构 * Inria, Paris, France(法国国家信息与自动化研究所(巴黎)) Inria, Chile(法国国家信息与自动化研究所(智利)) Dept. of Computer Science, Universidad de Chile(智利大学计算机科学系)

AI总结 研究发现,即使单跳准确率达97%,大语言模型仍无法进行隐式多跳推理,原因是预训练中缺乏组合上下文,而非知识缺失。

详情
AI中文摘要

大语言模型在隐式多跳推理上失败:当模型能正确回答“$X$出生于何时?”和“$Y$最亲密的朋友是谁?”,但在单次前向传播中回答“$Y$最亲密的朋友出生于何时?”时却失败,即使这两个事实都被完美记忆且可单独检索。我们在受控自然语言环境中研究这一失败,严格区分预训练期间暴露于组合上下文的个体和从未出现在任何此类上下文中的个体。我们确认,即使单跳准确率达到97%,组合失败仍然存在,从而将这一差距确定为预训练失败而非知识缺失。我们提出并测试了九种以数据为中心的增强格式,发现组合预训练可以迁移到暴露个体的未见问题,但从未迁移到未参与组合预训练的个体,这表明预训练期间暴露于组合上下文是隐式多跳推理的必要条件。

英文摘要

Large Language Models fail at implicit multi-hop reasoning: a model answers "When was $X$ born?" and "Who is $Y$'s closest friend?" correctly but fails on "When was $Y$'s closest friend born?" in a single forward pass, even when both facts are perfectly memorized and individually retrievable. We study this failure in a controlled natural language setting with a strict separation between individuals exposed to compositional contexts during pretraining and those that never appear in any such context. We confirm that compositional failure persists even at 97% 1-hop accuracy, establishing the gap as a pretraining failure rather than a knowledge absence. We propose and test nine data-centric augmentation formats and find that compositional pretraining transfers to unseen questions for exposed individuals, but never to individuals absent from compositional pretraining, suggesting that exposure to compositional contexts during pretraining is a necessary condition for implicit multi-hop reasoning.

2606.09334 2026-06-09 cs.CL 新提交

How Far Can Prompting Go for Minimal-Edit Ukrainian Grammatical Error Correction?

提示工程在最小编辑乌克兰语语法错误纠正中能走多远?

Kateryna Karpo, Artem Chernodub

发表机构 * Ukrainian Catholic University(乌克兰天主教大学) YouScan Zendesk

AI总结 评估11个商业LLM在乌克兰语最小编辑语法错误纠正上的表现,发现结合最小编辑提示和少样本策略的Gemini 3.1-Pro达到F0.5=69.22,缩小了与微调SOTA的差距。

详情
AI中文摘要

微调大型语言模型在乌克兰语语法错误纠正中占主导地位,而通过API访问的LLM在最小编辑基准上几乎未经过测试。我们在UNLP 2023 GEC-only基准上评估了来自四个提供商的11个商业LLM和一个开源乌克兰语模型,比较了零样本、少样本、最小编辑和LLM辅助提示优化策略。我们的最佳配置(Gemini 3.1-Pro)达到了F0.5=69.22,缩小了与微调SOTA(F0.5=73.14)超过90%的差距。对于零样本提示,只有Claude模型受益于乌克兰语指令。然而,所有模型的最佳总体结果使用了乌克兰语最小编辑提示,其语言特定规则需要精确的乌克兰语表达。在最小编辑+少样本基础上进行LLM辅助提示优化获得了最高分数。详细的最小编辑指令在标点和大小写错误上带来了最大收益,但导致模型放弃了几个低频类别。深入错误分析,我们识别了与乌克兰语特定语言现象相关的五种重复过度纠正模式。代码、提示和输出已公开。

英文摘要

Fine-tuned Large Language Models (LLMs) dominate in Ukrainian grammatical error correction (GEC), while API-accessed LLMs remain nearly untested on minimal-edit benchmarks. We evaluate 11 commercial LLMs from four providers and one open-source Ukrainian model on the UNLP 2023 GEC-only benchmark, comparing zero-shot, few-shot, minimal-edits, and LLM-assisted prompt optimization strategies. Our best configuration (Gemini 3.1-Pro) reaches F0.5=69.22, closing over 90% of the gap to fine-tuned SOTA (F0.5=73.14). For zero-shot prompts, only Claude models benefit from Ukrainian instructions. However, the best overall results for all models use Ukrainian minimal-edits prompts, whose language-specific rules require Ukrainian to express precisely. LLM-assisted prompt optimization on top of minimal-edits + few-shot achieves the highest score. Detailed minimal-edits instructions yield the largest gains for punctuation and case errors but cause the model to abandon several low-frequency categories. Delving into error analysis, we identify five recurring overcorrection patterns tied to Ukrainian-specific linguistic phenomena. Code, prompts, and outputs are publicly available.

2606.09327 2026-06-09 cs.LG cs.AI 新提交

A Universal Dense Football Event Representation Based on TabTransformer

基于TabTransformer的通用密集足球事件表示

Weiran Yang, Daniel Memmert, Maximilian Klemp-Weins

发表机构 * Institute of Exercise Training and Sport Informatics, German Sport University Cologne(科隆德国体育大学运动训练与体育信息学研究所)

AI总结 提出基于TabTransformer的模型,通过学习分类特征的嵌入向量,生成密集的足球事件表示,在下游任务中优于基线方法。

Comments 12 pages, 1 figure. Preprint submitted to the 13th Workshop on Machine Learning and Data Mining for Sports Analytics (MLSA 2026)

详情
AI中文摘要

足球事件数据为团队运动中球员动作的定量分析提供了丰富的时空来源。这些数据集包含异构特征,将连续的位置坐标与分类变量(如动作类型、动作结果和身体部位)相结合。此类数据已应用于体育分析中的比赛结果预测、球员评估和战术模式识别。然而,现有方法主要使用独热或序数嵌入表示来编码分类特征,忽略了动作描述符的内在语义。Transformer是一种基于自注意力的深度神经网络架构,能够捕获输入特征在任意位置之间的依赖关系。我们提出并实现了一个基于Transformer的模型,以学习分类事件特征之间的潜在依赖关系,并生成足球事件的密集表示。通过将分类特征编码为学习到的嵌入向量,在预训练期间捕获了特定于运动的动作语义,使得表示能够支持下游任务,如动作价值估计和比赛风格识别。实证评估表明,在下游预测任务中,嵌入表示在概率校准方面优于任务特定基线,如Brier分数所衡量的。

英文摘要

Football event data constitute a rich spatiotemporal source for quantitative analysis of player actions in team sports. These datasets contain heterogeneous features, combining continuous location coordinates with categorical variables such as action type, action outcome, and body part. Such data have been applied in sports analytics for match outcome forecasting, player evaluation, and tactical pattern recognition. However, existing approaches predominantly encode categorical features using one-hot or ordinal embedding representations, overlooking the intrinsic semantics of action descriptors. The Transformer is a deep neural network architecture based on self-attention that captures dependencies between input features at arbitrary positions. We propose and implement a Transformer-based model to learn latent dependencies among categorical event features and produce dense representations of football events. By encoding categorical features as learned embedding vectors, sport-specific action semantics are captured during pretraining, enabling the representations to support downstream tasks such as action value estimation and play style recognition. Empirical evaluation shows that the embedding representations yield superior probability calibration over task-specific baselines on the downstream prediction tasks, as measured by Brier score.

2606.09323 2026-06-09 cs.AI cs.DB 新提交

TRL-Bench: Standardizing Cross-Paradigm Representation-Level Evaluation of Tabular Encoders

TRL-Bench:标准化跨范式的表格编码器表示级评估

Wei Pang, Xiangru Jian, Hehan Li, Zhixuan Yu, Alex Xue, Jinyang Li, Zhengyuan Dong, Xinjian Zhao, Hao Xu, Chao Zhang, Reynold Cheng, M. Tamer Özsu, Tianshu Yu

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) University of Waterloo(滑铁卢大学) The University of Hong Kong(香港大学) The University of Sydney(悉尼大学) Université Lyon 1(里昂第一大学)

AI总结 提出TRL-Bench,通过标准化下游条件,从列/表、行和组合数据湖表增强三个粒度评估表格编码器,揭示编码器质量具有能力特异性而非单一排名。

详情
AI中文摘要

表格编码器通常在特定任务的全流程管道中进行评估,因此即使处理相似的表格信号,来自不同训练范式的模型也难以直接比较。我们引入了TRL-Bench,一个多粒度表格表示学习基准,用于标准化跨范式的表示级评估:每个编码器通过其支持的封装器导出行、列或表嵌入,共享的轻量级探测头在三个套件中对其进行探测:TRL-CTbench(列/表)、TRL-Rbench(行)和TRL-DLTE(涵盖所有三种粒度的组合数据湖表增强)。为支持这一标准化设置,我们发布了精选的基准资产和任务重构,包括50个OpenML表格(含123个验证目标)、16个行对链接重写以及一个由1379个父表衍生的47772表DLTE湖。在20个模型和16个任务上的实验表明,一旦下游条件标准化,编码器质量是能力特定的,而非由单一排行榜决定。在TRL-CTbench中,通用文本编码器通常在具有强表面文本信号的任务上领先,而表格专用编码器在其预训练目标与任务对齐时获胜。在TRL-Rbench中,表内预测和跨表链接偏好不同的训练机制,原子链接性能与DLTE管道中的行匹配阶段强相关。在TRL-DLTE中,最强管道结合了能力匹配的专用编码器而非重复使用单一编码器,且顶级端到端质量取决于非加性的组合适配而非每阶段边际排名。TRL-Bench提供了一个通用协议,用于在共享下游条件下测量导出表格表示中的可复用信号。代码和数据:https://github.com/LOGO-CUHKSZ/TRL-Bench

英文摘要

Tabular encoders are usually evaluated inside task-specific end-to-end pipelines, so models from different training paradigms are difficult to compare directly even when they operate on similar tabular signals. We introduce TRL-Bench, a multi-granular tabular representation learning (TRL) benchmark that standardizes cross-paradigm representation-level evaluation: each encoder exports row-, column-, or table embeddings through its supported wrapper, and shared lightweight heads probe them across three suites: TRL-CTbench (column/table), TRL-Rbench (row), and TRL-DLTE (compositional Data-Lake Table Enrichment spanning all three granularities). To support this standardized setting, we release curated benchmark assets and task reformulations, including 50 OpenML tables with 123 verified targets, 16 row-pair linkage rewrites, and a 47,772-table DLTE lake derived from 1,379 parent tables. Across 20 models and 16 tasks, TRL-Bench shows that once downstream conditions are standardized, encoder quality is capability-specific rather than captured by a single leaderboard. In TRL-CTbench, generic text encoders often lead on tasks with strong surface-text signal, while tabular specialists win where their pretraining objective aligns with the task. In TRL-Rbench, within-table prediction and cross-table linkage favor different training regimes, with atomic linkage performance correlating strongly with the row-matching stage of DLTE pipelines. In TRL-DLTE, the strongest pipelines combine capability-matched specialists rather than reuse a single encoder, and top end-to-end quality depends on non-additive compositional fit rather than per-stage marginal rank alone. TRL-Bench provides a common protocol for measuring reusable signal in exported tabular representations under shared downstream conditions. Code and data: https://github.com/LOGO-CUHKSZ/TRL-Bench

2606.09314 2026-06-09 cs.RO 新提交

KPGrasp: Scalable Keypoint Flow Matching for Dexterous Grasp Generation

KPGrasp: 可扩展的关键点流匹配用于灵巧抓取生成

Yuansen Huang, Jiayi Chen, Haoran Liu, Yubin Ke, Bing Han, Jiangran Lyu, Mi Yan, Li Yi, He Wang

发表机构 * Peking University(北京大学) Galbot Xi’an Jiaotong University(西安交通大学) Tsinghua University(清华大学)

AI总结 提出KPGrasp框架,通过全欧几里得手部关键点参数化和Transformer流模型,从大规模数据学习灵巧抓取先验,无需接触损失或测试时优化,在模拟和真实场景中实现高成功率与低穿透深度。

Comments 14 pages, 7 figures, 6 tables

详情
AI中文摘要

对于基于学习的方法而言,生成高质量的灵巧抓取仍然具有挑战性,这些方法通常依赖于精心调整的接触损失或昂贵的基于接触的测试时优化。我们提出了KPGrasp,一个流匹配框架,从大规模数据中学习灵巧抓取先验,而不是依赖接触损失或基于接触的测试时优化。KPGrasp将全欧几里得3D手部关键点参数化与一个简单但可扩展的Transformer流模型相结合。该参数化避免了传统混合SE(3)姿态和关节角度输出空间的缺点,在与物体点云相同的坐标系中表达抓取,从而实现了原生空间推理;Transformer流模型仅使用标准流匹配损失进行训练,并随着数据、模型容量和批大小有效扩展。实验表明,在两个模拟基准上达到了最先进的性能。在Dexonomy基准上,它达到了76.3%的抓取成功率,比最强的直接可比基线提高了47.4%,同时将穿透深度减少到2.4毫米。同一模型在DexGrasp Anything基准上也无需微调即可达到最佳平均性能。对于批量推理,KPGrasp每次抓取仅需0.032秒。最后,在20个不同物体上的真实世界实验表明,该流水线可以在真实环境中部署。

英文摘要

Generating high-quality dexterous grasps remains challenging for learning-based methods, which often depend on carefully tuned contact losses or costly contact-based test-time refinement. We present KPGrasp, a flow-matching framework that learns dexterous grasp priors from large-scale data rather than relying on contact losses or contact-based test-time refinement. KPGrasp couples an all-Euclidean 3D hand-keypoint parameterization with a simple yet scalable Transformer flow model. The parameterization avoids the drawbacks of the conventional mixed SE(3) pose and joint-angle output space, expresses grasps in the same frame as the object point cloud, and thus enables native spatial reasoning; the Transformer flow model is trained with only the standard flow-matching loss and scales effectively with data, model capacity, and batch size. Experiments demonstrate state-of-the-art performance on two simulation benchmarks. On the Dexonomy benchmark, it reaches a 76.3% grasp success rate, improving over the strongest directly comparable baseline by 47.4% while reducing penetration depth to 2.4 mm. The same model also achieves the best average performance on the DexGrasp Anything benchmark without fine-tuning. For batched inference, KPGrasp requires only 0.032 s per grasp. Finally, real-world experiments on 20 diverse objects demonstrate that the pipeline can be deployed in a real-world setup.

2606.09313 2026-06-09 cs.LG stat.AP 新提交

Machine-Learning Emulation of Satellite Greenhouse Gas Retrievals: Stability over Time

卫星温室气体反演的机器学习仿真:时间稳定性

Nugzar Gognadze, Motonobu Kanagawa, Yu Someya, Hisashi Yashiro

发表机构 * EURECOM National Institute for Environmental Studies(国立环境研究所)

AI总结 研究机器学习仿真卫星温室气体反演算法的时间稳定性,发现预测精度随时间下降,加入时间特征可改善Lasso和神经网络模型的XCH4预测,简单Lasso模型表现优于复杂方法且更稳定。

Comments 48 pages, 9 figures, 15 tables

详情
AI中文摘要

反演算法通过求解高光谱分辨率卫星辐射测量值的逆问题,用于估算二氧化碳(CO2)和甲烷(CH4)等温室气体(GHGs)的大气浓度。然而,这些算法计算成本高,使得大规模实时估算变得困难。因此,机器学习模型被提出作为反演算法的快速仿真器。然而,现有大多数研究仅使用与训练数据同期的测试数据评估它们。我们利用温室气体观测卫星(GOSAT)的数据研究此类仿真器的时间稳定性。我们表明,当测试期远离训练期时,预测精度通常会下降。我们还表明,将时间作为输入特征显著改善了Lasso和神经网络模型的XCH4预测。在所考虑的方法中,简单的Lasso模型表现与神经网络等更复杂的方法相当或更好,并且随时间产生更稳定的预测。我们利用地面观测网络——总碳柱观测网络(TCCON)进一步验证了结果。在TCCON匹配数据集上,时间增强的Lasso模型对TCCON的误差与GOSAT和TCCON之间在XCO2和XCH4上的差异相当。

英文摘要

Retrieval algorithms are used to estimate atmospheric concentrations of greenhouse gases (GHGs), such as carbon dioxide (CO2) and methane (CH4), by solving inverse problems from high-spectral-resolution satellite radiance measurements. However, these algorithms are computationally expensive, which makes real-time estimation at scale difficult. Machine-learning models have therefore been proposed as fast emulators of retrieval algorithms. Most existing studies, however, evaluate them only on test data from the same period as the training data. We study the stability over time of such emulators using data from the Greenhouse Gases Observing SATellite (GOSAT). We show that prediction accuracy generally deteriorates when the test period moves away from the training period. We also show that including time as an input feature substantially improves XCH4 prediction for Lasso and neural-network models. Among the methods considered, a simple Lasso model performs as well as or better than more complex methods such as neural networks, and yields more stable predictions over time. We further validate the results using the Total Carbon Column Observing Network (TCCON), a ground-based observation network. On the TCCON-matched dataset, the time-augmented Lasso achieves errors against TCCON that are comparable to the disagreement between GOSAT and TCCON for both XCO2 and XCH4.

2606.09312 2026-06-09 cs.LG cs.PL 新提交

Toward Compiler World Models: Learning Latent Dynamics for Efficient Tensor Program Search

迈向编译器世界模型:学习潜在动态以实现高效张量程序搜索

Haolin Pan, Lianghong Huang, Xvlin Zhou, Mingjie Xing, Yanjun Wu

发表机构 * Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences(中国科学院大学杭州高等研究院) Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 提出一种受世界模型启发的评估器,通过轻量级过渡模型在连续潜在空间中展开调度动作,避免昂贵AST变异和重复编码,在TVM AutoScheduler中实现比Ansor更优的延迟和测量效率。

详情
AI中文摘要

张量程序优化对现代机器学习系统至关重要,但其搜索空间巨大。现有的自动调度器通过学习成本模型来降低测量成本,但它们通常将每个候选视为静态代码快照,忽略了产生它的调度轨迹。这使得它们对动作依赖不敏感,且易受表面代码变化影响。我们提出一种受世界模型启发的评估器,将调度评估建模为程序状态上的动作条件潜在动态。从初始程序开始,它使用轻量级过渡模型在连续潜在空间中展开调度动作,避免了昂贵的AST变异和重复代码编码。最终的动态表示与动作和硬件特征结合以对候选进行排序。在TVM AutoScheduler中实现后,我们的方法在相同64次试验预算下,GPU上代表性子图延迟比Ansor提升1.37倍,CPU上提升1.54倍。它还在使用10倍更少测量次数的情况下,在2.2%几何平均内匹配Ansor-10K,并将完整模型推理速度提升至PyTorch/PyTorch-opt(cuDNN)的4.61倍/3.67倍几何平均。

英文摘要

Tensor program optimization is essential for modern machine learning systems, but its search space is enormous. Existing auto-schedulers reduce measurement cost with learned cost models, yet they usually evaluate each candidate as a static code snapshot, ignoring the schedule trajectory that produced it. This makes them insensitive to action dependencies and vulnerable to superficial code variations. We propose a \emph{world-model-inspired} evaluator that models schedule evaluation as action-conditioned latent dynamics over program states. Starting from the initial program, it rolls out scheduling actions in a continuous latent space with a lightweight transition model, avoiding expensive AST mutation and repeated code encoding. The final dynamic representation is combined with action and hardware features to rank candidates. Implemented in TVM AutoScheduler, our method improves representative-subgraph latency over Ansor by 1.37$\times$ on GPU and 1.54$\times$ on CPU under the same 64-trial budget. It also matches Ansor-10K within 2.2% geometric mean using 10$\times$ fewer measurements, and accelerates full-model inference over PyTorch/PyTorch-opt(cuDNN) by 4.61$\times$/3.67$\times$ geometric mean.

2606.09311 2026-06-09 cs.AI 新提交

FF-JEPA: Long-Horizon Planning in World Models with Latent Planners

FF-JEPA:基于潜在规划器的世界模型中的长时域规划

Sergi Masip, Jonathan Swinnen, Yutong Hu, Renaud Detry, Tinne Tuytelaars

发表机构 * KU Leuven(鲁汶大学)

AI总结 提出FF-JEPA层次化方法,通过引入无动作潜在规划器预测子目标,将复杂轨迹分解为短期优化问题,解决长时域规划中计算昂贵和需要目标图像的问题。

详情
AI中文摘要

联合嵌入预测架构(JEPAs)展示了有前景的世界建模能力,能够通过使用交叉熵方法(CEM)等方法优化动作轨迹,在潜在空间中进行规划。然而,这些方法对于长时域规划而言计算成本过高且效果不佳。此外,这些方法通常需要目标状态的显式图像,这在现实任务中并不总是可行。在这项工作中,我们通过提出Forward-Forward-JEPA(FF-JEPA)来解决这些局限性,这是一种利用两个前向动力学模型的层次化方法。除了标准的动作条件前向模型外,我们还引入了一个无动作潜在规划器,该规划器根据当前状态预测下一个子目标。这种方法消除了对目标图像的需求,并通过将复杂轨迹分解为一系列可处理的短期优化问题来实现长时域规划。在PushT上的初步结果表明,FF-JEPA成功克服了扁平世界模型的长时域崩溃,凸显了该方法作为无目标规划的一个有前景的方向。

英文摘要

Joint Embedding Predictive Architectures (JEPAs) have shown promising world modeling capabilities, enabling planning in latent space by optimizing action trajectories using methods like the Cross-Entropy Method (CEM). These methods are, however, too computationally expensive and ineffective for long-horizon planning. Furthermore, these methods typically require an explicit image of the goal state, which is not always possible in real-world tasks. In this work, we tackle these limitations by proposing Forward-Forward-JEPA (FF-JEPA), a hierarchical approach leveraging two forward dynamics models. Alongside a standard action-conditioned forward model, we introduce an action-free latent planner that predicts the next subgoal given the current state. This approach removes the need for goal images and enables long-horizon planning by decomposing complex trajectories into a sequence of tractable, short-term optimization problems. Preliminary results on PushT demonstrate that FF-JEPA successfully overcomes flat world models' long-horizon collapse, highlighting this approach as a promising direction for goal-free planning.

2606.09304 2026-06-09 cs.CL cs.LG 新提交

SG-OPD: Sign-Gated On-Policy Distillation via Sign-Consistency Gating and Phased Teacher Sampling

SG-OPD: 通过符号一致性门控和分阶段教师采样的符号门控在线蒸馏

Haoran Xu, Hongyu Wang, Yifei Gao, Jiaze Li, Xiaofeng Zhang, Xiaosong Yuan

发表机构 * Zhejiang University(浙江大学) Hunan University(湖南大学) Tianjin University(天津大学) Shanghai Jiao Tong University(上海交通大学) Jilin University(吉林大学)

AI总结 针对在线蒸馏中轨迹级对齐和教师偏好均匀可靠性假设的失效问题,提出SG-OPD方法,通过符号一致性门控和分阶段教师采样改进蒸馏效果,在竞赛级数学推理任务上平均提升1.98和7.50。

详情
AI中文摘要

在线蒸馏(OPD)在自身轨迹上训练学生模型,并利用更强教师的密集逐token监督,通常优于离线蒸馏和标准强化学习。然而,我们发现其有效性隐含地依赖于两个在实践中经常失效的假设:学生与教师之间的轨迹级对齐,以及教师偏好的均匀token级可靠性。因此,我们提出符号门控在线蒸馏(SG-OPD),该方法使用二元验证器作为教师信任信号,在两个互补粒度上发挥作用:分阶段教师采样在冷启动时混合验证器认可的教师轨迹,而符号一致性门控在教师与验证器校正方向一致的token上外推蒸馏更新,在分歧时内插。在竞赛级数学推理基准上的实验表明,SG-OPD持续优于标准OPD,在每样本和每问题水平上平均提升分别为1.98和7.50。

英文摘要

On-policy distillation (OPD) trains a student on its own trajectories with dense per-token supervision from a stronger teacher, and often outperforms off-policy distillation and standard reinforcement learning. However, we find that its effectiveness implicitly relies on two assumptions that frequently break in practice: trajectory-level alignment between the student and the teacher, and uniform token-level reliability of the teacher's preferences. We therefore propose Sign-Gated On-Policy Distillation (SG-OPD), which uses a binary verifier as a trust signal for the teacher at two complementary granularities: phased teacher sampling mixes in verifier-endorsed teacher rollouts at cold-start, and a sign-consistency gate extrapolates the distillation update on tokens where the teacher agrees with the verifier-correct direction and interpolates it where it disagrees. Experiments on competition-level mathematical reasoning benchmarks show that SG-OPD consistently outperforms standard OPD, with average gains of 1.98 and 7.50 at the per-sample and per-question levels, respectively.

2606.09303 2026-06-09 cs.CV 新提交

Reason Twice: Segmentation via Candidate Discovery and Comparative Reasoning

再思考:通过候选发现与比较推理进行分割

Xinyan Gao, Haoran Hao, Xiangyu Yue

发表机构 * The Chinese University of Hong Kong(香港中文大学) Nanjing University(南京大学)

AI总结 提出两阶段框架Rea2Seg,先基于注意力图生成候选掩码,再用多模态大语言模型推理评分,将分割转化为候选发现与判别选择,并引入新基准ReasonSeg-SGDR全面评估感知、定位与推理能力。

Comments Project page: https://snowball521.github.io/Rea2Seg-Project/

详情
AI中文摘要

预训练基础模型的快速发展使得更通用的图像分割成为可能。多模态大语言模型(MLLMs)已被广泛探索用于需要高级推理的复杂查询的图像分割。尽管取得了有希望的进展,现有方法通常受限于有限的训练数据以及MLLMs与掩码生成模块之间的差距。为了更好地将MLLMs的感知和推理能力迁移到复杂的基于推理的分割任务,我们提出了一个两阶段框架Rea2Seg用于掩码生成和选择。具体来说,该框架首先基于分割MLLM的注意力图识别潜在区域作为候选掩码。然后,它利用MLLM对问题和候选掩码进行推理,并为每个掩码分配分数。最终的分割结果通过对候选掩码重新排序并选择最高分的掩码获得,将图像分割重新表述为候选发现后跟判别性掩码选择。\n我们还注意到,现有基准中的大部分问题集中在常识推理上,这些问题通常不需要完全的联合视觉观察和推理。为了解决这个问题,我们引入了一个名为ReasonSeg-SGDR的新基准,该基准在多个维度上全面评估模型的感知、定位和推理能力,包括判别性识别、空间推理、几何推理和多步推理,并带有细粒度的掩码生成。\n此外,我们收集训练数据以增强MLLMs联合理解多模态查询和候选掩码的能力,并通过推理分配分数。在提出的基准和ReasonSeg上的实验结果表明了统一掩码生成和选择框架的有效性。

英文摘要

The rapid development of pretrained foundation models has enabled more general image segmentation. Multimodal large language models (MLLMs) have been widely explored for image segmentation with complex queries that require high-level reasoning. Despite promising progress, existing methods are often constrained by limited training data and the gap between MLLMs and mask generation modules. To better transfer MLLMs' perception and reasoning ability to complex reasoning-based segmentation tasks, we propose a two-stage framework Rea2Seg for mask generation and selection. Specifically, the framework first identifies potential regions as candidate masks based on the attention maps of a segmentation MLLM. It then employs an MLLM to reason over the question and candidate masks and assign scores to each mask. The final segmentation result is obtained by reranking the candidates and selecting the highest-scoring mask, reformulating image segmentation as candidate discovery followed by discriminative mask selection. We also notice that a large portion of questions in existing benchmarks focus on commonsense reasoning, and these questions usually do not fully require joint visual observation and reasoning. To address this issue, we introduce a new benchmark called ReasonSeg-SGDR that comprehensively evaluates a model's perception, grounding, and reasoning abilities across multiple dimensions, including discriminative recognition, spatial reasoning, geometric reasoning, and multi-step reasoning, with fine-grained mask generation. In addition, we collect training data to enhance MLLMs' ability to jointly understand multimodal queries and candidate masks, and to assign scores through reasoning. Experimental results on the proposed benchmark and ReasonSeg demonstrate the effectiveness of the unified mask generation and selection framework.

2606.09301 2026-06-09 cs.LG 新提交

PRISM: Topology-Aware Cross-Modal Imputation for Modality-Deficient Federated Graph Learning

PRISM: 面向模态缺失联邦图学习的拓扑感知跨模态插补

Zekai Chen, Miao Zhang, Jiayang Xing, Xunkai Li, Xun Wu, Rong-Hua Li, Guoren Wang

发表机构 * Beijing Institute of Technology(北京理工大学)

AI总结 针对联邦图学习中客户端级模态缺失问题,提出拓扑感知跨模态插补框架PRISM,通过联邦检索缺失模态语义并利用拓扑控制注入局部图传播,在六个多模态图数据集上平均提升4.48%。

详情
AI中文摘要

多模态联邦图学习(MM-FGL)旨在从包含文本和图像的分散图中协作学习。然而,现实世界的客户端可能没有共同的模态基础:视觉搜索客户端可能包含图像-交互图但没有卖家描述,而目录客户端可能提供文本但没有产品图像。我们将这种实际设置称为客户端级模态缺失。与随机的实例级缺失不同,缺失模态的客户端缺乏重建缺失模态所需的局部语义基础。更重要的是,在图学习中,不完整的表示初始化消息传递,因此插补误差可以被接收拓扑过滤、混合和放大。为了解决这一问题,我们提出了\textbf{PRISM}(\textbf{P}roactive \textbf{R}etrieval and \textbf{I}mputation via \textbf{S}tructural \textbf{M}eta-prompting),一个拓扑感知的联邦跨模态插补框架。PRISM不是仅从局部观测重建缺失模态,而是从联邦中恢复缺失模态语义,并在拓扑感知控制下将其引入局部图传播。在六个多模态图数据集上的实验表明,PRISM持续改善模态缺失客户端,平均优于最先进的基线\textbf{4.48}\%。

英文摘要

Multimodal federated graph learning (MM-FGL) aims to collaboratively learn from decentralized graphs with text and images. However, real-world clients may not share a common modality basis: a visual-search client may contain image--interaction graphs but no seller descriptions, while a catalog client may provide text but no product images. We refer to this practical setting as client-level modality deficiency. Unlike random instance-wise missingness, a deficient client lacks the local semantic basis needed to reconstruct the absent modality. More importantly, in graph learning, incomplete representations initialize message passing, so imputation errors can be filtered, mixed, and amplified by the receiving topology. To address this gap, we propose \textbf{PRISM} (\textbf{P}roactive \textbf{R}etrieval and \textbf{I}mputation via \textbf{S}tructural \textbf{M}eta-prompting), a topology-aware federated cross-modal imputation framework. Rather than reconstructing the missing modality solely from local observations, PRISM recovers missing-modality semantics from the federation and introduces them into local graph propagation under topology-aware control. Experiments on six multimodal graph datasets across graph-centric and modality-centric tasks show that PRISM consistently improves modality-deficient clients, outperforming state-of-the-art baselines by \textbf{4.48}\% on average.

2606.09295 2026-06-09 cs.CL 新提交

NüshuVoice: Reviving the Voice of Endangered Nüshu with Pitch-Aware Text-to-Speech

NüshuVoice:利用音高感知文本到语音技术复兴濒危女书的声音

Hongkun Yang, Xinhui Yi, Xiyan Zhao, Yibo Meng, Lionel Z. Wang, Lixu Wang, Yaqi Zhang, Ruiqi Chen, Xuanyue Zhao, Lanxin Zhang, Yu Zeng, Weijia Chu, Yiming Ma, Chenyu Liu, Jianghao Lin, Xin Xu

发表机构 * Ocean University of China(中国海洋大学) The Hong Kong Polytechnic University(香港理工大学) Cornell University(康奈尔大学) Nanyang Technological University(南洋理工大学) Shanghai Jiao Tong University(上海交通大学) University of Michigan–Ann Arbor(密歇根大学安娜堡分校) University of Science and Technology of China(中国科学技术大学) Harbin Institute of Technology(哈尔滨工业大学)

AI总结 针对女书语音数据稀缺问题,提出NüshuVoice基准和F0条件VITS框架Nüshu-PitchVITS,利用五级音高标注作为韵律先验,在频谱保真度、音高重建和可懂度上优于强基线。

Comments 12 pages, 3 figures

详情
AI中文摘要

女书是一种濒危的音节文字,历史上由中国湖南省南部江永县的女性使用。现有的女书计算研究主要关注文本数字化和视觉识别,其真实发音的声学重建仍基本未被探索。构建女书文本到语音(TTS)系统尤其具有挑战性,因为可用的录音极其有限,且大多为孤立的音节级发音而非自然的句子级话语。在这项工作中,我们介绍了NüshuVoice,这是首个女书TTS基准。我们构建了一个句子级女书文本到音频数据集,对齐了标准化的Unicode女书文本、音标、标准中文翻译和档案录音。为了在这种极端低资源设置下合成语音,我们提出了Nüshu-PitchVITS,一种F0条件VITS框架,利用女书的五级音高符号作为显式的韵律归纳偏置。实验结果表明,Nüshu-PitchVITS在频谱保真度、音高重建和人类评定的可懂度方面优于强TTS基线。我们公开发布了数据集和代码,网址为:https://anonymous.4open.science/r/Nvshu-TTS-2EB6。

英文摘要

Nüshu is an endangered phonetic script historically used by women in Jiangyong County, southern Hunan, China. While existing computational studies of Nüshu mainly focus on textual digitization and visual recognition, the acoustic reconstruction of its authentic pronunciation remains largely unexplored. Building a Nüshu text-to-speech (TTS) system is particularly challenging because available recordings are extremely limited and mostly consist of isolated syllable-level pronunciations rather than natural sentence-level utterances. In this work, we introduce NüshuVoice, the first TTS benchmark for Nüshu. We construct a sentence-level Nüshu text-to-audio dataset that aligns standardized Unicode Nüshu text, phonetic transcriptions, standard Chinese translations, and archival recordings. To synthesize speech under this extreme low-resource setting, we propose Nüshu-PitchVITS, an F0-conditioned VITS framework that leverages Nüshu's five-level pitch notation as an explicit prosodic inductive bias. Experimental results show that Nüshu-PitchVITS outperforms strong TTS baselines in spectral fidelity, pitch reconstruction, and human-rated intelligibility. We publicly release the dataset and code at: https://anonymous.4open.science/r/Nvshu-TTS-2EB6.

2606.09294 2026-06-09 cs.CV 新提交

Virtual-point-based Solutions to Handle Generalized Absolute Pose Problem

基于虚拟点的广义绝对位姿问题求解方法

Bin Li, Banglei Guan, Shunkun Liang, Yang Shang

发表机构 * National University of Defense Technology(国防科技大学) Hunan Institute of Advanced Technology(湖南高级技术研究所)

AI总结 针对多相机系统广义PnP问题,提出虚拟点公式化方法,将标准PnP求解器转化为广义位姿求解器,并基于Cayley、四元数和旋转矩阵参数化导出三种求解器,在精度、全局最优性和效率上优于现有方法。

详情
AI中文摘要

多相机系统因其宽视场、灵活性和容错性在机器人和自主导航中日益普及。然而,现有的PnP求解器无法处理多个投影中心。本文引入一种虚拟点公式化方法,桥接了标准PnP与广义位姿问题,实现了将现有PnP求解器转化为广义位姿求解器的统一流程。基于该框架,我们推导了三种基于虚拟点的广义位姿求解器,即VGPc、VGPq和VGPr,分别利用Cayley、四元数和旋转矩阵参数化。大量实验表明,所提出的求解器继承了原始PnP算法的精度和效率,同时显著优于现有的广义求解器。具体而言,VGPc在异方差噪声条件下实现了更高的估计精度,VGPq保持了全局最优性,而VGPr在精度不降低的情况下提供了优越的计算效率。

英文摘要

Multi-camera systems are increasingly adopted in robotics and autonomous navigation for their wide field of view, flexibility, and fault tolerance. Nevertheless, existing PnP solvers fail to handle multiple projection centers. This paper introduces a virtual point formulation that bridges the standard PnP and generalized pose problems, enabling a unified pipeline that transforms existing PnP solvers into generalized pose solvers. Based on this framework, we derive three Virtual-point-based Generalized Pose solvers, namely VGPc, VGPq, and VGPr, leveraging Cayley, quaternion, and rotation-matrix parameterizations, respectively. Extensive experiments demonstrate that the proposed solvers inherit the accuracy and efficiency of original PnP algorithms while significantly outperforming existing generalized solvers. Specifically, VGPc achieves higher estimation accuracy under heteroscedastic noise conditions, VGPq maintains global optimality, whereas VGPr provides superior computational efficiency without accuracy degradation.

2606.09293 2026-06-09 cs.CL 新提交

One Model, Multiple Goals: Adaptive Multi-Objective Learning for E-commerce Dialogue Systems

一个模型,多个目标:面向电商对话系统的自适应多目标学习

Mingzhe Li, Jing Xiang, Enguo Zhou, Lang Gao, Tai Li, Qishen Zhang, Xiangliang Zhang, Xiuying Chen

发表机构 * ByteDance(字节跳动) MBZUAI(穆罕默德·本·扎耶德人工智能大学) University of Notre Dame(圣母大学)

AI总结 提出自适应多目标强化学习框架MORE,通过将推理功能作为约束指导策略优化,并引入自适应多奖励机制平衡语言目标,在电商对话系统中同时提升推理准确性和语言自然性,在线实验转化率提升30.09%。

Comments Accepted by KDD 2026

详情
AI中文摘要

电商场景中的对话系统通常需要满足多个目标:准确推理用户画像(如资格、信用额度)以确保正确决策和用户状态理解,同时生成自然且忠实的回复。这些目标是互补但非完全一致的。在这项工作中,我们提出了MORE,一个自适应多目标强化学习框架,联合优化推理准确性和语言自然性。我们的初步实验表明,直接混合具有不同优化动态的奖励会导致振荡和不稳定的学习。因此,我们不优化单一的混合奖励,而是将推理函数视为指导策略优化的约束。在推理时,系统直接生成回复,无需显式推理步骤,同时仍受益于推理增强的支架,避免额外的推理开销。为了更好地平衡回复生成过程中的语言目标,我们引入了一种自适应多奖励机制,该机制聚合流畅性和自然性等信号,并通过梯度反馈动态重新加权。我们在字节跳动的两个真实对话系统和MultiWOZ 2.2基准上评估MORE,其持续优于强基线。在字节跳动生产流量的14天在线实验中,MORE将总体转化率和达成转化率分别提高了16.53%和30.09%,同时提高了用户满意度并降低了转接率。值得注意的是,在人机对比中,MORE恢复了人类客服所实现的增量转化提升的约60%。

英文摘要

Dialogue systems in e-commerce scenarios often need to satisfy multiple objectives: accurately reasoning over user profiles (e.g., eligibility, credit limit) to ensure correct decision-making and user state interpretation, while also generating natural and faithful responses. These goals are complementary but not identical. In this work, we propose MORE, an adaptive Multi-Objective REinforcement learning framework that jointly optimizes reasoning accuracy and linguistic naturalness. Our preliminary experiments show that directly mixing rewards with diverging optimization dynamics can cause oscillations and unstable learning. Thus, instead of optimizing a single mixed reward, we treat reasoning functions as constraints that guide policy optimization. At inference time, the system directly generates responses without explicit reasoning steps, while still benefiting from reasoning-enhanced scaffold and avoiding additional inference overhead. To better balance linguistic objectives during response generation, we introduce an adaptive multi-reward mechanism that aggregates signals such as fluency and naturalness and dynamically reweighs them via gradient feedback. We evaluate MORE on two real-world dialogue systems at ByteDance and the MultiWOZ 2.2 benchmark, where it consistently outperforms strong baselines. In 14-day online experiments on ByteDance production traffic, MORE improves overall and reached conversion by 16.53% and 30.09%, while increasing user satisfaction and reducing handoff rates. Notably, in a human-machine comparison, MORE recovers about 60% of the incremental conversion lift achieved by human agents.

2606.09292 2026-06-09 cs.RO cs.SY eess.SY 新提交

Dual Quaternion-Based Unscented Kalman Filter with Visual Inertial Odometry for Navigation in GPS-Denied Environments

基于对偶四元数的无迹卡尔曼滤波与视觉惯性里程计在GPS拒止环境中的导航

Mohamed Khalifa, Hashim A. Hashim

发表机构 * Carleton University(卡尔顿大学)

AI总结 提出一种基于对偶四元数的无迹卡尔曼滤波(DQUKF)结合视觉惯性里程计(VIO),在GPS拒止环境下实现高精度状态估计,在EuRoC数据集上位置RMSE达0.2584米。

详情
AI中文摘要

在GPS拒止环境中的可靠导航仍然是机器人、航空航天和自动驾驶车辆应用中的基本挑战。本文提出了一种基于对偶四元数的无迹卡尔曼滤波(DQUKF),配备视觉惯性里程计(VIO)算法,用于在GPS拒止位置实现精确状态估计以实现导航。所提出的框架以误差状态形式构建DQUKF,其中名义位姿由单位对偶四元数表示,局部位姿误差由6维扭量参数化表示,用于sigma点生成、协方差传播和测量校正。同时,VIO算法跨图像帧跟踪特征,同步IMU和相机之间的测量,并提供补充惯性传播的视觉约束。在EuRoC MAV数据集上的仿真结果表明,所提出的DQUKF在高初始化不确定性下收敛,并在困难飞行序列中实现了0.2584米的位置RMSE,优于基准滤波器。

英文摘要

Reliable navigation in GPS-denied environments remains a fundamental challenge in robotics, aerospace, and autonomous vehicle applications. This paper presents a Dual Quaternion-Based Unscented Kalman Filter (DQUKF) equipped with a Visual Inertial Odometry (VIO) algorithm for accurate state estimation enabling navigation in GPS denied locations. The proposed framework formulates the DQUKF in an error state manner, where the nominal pose is represented by a unit dual quaternion and the local pose error is represented by a 6-dimensional twistor parameterization used for sigma point generation, covariance propagation, and measurement correction. In parallel, the VIO algorithm tracks features across image frames, synchronizes measurements between the IMU and camera, and provides visual constraints that complement inertial propagation. Simulation results on the EuRoC MAV dataset show that the proposed DQUKF converges under high initialization uncertainty and achieves a position RMSE of 0.2584~m in the difficult flight sequence, outperforming the benchmark filters.

2606.09290 2026-06-09 cs.CV 新提交

Visual Para-Thinker++: A Single-Policy Multi-Agent Framework for Visual Reasoning

Visual Para-Thinker++:用于视觉推理的单策略多智能体框架

Haoran Xu, Hongyu Wang, Yifei Gao, Jiaze Li, Zizhao Tong, Xiaofeng Zhang, Xiaosong Yuan

发表机构 * Zhejiang University(浙江大学) Hunan University(湖南大学) Tianjin University(天津大学) University of Chinese Academy of Sciences(中国科学院大学) Shanghai Jiao Tong University(上海交通大学) Jilin University(吉林大学)

AI总结 提出Visual Para-Thinker++框架,通过共享MLLM策略实例化为多个角色智能体并行推理,结合多智能体能力注入和角色解耦优化,有效缓解视觉推理中的早期感知承诺和幻觉问题。

详情
AI中文摘要

视觉推理需要整合分布在区域、属性和关系中的证据,这使得单链推理容易产生早期感知承诺和幻觉。我们提出Visual Para-Thinker++,一个单策略多智能体框架,其中共享的MLLM策略被实例化为角色条件的主智能体、工作智能体和总结智能体。主智能体使用固定分配模式分解任务;工作智能体在上下文隔离下并行推理;总结智能体整合所有工作智能体的推理轨迹,而不是对最终标签进行多数投票。共享策略通过多智能体能力注入和角色解耦多智能体优化进行训练,为相应的token片段分配角色特定的奖励和优势,以减少协作角色之间的梯度冲突。原生推理引擎通过共享视觉前缀和KV缓存重用实现高效的多智能体展开。在V*、CountBench、RefCOCO系列和HallusionBench上,Visual Para-Thinker++始终优于单轨迹和推理时并行基线,在幻觉敏感的视觉推理上尤其表现出色。

英文摘要

Visual reasoning requires integrating evidence distributed across regions, attributes, and relations, making single-chain reasoning prone to early perceptual commitment and hallucination. We propose Visual Para-Thinker++, a single-policy multi-agent framework in which one shared MLLM policy is instantiated as role-conditioned Main, Worker, and Summary Agents. The Main Agent decomposes the task with fixed allocation patterns; Worker Agents reason in parallel under context isolation; and the Summary Agent reconciles full Worker reasoning traces rather than majority-voting on final labels. The shared policy is trained by Multi-Agent Capability Injection and Role-Decoupled Multi-Agent Optimization, which assign role-specific rewards and advantages to corresponding token segments to reduce gradient conflict among collaborative roles. A native inference engine enables efficient multi-agent rollout through shared visual prefix and KV cache reuse. Across V*, CountBench, the RefCOCO family, and HallusionBench, Visual Para-Thinker++ consistently outperforms single-trajectory and inference-time parallel baselines, with especially strong gains on hallucination-sensitive visual reasoning.

2606.09286 2026-06-09 cs.RO 新提交

VAIC: Vision-Guided Humanoid Agile Object Interaction Control via Decoupled Commands

VAIC: 基于解耦命令的视觉引导人形机器人敏捷物体交互控制

Dongting Li, Qianyang Wu, Xingyu Chen, Liang Li, Yuhang Lin, Sikai Wu, Guoyao Zhang, Mingliang Zhou, Diyun Xiang, Qiang Zhang, Renjing Xu, Jianzhu Ma

发表机构 * Tsinghua University(清华大学) HKUST(Guangzhou)(香港科技大学(广州)) Xiaomi Robotics Lab(小米机器人实验室)

AI总结 提出VAIC框架,通过解耦命令和两阶段蒸馏范式,仅依靠机载深度、历史本体感知实现人形机器人的敏捷物体交互,在箱体搬运、推车、滑板等动态任务中超越基线。

Comments Webpage: https://vaic-humanoid.github.io/

详情
AI中文摘要

人形机器人在现实辅助中具有巨大潜力,但在非结构化环境中与物体的敏捷交互需要紧密耦合的全身协调。尽管近期取得了进展,当前控制器仍面临关键的部署差距:它们严重依赖密集的参考轨迹和完美的状态可观测性,这本质上限制了物理泛化。我们提出了视觉引导的敏捷交互控制(VAIC),这是一个统一框架,通过仅依靠机载深度、历史本体感知和解耦的用户命令接口来弥合这一差距。VAIC采用两阶段蒸馏范式。首先,一个特权教师策略利用精确的物体运动学和精确的环境状态掌握多样的交互技能。其次,一个可部署的学生策略通过将全身跟踪替换为多轴速度目标和每帧交互指示器来蒸馏这些能力。学生利用一个循环物体适应模块,从原始深度流和本体感知中隐式推断不可观测的物体动力学。在人形机器人上的评估和实际部署表明,单个VAIC策略能够成功执行高度多样的动态任务,包括箱体搬运、推车交互和滑板,持续优于基线,推动了自主人形机器人的部署。

英文摘要

Humanoid robots hold immense potential for real-world assistance, yet agile interaction with objects in unstructured environments demands tightly coupled whole-body coordination. Despite recent advancements, current controllers face a critical deployment gap. They rely heavily on dense reference trajectories and perfect state observability, which inherently limits physical generalization. We present Vision Guided Agile Interaction Control (VAIC), a unified framework that bridges this gap by operating exclusively on onboard depth, historical proprioception, and a decoupled user command interface. VAIC employs a two-stage distillation paradigm. First, a privileged teacher policy masters diverse interaction skills using precise object kinematics and exact environmental states. Second, a deployable student policy distills these capabilities by replacing full body tracking with velocity targets across multiple axes and an interaction indicator for each frame. The student utilizes a recurrent object adaptation module to implicitly infer unobservable object dynamics from raw depth streams and proprioception. Evaluations and real-world deployments on the humanoid robot demonstrate that a single VAIC policy successfully executes highly diverse dynamic tasks. These tasks include box carrying, cart interaction, and skateboarding, consistently outperforming baselines and advancing autonomous humanoid deployment.

2606.09278 2026-06-09 cs.LG cs.AI 新提交

Internalizing Geometric Law: Learning from Solver Residuals for Precision-Critical Generation

内化几何法则:从求解器残差中学习以实现精度关键生成

Rafael Cabral, Pang Zixi, Ziyi Shou, Shen Xin

发表机构 * Huawei Celia Team(华为Celia团队)

AI总结 针对大语言模型在精度关键领域(如技术图表和机械设计)中的幻觉问题,提出可编程几何DSL PyGeoX及分层基准PyGeoX-Bench,并设计饱和加性奖励(SAR)方法,将奖励分解为有界逐约束项,解决异常梯度掩盖问题,使8B模型在基准上达到与更大前沿系统竞争的水平。

详情
AI中文摘要

大语言模型在精度关键领域(如技术图表和机械设计)中经常出现幻觉,这些领域的输出必须满足严格的几何约束。我们研究从自然语言进行开放式几何合成:将自由形式的描述转化为精确的构造,其实体必须同时满足数十个相互作用的约束。为使这一问题易于处理,我们发布了PyGeoX,一个可编程的几何DSL,它将声明性约束编译为可微损失,以及PyGeoX-Bench,一个包含300个问题的分层套件,每个问题都有可验证的逐约束奖励。使用PyGeoX作为验证器,我们识别出一种称为异常梯度掩盖的失败模式:在全局范数奖励(任何通过单一范数聚合残差的方案,例如$\exp(-\mathrm{MSE})$)下,单个异常约束可以抵消所有其他约束的学习信号。为解决此问题,我们提出饱和加性奖励(SAR),它将奖励分解为有界的逐约束项,保留部分进展并确保即使在严重违反下也能保持一致的梯度。与基于MSE的奖励(几何求解器的自然基线)相比,SAR将困难层级求解率提高了2.3倍,由此得到的8B模型在该基准上与更大的前沿系统具有竞争力。我们在https://github.com/Huawei-AI4Math/PyGeoX发布引擎、基准和数据。

英文摘要

Large Language Models frequently hallucinate in precision-critical domains such as technical diagramming and mechanical design, where outputs must satisfy strict geometric constraints. We study open-ended geometric synthesis from natural language: translating free-form descriptions into precise constructions whose entities must simultaneously satisfy dozens of interacting constraints. To make this tractable, we release PyGeoX, a programmable geometric DSL that compiles declarative constraints into a differentiable loss, and PyGeoX-Bench, a stratified suite of 300 problems with per-constraint verifiable rewards. Using PyGeoX as a verifier, we identify a failure mode we call Outlier Gradient Masking: under global-norm rewards (any scheme that aggregates residuals through a single norm, for example, $\exp(-\mathrm{MSE})$), a single outlier constraint can nullify the learning signal across all others. To address this, we propose Saturating Additive Rewards (SAR), which decompose the reward into bounded per-constraint terms, preserving partial progress and ensuring consistent gradients even under severe violations. Against MSE-based rewards, the natural baseline for geometry solvers, SAR improves the hard-tier solving rate by $2.3\times$, and the resulting 8B model is competitive with much larger frontier systems on this benchmark. We release the engine, benchmark, and data at https://github.com/Huawei-AI4Math/PyGeoX.

2606.09276 2026-06-09 cs.LG 新提交

ERBench: A Benchmark and Testsuite for Equation Discovery Algorithms

ERBench:方程发现算法的基准与测试套件

Paul Kahlmeyer, Henrik Voigt, Michael Habeck, Joachim Giesen

发表机构 * University of Jena(耶拿大学)

AI总结 提出ERBench基准,通过方程恢复任务评估符号回归算法,强调在变化维度、采样大小、分布和域下的鲁棒性,填补现有基准的空白。

详情
AI中文摘要

方程发现旨在从数据中自动发现数学方程形式的科学模型。技术上,方程发现通过符号回归算法实现。符号回归用于方程发现的性能沿两个维度衡量:测试数据的预测精度,以及已知真实公式的恢复。对于标准回归,精度通常通过域内测试数据衡量,例如,将数据集随机分为训练和测试数据。虽然这对于域内插值(普通回归的常见目标)有意义,但它可能误导真正的模型发现和泛化。明显的替代方案是衡量域外精度。然而,获得具有挑战性的域外测试数据是一个非平凡问题。因此,我们专注于方程恢复来评估用于方程发现的符号回归算法。理由是,在恢复已知真实公式方面表现良好的符号回归算法是未知方程发现中表现良好的良好候选。现有的符号回归基准包括方程恢复任务,但只有少量公开已知的真实公式。此外,这些基准较少强调评估算法在变化维度、采样大小、采样分布和采样域下的鲁棒性。然而,这对于希望发现自然现象建模方程的从业者至关重要,因为数据几乎肯定有噪声,并且来自不同的域、分布和样本大小。为填补这一空白,我们引入了方程恢复基准(ERBench),这是一个新的评估框架,旨在严格评估明确针对方程发现任务的算法。

英文摘要

Equation discovery aims to automate the discovery of scientific models in the form of mathematical equations from data. Technically, equation discovery is implemented by symbolic regression algorithms. Performance of symbolic regression for equation discovery is measured along two dimensions: Prediction accuracy on test data, and recovery of known groundtruth formulas. For standard regression, accuracy is typically measured on in-domain test data, for instance, by splitting a data set randomly into training and test data. While this makes sense for in-domain interpolation, which is the common goal in ordinary regression, it can be a misleading proxy for true model discovery and generalization. The obvious alternative is to measure out-of-domain accuracy. However, obtaining challenging out-of-domain test data is a non-trivial problem. Therefore, we focus on equation recovery for evaluating symbolic regression algorithms for equation discovery. The rationale is that symbolic regression algorithms that perform well in recovering known groundtruth formulas are good candidates to perform well in unknown equation discovery. Existing benchmarks for symbolic regression include equation recovery tasks, however, with only a small number of groundtruth formulas that are publicly known. Moreover, these benchmarks place less emphasis on evaluating the robustness of algorithms in terms of their behavior under changing dimensionality, sampling size, sampling distribution and sampling domain. This, however, is of central importance to practitioners wanting to discover equations for modeling natural phenomena, since data is almost certainly noisy and comes from diverse domains, distributions, and sample sizes. To fill this gap, we introduce the Equation Recovery Benchmark (ERBench), a new evaluation framework designed to rigorously assess algorithms explicitly targeting the task of equation discovery.