arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2075
专题追踪
2605.13140 2026-05-14 cs.CV

Multi-Modal Guided Multi-Source Domain Adaptation for Object Detection

Sangin Lee, Seokjun Kwon, Jeongmin Shin, Namil Kim, Yukyung Choi

发表机构 * Sejong University(世宗大学) NAVER LABS(NAVER实验室) Artificial Intelligence and Robotics Institute (AIRI)(人工智能与机器人研究所(AIRI))

AI总结 该论文研究了多源领域自适应下的目标检测问题,旨在提升模型在目标领域中检测性能,特别是在训练数据分布与目标领域存在差异的情况下。为了解决现有方法在学习领域无关特征时无法有效保留领域特定信息的问题,作者提出了MS-DePro方法,结合深度图和文本提示,分别用于引导目标定位和分类特征对齐。该方法在多个基准测试中取得了最先进的性能,验证了其有效性。

详情
英文摘要

General object detection (OD) struggles to detect objects in the target domain that differ from the training distribution. To address this, recent studies demonstrate that training from multiple source domains and explicitly processing them separately for multi-source domain adaptation (MSDA) outperforms blending them for unsupervised domain adaptation (UDA). However, existing MSDA methods learn domain-agnostic features from domain-specific RGB images while preserving domain-specific information from the domain-agnostic feature map. To address this, we propose MS-DePro: Multi-Source Detector with Depth and Prompt, composed of (1) depth-guided localization and (2) multi-modal guided prompt learning. We leverage domain-agnostic input modalities, namely depth maps and text, to encode domain-agnostic characteristics. Specifically, we utilize depth maps to generate domain-agnostic region proposals for localization and integrate multi-modal features to align learnable text embeddings for classification. MS-DePro achieves state-of-the-art performance on MSDA benchmarks, and comprehensive ablations demonstrate the effectiveness of our contributions. Our code is available on https://github.com/sejong-rcv/Multi-Modal-Guided-Multi-Source-Domain-Adaptation-for-Object-Detection.

2605.13133 2026-05-14 cs.LG eess.SP

KAST-BAR: Knowledge-Anchored Semantically-Dynamic Topology Brain Autoregressive Modeling for Universal Neural Interpretation

Haoning Wang, Wenchao Yang, Shuai Shen, Yang Li

发表机构 * School of Automation Science and Electrical Engineering, Beihang University, Beijing, China.(自动化科学与电气工程学院,北航,北京,中国) School of Biological Science and Medical Engineering, Beihang University, Beijing, China.(生物科学与医学工程学院,北航,北京,中国) State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing, China.(虚拟现实技术与系统国家重点实验室,北航,北京,中国) T Magnetic Resonance Imaging Translational Medical Center, Department of Radiology, Southwest Hospital, Army Medical University (Third Military Medical University), Chongqing, China.(7T磁共振成像转化医学中心,放射科,西南医院,军医大学(第三军医大学),重庆,中国)

AI总结 本文提出了一种名为KAST-BAR的知识锚定语义动态拓扑脑自回归模型,旨在解决脑电图(EEG)基础模型在跨任务通用神经解码中面临的空间时间拓扑建模不足和生理信号与高层语义之间模态鸿沟的问题。该模型通过双流层次注意力编码器捕捉脑部非欧几里得拓扑结构,并结合知识锚定语义分析模块,将生理信号与专家级语义空间对齐,从而实现更准确的神经信号解码。实验表明,KAST-BAR在多个下游任务中均表现出色,有效融合了医学专家知识以提升EEG信号的理解与解释能力。

详情
英文摘要

While EEG foundation models have shown significant potential in universal neural decoding across tasks, their advancement remains constrained by the inadequacy modeling of complex spatiotemporal topology, as well as the inherent modality gap between low-level physiological signals and high-level textual semantics. To address these challenges, we propose a Knowledge-Anchored Semantically-Dynamic Topology Brain Autoregressive Model (KAST-BAR), which dynamically aligns physiological representations derived from multi-level brain topology with an expert-level semantic space. Specifically, we design a Dual-Stream Hierarchical Attention (DSHA) encoder that accurately captures the brain's intrinsic non-Euclidean topology by modeling local temporal dynamics with global spatial contexts. On this basis, a Knowledge-Anchored Semantic Profiler (KASP) is proposed to synthesize physically-grounded and instance-level textual profiles, which subsequently drive a Semantic Text-Aware Refiner (STAR) to dynamically reconstruct EEG representations using Latent Expert Queries. By conducting large-scale pre-training on 21 diverse datasets to build a foundation model, KAST-BAR effectively integrates expert-level medical knowledge into EEG signal representations, consistently achieving superior performance across six downstream tasks. Our code is available at https://github.com/KAST-BAR/KAST-BAR

2605.13131 2026-05-14 cs.LG cs.RO

ERPPO: Entropy Regularization-based Proximal Policy Optimization

Changha Lee, Gyusang Cho

发表机构 * Korea Advanced Institute of Science and Technology(韩国科学技术院)

AI总结 本文提出了一种基于熵正则化的近端策略优化算法(ERPPO),旨在解决多智能体强化学习中因非稳态观测导致的策略优化难题。该方法通过引入分布时空模糊性学习器,估计多维观测环境下的目标检测不确定性,并结合动态熵正则化项,在高模糊度情况下增强探索,在低模糊度情况下稳定策略更新,从而提升目标定位的准确性和搜索效率。实验表明,ERPPO在海上搜索等时间敏感任务中表现出优于MAPPO的性能,尤其在视觉不确定条件下能有效抑制误检。

Comments 9 pages, 5 figures

详情
英文摘要

Multi-Agent Proximal Policy Optimization (MAPPO) is a variant of the Proximal Policy Optimization (PPO) algorithm, specifically tailored for multi-agent reinforcement learning (MARL). MAPPO optimizes cooperative multi-agent settings by employing a centralized critic with decentralized actors. However, in case of multi-dimensional environment, MAPPO can not extract optimal policy due to non-stationary agent observation. To overcome this problem, we introduce a novel approach, Entropy Regularization-based Proximal Policy Optimization (ERPPO). For the policy optimization, we first define the object detection ambiguity under multi-dimensional observation environment. Distributional Spatiotemporal Ambiguity (DSA) learner is trained to estimate object detection uncertainty in non-stationary constraints. Then, we enhance PPO with a novel Entropy Regularization term. This regularization dynamically adjusts the policy update by applying a stronger (L1) regularization in high-ambiguity observation to encourage significant exploratory actions and a weaker (L2) regularization in low-ambiguity observation to stabilize the proximal policy optimization. This approach is designed to enhance the probability of successful object localization in time-critical operations by reducing detection failures and optimizing search policy. Experiments on a testbed with AirSim-based maritime searching scenarios show that the proposed ERPPO improves accuracy performance. Our proposed method improves higher gradient than MAPPO. Qualitative results confirm that ERPPO effectiveness in terms of suppressing false detection in visually uncertain conditions.

2605.13130 2026-05-14 cs.AI

GRACE: Gradient-aligned Reasoning Data Curation for Efficient Post-training

Junjie Li, Ziao Wang, NingXuan Ma, Jianghong Ma, Xiaofeng Zhang

发表机构 * Harbin Institute of Technology, Shenzhen, China(哈尔滨工业大学(深圳)) Hong Kong Baptist University, China(香港 Baptist 大学) City University of Hong Kong, China(香港城市大学)

AI总结 本文提出了一种名为GRACE的梯度对齐推理数据筛选方法,用于高效地进行模型后训练。该方法通过分析每个推理步骤与答案梯度方向的对齐程度以及与前序推理路径的一致性,对步骤进行评分,并将这些评分聚合为样本级别的选择依据,无需外部奖励模型或步骤注释。实验表明,GRACE在使用较少数据的情况下仍能保持接近甚至超越全数据的性能,且具有良好的模型迁移能力。

详情
英文摘要

Existing reasoning data curation pipelines score whole samples, treating every intermediate step as equally valuable. In reality, steps within a trace contribute very unevenly, and selecting reasoning data well requires assessing them individually. We present GRACE, a gradient-aligned curation method that views each reasoning trace as a sequence of optimization events and scores every step by two complementary signals: its alignment with the answer-oriented gradient direction, and its consistency with the preceding reasoning trajectory. Step-level scores are aggregated into a sample-level value for subset selection, using only the model's internal optimization signals and no external reward models or step annotations. To make this scalable, GRACE introduces a representation-level gradient proxy that estimates step-level alignment from token-level upstream signals in a single forward pass. Post-training Qwen3-VL-2B-Instruct on MMathCoT-1M, GRACE reaches 108.8% of the full-data performance with 20% of the data and retains 100.2% with only 5%, with subsets that transfer effectively across model backbones.

2605.13125 2026-05-14 cs.RO

MoCCA: A Movable Circle Probability of Collision Approximation

Tobias Kern, Christian Birkner

发表机构 * CARISSMA Institute of Safety in Future Mobility, Technische Hochschule Ingolstadt(未来移动安全性研究所,因戈尔施塔特技术大学)

AI总结 在自动驾驶中,准确评估碰撞概率(POC)对于避障和安全驾驶至关重要。本文提出了一种名为MoCCA的形状近似算法,通过为每辆车优化单个圆来近似其几何形状,从而在保持计算效率的同时减少保守性过高的问题。该方法建立了近似误差的上界,并引入了基于方向方差可调节的安全距离余量,以应对部分覆盖情况下的POC低估问题。

Comments Accepted at ITSC 2026

详情
英文摘要

In automated driving, crash mitigation is crucial to ensure passenger safety. Accurate avoidance requires precise knowledge of the object's position and orientation. However, sensor noise and occlusions often result in tracking and prediction uncertainties. To account for these uncertainties, estimating the Probability of Collision (POC) is a critical requirement. While Monte Carlo sampling is a common estimation technique, its high computational demand and stochastic nature often render it unsuitable for real-time applications. Analytical POC calculations are simplified by approximating vehicle geometries using circular bounds. While multi-circle approximations offer higher fidelity than a single circumscribed circle, they significantly increase computational complexity. This paper proposes a shape approximation algorithm, MoCCA, which utilizes a single circle for each vehicle, optimized to minimize the relative distance between them. MoCCA maintains a computational efficiency comparable to standard single-circle techniques while reducing over-conservatism. To address the potential underestimation of POC inherent in partial coverage, we establish an upper bound for the approximation error, demonstrating that it depends primarily on inter-vehicle distance and orientation variance. Furthermore, we introduce a safety distance margin that can be calibrated solely based on orientation variance.

2605.13123 2026-05-14 cs.RO

Multi-Depth Uniform Coverage Path Planning for Unmanned Surface Vehicle Surveying

Maider Larrazabal, Tong Yang, Izaro Goienetxea, Jaime Valls Miro

发表机构 * AZTI Foundation(AZTI基金会) Tsinghua University(清华大学) University of the Basque Country (UPV/EHU)(巴斯克国家大学(UPV/EHU)) University of Technology Sydney(悉尼技术大学) IKERBASQUE

AI总结 本文提出了一种用于无人水面船舶水下地形测绘的新型自动覆盖路径规划算法。传统方法基于固定深度的往返路径,无法适应海底地形变化,导致覆盖不均;本文方法结合粗略的深度先验信息,动态调整路径生成与传感器覆盖范围,实现海底地形的均匀覆盖。实验表明,该方法在合成与真实场景中均显著优于传统方法,覆盖率分别超过99%和92%,具有重要的实际应用价值。

Comments Accepted by ICRA 2026

详情
英文摘要

This paper introduces a novel automatic coverage path planning algorithm for bathymetry surveying with unmanned surface vehicles. The detection range of the mapping sensor employed - a multibeam echo sounder - is heavily influenced by local seafloor depths. Hence, a path designed to uniformly cover the sea surface does not guarantee uniform coverage of the seafloor. Yet this is currently the typical process for bathymetric surveys, with the simplistic boustrophedon scheme along manually selected waypoints at constant depths being the most widespread planner used. The proposed scheme incorporates coarse prior depth information to pre-process the target region and adaptively guide path generation and sensing range configuration. By explicitly accounting for depth variations, the proposed algorithm designs a coverage path with optimised spacing between survey passes that adjusts the sensing beam aperture to achieve more consistent seafloor coverage. The proposed method is shown to offer significant improvements in both synthetic and real-world scenarios. Validations in challenging synthetic terrains achieves coverage ratios beyond 99%, a marked improvement when compared with traditional boustrophedon paths revealing a maximum 75% coverage. The same trend appears in realistic simulations using real bathymetric data from a coastal harbour, with coverage reaching over 92%, and significantly surpassing boustrophedon sweeps with coverage rates below 65%. Beyond improved performance, the scheme also brings a fully automated design, suitable for autonomous marine vehicles, thus offering practical utilities for real-world applications.

2605.13122 2026-05-14 cs.CV

Early Semantic Grounding in Image Editing Models for Zero-Shot Referring Image Segmentation

Jingxuan He, Xiyu Wang, Yunke Wang, Mengyu Zheng, Chang Xu

发表机构 * The University of Sydney(悉尼大学)

AI总结 本文研究了基于指令的图像编辑模型在零样本参照图像分割任务中的语义定位能力。通过分析发现,这些模型在去噪过程的早期阶段已能生成具有强前景-背景可分性的内部表示,从而隐含实现了语言条件下的语义定位。基于此,作者提出了一种无需训练的框架,利用预训练图像编辑模型的中间表示,将分割任务分解为空间注意力和语义判别两个部分,实现了无需完整图像生成即可获得高精度分割掩码的方法,并在多个数据集上取得了优于现有零样本方法的性能。

详情
英文摘要

Instruction-based image editing (IIE) models have recently demonstrated strong capability in modifying specific image regions according to natural language instructions, which implicitly requires identifying where an edit should be applied. This indicates that such models inherently perform language-conditioned visual semantic grounding. In this work, we investigate whether this implicit grounding can be leveraged for zero-shot referring image segmentation (RIS), a task that requires pixel-level localization of objects described by natural language expressions. Through systematic analysis, we reveal that strong foreground-background separability emerges in the internal representations of these models at the earliest denoising timestep, well before any visible image transformation occurs. Building on this insight, we propose a training-free framework that repurposes pretrained image editing models for RIS by exploiting their intermediate representations. Our approach decomposes localization into two complementary components: attention-based spatial priors that estimate where to focus, and feature-based semantic discrimination that determines what to segment. By leveraging feature-space separability, the framework produces accurate segmentation masks using only a single denoising step, without requiring full image synthesis. Extensive experiments on RefCOCO, RefCOCO+, and RefCOCOg demonstrate that our method achieves superior performance over existing zero-shot baselines.

2605.13119 2026-05-14 cs.RO cs.AI cs.CV

Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models

Zixing Lei, Changxing Liu, Yichen Xiong, Minhao Xiong, Yuanzhuo Ding, Zhipeng Zhang, Weixin Li, Siheng Chen

发表机构 * Shanghai Jiao Tong University(上海交通大学) Zhongguancun Academy(中关村学院) Beihang University(北京航空航天大学)

AI总结 该研究旨在解决视觉-语言-动作(VLA)模型在长期任务中执行能力受限的问题,提出了一种将高层视觉语言模型与专用工具型VLA模块相结合的新策略。通过引入工具对齐的后训练方法(TAPT)和工具族接口,实现了高效的长期任务规划与执行协同,显著提升了机器人在复杂环境中的任务完成率和指令遵循精度。

详情
英文摘要

Vision-language-action (VLA) models are effective robot action executors, but they remain limited on long-horizon tasks due to the dual burden of extended closed-loop planning and diverse physical operations. We therefore propose VLAs-as-Tools, a strategy that distributes this burden across a high-level vision language model (VLM) agent for temporal reasoning and a family of specialized VLA tools for diverse local physical operations. The VLM handles scene analysis, global planning, and recovery, while each VLA tool executes a bounded subtask. To tightly couple agent planning with VLA tool execution in long-horizon tasks, we introduce a VLA tool-family interface that exposes explicit tool selection and in-execution progress feedback, enabling efficient event-triggered agent replanning without continuous agent polling. To obtain diverse specialized VLA tools that faithfully follow agent invocations, we further propose Tool-Aligned Post-Training (TAPT), which constructs invocation-aligned training units for instruction following and adopts tool-family residual adapters for efficient tool specialization. Experiments show that VLAs-as-Tools improves the success rate of $π_{0.5}$ by 4.8 points on LIBERO-Long and 23.1 points on RoboTwin, and further enhances invocation fidelity by 15.0 points as measured by Non-biased Rate. Code will be released.

2605.13117 2026-05-14 cs.RO cs.AI

SECOND-Grasp: Semantic Contact-guided Dexterous Grasping

Han Yi Shin, Heeju Ko, Jaewon Mun, Qixing Huang, Jaehyeok Lee, Sung June Kim, Honglak Lee, Sujin Jang, Sangpil Kim

发表机构 * Korea University(韩国大学) University of Texas at Austin(德克萨斯大学奥斯汀分校) University of Michigan(密歇根大学) Hanyang University(翰阳大学)

AI总结 本文提出 SECOND-Grasp,一种语义引导的灵巧抓取框架,旨在将物理稳定性与语义任务理解相结合,以实现更可靠的机器人抓取。该方法通过视觉-语言推理生成粗略接触区域,并利用语义-几何一致性优化技术提升接触预测的准确性,最终通过逆运动学生成可行的抓取姿态。实验表明,该方法在已见和未见物体类别上的抓取成功率分别达到98.2%和97.7%,并在意图感知抓取任务中表现出显著提升。

详情
英文摘要

Achieving reliable robotic manipulation, such as dexterous grasping, requires a synergy between physically stable interactions and semantic task guidance, yet these objectives are often treated as separate, disjoint goals. In this paper, we investigate how to integrate dexterous grasping techniques, i.e., physically stable grasps for object lifting and language-guided grasp generation, to achieve both physical stability and semantic understanding. To this end, we propose SECOND-Grasp (SEmantic CONtact-guided Dexterous Grasping), a unified framework that enables robotic hands to dynamically adjust grasping strategies based on semantic reasoning while ensuring physical feasibility. We begin by obtaining coarse contact proposals through vision-language reasoning to infer where contacts should occur based on object properties, followed by segmentation to localize these regions across views. To further ensure consistency across multiple viewpoints, we introduce Semantic-Geometric Consistency Refinement (SGCR), which refines initial contact predictions by enforcing semantic consistency across views and removing geometrically invalid regions, yielding reliable 3D contact maps. Then, we derive a feasible hand pose for each contact map via inverse kinematics, generating a supervision signal for policy learning. Our approach, trained on DexGraspNet, consistently outperforms baselines in lifting success rate on both seen and unseen categories, achieving 98.2% and 97.7%, respectively, while also improving intent-aware grasping by 12.8% and 26.2%. We further show promising results on additional datasets and robotic hands, including Shadow Hand and Allegro Hand.

2605.13111 2026-05-14 cs.CV

Pyramid Forcing: Head-Aware Pyramid KV Cache Policy for High-Quality Long Video Generation

Jiayu Chen, Junbei Tang, Wenbiao Zhao, Maoliang Li, Jiayi Luo, Zihao Zheng, Jiawei Yang, Guojie Luo, Xiang Chen

发表机构 * Peking University(北京大学) South China University of Technology(华南理工大学) Xinjiang University(新疆大学) Beihang University(北京航空航天大学) Zhongguancun Academy(中关村学院)

AI总结 本文提出了一种名为Pyramid Forcing的头部感知金字塔KV缓存策略,用于提升高质量长视频生成的效果。该方法通过分析不同注意力头的历史帧关注模式,识别出三种具有不同特性的头类型,并据此设计差异化的缓存策略,从而有效缓解长期误差累积导致的退化问题。实验表明,该方法在多个指标上显著提升了长时序视频生成的质量。

详情
英文摘要

Autoregressive video generation enables streaming and open-ended long video synthesis, but still suffers from long-term degradation caused by accumulated errors. Existing KVCache strategies usually apply unified historical-frame retention, implicitly assuming homogeneous historical dependencies across attention heads. We revisit historical-frame attention and reveal three distinct head types: Anchor Heads require broad long-range context, Wave Heads exhibit periodic temporal dependencies, and Veil Heads focus on initial and adjacent frames. Based on this finding, we propose Pyramid Forcing, a head-aware pyramidal KVCache framework that identifies head types offline, assigns behavior-specific cache policies, and supports heterogeneous cache lengths via efficient ragged-cache attention. Experiments on Self Forcing and Causal Forcing show that Pyramid Forcing consistently improves long-horizon generation quality on VBench-Long, increasing the 60-second Self Forcing score from 77.87 to 81.21 while enhancing motion dynamics, visual fidelity, and semantic consistency. Project: https://if-lab-pku.github.io/Pyramid-Forcing/.

2605.13108 2026-05-14 cs.CV

Flow Augmentation and Knowledge Distillation for Lightweight Face Presentation Attack Detection

Muhammad Shahid Jabbar, Muhammad Sohail Ibrahim, Taha Hasan Masood Siddique, Kejie Huang, Shujaat Khan

发表机构 * SDAIA-KFUPM Joint Research Center for Artificial Intelligence(SDAIA-KFUPM联合人工智能研究中心) King Fahd University of Petroleum & Minerals(国王法赫德石油大学) Interdisciplinary Research Center for Intelligent Secure Systems (IRC-ISS)(智能安全系统跨学科研究中心) College of Information Science & Electronic Engineering(信息科学与电子工程学院) Department of Computer Engineering, College of Computing and Mathematics(计算机工程系,计算与数学学院)

AI总结 本文研究了在复杂攻击方式和多变采集条件下实现轻量级人脸活体检测(FacePAD)的问题,提出了一种结合光流增强和知识蒸馏的方法。通过训练时引入光流信息增强运动表征,推理时无需计算光流,同时设计了一个双分支教师模型融合外观与运动线索,并利用知识蒸馏将运动感知知识传递给轻量的学生模型,显著提升了检测性能并降低了计算开销。实验表明,该方法在多个基准数据集上取得了优异的检测效果,并能在嵌入式设备上实现每秒52帧的实时检测。

Comments Accepted at 2026 International Conference on Automatic Face and Gesture Recognition (FG)

详情
英文摘要

Face presentation attack detection (FacePAD) remains challenging under diverse spoofing representation, including 2D print and replay, 3D mask-based spoofing, makeup-induced appearance manipulation, and physical occlusions, as well as under varying capture conditions. Motion cues are highly discriminative for FacePAD but typically require explicit optical flow estimation, which introduces substantial computational overhead and limits real-time deployment. In this work, we leverage optical flow to enhance motion representation during training while eliminating the need for flow computation at inference. We propose a dual-branch teacher model that fuses appearance cues from RGB frames with motion cues derived from colorwheel-encoded optical flow, enabling effective modeling of micro-motions and temporal consistency. To enable efficient deployment, we introduce a knowledge distillation framework that transfers motion-aware knowledge from the flow-augmented teacher to a lightweight RGB-only student via logit distillation. As a result, the student implicitly learns motion-sensitive representations without requiring explicit flow estimation or additional feature extraction blocks at inference. Extensive experiments demonstrate strong performance across multiple benchmarks, achieving 0.0% HTER on Replay-Attack and Replay-Mobile, 0.94% HTER on ROSE-Youtu, 5.65% HTER on SiW-Mv2, and 0.42% ACER on OULU-NPU. The distilled student achieves performance comparable to or better than the teacher while significantly reducing parameters and FLOPs, achieving 52 FPS on an NVIDIA Jetson Orin Nano, indicating its suitability for real-time and resource-constrained FacePAD deployment.

2605.13105 2026-05-14 cs.RO

What to Ignore, What to React: Visually Robust RL Fine-Tuning of VLA Models

Yuanfang Peng, Jingjing Fu, Chuheng Zhang, Li Zhao, Jiang Bian, Mingyu Liu, Ling Zhang, Jun Zhang, Rui Wang

发表机构 * Hong Kong University of Science and Technology(香港理工大学) Microsoft Research Asia(微软亚洲研究院) Zhejiang University(浙江大学)

AI总结 该研究针对视觉语言动作(VLA)模型在机器人操作任务中面临的视觉变化问题,提出了一种名为PAIR-VLA的强化学习微调框架。该方法通过在PPO优化过程中引入两个辅助目标——动作不变性目标和动作敏感性目标,引导模型在视觉变化时区分任务相关与无关的变化,从而提升模型的鲁棒性。实验表明,PAIR-VLA在多种视觉分布外变化场景下均优于标准PPO方法,显著提升了模型的泛化能力和操作成功率。

详情
英文摘要

Reinforcement learning (RL) fine-tuning has shown promise for Vision-Language-Action (VLA) models in robotic manipulation, but deployment-time visual shifts pose practical challenges. A key difficulty is that standard task rewards supervise task success, but offer limited guidance on whether a visual change is task-irrelevant or changes the behavior required for manipulation. We propose PAIR-VLA (Paired Action Invariance & Sensitivity for Visually Robust VLA), an RL fine-tuning framework to address this difficulty by adding two auxiliary objectives over paired visual variants during PPO optimization: an invariance term that reduces the discrepancy between action distributions for a task-preserving pair (e.g., different distractors), and a sensitivity objective that encourages separable action distributions for a task-altering pair (e.g., target object in a different pose). Together, these objectives turn visual variants from mere observation diversity into behavior-level guidance on policy responses during RL fine-tuning. We evaluate on ManiSkill3 across two representative VLA architectures, OpenVLA and $π_{0.5}$, under diverse out-of-distribution visual shifts including unseen distractors, texture changes, target object pose variation, viewpoint shifts, and lighting changes. Our method consistently improves over standard PPO, achieving average improvements of 16.62% on $π_{0.5}$ and 9.10% on OpenVLA. Notably, ablations further show generalization across visual shifts: invariance guidance learned from distractor and texture variants transfers to target-pose and lighting shifts, while adding sensitivity guidance on target-pose variants further improves robustness to nuisance shifts, highlighting the broader transferability of behavior-level RL guidance.

2605.13101 2026-05-14 cs.LG cs.AI

Margin-calibrated Classifier Guidance for Property-driven Synthesis Planning

Najwa Laabid, Vikas Garg

发表机构 * Aalto University(阿alto大学)

AI总结 该研究提出了一种名为Sequence Completion Ranking(SCR)的新方法,用于改进基于单步 retrosynthesis 模型的化学合成路径规划。通过引入对比论证和基于边距的损失函数,SCR 能够校准分类器,使其在解码过程中更有效地区分满足特定属性的反应路径,从而提升生成路径的质量与多样性。实验表明,该方法在 USPTO-190 数据集上显著提高了多步合成的成功率,并有效弥补了无模板与有模板方法之间的多样性差距。

详情
英文摘要

Synthesis planning seeks an efficient sequence of chemical reactions that produce a target molecule. Typically, a pretrained single-step (autoregressive) retrosynthesis model is repeatedly invoked to generate such a sequence. Classifier guidance can, in principle, help steer the output of single-step model toward reactions that satisfy specific constraints or accommodate chemist's preferences during inference without having to retrain the autoregressive generator. We expose the insufficiency of auxiliary classifiers trained with cross-entropy loss to override the unconditional token-level distributions learned from typical sparse single-disconnection reaction datasets. We overcome this issue with a novel method called Sequence Completion Ranking (SCR), which employs contrastive argumentation and a margin-based loss to calibrate the classifier so that it can meaningfully discriminate between continuations during decoding. We formally establish that margin-calibrated classifiers can expand the set of property-satisfying sequences reachable under guided beam search. Empirically, on USPTO-190, given chemist-specified guidance targets, SCR substantially improves multi-step solve rates from $16.8\%$ (unguided generator) to $78.4\%$ with reaction-type guidance and $95.3\%$ with Tanimoto guidance, unlocking valid routes for 33 targets ($17.4\%$) previously unsolvable with baselines. Our method also effectively closes the long-standing diversity gap between template-free and template-based methods.

2605.13099 2026-05-14 cs.SD

Bypassing Direct Reconstruction: Speech Detection from MEG via Large-Scale Audio Retrieval

Boda Xiao, Bo Wang, Heping Cheng

发表机构 * Center for BioMed-X Research, Academy for Advanced Interdisciplinary Studies, Peking University(北京大学生物医学交叉研究学院,先进跨学科研究学院) Speech and Hearing Research Center, School of Intelligence Science and Technology, Peking University(北京大学智能科学与技术学院语音听力研究中心) State Key Laboratory of General Artificial Intelligence, Beijing, China(一般人工智能国家重点实验室,中国北京) National Biomedical Imaging Center, State Key Laboratory of Membrane Biology, Institute of Molecular Medicine, Peking-Tsinghua Center for Life Sciences, College of Future Technology, Peking University(国家生物医学成像中心,膜生物学国家重点实验室,分子医学研究院,北京大学-清华大学生命科学学院,未来技术学院,北京大学)

AI总结 本文研究如何从非侵入式脑信号(MEG)中检测语音内容,提出了一种无需直接重建语音信号的新方法。该方法首先利用对比学习模型从大规模音频库中检索与测试MEG信号匹配的语音片段,再通过语音检测模型生成静音与语音的二值序列。该方法在LibriBrain 2025语音检测任务中取得了优异成绩,验证了借助外部音频数据库进行语音检测的有效性。

Comments ranked first at LibriBrain Competition 2025 https://neural-processing-lab.github.io/2025-libribrain-competition/prizes/

详情
英文摘要

Decoding speech from non-invasive brain signals is challenging. For the LibriBrain 2025 Speech Detection task, we propose a novel two-step framework that bypasses direct reconstruction. First, a contrastive learning model retrieves the matching speech segment for the given test MEG from a large-scale audio library (LibriVox). Second, a speech detection model generates the binary silence/speech sequence directly from this retrieved audio. With this approach, our team Sherlock Holmes achieved first place in the extended track (F1-score: 0.962), demonstrating that leveraging external audio databases is a highly effective strategy.

2605.13094 2026-05-14 cs.RO

Identification of Non-Transversal Bifurcations of Linkages

Andreas Mueller, P. C. López Custodio, J. S. Dai

发表机构 * Johannes Kepler University, Linz, Austria(约翰内斯·开普勒大学,林茨,奥地利) King's College London, UK(国王学院伦敦,英国)

AI总结 本文研究了机构在非横截分岔情况下的运动分支识别问题,提出了一种基于运动切锥的局部分析方法。该方法通过构造性定义的运动切锥提取必要的信息,以区分不同运动分支,弥补了传统局部分析在处理非横截分岔时的不足。文中还提出了一种计算方法,扩展了已有算法框架,为机构奇异性和运动性的研究提供了新的工具。

Comments Paper No: DETC2020-22301, V010T10A090; 8 pages

Journal ref Proceedings of the ASME 2020 International Design Engineering Technical Conferences and Computers and Information in Engineering Conference. Volume 10: 44th Mechanisms and Robotics Conference (MR). Virtual, Online

详情
英文摘要

The local analysis is an established approach to the study of singularities and mobility of linkages. Key result of such analyses is a local picture of the finite motion through a configuration. This reveals the finite mobility at that point and the tangents to smooth motion curves. It does, however, not immediately allow to distinguish between motion branches that do not intersect transversally (which is a rather uncommon situation that has only recently been discussed in the literature). The mathematical framework for such a local analysis is the kinematic tangent cone. It is shown in this paper that the constructive definition of the kinematic tangent cone already involves all information necessary to separate different motion branches. A computational method is derived by amending the algorithmic framework reported in previous publications.

2605.13093 2026-05-14 cs.CV

RoSplat: Robust Feed-Forward Pixel-wise Gaussian Splatting for Varying Input Views and High-Resolution Rendering

Hoang Chuong Nguyen, Renjie Wu, Jose M. Alvarez, Miaomiao Liu

发表机构 * Australian National University(澳大利亚国立大学) NVIDIA

AI总结 RoSplat 是一种鲁棒的前馈像素级高斯点绘方法,旨在解决在输入视角变化和高分辨率渲染时出现的过亮和孔洞伪影问题。该方法通过引入像素级的 alpha 归一化策略和基于三维采样的辅助正则化器,有效提升了高斯尺度估计的准确性与渲染一致性。实验表明,RoSplat 在多个基准数据集上显著优于现有方法,尤其在输入视角变化和高分辨率场景下表现优异。

详情
英文摘要

Generalizable 3D Gaussian Splatting has recently emerged as an efficient approach for novel-view synthesis, enabling feed-forward synthesis from only a few input views. However, existing pixel-wise feed-forward methods suffer from over-bright renderings when the number of input views varies during inference, as well as insufficient supervision for accurate Gaussian scale estimation, which leads to hole artifacts, particularly in high-resolution renderings. To address these issues, we identify that the over-brightness is caused by the varying number of overlapping Gaussians and propose a simple alpha normalization strategy to maintain brightness consistency across different number of input views. In addition, we introduce an auxiliary 3D sampling-based regularizer to improve Gaussian scale estimation, thereby mitigating hole artifacts in high-resolution rendering. Experiments on benchmark datasets demonstrate that our method significantly improves baseline models under varying input-view and high-resolution rendering settings.

2605.13088 2026-05-14 cs.LG

Bayesian Nonparametric Mixed-Effect ODEs with Gaussian Processes

Julien Martinelli, Maksim Sinelnikov, Harri Lähdesmäki, Quentin Clairon, Mélanie Prague

发表机构 * Aalto University(阿alto大学) Univ. Bordeaux, INSERM BPH, U1219, Inria SISTM team, VRI, France(波尔多大学,INSERM BPH,U1219,Inria SISTM团队,VRI,法国) Inria SISTM team(Inria SISTM团队)

AI总结 该论文提出了一种基于贝叶斯非参数方法的混合效应常微分方程(ODE)模型,用于处理具有个体差异的动态系统建模问题。该方法通过将每个个体的动态场分解为共享的群体成分和个体特异性偏差,并为两者赋予高斯过程先验,从而在保持不确定性量化的同时提升了模型的灵活性。研究引入了结合状态空间高斯过程轨迹先验和虚拟配点观测的训练方法,有效提高了对群体动态场和个体轨迹的预测性能。

详情
英文摘要

Dynamical modelling is central to many scientific domains, including pharmacometrics, systems biology, physiology, and epidemiology. In these settings, heterogeneity is often intrinsic: different subjects or units follow related but distinct continuous-time dynamics. Classical nonlinear mixed-effects Ordinary Differential Equation (ODE) models address this by combining population-level structure with subject-specific effects, but they rely on a parametric vector field and are therefore vulnerable to structural misspecification and unmodelled mechanisms. This motivates nonparametric approaches that can retain principled uncertainty quantification, yet existing nonparametric ODE methods typically assume a single shared dynamical system rather than an explicit mixed-effect hierarchy over subject-specific dynamics. We propose MEGPODE, a Bayesian nonparametric mixed-effect ODE model in which each subject's vector field is decomposed into a shared population component and a subject-specific deviation, both endowed with Gaussian process (GP) priors. To avoid repeated ODE solves per subject during training, we combine state-space GP trajectory priors with virtual collocation observations, yielding Kalman-smoothing trajectory updates and closed-form regressions for the vector fields. Across controlled heterogeneous ODE benchmarks spanning oscillatory, biomedical systems, MEGPODE improves population-field recovery and subject-level trajectory prediction relative to strong baselines.

2605.13087 2026-05-14 cs.CL cs.AI

Vividh-ASR: A Complexity-Tiered Benchmark and Optimization Dynamics for Robust Indic Speech Recognition

Kush Juvekar, Kavya Manohar, Aditya Srinivas Menon, Arghya Bhattacharya, Kumarmanas Nethil

发表机构 * Adalat AI, India(Adalat AI 印度)

AI总结 该研究针对多语言语音识别模型在低资源语言上的微调问题,提出了Vividh-ASR基准,用于评估印地语和马拉雅拉姆语在不同复杂度场景下的识别性能。通过分析学习率时机和课程学习顺序,研究发现早期大参数更新和由易到难的课程学习策略能显著提升模型性能,特别是对自发语音的识别效果。基于这些发现,作者提出了逆向多阶段微调方法(R-MFT),使参数高效的244M Whisper模型在性能上达到甚至超越传统微调的769M模型。

Comments Submitted to Interspeech 2026

详情
英文摘要

Fine-tuning multilingual ASR models like Whisper for low-resource languages often improves read speech but degrades spontaneous audio performance, a phenomenon we term studio-bias. To diagnose this mismatch, we introduce Vividh-ASR, a complexity-stratified benchmark for Hindi and Malayalam across four tiers: studio, broadcast, spontaneous, and synthetic noise. Through a controlled study of learning-rate timing and curriculum ordering, we find that early large parameter updates improve global WER by 12 absolute points, while a hard-to-easy curriculum adds gains for spontaneous speech. These findings motivate reverse multi-stage fine-tuning (R-MFT), a training recipe that enables a parameter-efficient 244M Whisper model to match or exceed conventionally fine-tuned 769M counterparts. Representational analysis via CKA and SVD reveals effective schedules concentrate adaptation in the decoder, preserving the pre-trained encoder's acoustic geometry. We release the benchmark and models.

2605.13086 2026-05-14 cs.RO

Object Manipulation of the Variable Topology Truss system

Andrew Jang-Ho Bae, Myeongjin Choi, Haorui Li, Mark Yim, TaeWon Seo

发表机构 * RealMan Robotics Co., Ltd.(RealMan机器人有限公司)

AI总结 本文提出了一种针对可变拓扑桁架(VTT)系统的物体操作策略,该系统由带有被动球形关节的驱动桁架杆件组成。为实现有效操作,研究引入了一种混合控制框架,能够同时调节位置和力,无需显式解耦。通过实验验证了该方法在单个杆件模块和完整VTT系统中的力跟踪性能,并展示了两种典型配置下的物体操作效果,证明了该方法在位置和力跟踪方面的可靠性和一致性。

Comments 15 pages, 14 figures

详情
英文摘要

This paper presents an object manipulation strategy for the Variable Topology Truss (VTT) system, a truss robot that comprises actuated truss members connected by passive spherical joints. Although truss robots were originally proposed as rapidly deployable manipulators, manipulation strategy has not been studied thoroughly. To enable manipulation, we introduce a hybrid control framework that regulates position and force concurrently without explicit decoupling. At the actuator level, each member employs a sensor-based force feedback controller to generate the desired axial forces despite high actuator friction. At the task level, the forces applied at the end-effector nodes are produced by computing the required member forces using a static model of the VTT. We evaluate force-tracking performance through experiments on both a single member module and the full VTT system. Finally, we demonstrate object manipulation using two representative configurations and quantitatively assess combined position and force tracking performance. Experimental results confirm that the proposed approach enables consistent and reliable object manipulation with the VTT system.

2605.13083 2026-05-14 cs.RO

TouchAnything: A Dataset and Framework for Bimanual Tactile Estimation from Egocentric Video

Jianyi Zhou, Ziteng Gao, Feiyang Hong, Zirui Liu, Guannan Zhang, Weisheng Dai, Ruichen Zhen, Chuqiao Lyu, Haotian Wu, Yinian Mao, Xushi Wang, Yuxiang Jiang, Wenbo Ding, Shuo Yang

发表机构 * Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳研究院) Meituan Academy of Robotics(美团机器人研究院) Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院)

AI总结 本文提出了一种名为TouchAnything的框架和一个大规模数据集EgoTouch,用于从第一人称视角视频中估计双臂操作物体时的触觉信息。研究解决了现有数据集缺乏触觉信号的问题,通过可穿戴触觉传感器同步采集多视角视频、双手3D姿态和压力图,构建了包含208个操作任务的数据集。基于该数据集,作者设计了一个多视角视觉到触觉预测框架,实验表明结合手腕视角信息可有效提升触觉预测性能。

详情
英文摘要

Egocentric human video data, which captures rich human-environment interactions and can be collected at scale, has become a key driver of embodied intelligence research. However, existing egocentric datasets typically lack tactile sensing, a critical modality that provides direct cues about contact, force, and pressure in human-object interaction. Without such signals, models struggle to learn physically grounded representations of real-world interaction dynamics. While tactile sensors provide these cues, deploying high-quality tactile hardware at scale remains expensive and cumbersome. This raises a central question: can tactile feedback be inferred directly from visual observations, enabling scalable tactile supervision for egocentric video data and supporting physically grounded embodied learning? To enable research in this direction, we introduce EgoTouch, a large-scale multi-view egocentric dataset with dense tactile supervision for bimanual hand-object interaction. EgoTouch comprises 208 manipulation tasks spanning 1,891 episodes in diverse indoor and outdoor environments, with synchronized multi-view RGB (head-mounted egocentric and dual wrist-mounted cameras), bimanual 3D hand pose, and continuous pressure maps from wearable tactile sensors. Building on EgoTouch, we introduce TouchAnything, a baseline multi-view vision-to-touch prediction framework that uses the egocentric view as the primary input and flexibly leverages available wrist-mounted views at inference time. Experiments show that incorporating wrist-mounted views generally improves tactile prediction over egocentric-only input, achieving up to 5.0% relative improvement in Contact IoU and 6.1% relative improvement in Volumetric IoU. We will publicly release the dataset, code, and benchmark.

2605.13080 2026-05-14 cs.CV

Learning to See What You Need: Gaze Attention for Multimodal Large Language Models

Junha Song, Byeongho Heo, Geonmo Gu, Jaegul Choo, Dongyoon Han, Sangdoo Yun

发表机构 * NAVER AI Lab(NAVER AI实验室)

AI总结 本文研究了多模态大语言模型在视觉描述任务中如何更高效地关注图像关键区域的问题。作者提出了一种新的注意力机制——Gaze Attention,通过将视觉嵌入分组为紧凑的注视区域,并动态选择与任务相关的区域进行注意力计算,从而减少冗余计算并提升聚焦效果。此外,为保持全局上下文信息,作者还引入了可学习的上下文标记。实验表明,该方法在图像和视频理解任务中表现优异,且显著降低了视觉键值对的使用量。

详情
英文摘要

When humans describe a visual scene, they do not process the entire image uniformly; instead, they selectively fixate on regions relevant to their intended description. In contrast, current multimodal large language models (MLLMs) attend to all visual tokens at each generation step, leading to diluted focus and unnecessary computational overhead. In this work, we introduce Gaze Attention, a novel mechanism that enables MLLMs to selectively attend to task-relevant visual regions during generation. Specifically, we spatially group visual embeddings-stored as key-value caches-into compact gaze regions, each represented by a lightweight descriptor. At each decoding step, the model dynamically selects the most relevant regions and restricts attention to them, reducing redundant computation while enhancing focus. To mitigate the loss of global context caused by localized attention, we further propose learnable context tokens appended to each image or frame, allowing the model to maintain holistic visual awareness. Extensive experiments on image and video understanding benchmarks demonstrate that Gaze Attention matches or surpasses dense-attention baselines, while using up to 90% fewer visual KV entries in the attention computation.

2605.13079 2026-05-14 cs.LG cs.AI

Spectral Flattening Is All Muon Needs: How Orthogonalization Controls Learning Rate and Convergence

Tien-Phat Nguyen, Truong Nguyen, Minh-Phuc Truong, Tuc Nguyen, James Bailey, Trung Le

发表机构 * Hanoi University of Science and Technology(河内理工大学) Indiana University(印第安纳大学) Monash University(墨尔本大学)

AI总结 本文研究了优化器 Muon 的成功机制,揭示其核心在于通过正交化动量缓冲区实现谱平坦化,从而提升学习率容忍度和收敛速度。作者证明,Muon 的最大稳定步长与梯度的平均奇异值相关,而非最大值,这突破了传统梯度下降的瓶颈。此外,将 Muon 视为预条件梯度方法,其收敛效率的提升由梯度协方差的谱特性所控制。实验表明,Muon 在更大学习率下仍保持稳定,并比标准梯度下降更快达到精度目标。

详情
英文摘要

Muon orthogonalizes the momentum buffer before each update, replacing its singular values with ones via Newton-Schulz iterations. This simple change lets Muon tolerate far larger learning rates and converge faster than other optimizers, but why? We show that the mechanism is spectral flattening, and develop two results around it. First, we prove that Muon's maximal stable step size scales with the average singular value of the gradient rather than the largest, which bottlenecks standard gradient descent. Second, we recast Muon as a preconditioned gradient method and show, under a Kronecker-factored curvature model, that it improves the effective convergence factor, with the improvement controlled by the spectrum of the gradient covariance. Extensive experiments validate both results: Muon remains stable at learning rates that cause SGD to diverge within the first few iterations, and reaches accuracy milestones several epochs earlier even at identical step sizes. Taken together, our results offer a principled, geometric explanation for Muon's empirical success.

2605.13076 2026-05-14 cs.CL cs.FL cs.SE

TruncProof: A Guardrail for LLM-based JSON Generation under Token-Length Constraints

Yoshio Kato, Shuhei Tarashima

发表机构 * NTT DOCOMO BUSINESS, Inc., Japan(NTT DOCOMO商务公司,日本)

AI总结 TruncProof 是一种用于在令牌长度限制下生成语法正确的 JSON 输出的新型语法约束生成方法。该方法利用 LL(1) 解析器的特性,在解码过程中高效估计完成合法 JSON 所需的最小令牌数,从而确保生成结果既符合语法规范又不超出预设长度限制。实验表明,TruncProof 在严格令牌约束下仍能生成语义准确的 JSON,并可与先进解码策略结合使用,提升生成质量。

Comments Main paper (8 pages). Accepted at the International Joint Conference on Neural Networks (IJCNN 2026)

详情
英文摘要

The LLM-based generation of machine-readable outputs such as JSON has attracted significant attention for integration with external systems. However, existing approaches cannot strictly enforce the maximum number of tokens to be generated, leading to infinite generation or truncated outputs that cause a system malfunction. To address this limitation, we propose TruncProof, a novel grammar-constrained generation method that enables LLMs to produce grammatically valid JSONs while adhering to a predefined token limit. By leveraging the properties of LL(1) parsers, TruncProof efficiently approximates the minimum number of tokens required to complete a grammatically valid output at each decoding step. Experiments on the Text-to-JSON instruction tasks demonstrate that TruncProof successfully generates syntactically correct outputs even under strict token constraints. Furthermore, we show that TruncProof can be effectively combined with advanced decoding strategies, resulting in outputs that are not only grammatically valid but also semantically accurate.

2605.13068 2026-05-14 cs.LG

Local Inverse Geometry Can Be Amortized

Aaditya L. Kachhadiya

发表机构 * Independent Researcher(独立研究者)

AI总结 该论文研究了非线性反问题中的局部逆几何学习方法,提出了一种通过学习可复用的逆算子来替代传统曲率感知优化方法的新框架。核心方法是构建双向代理模型Deceptron,并结合D-IPG迭代求解器,利用雅可比矩阵组合惩罚(JCP)机制训练逆雅可比以近似前向雅可比的局部左逆。实验表明,该方法在多个偏微分方程反问题基准上优于传统方法,具有更高的求解效率和恢复质量。

Comments Preprint. 21 pages, 8 figures, 8 tables. Code available at https://github.com/AadityaKachhadiya/deceptron

详情
英文摘要

Nonlinear inverse problems often trade inexpensive but fragile first-order updates against curvature-aware methods such as Gauss-Newton and Levenberg-Marquardt, which obtain stronger directions by repeatedly solving Jacobian-based linearized systems. We propose a learned alternative: amortize local inverse geometry into a reusable reverse operator. Our framework learns a bidirectional surrogate, Deceptron, and deploys it through D-IPG (Deceptron Inverse-Preconditioned Gradient), an iterative solver that pulls residual-corrected measurement-space proposals back to latent space. The key mechanism is a Jacobian Composition Penalty (JCP), which trains the reverse Jacobian to act as a local left inverse of the forward Jacobian; its runtime counterpart, RJCP, measures the same inverse-consistency error along optimization trajectories. We prove that D-IPG is first-order equivalent to damped Gauss-Newton under local pseudoinverse consistency, with deviation controlled by composition error and conditioning. Across seven PDE inverse-problem benchmarks, D-IPG outperforms standard baselines, achieves 94.8% mean success across the six-problem reliability suite, and reaches comparable or better recovery quality at up to 77x lower inference-time solve cost on the main benchmarks.

2605.13067 2026-05-14 cs.RO cs.AI

When Absolute State Fails: Evaluating Proprioceptive Encodings for Robust Manipulation

Maxime Alvarez, Ryo Watanabe, Paul Crook, Afshin Zeinaddini Meymand, Suvin Kurian, Pablo Ferreiro, Genki Sano

发表机构 * TELEXISTENCE Inc, Foundation Model Division(TELEXISTENCE公司基础模型部门) The University of Tokyo(东京大学)

AI总结 随着端到端机器人策略在现实任务中的应用增多,训练与推理条件之间的差距成为一大挑战。本文研究了如何通过改进机器人本体感觉状态的编码方式,提升其在分布内和分布外场景下的性能,特别是在面对未知测试条件时的鲁棒性。研究发现,采用基于任务的相对参考系编码方法,在实际机器人实验中表现出优于现有方法的性能,为利用不同参考系下的数据提升机器人泛化能力提供了可行路径。

Comments Accepted to ICRA 2026 Workshop: From Data to Decisions

详情
英文摘要

As end-to-end robotic policies are progressively deployed in the real world to solve real tasks, they face a gap between the training and inference conditions. Scaling the amount and diversity of the training data has shown some success in improving zero-shot generalization, yet robots still fail when faced with new, unseen test conditions. For instance, while robots with fixed frames of reference are common, those with moving frames pose a greater challenge for deployment. To address this specific instance of the issue, we present a study of strategies for encoding the robot's proprioceptive state to improve both in- and out-of-distribution performance at test time. Through a systematic study of joint representations, we find that a simple episode-wise relative frame provides the best trade-off between task performance and robustness, outperforming the baselines in extensive real-robot experiments conducted in a realistic test environment. The results suggest a practical path to leveraging data collected by robots with varying frames of reference and deployment to unseen test configurations.

2605.13063 2026-05-14 cs.LG

Ergodic Trajectory Design by Learned Pushforward Maps: Provable Coverage via Conditional Flow Matching

Ehsan Aghazadeh, Masoud Malekzadeh, Ahmad Ghasemi, Hossein Pishro-Nik

发表机构 * University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校)

AI总结 本文研究了如何设计连续轨迹,使其时间平均占用密度能够可证明地匹配给定的空间密度,即“遍历覆盖”问题,该问题在无人机数据采集、机器人探索和移动监测等领域具有重要意义。作者提出了一种名为epushforward的框架,通过将遍历性与密度匹配解耦,利用最优传输条件流匹配方法学习一个离线映射,将简单环形区域上的均匀遍历轨迹转换为目标密度。该方法在训练完成后可支持无限数量的轨迹和多智能体系统,并能自然处理多种可微操作约束,具有理论保证的覆盖性能。

详情
英文摘要

Designing continuous trajectories whose time-averaged occupancy provably matches a prescribed spatial density (the \emph{ergodic coverage} problem) is central to UAV-assisted data collection and sensing, robotic exploration, and mobile monitoring. For flying agents in particular, this challenge is acute: trajectories must balance coverage fidelity against tight energy budgets, no-fly zones, and acceleration limits. Existing methods either re-optimize each trajectory online (with cost growing in the horizon and re-running for every target, agent, and realization) or rely on bespoke analytical constructions that must be re-derived for each new constraint. We propose a \emph{epushforward} framework that decouples ergodicity from density matching: an analytic latent trajectory provides exact uniform ergodicity on a simple annular domain, and a single map, learned offline by optimal-transport conditional flow matching, transports this latent occupancy onto the prescribed target density. The composed trajectory is then asymptotically ergodic with respect to the learned pushforward distribution, with deviation from the target controlled by the flow-matching training loss. Once trained for a given target density and constraint set, the map serves an unbounded number of trajectories and a multi-agent fleet without per-agent retraining, and many differentiable operational constraints (no-fly zones, acceleration ceilings, or fairness penalties) enter as additive soft penalties in the training loss without re-deriving the design. We prove three results (an acceleration-energy bound, an $O(1/\sqrt{K})$ ergodic convergence rate in the number of trajectory cycles $K$, and an approximation-error bound) that combine into an end-to-end coverage bound estimable from CFM training diagnostics (certified given an architectural Lipschitz bound on $v_θ$).

2605.13062 2026-05-14 cs.CV

Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling

Xuehai Bai, Yang Shi, Yi-Fan Zhang, Xuanyu Zhu, Yuran Wang, Yifan Dai, Xinyu Liu, Yiyan Ji, Xiaoling Gu, Yuanxing Zhang

发表机构 * HDU(杭州大学) PKU(北京大学) Kling Team(Kling团队) CASIA(中国科学院自动化研究所)

AI总结 近年来,图像编辑模型在指令理解、多模态感知和复杂视觉编辑方面取得了显著进展,但现有基准测试难以准确反映人类判断,尤其在评估前沿模型时存在任务难度有限和评价方式粗粒度的问题。为解决这一问题,本文提出Edit-Compass和EditReward-Compass,一个统一的图像编辑与奖励模型评估基准。Edit-Compass包含2,388个精细标注的样本,涵盖六个逐步提升难度的任务类别,采用多维细粒度评价框架;EditReward-Compass则包含2,251个偏好对,用于模拟实际强化学习中的奖励建模场景,为模型评估提供了更真实可靠的依据。

详情
英文摘要

Recent image editing models have achieved remarkable progress in instruction following, multimodal understanding, and complex visual editing. However, existing benchmarks often fail to faithfully reflect human judgment, especially for strong frontier models, due to limited task difficulty and coarse-grained evaluation protocols. In parallel, reward models have become increasingly important for RL-based image editing optimization, yet existing reward model benchmarks still rely on unrealistic evaluation settings that deviate from practical RL scenarios. These limitations hinder reliable assessment of both image editing models and reward models. To address these challenges, we introduce Edit-Compass and EditReward-Compass, a unified evaluation suite for image editing and reward modeling. Edit-Compass contains 2,388 carefully annotated instances spanning six progressively challenging task categories, covering capabilities such as world knowledge reasoning, visual reasoning, and multi-image editing. Beyond broad task coverage, Edit-Compass adopts a fine-grained multidimensional evaluation framework based on structured reasoning and carefully designed scoring rubrics. In parallel, EditReward-Compass contains 2,251 preference pairs that simulate realistic reward modeling scenarios during RL optimization.

2605.13059 2026-05-14 cs.CV

BrainAnytime: Anatomy-Aware Cross-Modal Pretraining for Brain Image Analysis with Arbitrary Modality Availability

Guangqian Yang, Tong Ding, Wenlong Hou, Yue Xun, Ye Du, Qian Niu, Shujun Wang

发表机构 * Department of Biomedical Engineering, The Hong Kong Polytechnic University, Hong Kong SAR, China(生物医学工程系,香港理工大学,香港特别行政区,中国) Department of Technology Management for Innovation, The University of Tokyo, Japan(创新技术管理系,东京大学,日本) Department of Data Science and Artificial Intelligence, The Hong Kong Polytechnic University, Hong Kong SAR, China(数据科学与人工智能系,香港理工大学,香港特别行政区,中国)

AI总结 本文提出了一种名为BrainAnytime的统一预训练框架,用于处理在任意模态可用情况下的脑影像分析任务。该方法通过跨模态蒸馏和基于图谱的课程掩码技术,在共享的三维掩码自编码器中学习MRI与PET之间的结构-分子对应关系,并关注疾病易感解剖区域。实验表明,BrainAnytime在多种临床模态设置下显著优于现有模型,尤其在阿尔茨海默病分类任务中提升了平均准确率。

Comments Early accepted by MICCAI 2026

详情
英文摘要

Clinical diagnostic workups typically follow a modality escalation pathway: after initial clinical evaluation, clinicians begin with routine structural imaging (e.g., MRI), selectively add sequences such as FLAIR or T2 to refine the differential, and reserve molecular imaging (e.g., amyloid-PET) for cases that remain uncertain after standard evaluation. Consequently, patients are observed with heterogeneous and often incomplete modality subsets. However, most current AI models assume fixed data modalities as the model inputs. In this paper, we present BrainAnytime, a unified pretraining framework pretrained on 34,899 3D brain scans from five datasets that support brain image analysis under arbitrary modality availability spanning multi-sequence MRI and amyloid-PET. A single model accepts whatever imaging is available, from a lone T1 scan to a full multimodal workup. Pretraining learns structural-molecular correspondences between MRI and PET via cross-modal distillation (RCMD) and prioritizes disease-vulnerable anatomy via atlas-guided curriculum masking (PACM), all within a shared 3D masked autoencoder (Multi-MAE3D). Across four downstream tasks and five clinically motivated modality settings, BrainAnytime largely outperforms modality-specific models, missing-modality baselines, and large-scale brain MRI pretrained foundation models on most modality settings. Notably, it surpasses the strongest missing-modality baselines with relative improvements of 6.2% and 7.0% in average accuracy on CN vs. AD and CN vs. MCI classification, respectively. Code is available at https://github.com/SDH-Lab/BrainAnytime.

2605.13058 2026-05-14 cs.RO

MUJICA: Multi-skill Unified Joint Integration of Control Architecture for Wheeled-Legged Robots

Yuqi Li, Peng Zhai, Yueqi Zhang, Xiaoyi Wei, Quancheng Qian, Zhengxu He, Qianxiang Yu, Lihua Zhang

发表机构 * College of Intelligent Robotics and Advanced Manufacturing, Fudan University(智能机器人与先进制造学院,复旦大学) Power China Huadong Engineering Corporation Limited(中国电力工程顾问集团华东分公司)

AI总结 本文提出了一种名为MUJICA的统一控制架构,用于轮腿机器人,旨在解决其在复杂地形中轮式移动与腿部控制之间的协调问题。该方法通过单一策略集成多种低级技能,如全向移动、高平台攀爬和跌落恢复,并结合精确的直流电机约束建模进行联合训练,同时引入基于本体感觉的高层技能选择器,实现对环境的自适应响应。实验表明,MUJICA显著提升了轮腿机器人在非结构化环境中的适应能力和任务成功率。

详情
英文摘要

Wheeled-legged robots hold promise for traversing complex terrains and offer superior mobility compared to legged robots. However, wheeled-legged robots must effectively balance both wheeled driving and legged control. Furthermore, due to noisy proprioceptive sensing and real-world motor constraints, realizing robust and adaptive locomotion at peak performance of motors remains challenging. We propose the Multi-skill Unified Joint Integration of Control Architecture (MUJICA), a unified, fully proprioceptive control framework for wheeled-legged robots that integrates diverse low-level skills-including omnidirectional moving, high platform climbing, and fall recovery-within a single policy. All skills, distinguished by unique indicator variables, are trained jointly with accurate DC-motor constraint modeling. Additionally, a high-level skill selector is learned to dynamically choose the optimal skill based solely on proprioceptions, enabling adaptive responses to the surrounding environment. Therefore, MUJICA enhances sim-to-real robustness and enables seamless transitions across diverse locomotion modes, facilitating autonomous adjustment to the environment. We validate our framework in both simulation and real-world experiments on the Unitree Go2-W robot, demonstrating significant improvements in adaptability and task success in unstructured environments.

2605.13055 2026-05-14 cs.CL cs.CY

The Cost of Perfect English: Pragmatic Flattening and the Erasure of Authorial Voice in L2 Writing Supported by GenAI

Ao Liu, Shanhua Zhu

发表机构 * School of Foreign Languages, Southeast University(东南大学外国语言学院)

AI总结 该研究探讨了生成式人工智能(GenAI)在辅助非母语者(如中国B2级大学生)写作时,可能引发的“语用扁平化”现象,即文化特定的礼貌表达和作者立场被系统性地抹去。通过对比分析使用GenAI润色前后的议论文,研究发现尽管模型在语法和语义层面表现良好,但在对话互动和知识立场等语用维度上存在显著差异,导致作者独特的声音被同质化的英语表达所取代。研究指出,应推动批判性AI素养教育,帮助多语写作者在使用GenAI提升语言质量的同时,保留其语用多样性和修辞性能。

Comments 16 pages, 2 figures

详情
英文摘要

The integration of Generative AI (GenAI) into language learning offers second language (L2) writers powerful tools for text optimization. However, pursuing native-like fluency often sacrifices sociopragmatic diversity. Investigating "pragmatic flattening" - the systematic erasure of culturally preferred politeness and authorial stance - this study conducts a comparative analysis of argumentative essays by Chinese B2-level university students from the ICNALE corpus. The original texts were polished via the APIs of four leading Large Language Models at a zero-temperature setting for reproducibility. Findings reveal a nuanced "dimensional divergence" within the Semantic Preservation Paradox. While models corrected lexicogrammatical errors and retained propositional meaning, sociopragmatic interventions were bifurcated. In the interactive dimension, all models showed a drastic collapse of dialogic engagement markers, turning negotiated discourse into monologic assertions. Conversely, in the epistemic stance dimension, models showed architecture-based variability: some aggressively scrubbed epistemic markers, while others reinforced tentative hedging as decontextualized algorithmic caution. This confirms that while GenAI enhances accuracy, it systematically overwrites L2 writers' unique rhetorical identities into a homogenized Anglo-American paradigm. We argue that future instruction must move beyond error correction, advocating for Critical AI Literacy to empower multilingual writers to use GenAI for linguistic enhancement while safeguarding sociopragmatic diversity and rhetorical agency.