arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2075
专题追踪
2510.10642 2026-05-14 cs.RO cs.AI

UniJEPA: Enhancing Robot Policy via Unified Continuous and Discrete Representation Learning

Jianke Zhang, Yucheng Hu, Yanjiang Guo, Xiaoyu Chen, Yichen Liu, Wenna Chen, Chaochao Lu, Jianyu Chen

发表机构 * Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China.(清华大学交叉信息研究院) Shanghai Qi Zhi Institute, Shanghai, China(上海启智研究院) Peking University, Beijing, China(北京大学) Shanghai AI Lab, Shanghai, China(上海人工智能实验室)

AI总结 本文提出了一种名为UniJEPA的新型机器人策略学习框架,旨在提升机器人在开放环境中处理多样化任务的能力。该方法通过统一学习连续和离散的视觉表征,结合大规模预训练和机器人本体数据微调,实现了对高维视觉特征的动态建模以及从预测表征到动作的映射学习。实验表明,UniJEPA在仿真环境和现实世界的分布外任务中均优于现有基线方法,展现出显著的性能提升。

Journal ref ICML 2026

详情
英文摘要

Building generalist robot policies that can handle diverse tasks in open-ended environments is a central challenge in robotics. To leverage knowledge from large-scale pretraining, prior work (VLA) has typically built generalist policies either on top of vision-language understanding models (VLMs) or generative models. However, both semantic understanding from vision-language pretraining and visual dynamics modeling from visual-generation pretraining are crucial for embodied robots. Recent unified models of generation and understanding have demonstrated strong capabilities in both comprehension and generation through large-scale pretraining. We posit that robotic policy learning can likewise benefit from the combined strengths of understanding, planning, and continuous future representation learning. Building on this insight, we introduce UniJEPA, which acquires the ability to dynamically model high-dimensional visual features through pretraining on over 1M internet-scale instructional manipulation videos. Subsequently, UniJEPA is fine-tuned on data collected from the robot embodiment, enabling the learning of mappings from predictive representations to action tokens. Extensive experiments show our approach consistently outperforms baseline methods in terms of 9\% and 12\% across simulation environments and real-world out-of-distribution tasks.

2510.10265 2026-05-14 cs.CL

Backdoor Collapse: Eliminating Unknown Threats via Known Backdoor Aggregation in Language Models

Liang Lin, Miao Yu, Moayad Aloqaily, Zhenhong Zhou, Kun Wang, Linsey Pang, Prakhar Mehrotra, Qingsong Wen

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所) University of Science and Technology of China(中国科学技术大学) United Arab Emirates University(阿联酋大学) PayPal Inc(PayPal公司) Walmart Labs(沃尔玛实验室) Squirrel Ai Learning(Squirrel Ai学习) Nanyang Technological University(南洋理工大学)

AI总结 该研究针对大语言模型中的后门攻击问题,提出了一种无需先验触发设置知识的防御方法——Backdoor Collapse。其核心思想是通过注入已知后门触发器,使原有未知后门与新注入后门在表示空间中聚合,再通过恢复微调消除后门影响。实验表明,该方法在多个基准测试中显著降低了攻击成功率,同时保持了模型的清洁准确率和实用性,具有良好的泛化性和实际应用价值。

详情
英文摘要

Backdoor attacks are a significant threat to large language models (LLMs), often embedded via public checkpoints, yet existing defenses rely on impractical assumptions about trigger settings. To address this challenge, we propose \ourmethod, a defense framework that requires no prior knowledge of trigger settings. \ourmethod is based on the key observation that when deliberately injecting known backdoors into an already-compromised model, both existing unknown and newly injected backdoors aggregate in the representation space. \ourmethod leverages this through a two-stage process: \textbf{first}, aggregating backdoor representations by injecting known triggers, and \textbf{then}, performing recovery fine-tuning to restore benign outputs. Extensive experiments across multiple LLM architectures demonstrate that: (I) \ourmethod reduces the average Attack Success Rate to 4.41\% across multiple benchmarks, outperforming existing baselines by 28.1\%$\sim$69.3\%$\uparrow$. (II) Clean accuracy and utility are preserved within 0.5\% of the original model, ensuring negligible impact on legitimate tasks. (III) The defense generalizes across different types of backdoors, confirming its robustness in practical deployment scenarios.

2510.08992 2026-05-14 cs.LG

Constraints-of-Thought: A Framework for Constrained Reasoning in Language-Model-Guided Search

Kamel Alrashedy, Vriksha Srihari, Zulfiqar Zaidi, Ridam Srivastava, Pradyumna Tambwekar, Matthew Gombolay

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 该研究提出了一种名为Constraints-of-Thought(Const-o-T)的框架,旨在解决大语言模型在多步骤规划中难以满足高层用户意图和符号约束的问题。该框架通过将每个推理步骤表示为(意图,约束)对,为蒙特卡洛树搜索(MCTS)提供结构化先验,从而压缩搜索空间并确保路径的语义有效性。实验表明,Const-o-T在风险游戏、CAD代码生成和算术推理等多个领域均优于现有方法,展示了其在提升规划效率和约束对齐方面的重要贡献。

详情
英文摘要

While researchers have made significant progress in enabling large language models (LLMs) to perform multi-step planning, LLMs struggle to ensure that those plans align with high-level user intent and satisfy symbolic constraints, especially in complex, multi-step domains. Existing reasoning approaches such as Chain-of-Thought (CoT), Tree-of-Thought (ToT), and verifier-augmented methods, expand the search space but often yield infeasible actions or hallucinated steps. To overcome these limitations, we propose Constraints-of-Thought (Const-o-T), a framework that provides a structured prior that enables Monte Carlo Tree Search (MCTS) focus search on semantically meaningful paths. Each reasoning step is represented as an (intent, constraint) pair, which serves both to compress the search space and enforce validity. Unlike prior methods that merely generate reasoning traces or validate outputs post hoc, Const-o-T uses (intent, constraint)pairs to actively focus the search toward feasible and meaningful plans. We integrate Const-o-T into MCTS using a structured representation of intent-constraint pairs constraints prune infeasible branches and guide exploration toward semantically valid actions, improving planning efficiency and verifiable decision-making. We demonstrate across three domains Risk game, CAD code generation, and arithmetic reasoning that our approach outperforms baselines, yielding higher accuracy and stronger structural alignment. Our contribution is to demonstrate that Const-of-T offers a generalizable foundation for constraint-guided reasoning, enabling more efficient, constraint-aligned, and domain-adaptable planning with LLMs.

2510.03548 2026-05-14 cs.CV cs.AI

Unmasking Puppeteers: Leveraging Biometric Leakage to Expose Impersonation in AI-Based Videoconferencing

Danial Samadi Vahdati, Tai Duc Nguyen, Ekta Prashnani, Koki Nagano, David Luebke, Orazio Gallo, Matthew Stamm

发表机构 * Drexel University(德雷克斯el大学) NVIDIA

AI总结 本文研究了基于人工智能的视频会议系统中身份伪装攻击的问题,即攻击者可通过操控传输的潜空间信息实时劫持用户的形象。为解决这一问题,作者提出了一种新型防御方法,通过利用潜空间中固有的生物特征信息,设计了一个基于姿态条件的对比编码器,能够分离身份特征并消除姿态和表情的干扰,从而在不依赖重建视频的情况下检测身份伪装。实验表明,该方法在多个生成模型上均表现出优越的检测性能,并具有实时性和良好的泛化能力。

详情
英文摘要

AI-based talking-head videoconferencing systems reduce bandwidth by sending a compact pose-expression latent and re-synthesizing RGB at the receiver, but this latent can be puppeteered, letting an attacker hijack a victim's likeness in real time. Because every frame is synthetic, deepfake and synthetic video detectors fail outright. To address this security problem, we exploit a key observation: the pose-expression latent inherently contains biometric information of the driving identity. Therefore, we introduce the first biometric leakage defense without ever looking at the reconstructed RGB video: a pose-conditioned, large-margin contrastive encoder that isolates persistent identity cues inside the transmitted latent while cancelling transient pose and expression. A simple cosine test on this disentangled embedding flags illicit identity swaps as the video is rendered. Our experiments on multiple talking-head generation models show that our method consistently outperforms existing puppeteering defenses, operates in real-time, and shows strong generalization to out-of-distribution scenarios.

2510.00929 2026-05-14 cs.CV

Equivariant Splitting: Self-supervised learning from incomplete data

Victor Sechaud, Jérémy Scanvic, Quentin Barthélemy, Patrice Abry, Julián Tachella

发表机构 * LPENSL, CNRS, ENS de Lyon, France(LPENSL、CNRS、 Lyon 工程科学研究院、法国) Prysm, Lyon, France(Prysm、Lyon、法国)

AI总结 本文提出了一种用于不完整数据的自监督学习新方法——等变分裂,旨在解决在仅有单一不完整观测模型的情况下重建问题。该方法引入了重建网络中的等变性概念,并结合自监督分裂损失,实现了对有监督损失的无偏估计。实验表明,该方法在图像修复、加速磁共振成像、稀疏视角CT和压缩感知等任务中表现出色,尤其适用于正向模型高度欠秩的场景。

详情
英文摘要

Self-supervised learning for inverse problems allows to train a reconstruction network from noise and/or incomplete data alone. These methods have the potential of enabling learning-based solutions when obtaining ground-truth references for training is expensive or even impossible. In this paper, we propose a new self-supervised learning strategy devised for the challenging setting where measurements are observed via a single incomplete observation model. We introduce a new definition of equivariance in the context of reconstruction networks, and show that the combination of self-supervised splitting losses and equivariant reconstruction networks results in unbiased estimates of the supervised loss. Through a series of experiments on image inpainting, accelerated magnetic resonance imaging, sparse-view computed tomography, and compressive sensing, we demonstrate that the proposed loss achieves state-of-the-art performance in settings with highly rank-deficient forward models. The code is available at https://github.com/vsechaud/Equivariant-Splitting

2509.25781 2026-05-14 cs.AI cs.LO

Deontic Argumentation

Guido Governatori, Antonino Rotolo

发表机构 * School of Engineering and Technology, Central Queensland University(中央昆士兰大学工程与技术学院) Alma AI and Department of Legal Studies, University of Bologna(博洛尼亚大学法律系与Alma AI)

AI总结 本文研究了如何为道义论证(deontic argumentation)定义一种支持弱许可(weak permission)的语义。作者指出,当前基于 grounded 语义的方法在义务冲突时无法支持弱许可,并提出了一个新的道义论证理论,以正确处理弱许可问题,从而完善了道义论证的语义基础。

详情
英文摘要

We address the issue of defining a semantics for deontic argumentation that supports weak permission. Some recent results show that grounded semantics do not support weak permission when there is a conflict between two obligations. We provide a definition of Deontic Argumentation Theory that accounts for weak permission, and we recall the result about grounded semantics. Then, we propose a new semantics that supports weak permission.

2509.24728 2026-05-14 cs.LG stat.ML

Beyond Softmax: A Natural Parameterization for Categorical Random Variables

Alessandro Manenti, Cesare Alippi

发表机构 * Università della Svizzera italiana, IDSIA(瑞士大学、IDSIA)

AI总结 该论文提出了一种替代传统softmax函数的新方法——catnat函数,用于处理分类随机变量。从信息几何角度出发,作者揭示了softmax的局限性,并通过分层二元分割构造catnat函数,使其具有对角化的费舍尔信息矩阵,从而提升梯度下降的效率。实验表明,catnat在图结构学习、变分自编码器和强化学习等多种任务中均能提高学习效率和模型性能,且易于实现并兼容现有训练技术。

详情
英文摘要

Latent categorical variables are frequently found in deep learning architectures. They can model actions in discrete reinforcement-learning environments, represent categories in latent-variable models, or express relations in graph neural networks. Despite their widespread use, their discrete nature poses significant challenges to gradient-descent learning algorithms. While a substantial body of work has offered improved gradient estimation techniques, we take a complementary approach. Specifically, we: 1) revisit the ubiquitous $\textit{softmax}$ function and demonstrate its limitations from an information-geometric perspective; 2) replace the $\textit{softmax}$ with the $\textit{catnat}$ function, a function composed of a sequence of hierarchical binary splits; we prove that this choice offers significant advantages to gradient descent due to the resulting diagonal Fisher Information Matrix. A rich set of experiments - including graph structure learning, variational autoencoders, and reinforcement learning - empirically show that the proposed function improves the learning efficiency and yields models characterized by consistently higher test performance. $\textit{Catnat}$ is simple to implement and seamlessly integrates into existing codebases. Moreover, it remains compatible with standard training stabilization techniques and, as such, offers a better alternative to the $\textit{softmax}$ function.

2509.23597 2026-05-14 cs.LG cs.AI

Characteristic Root Analysis and Regularization for Linear Time Series Forecasting

Zheng Wang, Kaixuan Zhang, Wanfang Chen, Xiaonan Lu, Longyuan Li, Tobias Schlagenhauf

发表机构 * Bosch Center for AI (BCAI) & Bosch (China) Investment Co., Ltd.(博世人工智能中心(BCAI)及博世(中国)投资有限公司) Robert Bosch GmbH(罗伯特·博世有限公司)

AI总结 本文系统研究了线性模型在时间序列预测中的应用,重点分析了特征根在时间动态行为中的作用,并揭示了噪声环境下模型易产生虚假特征根的问题。为此,作者提出了两种互补的正则化策略:一种基于低秩回归技术恢复潜在动态结构,另一种通过新方法“Root Purge”引导模型学习抑制噪声的零空间。实验表明,这两种方法在多个基准数据集上表现优异,验证了理论分析的有效性,并在某些场景下达到了当前最优结果。

Comments Accepted for publication at ICLR 2026

详情
英文摘要

Time series forecasting remains a critical challenge across numerous domains, yet the effectiveness of complex models often varies unpredictably across datasets. Recent studies highlight the surprising competitiveness of simple linear models, suggesting that their robustness and interpretability warrant deeper theoretical investigation. This paper presents a systematic study of linear models for time series forecasting, with a focus on the role of characteristic roots in temporal dynamics. We begin by analyzing the noise-free setting, where we show that characteristic roots govern long-term behavior and explain how design choices such as instance normalization and channel independence affect model capabilities. We then extend our analysis to the noisy regime, revealing that models tend to produce spurious roots. This leads to the identification of a key data-scaling property: mitigating the influence of noise requires disproportionately large training data, highlighting the need for structural regularization. To address these challenges, we propose two complementary strategies for robust root restructuring. The first uses rank reduction techniques, including \textbf{Reduced-Rank Regression (RRR)} and \textbf{Direct Weight Rank Reduction (DWRR)}, to recover the low-dimensional latent dynamics. The second, a novel adaptive method called \textbf{Root Purge}, encourages the model to learn a noise-suppressing null space during training. Extensive experiments on standard benchmarks demonstrate the effectiveness of both approaches, validating our theoretical insights and achieving state-of-the-art results in several settings. Our findings underscore the potential of integrating classical theories for linear systems with modern learning techniques to build robust, interpretable, and data-efficient forecasting models. The code is publicly available at: https://github.com/Wangzzzzzzzz/RootPurge.

2509.23056 2026-05-14 cs.CV cs.LG

FMC-DETR: Frequency-Decoupled Multi-Domain Coordination for Aerial-View Object Detection

Ben Liang, Hongguang Wei, Yuan Liu, Bingwen Qiu, Yihong Wang, Xiubao Sui, Qian Chen

发表机构 * School of Electronic Engineering and Optoelectronic Technology, Nanjing University of Science and Technology(南京理工大学电子工程与光电子技术学院)

AI总结 本文提出FMC-DETR,一种用于遥感图像中空中视角目标检测的频率解耦融合框架,旨在解决高分辨率图像中微小目标检测因视觉线索弱和全局上下文建模不足而面临的问题。该方法引入了Wavelet Kolmogorov-Arnold Transformer(WeKat)作为主干网络,结合小波变换和Kolmogorov-Arnold网络以增强浅层特征的全局低频结构感知和多尺度依赖的非线性建模;同时设计了多域特征协调模块(MDFC)和紧凑部分融合模块(CPF),分别用于优化跨尺度特征融合和提升小目标检测性能。实验表明,FMC-DETR在多个遥感基准数据集上取得了最先进的检测效果。

详情
英文摘要

Remote sensing object detection is a critical technology for real-world applications such as natural resource monitoring, traffic management, and UAV-based rescue. Detecting tiny objects in high-resolution aerial imagery remains challenging due to weak visual cues and insufficient global context modeling in complex scenes. Existing methods often suffer from delayed contextual interaction and limited nonlinear reasoning, which restrict their ability to effectively refine shallow representations and ultimately lead to suboptimal performance. To address these challenges, we propose FMC-DETR, a frequency-decoupled fusion framework for aerial-view object detection. First, we propose the Wavelet Kolmogorov-Arnold Transformer (WeKat) backbone, which employs cascaded wavelet transforms to enhance global low-frequency structure perception in shallow features while preserving fine-grained details, and further leverages Kolmogorov-Arnold networks for adaptive nonlinear modeling of multi-scale dependencies. Second, we introduce the Multi-Domain Feature Coordination (MDFC) module, which refines cross-scale fused representations through partial-channel spatial, spectral, and structural coordination, thereby strengthening small-object-related feature responses in cluttered scenes. Finally, we design the Compact Partial Fusion (CPF) module, which performs compact multi-branch aggregation with progressive partial refinement to improve feature diversity and multi-scale interaction while preserving stable information flow and reducing redundant perturbation. Extensive experiments across multiple remote sensing benchmarks demonstrate that FMC-DETR achieves state-of-the-art performance and significantly outperforming the baseline detector. Code is available at https://github.com/bloomingvision/FMC-DETR.

2509.19538 2026-05-14 cs.LG cs.AI

DAWM: Diffusion Action World Models for Offline Reinforcement Learning via Action-Inferred Transitions

Zongyue Li, Xiao Han, Yusong Li, Niklas Strauss, Matthias Schubert

发表机构 * Department of Computer Science, University of Munich, Munich, Germany(慕尼黑大学计算机科学系) Munich Center for Machine Learning (MCML), Munich, Germany(慕尼黑机器学习中心(MCML))

AI总结 DAWM 是一种基于扩散模型的世界模型,旨在提升离线强化学习的性能。该方法通过当前状态、动作和剩余回报生成未来状态-奖励轨迹,并结合逆动力学模型实现高效的动作推断,从而生成适用于基于一步时差学习的离线RL算法的完整合成转移。实验表明,DAWM 显著提升了保守离线RL算法如TD3BC和IQL在D4RL基准上的表现,优于现有的扩散模型基线。

Comments ICML2025 workshop Building Physically Plausible World Models

详情
英文摘要

Diffusion-based world models have demonstrated strong capabilities in synthesizing realistic long-horizon trajectories for offline reinforcement learning (RL). However, many existing methods do not directly generate actions alongside states and rewards, limiting their compatibility with standard value-based offline RL algorithms that rely on one-step temporal difference (TD) learning. While prior work has explored joint modeling of states, rewards, and actions to address this issue, such formulations often lead to increased training complexity and reduced performance in practice. We propose \textbf{DAWM}, a diffusion-based world model that generates future state-reward trajectories conditioned on the current state, action, and return-to-go, paired with an inverse dynamics model (IDM) for efficient action inference. This modular design produces complete synthetic transitions suitable for one-step TD-based offline RL, enabling effective and computationally efficient training. Empirically, we show that conservative offline RL algorithms such as TD3BC and IQL benefit significantly from training on these augmented trajectories, consistently outperforming prior diffusion-based baselines across multiple tasks in the D4RL benchmark.

2509.15642 2026-05-14 cs.CV

UNIV: Unified Foundation Model for Infrared and Visible Modalities

Fangyuan Mao, Shuo Wang, Jilin Mei, Shun Lu, Chen Min, Fuyang Liu, Xiaokun Feng, Meiqi Wu, Yu Hu

发表机构 * Research Center for Intelligent Computing Systems, CAS ICT(智能计算系统研究所以及中国科学院信息科技研究院) University of Chinese Academy of Sciences(中国科学院大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所)

AI总结 本文提出UNIV,一种统一的红外与可见光基础模型,旨在解决跨模态感知中的模态偏差问题。核心方法为Patch Cross-modal Contrastive Learning(PCCL),通过自监督学习构建统一的跨模态特征空间,提升语义对齐与类别可分性。此外,研究还构建了目前最全面的可见光-红外数据集MVIP,并在多个任务上验证了UNIV的优越性能。

详情
英文摘要

Joint RGB-infrared perception is essential for achieving robustness under diverse weather and illumination conditions. Although foundation models excel within single modalities, they suffer from substantial cross-modal degradation, an issue we attribute to a pattern shortcut, i.e., a modal bias that prioritizes superficial sensor patterns over underlying semantics. To address this problem, we introduce UNIV, a Unified foundation model for Infrared and Visible modalities. At the core of UNIV lies Patch Cross-modal Contrastive Learning (PCCL), a self-supervised contrastive learning strategy that constructs a unified cross-modal feature space. PCCL employs a frozen pre-trained model to sample pseudo patch pairs based on semantic similarity, and aligns infrared-visible representations by attracting semantically related pairs while repelling unrelated ones. This process simultaneously enhances cross-modal alignment and inter-class semantic separability, guiding the model to focus on semantic structure rather than falling into pattern shortcuts. To further enable cross-modal learning, we introduce MVIP, the most comprehensive visible-infrared benchmark to date, containing 98,992 precisely aligned image pairs across diverse scenes. Extensive experiments demonstrate UNIV's superior performance on infrared tasks (+1.7 mIoU for semantic segmentation and +0.7 mAP for detection), while maintaining competitive accuracy on RGB tasks.

2509.13858 2026-05-14 cs.CV

EDITS: Enhancing Dataset Distillation with Implicit Textual Semantics

Qianxin Xia, Jiawei Du, Guoming Lu, Zhiyong Shu, Jielei Wang

发表机构 * University of Electronic Science and Technology of China(电子科技大学) Centre for Frontier AI Research, Agency for Science, Technology and Research(前沿人工智能研究中心,科技研究局)

AI总结 本文提出了一种名为EDITS的新框架,旨在通过利用图像中的隐含文本语义来提升数据集蒸馏的效果。该方法结合视觉语言模型生成的外部文本与图像特征,构建语义聚类缓冲区,并通过局部语义感知模块选择代表性样本生成图像与文本原型,最终利用扩散模型生成高质量的合成数据集。实验表明,该方法在保持模型性能的同时显著提升了蒸馏效率。

详情
英文摘要

Dataset distillation aims to synthesize a compact dataset from the original large-scale one, enabling highly efficient learning while preserving competitive model performance. However, traditional techniques primarily capture low-level visual features, neglecting the high-level semantic and structural information inherent in images. In this paper, we propose EDITS, a novel framework that exploits the implicit textual semantics within the image data to achieve enhanced distillation. First, external texts generated by a Vision Language Model (VLM) are fused with image features through a Global Semantic Query module, forming the prior clustered buffer. Local Semantic Awareness then selects representative samples from the buffer to construct image and text prototypes, with the latter produced by guiding a Large Language Model (LLM) with meticulously crafted prompt. Ultimately, Dual Prototype Guidance strategy generates the final synthetic dataset through a diffusion model. Extensive experiments confirm the effectiveness of our method.Source code is available in: https://github.com/einsteinxia/EDITS.

2509.10796 2026-05-14 cs.RO

Follow-Bench: A Unified Motion Planning Benchmark for Socially-Aware Robot Person Following

Hanjing Ye, Weixi Situ, Jianwei Peng, Yu Zhan, Bingyi Xia, Kuanqi Cai, Hong Zhang

发表机构 * Shenzhen Key Laboratory of Robotics and Computer Vision(深圳机器人与计算机视觉重点实验室) Southern University of Science and Technology(南方科技大学) Human-Robot Interfaces and Interaction Laboratory(人机交互实验室) Istituto Italiano Di Tecnologia(意大利技术研究院) Swiss Federal Technology Institute of Lausanne(洛桑联邦理工学院)

AI总结 本文提出Follow-Bench,首个统一的用于社会感知机器人跟随任务的运动规划基准,旨在解决机器人在跟随目标人时如何同时保证安全与舒适的问题。研究系统回顾了相关场景、规划方法和评估指标,构建了包含多种轨迹模式、人群动态和环境布局的仿真基准,并重新实现了八种代表性规划器,全面评估其安全与舒适性能。通过仿真与实际机器人实验,揭示了现有方法的权衡与挑战,为未来研究提供了方向。

Comments Project page: https://follow-bench.github.io/

详情
英文摘要

Robot person following (RPF) -- mobile robots that follow and assist a specific person -- has emerging applications in personal assistance, security patrols, eldercare, and logistics. To be effective, such robots must follow the target while ensuring safety and comfort for both the target and surrounding people. In this work, we present the first comprehensive study of RPF, which (i) surveys representative scenarios, motion-planning methods, and evaluation metrics with a focus on safety and comfort; (ii) introduces Follow-Bench, a unified benchmark simulating diverse scenarios, including various target trajectory patterns, crowd dynamics, and environmental layouts; and (iii) re-implements eight representative RPF planners, ensuring that both safety and comfort are systematically considered. Moreover, we evaluate the two best-performing planners from our benchmark on a differential-drive robot to provide insights into real-world deployment of RPF planners. Extensive simulation and real-world experiments provide quantitative study of the safety-comfort trade-offs of existing planners, while revealing open challenges and future research directions.

2509.08461 2026-05-14 cs.LG cs.AI cs.CV hep-ex

Adapting Vision-Language Models for Neutrino Event Classification in High-Energy Physics

Dikshant Sagar, Kaiwen Yu, Alejandro Yankelevich, Jianming Bian, Pierre Baldi

发表机构 * Department of Computer Science, University of California, Irvine, CA, USA(计算机科学系,加州大学欧文分校,加州,美国) Department of Physics, University of California, Irvine, CA, USA(物理系,加州大学欧文分校,加州,美国)

AI总结 本文研究了将视觉语言模型(VLM)应用于高能物理实验中中微子事件分类的问题,提出了一种基于微调LLaMA 3.2的VLM方法,并与卷积神经网络(CNN)和视觉变换器(ViT)进行了对比。实验表明,基于变换器的模型在分类准确率和鲁棒性方面优于传统CNN,而VLM通过引入文本或语义信息,进一步提升了预测的可解释性和推理能力。该研究展示了VLM作为物理事件分类通用框架的潜力,为中微子物理实验中的多模态推理提供了新思路。

Comments Accepted for publication in Communications Physics (Nature Portfolio)

详情
英文摘要

Recent advances in Large Language Models (LLMs) have demonstrated their remarkable capacity to process and reason over structured and unstructured data modalities beyond natural language. In this work, we explore the applications of Vision Language Models (VLMs), specifically a fine-tuned variant of LLaMA 3.2 to the task of identifying neutrino interactions in pixelated detector data from high-energy physics (HEP) experiments. We benchmark this model against a state-of-the-art convolutional neural network (CNN) architecture, similar to those used in major neutrino experiments, which have achieved high efficiency and purity in classifying electron and muon neutrino events, and also a Vision Transformer (ViT-h/14), which is the same architecture inside the VLM's vision encoder. Our evaluation considers both classification performance and interpretability of the model predictions, comparing a VLM with a vision-only transformer (ViT) and a convolutional neural network (CNN) baseline. We find that transformer-based architectures outperform conventional CNNs in classification accuracy and robustness, with the VLM providing additional flexibility through the integration of auxiliary textual or semantic information and enabling more interpretable, reasoning-based predictions. These results highlight the potential of large transformer models, particularly vision-language models, as general-purpose backbones for physics event classification, combining strong performance, robustness, and interpretability, and opening new avenues for multimodal reasoning in experimental neutrino physics.

2509.00626 2026-05-14 cs.CV cs.AI

Towards Methane Detection Onboard Satellites

Maggie Chen, Hala Lamdouar, Luca Marini, Laura Martínez-Ferrer, Chris Bridges, Giacomo Acciarini

发表机构 * University of Oxford(牛津大学) Delft University of Technology(代尔夫特理工大学) Universitat de València(瓦伦西亚大学) University of Surrey(萨里大学) European Space Agency (ESA)(欧洲航天局)

AI总结 本文研究了如何在卫星上利用机器学习技术实现甲烷气体的快速检测,以支持气候变化的及时应对。研究提出了一种新的方法,无需传统图像预处理步骤,直接使用未正射校正的高光谱数据进行训练,取得了与传统方法相当的检测效果。此外,研究还展示了基于正射校正数据训练的模型在性能上优于传统匹配滤波方法,并公开了数据集和代码,为相关研究提供了重要资源。

详情
英文摘要

Methane is a potent greenhouse gas and a major driver of climate change, making its timely detection critical for effective mitigation. Machine learning (ML) deployed onboard satellites can enable rapid detection while reducing downlink costs, supporting faster response systems. Conventional methane detection methods often rely on image processing techniques, such as orthorectification to correct geometric distortions and matched filters to enhance plume signals. We introduce a novel approach that bypasses these preprocessing steps by using \textit{unorthorectified} data (UnorthoDOS). We find that ML models trained on this dataset achieve performance comparable to those trained on orthorectified data. Moreover, we also train models on an orthorectified dataset, showing that they can outperform the matched filter baseline (mag1c). We release model checkpoints and two ML-ready datasets comprising orthorectified and unorthorectified hyperspectral images from the Earth Surface Mineral Dust Source Investigation (EMIT) sensor at https://huggingface.co/datasets/SpaceML/UnorthoDOS , along with code at https://github.com/spaceml-org/plume-hunter.

2509.00072 2026-05-14 cs.AI

Test of Time: Rethinking Temporal Signal of Benchmark Contamination

Terry Jingchen Zhang, Gopal Dev, Ning Wang, Max Obreiter, Punya Syon Pandey, Keenan Samway, Wenyuan Jiang, Yinya Huang, Bernhard Schölkopf, Mrinmaya Sachan, Zhijing Jin

发表机构 * Jinesis Lab, University of Toronto & Vector Institute(Jinesis实验室、多伦多大学及向量研究所) ETH Zürich & ETH AI Center(苏黎世联邦理工学院及ETH人工智能中心) Max Planck Institute for Intelligent Systems, Tübingen, Germany(智能系统马克斯·普朗克研究所,图宾根,德国) ELLIS Institute Tübingen(图宾根ELLIS研究所)

AI总结 本文重新审视了大语言模型(LLM)在训练截止时间后性能下降作为基准污染的时序信号问题。研究指出,这一时序信号高度依赖于基准问题的构造方式,即使来源材料不变,不同形式的问题也可能导致截然不同的时序表现。通过实验证明,对同一问题进行LLM驱动的转换可以有效消除时序模式,并结合影响函数分析揭示了其机制,表明该信号易受问题构造方式影响,需更稳健的方法来评估模型污染情况。

Comments ACL 2026

详情
英文摘要

Post-cutoff performance decay of LLMs has been widely interpreted as a temporal signal for benchmark contamination, where public information released before the training cutoff may have been included into training corpora and inflated model performance by memorization. We critically examine this view and demonstrate that this temporal signal is highly sensitive to how benchmark questions are constructed, even if the underlying source material remains invariant. Specifically, we show that LLM-transformed questions can produce remarkably different temporal patterns compared to fill-in-the-blank (cloze) questions directly retrieved from the very same documents. We validate this effect on prior benchmarks that report clear post-cutoff decay (LiveCodeBench), and show that a simple LLM-driven transformation of the same problems can effectively remove the temporal pattern. We further provide a mechanistic understanding of this phenomenon using influence function analysis. Overall, our results suggest that post-cutoff performance decay is a sensitive contamination signal, motivating more robust contamination probes for reliable LLM evaluation.

2508.19651 2026-05-14 cs.CV

Scalable Object Detection in the Car Interior With Vision Foundation Models

Sebastian Schmidt, Bálint Mészáros, Ahmet Firintepe, Stephan Günnemann

发表机构 * Technical University of Munich, School of Computation, Information and Technology(慕尼黑技术大学,计算、信息与技术学院) BMW Group(宝马集团)

AI总结 本文研究了如何在车载环境中高效地进行车内物体检测与定位,以提升智能助手的响应质量。为解决车载系统计算资源受限的问题,作者提出了一种基于视觉基础模型的分布式检测框架 ODAL,将计算任务分配到车载端与云端,从而实现高效部署。研究还引入了 ODALbench 评估指标,并通过微调轻量模型 LLaVA 1.5 7B 实现了显著性能提升,其检测准确率较基线提升了 71%,并在关键指标上超越了 GPT-4o 模型。

详情
英文摘要

AI tasks in the car interior like identifying and localizing externally introduced objects is crucial for response quality of personal assistants. However, computational resources of on-board systems remain highly constrained, restricting the deployment of such solutions directly within the vehicle. To address this limitation, we propose the novel Object Detection and Localization (ODAL) framework for interior scene understanding. Our approach leverages vision foundation models through a distributed architecture, splitting computational tasks between on-board and cloud. This design overcomes the resource constraints of running foundation models directly in the car. To benchmark model performance, we introduce ODALbench, a new metric for comprehensive assessment of detection and localization.Our analysis demonstrates the framework's potential to establish new standards in this domain. We compare the state-of-the-art GPT-4o vision foundation model with the lightweight LLaVA 1.5 7B model and explore how fine-tuning enhances the lightweight models performance. Remarkably, our fine-tuned ODAL-LLaVA model achieves an ODAL$_{score}$ of 89%, representing a 71% improvement over its baseline performance and outperforming GPT-4o by nearly 20%. Furthermore, the fine-tuned model maintains high detection accuracy while significantly reducing hallucinations, achieving an ODAL$_{SNR}$ three times higher than GPT-4o.

2508.14302 2026-05-14 cs.LG cs.AI cs.CL

GLASS: Global-Local Aggregation for Inference-time Sparsification of LLMs

Amirmohsen Sattarifard, Sepehr Lavasani, Kunlin Zhang, Amirhossein Rajabpour, Hanlin Xu, Fengyu Sun, Negar Hassanpour, Chao Gao

发表机构 * Huawei Technologies Canada Co., Ltd.(华为技术加拿大公司) Huawei Technologies Ltd.(华为技术有限公司)

AI总结 本文提出了一种名为GLASS的推理时稀疏化框架,旨在在资源受限设备上高效部署大语言模型。该方法通过结合局部的输入提示激活信息和全局的模型内在先验,稳定动态剪枝过程,从而提升生成质量。实验表明,GLASS在短提示、长生成场景下显著优于现有无训练方法,有效降低了困惑度和KL散度,同时提升了设备端的推理速度。

详情
英文摘要

Inference-time sparsification is a promising path to deploy large language models (LLMs) on resource-constrained devices, yet existing training-free methods typically estimate feedforward network (FFN) neuron importance from the input prompt alone. We show this prompt-only signal is often unreliable, especially for short prompts and long-form decoding, leading to inaccurate masks and degraded generation fidelity. We propose GLASS, a plug-and-play, training-free framework that stabilizes dynamic FFN pruning by aggregating two complementary views of neuron criticality: local prompt-specific activations and a global model-intrinsic prior. GLASS fuses global and local signals via rank aggregation, yielding robust critical-neuron selection even when the prompt is short. We interpret GLASS as the maximum-a-posteriori consensus ranking under a permutation-based probabilistic model, providing a principled foundation for its weighted rank-aggregation rule. We apply GLASS to a diverse set of open-source LLMs, and show that it yields substantial improvements over prior training-free baselines in the challenging short-prompt, long-generation scenarios, achieving up to 45.10% lower perplexity and 25.73% lower KL divergence, while delivering significant on-device decoding speedup.

2508.10683 2026-05-14 cs.CL

Neural Machine Translation for Coptic-French: Strategies for Low-Resource Ancient Languages

Nasma Chaoui, Richard Khoury

发表机构 * Department of Computer Science and Software Engineering(计算机科学与软件工程系) Université Laval(拉瓦尔大学)

AI总结 本文首次系统研究了将科普特语翻译为法语的策略,旨在解决低资源古代语言的机器翻译问题。研究全面评估了包括通过中间语言翻译、预训练的影响、多版本微调的优势以及模型对噪声的鲁棒性等多种方法。实验表明,使用风格多样且注重噪声处理的训练语料进行微调,能显著提升翻译质量,为历史语言的翻译工具开发提供了重要的实践指导。

Journal ref Fourth Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA 2026) @ LREC 2026, 482-490

详情
英文摘要

This paper presents the first systematic study of strategies for translating Coptic into French. Our comprehensive pipeline systematically evaluates: pivot versus direct translation, the impact of pre-training, the benefits of multi-version fine-tuning, and model robustness to noise. Utilizing aligned biblical corpora, we demonstrate that fine-tuning with a stylistically-varied and noise-aware training corpus significantly enhances translation quality. Our findings provide crucial practical insights for developing translation tools for historical languages in general.

2508.07642 2026-05-14 cs.AI cs.CL cs.CV

Breaking Down and Building Up: Mixture of Skill-Based Vision-and-Language Navigation Agents

Tianyi Ma, Yue Zhang, Zehao Wang, Parisa Kordjamshidi

发表机构 * Michigan State University(密歇根州立大学) ESAT-PSI, KU Leuven(KU莱顿大学ESAT-PSI实验室)

AI总结 视觉与语言导航(VLN)任务要求智能体理解自然语言指令并在复杂的3D环境中进行导航,当前方法在面对需要复杂时空推理的未知场景时仍存在较大挑战。本文提出SkillNav框架,通过将导航分解为一组可解释的原子技能,并由专门的智能体分别处理,引入结构化的技能推理机制。此外,研究构建了一个合成数据生成管道以支持无监督技能训练,并设计了一种基于视觉语言模型的路由器,动态选择最合适的智能体执行任务,显著提升了模型在新型指令风格和未知环境中的泛化能力。

Comments Accepted by ACL 2026 Main Conference

详情
英文摘要

Vision-and-Language Navigation (VLN) poses significant challenges for agents to interpret natural language instructions and navigate complex 3D environments. While recent progress has been driven by large-scale pre-training and data augmentation, current methods still struggle to generalize to unseen scenarios, particularly when complex spatial and temporal reasoning is required. In this work, we propose SkillNav, a modular framework that introduces structured, skill-based reasoning into Transformer-based VLN agents. Our method decomposes navigation into a set of interpretable atomic skills (e.g., Vertical Movement, Area and Region Identification, Stop and Pause), each handled by a specialized agent. To support targeted skill training without manual data annotation, we construct a synthetic dataset pipeline that generates diverse, linguistically natural, skill-specific instruction-trajectory pairs. We then introduce a novel training-free Vision-Language Model (VLM)-based router, which dynamically selects the most suitable agent at each time step by aligning sub-goals with visual observations and historical actions. SkillNav obtains competitive results on commonly used benchmarks and establishes state-of-the-art generalization to the GSA-R2R, a benchmark with novel instruction styles and unseen environments.

2507.19247 2026-05-14 cs.LG cs.AI cs.CL

A Markov Categorical Framework for Language Modeling

Yifan Zhang

发表机构 * Princeton University(普林斯顿大学)

AI总结 本文提出了一种基于马尔可夫范畴的语言建模分析框架,旨在统一解释自回归语言模型的内部机制、训练过程对表示学习的影响以及这些表示如何支持复杂行为。该框架将单步生成过程建模为信息处理阶段的组合,揭示了训练目标、表示空间几何结构与模型能力之间的内在联系。研究还展示了负对数似然目标如何同时学习下一个词和数据的条件不确定性,并通过谱分析结果表明,在特定条件下,优化后的损失函数能够引导表示方向与预测原型对齐,从而为理解信息流动和模型内部结构提供了新的视角。

详情
英文摘要

Autoregressive language models achieve remarkable performance, yet a unified theory explaining their internal mechanisms, how training shapes representations, and why these representations support complex behavior remains incomplete. We introduce an analytical framework that models the single-step generation process as a composition of information-processing stages using the language of Markov categories. This compositional perspective connects three aspects of language modeling that are often studied separately: the training objective, the geometry of the learned representation space, and practical model capabilities. First, our framework gives an information-theoretic rationale for parallel drafting methods such as speculative decoding by quantifying the information surplus a hidden state contains about future tokens beyond the immediate next one. Second, we clarify how the standard negative log-likelihood (NLL) objective learns not only a most likely next token, but also the data's intrinsic conditional uncertainty, formalized through categorical entropy. Our main spectral result is conditional: for a linear-softmax head with bounded output features, a calibrated quadratic upper-bound surrogate to NLL induces, after whitening or variance normalization, a generalized CCA/eigenproblem aligning representation directions with predictive prototypes. This gives a compositional lens for understanding how information flows through a model and how likelihood training can shape its internal geometry.

2507.18809 2026-05-14 cs.LG

Test-time Offline Reinforcement Learning on Goal-related Experience

Marco Bagatella, Mert Albaba, Jonas Hübotter, Georg Martius, Andreas Krause

发表机构 * ETH Zurich, Zurich, Switzerland(苏黎世联邦理工学院,苏黎世,瑞士) Max Planck Institute for Intelligent Systems, Tubingen, Germany(智能系统马克斯·普朗克研究所,图宾根,德国) University of Tubingen, Tubingen, Germany(图宾根大学,图宾根,德国)

AI总结 本文研究了在测试阶段利用与目标相关的历史经验进行离线强化学习的方法,旨在提升策略性能。作者提出了一种新的自监督数据选择准则,根据当前状态和评估目标的相关性与质量筛选离线数据,并通过少量梯度步的微调显著提升策略表现。该方法在多个高维导航与操作任务中验证有效,且在推理阶段合理分配计算资源后,其性能提升优于单纯扩大模型规模。

详情
英文摘要

Foundation models compress a large amount of information in a single, large neural network, which can then be queried for individual tasks. There are strong parallels between this widespread framework and offline goal-conditioned reinforcement learning algorithms: a universal value function is trained on a large number of goals, and the policy is evaluated on a single goal in each test episode. Extensive research in foundation models has shown that performance can be substantially improved through test-time training, specializing the model to the current goal. We find similarly that test-time offline reinforcement learning on experience related to the test goal can lead to substantially better policies at modest compute costs. We propose a novel self-supervised data selection criterion, which selects transitions from an offline dataset according to their relevance to the current state and quality with respect to the evaluation goal. We demonstrate across a wide range of high-dimensional loco-navigation and manipulation tasks that fine-tuning a policy on the selected data for a few gradient steps leads to significant performance gains over standard offline pre-training. Our goal-conditioned test-time training (GC-TTT) algorithm applies this routine in a receding-horizon fashion during evaluation, adapting the policy to the current trajectory as it is being rolled out. Finally, we study compute allocation at inference, demonstrating that, at comparable costs, GC-TTT induces performance gains that are not achievable by scaling model size.

2507.15867 2026-05-14 cs.LG cs.AI cs.CL cs.MA

RDMA: Cost Effective Agent-Driven Rare Disease Mining from Electronic Health Records

John Wu, Adam Cross, Jimeng Sun

发表机构 * Department of Computer Science, University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校计算机科学系) Department of Pediatrics, University of Illinois College of Medicine Peoria(伊利诺伊大学皮奥里亚医学院儿科系)

AI总结 该研究针对罕见病在电子健康记录中记录不足的问题,提出了一种基于智能代理的罕见病挖掘框架RDMA。该方法利用量化小型语言模型,结合专业工具实现缩写解析、隐性表型推理和本体映射,无需特定任务训练即可在多个数据集上显著优于现有方法。RDMA不仅大幅降低了推理和硬件成本,还通过不确定性标记机制减轻专家标注负担,为临床罕见病记录的规模化应用提供了可行方案。

详情
英文摘要

Rare diseases affect 1 in 10 Americans yet remain systematically underdocumented in clinical records. ICD-based systems cannot capture their breadth, over 50\% of Orphanet codes lack a direct ICD mapping and only 2.2\% of HPO codes have matching ICD codes, leaving patient populations invisible and delaying diagnosis. Mining unstructured clinical notes offers a direct path forward, but real notes are long, noisy, and abbreviation-dense, and limited annotations make fine-tuning infeasible, demanding approaches that generalize without task-specific training. We present Rare Disease Mining Agents (RDMA), an agentic framework equipping smaller quantized LLMs with tools for abbreviation resolution, implicit phenotype reasoning, and ontology grounding against Orphanet and HPO. RDMA substantially outperforms fine-tuned and RAG-based baselines across benchmarks with different data characteristics, without any task-specific training. A small quantized model achieves maximal performance, reducing inference costs by up to 10x and local hardware costs by up to 17x, enabling private deployment on standard hardware without cloud-based PHI exposure. RDMA's uncertainty-flagging mechanism further reduces expert annotation burden while preserving agreement quality, supporting scalable rare disease documentation in clinical practice. Available at https://github.com/jhnwu3/RDMA.

2507.10797 2026-05-14 cs.LG math.OC stat.ML

Multi-Armed Sampling Problem and the End of Exploration

Mohammad Pedramfar, Siamak Ravanbakhsh

发表机构 * Mila - Quebec AI Institute, McGill University(魁北克AI研究所,麦吉尔大学)

AI总结 本文提出了多臂采样问题框架,作为多臂老虎机优化问题的采样对应,旨在严谨分析采样过程中的探索与利用权衡。研究系统定义了该框架下的遗憾概念并建立了下界,提出了一种简单算法实现了近似最优的遗憾界,理论结果表明与优化不同,采样几乎无需探索。通过引入温度参数,本文还建立了连接多臂采样与多臂老虎机的连续问题族,为采样相关研究,如神经采样器、熵正则化强化学习等提供了基础理论支持。

Comments 29th International Conference on Artificial Intelligence and Statistics (AISTATS) 2026

详情
英文摘要

This paper introduces the framework of multi-armed sampling, which serves as the sampling counterpart to the optimization problem of multi-armed bandits. Our primary motivation is to rigorously examine the exploration-exploitation trade-off in the context of sampling. We systematically define plausible notions of regret for this framework and establish corresponding lower bounds. We then propose a simple algorithm that achieves near-optimal regret bounds. Our theoretical results suggest that, in contrast to optimization, sampling barely requires any exploration. To further connect our findings with those of multi-armed bandits, we define a continuous family of problems and associated regret measures that smoothly interpolate and unify multi-armed sampling and multi-armed bandit problems using a temperature parameter. We believe that the multi-armed sampling framework and our findings in this setting can play a foundational role in the study of sampling, including recent neural samplers, much like the role of multi-armed bandits in reinforcement learning. In particular, our work sheds light on the role of exploration (or lack thereof) and the convergence properties of algorithms for entropy-regularized reinforcement learning, fine-tuning of pretrained models and reinforcement learning with human feedback (RLHF).

2507.03167 2026-05-14 cs.CL cs.AI cs.LG

Where Do Reasoning Models Refuse?

Kureha Yamaguchi, Benjamin Etheridge, Andy Arditi

发表机构 * The Alan Turing Institute(艾伦·图灵研究所) University of Oxford(牛津大学) Northeastern University(东北大学)

AI总结 本文研究了推理模型在生成过程中何时决定拒绝有害请求的问题。通过分析四个开源推理模型,发现推理过程中的因果链对拒绝决策有显著影响,特定的推理轨迹可大幅减少模型最终拒绝或服从的不确定性。研究还发现,在蒸馏模型中,推理链开头的细微差异可能完全决定拒绝决策,并且这种模式在来自同一教师模型的蒸馏模型中具有可迁移性。此外,研究从模型激活中提取了拒绝方向,并验证了其对有害服从行为的影响。

Comments v1 accepted to the ICML 2025 Workshop on Reliable and Responsible Foundation Models (R2FM). 20 pages, 12 figures

详情
英文摘要

Chat models without chain-of-thought (CoT) reasoning must decide whether to refuse a harmful request before generating their first response token. Reasoning models, by contrast, produce extended chains of thought before their final output, raising a natural question: where in this process does the decision to refuse occur? We investigate this across four open-source reasoning models. We first show that the CoT causally influences refusal outcomes; fixing a specific reasoning trace substantially reduces variance in whether the model ultimately refuses or complies. Zooming into the reasoning trace, we find that in distilled models, subtle differences in the opening sentence of the CoT can fully determine the model's refusal decision, and that these patterns transfer across models distilled from the same teacher. Finally, we extract linear refusal directions from model activations and show that ablating them increases harmful compliance, though less reliably than the same technique achieves on non-reasoning models, and with non-negligible degradation to general capabilities.

2507.01908 2026-05-14 cs.CV

Reasoning to Edit: Hypothetical Instruction-Based Image Editing with Visual Reasoning

Qingdong He, Xueqin Chen, Chaoyi Wang, Yanjie Pan, Xiaobin Hu, Zhenye Gan, Yabiao Wang, Chengjie Wang, Xiangtai Li, Jiangning Zhang

发表机构 * Tencent Youtu Lab(腾讯云图实验室) Sichuan University(四川大学) University of the Chinese Academy of Sciences(中国科学院大学) Fudan University(复旦大学) Zhejiang University(浙江大学) National University of Singapore(新加坡国立大学) Nanyang Technological University(南洋理工大学)

AI总结 该论文提出了一种基于视觉推理的假设指令图像编辑方法,旨在解决现有图像编辑技术在处理复杂隐含指令时的不足。研究引入了Reason50K数据集和ReasonBrain框架,前者包含5万余个样本,涵盖物理、时间、因果和故事推理等四类场景,后者结合多模态大语言模型和扩散模型,通过细粒度推理线索提取模块和跨模态增强模块,实现对隐含指令的精准理解和编辑。实验表明,该方法在推理场景中表现优异,并具备良好的零样本泛化能力。

Comments Accepted by ICML2026

详情
英文摘要

Instruction-based image editing (IIE) has advanced rapidly with the success of diffusion models. However, existing efforts primarily focus on simple and explicit instructions to execute editing operations such as adding, deleting, moving, or swapping objects. They struggle to handle more complex implicit hypothetical instructions that require deeper reasoning to infer plausible visual changes and user intent. Additionally, current datasets provide limited support for training and evaluating reasoning-aware editing capabilities. Architecturally, these methods also lack mechanisms for fine-grained detail extraction that support such reasoning. To address these limitations, we propose Reason50K, a large-scale dataset specifically curated for training and evaluating hypothetical instruction reasoning image editing, along with ReasonBrain, a novel framework designed to reason over and execute implicit hypothetical instructions across diverse scenarios. Reason50K includes over 50K samples spanning four key reasoning scenarios: Physical, Temporal, Causal, and Story reasoning. ReasonBrain leverages Multimodal Large Language Models (MLLMs) for editing guidance generation and a diffusion model for image synthesis, incorporating a Fine-grained Reasoning Cue Extraction (FRCE) module to capture detailed visual and textual semantics essential for supporting instruction reasoning. To mitigate the semantic loss, we further introduce a Cross-Modal Enhancer (CME) that enables rich interactions between the fine-grained cues and MLLM-derived features. Extensive experiments demonstrate that ReasonBrain consistently outperforms state-of-the-art baselines on reasoning scenarios while exhibiting strong zero-shot generalization to conventional IIE tasks. Our dataset and code will be released publicly.

2507.00990 2026-05-14 cs.RO cs.AI cs.CV

Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations

Shivansh Patel, Shraddhaa Mohan, Hanlin Mai, Unnat Jain, Svetlana Lazebnik, Yunzhu Li

发表机构 * UIUC(伊利诺伊大学香槟分校) UC Irvine(加州大学尔湾分校) Columbia University(哥伦比亚大学)

AI总结 本文提出了一种名为 RIGVid 的系统,使机器人能够通过模仿人工智能生成的视频完成复杂的操作任务,如倒水、擦拭和混合,而无需任何物理演示或机器人特定的训练。系统通过语言指令和初始场景图像生成潜在演示视频,并利用视觉语言模型筛选符合指令的视频,再通过6D姿态追踪提取物体轨迹并映射到机器人上。实验表明,生成的视频在实际任务中表现优异,且生成质量越高效果越佳,优于基于关键点预测等更简洁的方法。

Comments In ICLR 2026. Website: https://rigvid-robot.github.io/

详情
英文摘要

This work introduces Robots Imitating Generated Videos (RIGVid), a system that enables robots to perform complex manipulation tasks--such as pouring, wiping, and mixing--purely by imitating AI-generated videos, without requiring any physical demonstrations or robot-specific training. Given a language command and an initial scene image, a video diffusion model generates potential demonstration videos, and a vision-language model (VLM) automatically filters out results that do not follow the command. A 6D pose tracker then extracts object trajectories from the video, and the trajectories are retargeted to the robot in an embodiment-agnostic fashion. Through extensive real-world evaluations, we show that filtered generated videos are as effective as real demonstrations, and that performance improves with generation quality. We also show that relying on generated videos outperforms more compact alternatives such as keypoint prediction using VLMs, and that strong 6D pose tracking outperforms other ways to extract trajectories, such as dense feature point tracking. These findings suggest that videos produced by a state-of-the-art off-the-shelf model can offer an effective source of supervision for robotic manipulation.

2507.00029 2026-05-14 cs.LG cs.AI

LoRA-Mixer: Coordinate Modular LoRA Experts Through Serial Attention Routing

Wenbing Li, Zikai Song, Hang Zhou, Yunyao Zhang, Junqing Yu, Wei Yang

发表机构 * Huazhong University of Science and Technology(华中科技大学)

AI总结 该论文提出了一种名为 LoRA-Mixer 的模块化混合专家框架,旨在提高大语言模型在多任务适应中的参数效率和任务专业化能力。与现有方法不同,LoRA-Mixer 将任务特定的 LoRA 专家嵌入到注意力模块的核心投影矩阵中,而非主要针对 FFN 模块,从而实现更细粒度的 token 级别专业化。通过引入自适应路由专业化损失(RSL),该方法在有限数据下训练出鲁棒的路由策略,提升了专家选择的稳定性和重用率,并在多个基准测试中以更少的可训练参数取得了优于现有方法的性能提升。

详情
英文摘要

Recent attempts to combine low-rank adaptation (LoRA) with mixture-of-experts (MoE) for multi-task adaptation of Large Language Models (LLMs) often replace whole attention/FFN layers with switch experts or append parallel expert branches, undermining parameter efficiency and limiting task specialization. We introduce LoRA-Mixer, a modular MoE framework that routes task-specific LoRA experts into the core projection matrices of the attention module, namely input and output linear layers, rather than primarily targeting FFN blocks. The design delivers fine-grained token-level specialization by fully exploiting the attention mechanism, while remaining drop-in compatible with Transformers and state-space models (SSMs), since linear projection layers are ubiquitous. To train robust routers from limited data while promoting stable, selective decisions and high expert reuse, LoRA-Mixer employs an adaptive Routing Specialization Loss (RSL) that jointly enforces global load balance and input-aware specialization via an entropy-shaping objective. The framework supports two regimes: (i) joint optimization of adapters and router with a differentiable hard-soft top-k routing scheme, and (ii) plug-and-play routing over frozen, pre-trained LoRA modules sourced from public repositories. Across 15 benchmarks, including MedQA, GSM8K, HumanEval, and GLUE, RSL-optimized LoRA-Mixer outperforms state-of-the-art routing and LoRA-MoE baselines while using 48 percent of their trainable parameters, with gains of 3.79, 2.90, and 3.95 percentage points on GSM8K, CoLA, and ARC-C, respectively. Cross-model transfer and adapter reuse experiments further demonstrate the approach's versatility and data efficiency. Our code is available at https://github.com/hustcselwb/LoRA-Mixer.

2506.15953 2026-05-14 cs.RO

ViTacFormer: Learning Cross-Modal Representation for Visuo-Tactile Dexterous Manipulation

Liang Heng, Haoran Geng, Kaifeng Zhang, Pieter Abbeel, Jitendra Malik

发表机构 * University of California, Berkeley(加州大学伯克利分校) Peking University(北京大学) Sharpa

AI总结 ViTacFormer 是一种用于视觉-触觉灵巧操作的跨模态表征学习方法,旨在提升机器人在复杂环境中进行精细操作的能力。该方法结合了交叉注意力编码器和自回归触觉预测头,实现了高分辨率视觉与触觉信息的融合,并通过渐进式课程学习优化跨模态表征。实验表明,ViTacFormer 在多个现实基准测试中表现出更高的成功率,并首次实现了使用类人手完成多阶段、长时间的高精度灵巧操作任务。

详情
英文摘要

Dexterous manipulation is a cornerstone capability for robotic systems aiming to interact with the physical world in a human-like manner. Although vision-based methods have advanced rapidly, tactile sensing remains crucial for fine-grained control, particularly in unstructured or visually occluded settings. We present ViTacFormer, a representation-learning approach that couples a cross-attention encoder to fuse high-resolution vision and touch with an autoregressive tactile prediction head that anticipates future contact signals. Building on this architecture, we devise an easy-to-challenging curriculum that steadily refines the visual-tactile latent space, boosting both accuracy and robustness. The learned cross-modal representation drives imitation learning for multi-fingered hands, enabling precise and adaptive manipulation. Across a suite of challenging real-world benchmarks, our method achieves approximately 50% higher success rates than prior state-of-the-art systems. To our knowledge, it is also the first to autonomously complete long-horizon dexterous manipulation tasks that demand highly precise control with an anthropomorphic hand, successfully executing up to 11 sequential stages and sustaining continuous operation for 2.5 minutes.

2506.13456 2026-05-14 cs.AI cs.RO

Block-wise Adaptive Caching for Accelerating Diffusion Policy

Kangye Ji, Yuan Meng, Hanyun Cui, Ye Li, Jianbo Zhou, Shengjia Hua, Lei Chen, Zhi Wang

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院,清华大学) Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系)

AI总结 扩散策略(Diffusion Policy)在视觉运动控制建模方面表现出色,但由于计算成本高,难以用于实时机器人控制。本文提出了一种块级自适应缓存(BAC)方法,通过缓存中间动作特征并自适应更新和复用,实现无损的动作生成加速。BAC引入了自适应缓存调度器和冒泡联合算法,有效缓解了块间缓存误差传播问题,能够在不改变模型结构的前提下,为多种基于Transformer的扩散策略和视觉-语言-动作模型带来最高达3倍的推理加速。

详情
英文摘要

Diffusion Policy has demonstrated strong visuomotor modeling capabilities, but its high computational cost renders it impractical for real-time robotic control. Despite huge redundancy across repetitive denoising steps, existing diffusion acceleration techniques fail to generalize to Diffusion Policy due to fundamental architectural and data divergences. In this paper, we propose $\textbf{B}$lock-wise $\textbf{A}$daptive $\textbf{C}$aching ($\textbf{BAC}$), a method to accelerate Diffusion Policy by caching intermediate action features. BAC achieves lossless action generation acceleration by adaptively updating and reusing cached features at the block level, based on a key observation that feature similarities exhibit non-uniform temporal dynamics and distinct block-specific patterns. To operationalize this insight, we first design an Adaptive Caching Scheduler to identify optimal update timesteps by maximizing the global feature similarities between cached and skipped features. However, applying this scheduler for each block leads to significant error surges due to the inter-block propagation of caching errors, particularly within Feed-Forward Network (FFN) blocks. To mitigate this issue, we develop the Bubbling Union Algorithm, which truncates these errors by updating the upstream blocks with significant caching errors before downstream FFNs. As a training-free plugin, BAC is readily integrable with existing transformer-based Diffusion Policy and vision-language-action models. Extensive experiments on multiple robotic benchmarks demonstrate that BAC achieves up to 3$\times$ inference speedup for free. Project page: https://block-wise-adaptive-caching.github.io.