arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2367
专题追踪
2409.01159 2026-05-29 cs.RO

Remote telepresence over large distances via robot avatars: case studies

通过机器人化身进行远距离远程呈现:案例研究

Mohamed Elobaid, Stefano Dafarra, Ehsan Ranjbari, Giulio Romualdi, Tomohiro Chaki, Tomohiro Kawakami, Takahide Yoshiike, Daniele Pucci

发表机构 * Artificial and Mechanical Intelligence AMI (Italian Insititute of Technology)(人工与机械智能AMI(意大利理工学院)) Frontier Robotics, Innovative Research Excellence(前沿机器人,创新研究卓越;本田研发) Honda R&D(机器学习与优化,曼彻斯特大学) Machine Learning and Optimisation, The University of Manchester

AI总结 本文探讨了如何调整一种新提出的化身系统架构,以适应不同形态的机器人(轮式、腿式及多种手部与运动学结构),在带宽受限条件下实现洲际远程呈现。

详情
AI中文摘要

本文讨论了必要的考虑因素和调整,使得最近提出的化身系统架构能够与不同的机器人化身形态(包括轮式和腿式机器人,具有各种类型的手部和运动学结构)配合使用,以在通信带宽限制下实现远程(洲际)远程呈现。所报告的案例研究涉及使用位置和力矩控制模式的机器人,独立于其软件中间件。

英文摘要

This paper discusses the necessary considerations and adjustments that allow a recently proposed avatar system architecture to be used with different robotic avatar morphologies (both wheeled and legged robots with various types of hands and kinematic structures) for the purpose of enabling remote (intercontinental) telepresence under communication bandwidth restrictions. The case studies reported involve robots using both position and torque control modes, independently of their software middleware.

2409.01144 2026-05-29 cs.RO

Adaptive Non-linear Centroidal MPC with Stability Guarantees for Robust Locomotion of Legged Robots

具有稳定性保证的自适应非线性质心MPC用于腿式机器人鲁棒运动

Mohamed Elobaid, Giulio Turrisi, Lorenzo Rapetti, Giulio Romualdi, Stefano Dafarra, Tomohiro Kawakami, Tomohiro Chaki, Takahide Yoshiike, Claudio Semini, Daniele Pucci

发表机构 * Artificial and Mechanical Intelligence (AMI), Istituto Italiano di Tecnologia (IIT)(人工智能与机械智能(AMI),意大利技术研究院(IIT)) Dynamic Legged Systems (DLS), Istituto Italiano di Tecnologia (IIT)(动态腿部系统(DLS),意大利技术研究院(IIT)) Frontier Robotics, Innovative Research Excellence(前沿机器人,创新研究卓越;本田研发,日本埼玉) Honda R&D, Saitama, Japan(机器学习与优化,曼彻斯特大学) Machine Learning and Optimisation, The University of Manchester

AI总结 通过自适应控制和李雅普诺夫函数重新表述质心MPC控制器,为腿式机器人在未知负载和恒定扰动下提供闭环稳定性与鲁棒性保证。

详情
AI中文摘要

基于简化质心动力学的非线性模型预测运动控制器如今在腿式机器人中无处不在。这些方案即使假设了机器人动力学的固有简化,也被证明能够赋予机器人对微小推力的步态调整能力,此外,在参数不确定(如未知负载)的情况下,它们能够提供一些实用的、尽管有限的鲁棒性。在这项工作中,我们通过重新表述质心MPC控制器,为其闭环稳定性提供了严格的证明。这是通过一种受自适应控制机制启发的系统化程序以及来自控制李雅普诺夫函数的思想实现的。此外,我们的重新表述为一类未测量的恒定扰动提供了鲁棒性。为了展示我们方法的通用性,我们在新一代人形机器人——56.7千克的ergoCub,以及商用21千克四足机器人Aliengo上验证了我们的公式。

英文摘要

Nonlinear model predictive locomotion controllers based on the reduced centroidal dynamics are nowadays ubiquitous in legged robots. These schemes, even if they assume an inherent simplification of the robot's dynamics, were shown to endow robots with a step-adjustment capability in reaction to small pushes, and, moreover, in the case of uncertain parameters - as unknown payloads - they were shown to be able to provide some practical, albeit limited, robustness. In this work, we provide rigorous certificates of their closed loop stability via a reformulation of the centroidal MPC controller. This is achieved thanks to a systematic procedure inspired by the machinery of adaptive control, together with ideas coming from Control Lyapunov functions. Our reformulation, in addition, provides robustness for a class of unmeasured constant disturbances. To demonstrate the generality of our approach, we validated our formulation on a new generation of humanoid robots - the 56.7 kg ergoCub, as well as on a commercially available 21 kg quadruped robot, Aliengo.

2406.10238 2026-05-29 cs.CL cs.LG cs.SI

Early Detection of Misinformation for Infodemic Management: A Domain Adaptation Approach

信息疫情管理中虚假信息的早期检测:一种领域自适应方法

Minjia Mao, Xiaohang Zhao, Xiao Fang

发表机构 * Lerner College of Business and Economics, University of Delaware(德克萨斯大学德尔韦大学商学院与经济学学院) School of Information Management & Engineering, Shanghai University of Finance and Economics(上海财经大学信息管理与工程学院)

AI总结 针对信息疫情早期缺乏标注数据的问题,提出一种同时处理协变量偏移和概念偏移的领域自适应虚假信息检测方法,在真实数据集上优于现有方法。

详情
AI中文摘要

信息疫情是指在疾病爆发期间传播的大量真实信息和虚假信息。在信息疫情早期检测虚假信息是减少其对公共健康危害的关键。信息疫情早期的特点是存在大量关于某种疾病的未标注信息。因此,传统的虚假信息检测方法不适合此任务,因为它们依赖信息疫情领域的标注信息来训练模型。为解决这一局限,最先进的方法利用其他领域的标注信息来学习模型,以检测信息疫情领域的虚假信息。这些方法的有效性取决于它们缓解信息疫情领域与利用标注信息的领域之间的协变量偏移(即特征分布差异)和概念偏移(即标注模式差异)的能力。然而,这些方法侧重于缓解协变量偏移而忽略了概念偏移,导致其在该任务上效果不佳。为此,我们从理论上证明了同时处理协变量偏移和概念偏移的必要性,以及如何分别实现它们。基于理论分析,我们开发了一种新颖的虚假信息检测方法,同时解决了协变量偏移和概念偏移。使用真实数据集,我们进行了广泛的实证评估,证明我们的方法在性能上优于最先进的虚假信息检测方法以及可适用于该任务的常见领域自适应方法。

英文摘要

An infodemic refers to an enormous amount of true information and misinformation disseminated during a disease outbreak. Detecting misinformation at the early stage of an infodemic is key to reduce its harm to public health. An early stage infodemic is characterized by a large volume of unlabeled information concerning a disease. As a result, conventional misinformation detection methods are not suitable for this misinformation detection task because they rely on labeled information in the infodemic domain to train their models. To address this limitation, state-of-the-art methods learn their models using labeled information in other domains to detect misinformation in the infodemic domain. The efficacy of these methods depends on their ability to mitigate both covariate shift (i.e., differences in feature distributions) and concept shift (i.e., differences in labeling patterns) between the infodemic domain and the domains from which they leverage labeled information. However, these methods focus on mitigating covariate shift but overlook concept shift, rendering them less effective for the task. In response, we theoretically show the necessity of tackling both covariate and concept shifts as well as how to operationalize each of them. Built on the theoretical analysis, we develop a novel misinformation detection method that addresses both covariate and concept shifts. Using real-world datasets, we conduct extensive empirical evaluations to demonstrate the superior performance of our method over state-of-the-art misinformation detection methods as well as prevalent domain adaptation methods that can be tailored to solve the misinformation detection task.

2405.13003 2026-05-29 cs.CL cs.AI cs.IR

A Survey on Recent Advances in Conversational Data Generation

对话数据生成最新进展综述

Heydar Soudani, Roxana Petcu, Evangelos Kanoulas, Faegheh Hasibi

发表机构 * Radboud University(拉博德大学) University of Amsterdam(阿姆斯特丹大学)

AI总结 本文系统综述了多轮对话数据生成方法,涵盖开放域、任务导向和信息检索三类对话系统,提出了包含种子数据创建、话语生成和质量过滤的通用框架,并讨论了评估指标与未来方向。

详情
AI中文摘要

近年来对话系统的进步显著增强了各领域的人机交互。然而,由于专业对话数据的稀缺,训练这些系统面临挑战。传统上,对话数据集通过众包创建,但该方法成本高、规模有限且劳动密集。作为解决方案,合成对话数据的开发应运而生,利用技术增强现有数据集或将文本资源转换为对话格式,提供了一种更高效且可扩展的数据集创建方法。在本综述中,我们系统全面地回顾了多轮对话数据生成,重点关注三类对话系统:开放域、任务导向和信息检索。我们根据种子数据创建、话语生成和质量过滤方法等关键组件对现有研究进行分类,并引入了一个概述对话数据生成系统主要原则的通用框架。此外,我们考察了评估合成对话数据的指标和方法,探讨了当前领域的挑战,并探索了未来研究的潜在方向。我们的目标是通过概述最先进的方法并强调该领域进一步研究的机会,加速研究人员和从业者的进展。

英文摘要

Recent advancements in conversational systems have significantly enhanced human-machine interactions across various domains. However, training these systems is challenging due to the scarcity of specialized dialogue data. Traditionally, conversational datasets were created through crowdsourcing, but this method has proven costly, limited in scale, and labor-intensive. As a solution, the development of synthetic dialogue data has emerged, utilizing techniques to augment existing datasets or convert textual resources into conversational formats, providing a more efficient and scalable approach to dataset creation. In this survey, we offer a systematic and comprehensive review of multi-turn conversational data generation, focusing on three types of dialogue systems: open domain, task-oriented, and information-seeking. We categorize the existing research based on key components like seed data creation, utterance generation, and quality filtering methods, and introduce a general framework that outlines the main principles of conversation data generation systems. Additionally, we examine the evaluation metrics and methods for assessing synthetic conversational data, address current challenges in the field, and explore potential directions for future research. Our goal is to accelerate progress for researchers and practitioners by presenting an overview of state-of-the-art methods and highlighting opportunities to further research in this area.

2403.09441 2026-05-29 cs.LG

An Empirical Study of the Influence of Adversarial Fine-Tuning on Compressed Neural Networks

对抗性微调对压缩神经网络影响的实证研究

Hallgrimur Thorsteinsson, Valdemar J Henriksen, Daniel I R Cruz, Raghavendra Selvan, Tong Chen

发表机构 * Department of Computer Science, University of Copenhagen, Denmark(丹麦哥本哈根大学计算机科学系)

AI总结 通过实验研究压缩模型的对抗性微调,发现其能显著提升鲁棒性,并在计算效率与鲁棒性之间取得平衡。

Comments 23 pages, 4 figures, 9 tables. Accepted to The 15th Scandinavian Conference on Artificial Intelligence (SCAI)

详情
AI中文摘要

随着深度学习模型日益融入日常生活,通过使其抵御对抗性攻击来确保安全性变得至关重要。研究发现,通过引入微小、有针对性的扰动来干扰输入数据,深度学习模型容易受到对抗性攻击。对抗性训练作为一种缓解策略,可以产生更鲁棒的模型。然而,这种对抗鲁棒性伴随着训练过程中设计对抗性攻击所需的额外计算成本。因此,对抗鲁棒性和计算效率这两个目标似乎相互冲突。在这项工作中,我们探讨了神经网络压缩对对抗鲁棒性的影响。我们特别研究了微调对压缩模型的影响,并展示了标准微调与对抗性微调之间的权衡。我们的结果表明,对压缩模型进行对抗性微调可以大幅提升其鲁棒性性能。我们在多个基准数据集上进行了实验,表明压缩模型的对抗性微调可以达到与对抗性训练模型相当的鲁棒性性能,同时提高计算效率。源代码可在此处获取:https://github.com/saintslab/Adver-Fine。

英文摘要

As deep learning (DL) models are increasingly being integrated into our everyday lives, ensuring their safety by making them robust against adversarial attacks has become increasingly critical. DL models have been found to be susceptible to adversarial attacks by introducing small, targeted perturbations to disrupt the input data. Adversarial training has been presented as a mitigation strategy that can result in more robust models. This adversarial robustness comes with additional computational costs required to design adversarial attacks during training. The two objectives -- adversarial robustness and computational efficiency -- then appear to be in conflict with each other. In this work, we explore the effects of neural network compression on adversarial robustness. We specifically explore the effects of fine-tuning on compressed models, and present the trade-off between standard fine-tuning and adversarial fine-tuning. Our results show that adversarial fine-tuning of compressed models can yield large improvements to their robustness performance. We present experiments on several benchmark datasets showing that adversarial fine-tuning of compressed models can achieve robustness performance comparable to adversarially trained models, while also improving computational efficiency. Source code is available here: https://github.com/saintslab/Adver-Fine.

2401.08197 2026-05-29 cs.LG cs.IT eess.SP math.IT

Matrix Completion with Hypergraphs:Sharp Thresholds and Efficient Algorithms

超图矩阵补全:尖锐阈值与高效算法

Zhongtian Ma, Qiaosheng Zhang, Zhen Wang

发表机构 * Northwestern Polytechnical University(西北工业大学)

AI总结 本文研究基于子采样矩阵条目以及观测到的社交图和超图来补全评分矩阵的问题,证明了存在一个关于采样概率的尖锐阈值,并开发了一种高效算法,该算法在采样概率超过阈值时以高概率成功,且超图能有效降低所需采样概率。

Comments Accepted to LOG24

详情
AI中文摘要

本文考虑基于子采样矩阵条目以及观测到的社交图和超图来补全评分矩阵的问题。我们证明,对于精确补全评分矩阵的任务,存在一个关于采样概率的尖锐阈值——当采样概率高于阈值时任务可实现,否则不可能——展示了相变现象。该阈值可以表示为超图“质量”的函数,从而能够量化利用超图所带来的采样概率减少量。这也凸显了超图在矩阵补全问题中的有用性。在发现尖锐阈值的过程中,我们开发了一种计算高效的矩阵补全算法,该算法有效利用了观测到的图和超图。理论分析表明,只要采样概率超过上述阈值,我们的算法就以高概率成功,这一理论结果通过合成实验得到进一步验证。此外,我们在真实社交网络数据集(包含图和超图)上的实验表明,我们的算法优于其他最先进的矩阵补全算法。

英文摘要

This paper considers the problem of completing a rating matrix based on sub-sampled matrix entries as well as observed social graphs and hypergraphs. We show that there exists a \emph{sharp threshold} on the sample probability for the task of exactly completing the rating matrix -- the task is achievable when the sample probability is above the threshold, and is impossible otherwise -- demonstrating a phase transition phenomenon. The threshold can be expressed as a function of the ``quality'' of hypergraphs, enabling us to \emph{quantify} the amount of reduction in sample probability due to the exploitation of hypergraphs. This also highlights the usefulness of hypergraphs in the matrix completion problem. En route to discovering the sharp threshold, we develop a computationally efficient matrix completion algorithm that effectively exploits the observed graphs and hypergraphs. Theoretical analyses show that our algorithm succeeds with high probability as long as the sample probability exceeds the aforementioned threshold, and this theoretical result is further validated by synthetic experiments. Moreover, our experiments on a real social network dataset (with both graphs and hypergraphs) show that our algorithm outperforms other state-of-the-art matrix completion algorithms.

2305.10917 2026-05-29 cs.RO

Online Non-linear Centroidal MPC for Humanoid Robots Payload Carrying with Contact-Stable Force Parametrization

面向人形机器人负重任务的在线非线性质心模型预测控制与接触稳定力参数化

Mohamed Elobaid, Giulio Romualdi, Gabriele Nava, Lorenzo Rapetti, Hosameldin Awadalla Omer Mohamed, Daniele Pucci

发表机构 * Mechanical engineering department(机械工程系) Machine Learning and Optimisation, The University of Manchester(机器学习与优化,曼彻斯特大学)

AI总结 针对人形机器人负重行走问题,提出结合在线非线性质心模型预测控制与接触稳定力参数化的方法,实现给定脚步轨迹的跟踪。

详情
AI中文摘要

本文考虑了一个问题:允许受到持续干扰(以负重任务形式)的人形机器人遵循给定的规划脚步。为解决此问题,我们结合了在线非线性质心模型预测控制器(MPC)与接触稳定力参数化。MPC的成本函数增加了处理干扰和正则化参数的项。通过仿真和人形机器人iCub上的实验验证了所提出控制器的性能。最后,简要研究了使用参数化对控制器计算时间的影响。

英文摘要

In this paper we consider the problem of allowing a humanoid robot that is subject to a persistent disturbance, in the form of a payload-carrying task, to follow given planned footsteps. To solve this problem, we combine an online nonlinear centroidal Model Predictive Controller - MPC with a contact stable force parametrization. The cost function of the MPC is augmented with terms handling the disturbance and regularizing the parameter. The performance of the resulting controller is validated both in simulations and on the humanoid robot iCub. Finally, the effect of using the parametrization on the computational time of the controller is briefly studied.

2205.04297 2026-05-29 cs.RO cs.AI

Learning A Simulation-based Visual Policy for Real-world Peg In Unseen Holes

基于学习的视觉策略用于真实世界中未见过孔洞的插拔

Liang Xie, Hongxiang Yu, Kechun Xu, Tong Yang, Minhang Wang, Haojian Lu, Rong Xiong, Yue Wang

发表机构 * College of Control Science and Engineering, Zhejiang University, Zhejiang, China.(控制科学与工程学院,浙江大学,浙江,中国) The Application Innovate Lab, Huawei Incorporated Company, China.(应用创新实验室,华为公司,中国)

AI总结 提出一种基于学习的视觉插拔方法,通过解耦感知与策略模块,在仿真中训练多种形状,并仅需少量仿真到现实迁移成本即可适应真实世界中任意未见形状。

详情
AI中文摘要

本文提出一种基于学习的视觉插拔方法,能够在仿真中训练多种形状,并在真实世界中以最小的仿真到现实迁移成本适应任意未见形状。核心思想是将感知-运动策略的泛化解耦为快速适应的感知模块和仿真通用策略模块的设计。框架包括分割网络(SN)、虚拟传感器网络(VSN)和控制器网络(CN)。具体地,VSN被训练用于从分割图像中测量未见形状的位姿。然后,给定与形状无关的位姿测量,CN被训练以实现通用插拔。最后,当应用于真实未见孔洞时,我们只需微调仿真VSN+CN所需的分割网络。为进一步最小化迁移成本,我们提出在一分钟人工教学后自动收集和标注分割网络的数据。展示了在眼在外/眼在手配置下的仿真和真实世界结果。采用所提策略的电动汽车充电系统在2-3秒内实现了10/10的成功率,仅使用数百个自动标注样本进行分割网络迁移。

英文摘要

This paper proposes a learning-based visual peg-in-hole that enables training with several shapes in simulation, and adapting to arbitrary unseen shapes in real world with minimal sim-to-real cost. The core idea is to decouple the generalization of the sensory-motor policy to the design of a fast-adaptable perception module and a simulated generic policy module. The framework consists of a segmentation network (SN), a virtual sensor network (VSN), and a controller network (CN). Concretely, the VSN is trained to measure the pose of the unseen shape from a segmented image. After that, given the shape-agnostic pose measurement, the CN is trained to achieve generic peg-in-hole. Finally, when applying to real unseen holes, we only have to fine-tune the SN required by the simulated VSN+CN. To further minimize the transfer cost, we propose to automatically collect and annotate the data for the SN after one-minute human teaching. Simulated and real-world results are presented under the configurations of eye-to/in-hand. An electric vehicle charging system with the proposed policy inside achieves a 10/10 success rate in 2-3s, using only hundreds of auto-labeled samples for the SN transfer.

2605.29582 2026-05-29 cs.LG cs.CL

PEARL: Training Socratic Tutors with Pedagogically Aligned Reinforcement Learning

PEARL: 使用教学对齐强化学习训练苏格拉底式导师

Qikai Chang, Zhenrong Zhang, Linbo Chen, Pengfei Hu, Jianshu Zhang, Youhui Guo, Jun Du

发表机构 * University of Science and Technology of China(中国科学技术大学) iFLYTEK Research(iFLYTEK研究院)

AI总结 提出PEARL框架,通过可控学生模拟器、生成式奖励模型和稳定多目标强化学习,训练苏格拉底式教学代理,在多个基准上达到开源模型最佳性能并与专有模型竞争。

Comments 16 pages, 7 figures

详情
AI中文摘要

大型语言模型(LLM)在教育辅导方面展现出潜力,但有效的辅导不仅仅是解决问题:它必须提供渐进的苏格拉底式引导,并在多轮交互中平衡多个教学目标。然而,由于学生模拟的保真度有限且可控性弱、教学奖励建模不明确以及多目标优化不稳定,训练这样的导师仍然具有挑战性。为克服这些限制,我们提出了PEARL,一个教学对齐的强化学习框架,用于训练苏格拉底式教学代理,包含三个关键组件。首先,我们引入了一个可控的学生模拟器,将潜在认知状态与响应生成解耦,以模拟多样的能力和误解。其次,我们开发了一个生成式奖励模型,联合评估教学质量和目标正确性以进行策略优化。最后,我们提出了一种稳定的多目标强化学习方案,在每个维度内离散化奖励并跨维度聚合归一化优势,防止高方差目标主导更新。在多个基准上的实验表明,尽管仅使用30B策略模型,PEARL在开源模型中取得了最佳性能,并与领先的专有LLM保持竞争力。

英文摘要

Large Language Models (LLMs) have shown promise as educational tutors, yet effective tutoring requires more than solving problems: it must provide progressive Socratic guidance and balance multiple pedagogical objectives across multi-turn interactions. However, training such tutors remains challenging due to limited-fidelity and weakly controllable student simulation, under-specified pedagogical reward modeling, and unstable multi-objective optimization. To overcome these limitations, we propose PEARL, a pedagogically aligned reinforcement learning framework for training Socratic tutoring agents, consisting of three key components. First, we introduce a controllable student simulator that decouples latent cognitive states from response generation to model diverse abilities and misconceptions. Second, we develop a generative reward model that jointly evaluates pedagogical quality and objective correctness for policy optimization. Finally, we propose a stable multi-objective RL scheme that discretizes rewards within each dimension and aggregates normalized advantages across dimensions, preventing high-variance objectives from dominating updates. Experiments on multiple benchmarks show that PEARL achieves the best performance among open-source models and remains competitive with leading proprietary LLMs, despite using only a 30B policy model.

2605.29580 2026-05-29 cs.LG stat.ML

On the Construction and Implications of Low-Loss Valleys in LoRA-based Bayesian Inference

基于LoRA的贝叶斯推理中低损失谷的构造与启示

Daniel Dold, Emanuel Sommer, Julius Kobialka, Oliver Dürr, David Rügamer

发表机构 * HTWG Konstanz(康斯坦茨应用科学大学) LMU Munich(慕尼黑大学) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心)

AI总结 本文提出LoRA-Curve方法,通过分段贝塞尔曲线参数化在LoRA空间中连接独立最优解,形成连续低损失谷,并结合平坦极小扰动和JS散度正则化,在不牺牲性能的前提下提高预测分布的互信息,实现功能多样性。

详情
AI中文摘要

虽然低秩适应(LoRA)等参数高效微调方法已成为大型语言模型的标准方法,但对认知不确定性的原则性估计仍然具有挑战性。最近在LoRA机制下的结果表明,深度集成等离散多模态方法相比单模态方法几乎没有优势。这与深度学习中的更广泛观察相矛盾,在深度学习中,集成独立最优解通常能改善泛化,而通过连续低损失谷连接这些模态能进一步增强贝叶斯模型平均(BMA)。LoRA空间中是否存在这种结构,以及它是否能产生局部或离散方法所遗漏的功能多样性,尚未被研究。我们引入了LoRA-Curve,一种在LoRA空间中的分段贝塞尔曲线参数化,包含两种变体:一种自由配置,联合优化所有控制点;另一种锚定配置,连接独立微调的LoRA最优解。我们证明了损失沿曲线的路径连续性和Lipschitz正则性,并通过Qwen2.5 7B在推理和分类基准上的实验表明,线性插值会遇到损失障碍,而我们的锚定多段曲线通过连续低损失谷连接独立最优解。结合平坦极小扰动和詹森-香农散度正则化,LoRA-Curve在不牺牲性能的情况下,可测量地提高了预测分布的互信息,并将连续参数空间遍历与功能多样性联系起来。

英文摘要

While parameter-efficient fine-tuning methods like low-rank adaptation (LoRA) are standard for large language models, principled estimation of epistemic uncertainty remains challenging. Recent results in the LoRA regime suggest that discrete multi-mode approaches such as deep ensembles offer little benefit over single-mode methods. This contradicts broader observations in deep learning, where ensembling independent optima typically improves generalization, and linking these modes through continuous low-loss valleys further enhances Bayesian model averaging (BMA). Whether such structure exists in the LoRA space and whether it yields functional diversity missed by local or discrete methods has not been studied. We introduce LoRA-Curve, a segmented Bézier curve parameterization in the LoRA space, with two variants: a free configuration that jointly optimizes all control points, and an anchored configuration that connects independently fine-tuned LoRA optima. We prove pathwise continuity and Lipschitz regularity of the loss along the curve and empirically show, across reasoning and classification benchmarks with Qwen2.5 7B, that linear interpolation encounters loss barriers, while our anchored multi-segment curves connect independent optima through continuous low-loss valleys. Combined with flat-minima perturbations and a Jensen-Shannon divergence regularizer, LoRA-Curve yields measurably higher mutual information of the predictive distribution without sacrificing performance, and links continuous parameter-space traversal to functional diversity.

2605.29579 2026-05-29 cs.CV

ReactBench: A Cause-Driven Benchmark for Multimodal Hallucination via Systematic Evaluation

ReactBench:通过系统评估的多模态幻觉因果驱动基准

Shizhe Zhou, Bohan Jia, Kai Wu, Yan Shen, Tongyun Li, Yuyang Wu, Shaohui Lin

发表机构 * East China Normal University(华东师范大学)

AI总结 提出ReactBench基准,通过对抗性图像和诱导幻觉的查询,系统评估多模态大模型在关系擦除、反事实属性、变化追踪和密集计数等任务中的因果幻觉。

详情
AI中文摘要

尽管多模态大语言模型(MLLMs)在视觉-语言理解方面取得了快速进展,但它们仍然容易产生多模态幻觉,即生成与视觉输入不一致的响应。现有基准主要侧重于检测幻觉结果,而非评估这些失败的潜在原因。此外,许多基准依赖于简单的场景和有限的评估格式,不再能挑战最先进的模型。为了解决这些局限性,我们引入了ReactBench,一个因果驱动的幻觉基准,具有多个任务和考试式评估格式。通过生成对抗性图像和诱导幻觉的查询,ReactBench引入了四个目标任务:关系擦除、反事实属性、变化追踪和密集计数。这些任务系统地暴露了共现偏差、语言先验、跨图像比较感知缺陷和细粒度感知瓶颈。除了基于标准准确率的评估外,我们利用思维链推理来识别每个任务中幻觉的细粒度子原因。大量评估表明,当前的MLLMs仍然容易受到特定因果幻觉触发因素的影响,这证明了ReactBench作为诊断和提高多模态模型鲁棒性的系统化和可解释测试平台的价值。项目页面见https://reactbench.github.io/。

英文摘要

While multimodal large language models (MLLMs) have achieved rapid progress in vision-language understanding, they remain prone to multimodal hallucinations, producing responses that are inconsistent with the visual input. Existing benchmarks predominantly focus on detecting hallucination outcomes rather than evaluating the underlying causes of these failures. Moreover, many benchmarks rely on simplistic scenarios and limited evaluation formats that no longer challenge state-of-the-art models. To address these limitations, we introduce ReactBench, a cause-driven hallucination benchmark featuring multiple tasks and an exam-style evaluation format. By generating adversarial images and hallucination-inducing queries, ReactBench introduces four targeted tasks: Relational Erasure, Counterfactual Attribute, Alteration Tracing, and Dense Counting. These tasks systematically expose co-occurrence bias, language priors, cross-image comparative perception deficiencies, and fine-grained perceptual bottlenecks. Beyond standard accuracy-based evaluation, we leverage Chain-of-Thought reasoning to identify fine-grained sub-causes of hallucination within each task. Extensive evaluations reveal that current MLLMs remain notably vulnerable to cause-specific hallucination triggers, demonstrating the value of ReactBench as a systematic and interpretable testbed for diagnosing and improving multimodal model robustness. The project page is available at https://reactbench.github.io/.

2605.29578 2026-05-29 cs.AI

GPS-Enhanced Tourist Mobility Modeling with Seasonal Spatial Priors and LLM-Based Activity Chain Generation

基于季节性空间先验和LLM活动链生成的GPS增强游客移动性建模

Yifan Liu, Yanling Sang, Xishun Liao, Morgan Sun, Bo Yang, Zhiyuan Zhang, Chris Stanford, Haoxuan Ma, Jiaqi Ma

发表机构 * UCLA Mobility Lab, Department of Civil and Environmental Engineering, University of California, Los Angeles(加州大学洛杉矶分校移动实验室,土木与环境工程系,加州大学洛杉矶分校) Novateur Research Solutions(Novateur研究解决方案) University of Central Florida(中央佛罗里达大学)

AI总结 提出一种四阶段仿真框架,结合GPS和调查数据推导的月份条件空间先验、游客人口统计信息、距离可行区域序列分配以及基于LLM的活动链生成,以解决游客移动性建模中非例行、吸引驱动且对旅行目的、季节和成员组成高度敏感的问题。

详情
AI中文摘要

游客移动性对城市交通规划提出了独特挑战。与居民通勤不同,游客旅行大多是非例行的、由景点驱动的,并且对旅行目的、旅行季节和旅行成员组成高度敏感。现有方法要么测量聚合的游客空间模式而不生成个人行程,要么合成移动性而不考虑游客特定结构,如旅行持续时间条件、月份变化的景点需求以及家庭共同旅行规则。为了解决这些挑战,我们提出了一个四阶段仿真框架,结合了从GPS和调查数据推导的月份条件空间先验、基于游客人口统计的旅行范围预测、距离可行的区域序列分配,以及在家庭和空间约束下基于LLM的活动链生成。GPS数据仅以隐私保护的聚合形式用作月份条件空间先验,不保留或暴露任何个人轨迹。在东京旅游上的实验表明,基于GPS的游客群体提取恢复了与调查参考一致的空间访问特征,我们的框架生成了人口统计对齐的合成行程,其区域级访问份额与调查分布和停留点导出的月度访问模式紧密对齐。结果证明了该框架作为地理基础、人口统计感知的游客移动性建模方法的有效性。

英文摘要

Tourist mobility poses a distinct challenge for urban transportation planning. Unlike resident commuting, tourist travel is largely non-routine, attraction driven, and highly sensitive to trip purpose, travel season, and trip member composition. Existing approaches either measure aggregate tourist spatial patterns without generating individual schedules, or synthesize mobility without tourist specific structure such as trip duration conditioning, month varying attraction demand, and household co-travel rules. To address these challenges, we propose a four stage simulation framework combining month conditioned spatial priors derived from GPS and survey data, trip extent prediction from tourist demographics, distance feasible ward sequence assignment, and LLM-based activity chain generation under household and spatial constraints. GPS data are used only in privacy preserving aggregated form as month conditioned spatial priors, with no individual traces retained or exposed. Experiments on tourism in Tokyo demonstrate that the GPS based tourist cohort extraction recovers spatial visitation signatures consistent with survey references, and our framework produces demographically aligned synthetic schedules whose ward-level visitation shares align closely with both survey distributions and staypoint derived monthly visitation patterns. The results demonstrate the framework's effectiveness as a geographically grounded, demographically aware approach to tourist mobility modeling.

2605.29577 2026-05-29 cs.CV

Mitigating State Aliasing in Vision-Language-Action Models via Inverse Dynamics Learning

通过逆动力学学习缓解视觉-语言-动作模型中的状态混叠

Kyujin Lee, Injae Kim, Jihwan Park, Yejun Ju, Minseok Joo, Hyunwoo J. Kim

发表机构 * KAIST(韩国科学技术院) Korea University(韩国大学)

AI总结 提出将逆动力学学习作为辅助目标,直接监督VLA视觉编码器,通过预测当前与未来观测之间的动作来捕捉细粒度视觉差异,从而缓解状态混叠问题。

详情
AI中文摘要

视觉-语言-动作(VLA)模型通过将预训练的视觉-语言模型(VLM)适应于动作预测,成为统一机器人操作中感知、推理和控制的 promising 框架。然而,VLM 衍生的表示通常对低级控制所需的细微视觉差异不敏感,导致视觉相似但需要截然不同动作的状态之间出现状态混叠。先前的 VLA 研究通过生成视觉或推理输出(如未来帧、2D 接地点或轨迹、或中间空间推理步骤)来改善视觉理解,但这些目标通常仅通过端到端预测间接塑造视觉编码器,并未显式分析学习到的视觉特征空间中的状态混叠。为了缓解状态混叠,我们引入逆动力学学习作为辅助目标,直接监督 VLA 视觉编码器。通过预测当前与未来观测之间的动作,我们的目标鼓励编码器捕捉决定低级动作的细粒度视觉差异。我们进一步使用伪反向监督,使编码器暴露于更广泛的动作方向,并在有限的机器人演示下提高泛化能力。我们的方法适用于多种 VLA 基线,仅使用标准的观测-动作对,无需额外标注,并在测试时保留原始推理流程。在 CALVIN ABC-D 和 SimplerEnv 上的实验表明,在多种 VLA 基线上均获得一致的性能提升。冻结编码器探测和状态-特征对齐分析进一步表明,我们的方法学习了状态判别性的视觉表示,减少了状态混叠,并更好地与机器人状态变化对齐。

英文摘要

Vision-Language-Action (VLA) models have emerged as a promising framework that unifies perception, reasoning, and control for robot manipulation by adapting pretrained vision-language models (VLMs) to action prediction. However, VLM-derived representations are often insensitive to subtle visual distinctions required for low-level control, causing state aliasing between visually similar states that require substantially different actions. Prior VLA studies improve visual understanding by generating visual or reasoning outputs, such as future frames, 2D grounding points or traces, or intermediate spatial reasoning steps, but these objectives typically shape the vision encoder only indirectly through end-to-end prediction and do not explicitly analyze state aliasing in the learned visual feature space. To mitigate state aliasing, we introduce inverse dynamics learning as an auxiliary objective that directly supervises the VLA vision encoder. By predicting the action between current and future observations, our objective encourages the encoder to capture fine-grained visual distinctions that determine low-level actions. We further use pseudo-reversed supervision to expose the encoder to a broader range of action directions and improve generalization under limited robot demonstrations. Our method applies to diverse VLA baselines, uses only standard observation-action pairs without additional annotations, and preserves the original inference pipeline at test time. Experiments on CALVIN ABC-D and SimplerEnv show consistent gains across diverse VLA baselines. Frozen-encoder probing and state-feature alignment analyses further show that our method learns state-discriminative visual representations that reduce state aliasing and better align with robot state changes.

2605.29572 2026-05-29 cs.RO cs.HC

Learning to Feel Materials from Multisensory Tactile Data via Interpretable Models

通过可解释模型从多感官触觉数据中学习感知材料

Li Zou, Yasemin Vardar

发表机构 * Delft University of Technology (TU Delft), Department of Cognitive Robotics(代尔夫特理工大学(TU Delft),认知机器人学系)

AI总结 提出一个可解释的计算框架,利用多感官触觉数据(包括按压、静态接触和滑动交互)建模人类材料感知与识别,发现热觉和顺应性线索对感知建模和材料分类至关重要。

Comments 12 pages, 3 figures, journal

详情
AI中文摘要

人类对材料的触觉感知依赖于复杂的多感官触觉线索,然而低级触觉信号与感知表征之间的关系仍不清楚。这一知识差距阻碍了触觉在数字环境中的集成以及具有类人触觉感知能力的机器人的开发。在这里,我们提出了一个可解释的计算框架,用于使用多感官触觉数据建模人类材料感知和识别。我们的框架包含三个相互关联的模型:模型1将手指-表面交互特征映射到心理物理感官属性,模型2基于这些感知表征对材料进行分类,模型3直接从触觉特征对材料进行分类。结果表明,结合按压、静态接触和滑动交互的信息提高了预测准确性,并且热觉线索对于感知建模和材料分类尤其具有信息量。这些发现强调了热觉和顺应性线索的重要性,这些线索在当前机器人手指和触觉显示器中仍未得到充分体现。纳入此类线索可能增强人工系统近似人类材料感知的能力,并指导设计更具感知基础的触觉界面。

英文摘要

Human tactile perception of materials relies on complex multisensory touch cues, yet the relationship between low-level tactile signals and perceptual representations remains poorly understood. This knowledge gap hinders the integration of touch in digital environments and the development of robots capable of human-like tactile perception. Here, we present an interpretable computational framework for modeling human material perception and recognition using multisensory touch data. Our framework comprises three interconnected models: Model 1 maps finger-surface interaction features to psychophysical sensory attributes, Model 2 classifies materials based on these perceptual representations, and Model 3 directly classifies materials from tactile features. The results showed that combining information from pressing, static contact, and sliding interactions improves prediction accuracy, and that thermal cues are particularly informative for both perceptual modeling and material classification. These findings highlight the importance of thermal and compliance cues, which remain underrepresented in current robotic fingers and haptic displays. Incorporating such cues may enhance artificial systems' ability to approximate human material perception and guide the design of more perceptually grounded haptic interfaces.

2605.29570 2026-05-29 cs.CV

DefSynUS: Real-time Patient-specific Intrahepatic Vessel Identification via Deformation-Aware CT-US Domain Adaptation

DefSynUS:通过形变感知CT-超声域自适应的实时患者特异性肝内血管识别

Karl-Philippe Beaudet, Yordanka Velikova, Sidaty El Hadramy, Nassir Navab, Philippe Cattin, Juan Verde, Stéphane Cotin

发表机构 * Inria(法国国家信息与自动化研究所) University of Strasbourg(斯特拉斯堡大学) Technical University of Munich(慕尼黑技术大学) University of Basel(巴塞尔大学) Institute of Image-Guided Surgery(图像引导手术研究所)

AI总结 提出一种基于物理渲染和形变感知数据增强的域自适应框架,无需术前超声即可实现术中实时、患者特异性的肝内血管分支识别。

详情
AI中文摘要

目的:腹腔镜超声通过实时可视化肝内血管增强肝脏手术的安全性。然而,由于探头限制、复杂的血管结构和组织形变,血管识别仍然困难。本研究旨在通过可变形超声增强,实现实时、患者特异性的血管识别,并在形变下保持鲁棒性。方法:利用术前CT血管标注,通过优化的基于物理的渲染生成合成超声数据,并结合域自适应到术中超声。渲染过程以端到端方式训练,用于血管识别和患者特异性,无需术前超声。形变感知增强在渲染流程中模拟真实的术中运动和软组织形变。结果:在腹部体模和有限临床可行性实验(单病例临床评估)中,该框架实现了实时肝内血管分支识别,并在新患者姿势下保持性能。结论:该框架无需术前超声即可实现实时血管识别,并支持技术可行性,但仍需多患者验证以评估泛化性和临床可行性。

英文摘要

Purpose: Laparoscopic ultrasound (LUS) enhances the safety of liver surgery by visualizing intrahepatic vessels in real-time. Still, vessel identification remains difficult due to probe constraints, complex vascular structure, and tissue deformation. This work aims to enable real-time, patient-specific vessel identification that remains robust under deformation through deformable ultrasound augmentation. Methods: Preoperative CT vessel annotations are used to generate synthetic ultrasound data via optimized physics-based rendering, coupled with domain adaptation to intraoperative ultrasound. The rendering is trained end-to-end for vessel identification and patient-specificity, eliminating the need for preoperative ultrasound. A deformation-aware augmentation simulates realistic intraoperative motion and tissue deformation within the rendering pipeline. Results: In abdominal phantom and limited clinical feasibility experiments (single-case clinical evaluation), the framework achieved real-time intrahepatic vessel-branch identification, maintaining performance under new patient poses. Conclusion: The framework enables real-time vessel identification without preoperative ultrasound and supports technical feasibility, but multi-patient validation is still needed for generalizability and clinical feasibility.

2605.29568 2026-05-29 cs.AI

DeepTool: Scaling Interleaved Deliberation in Tool-Integrated Reasoning via Process-Supervised Reinforcement Learning

DeepTool: 通过过程监督强化学习扩展工具集成推理中的交错思考

Yang He, Xiao Ding, Bibo Cai, Yufei Zhang, Kai Xiong, Zhouhao Sun, Bing Qin, Ting Liu

发表机构 * Research Center for Social Computing(社会计算研究中心) Interactive Robotics, Harbin Institute of Technology, China(交互机器人,哈尔滨工业大学,中国)

AI总结 针对工具集成推理中缺乏逐步监督和自纠正能力的问题,提出DeepTool框架,通过合成交错轨迹和基于动作中心过程奖励的GRPO强化学习,显著提升模型在多个基准上的性能。

详情
AI中文摘要

工具集成推理通过利用外部环境扩展了LLM的能力。然而,现有方法在顺序调用工具时缺乏战略规划和自我纠正所需的思考。虽然强化学习缓解了这一问题,但传统的工具集成推理方法受到稀疏的基于结果奖励的阻碍,无法监督中间推理步骤和工具调用。为了解决这个问题,我们提出了DeepTool,一个新颖的框架,它在每一轮思考、行动和观察的交错过程中扩展了深思熟虑的思考。在DeepTool中,我们首先引入了一个合成流程,将扩展思考演变为交错轨迹,并集成对抗性扰动以确保鲁棒性和自我纠正。其次,我们基于GRPO设计了过程监督强化学习,利用以行动为中心的过程奖励来强化中间交错思考,并在每一轮强制执行精确的工具调用。大量实验表明,DeepTool实现了卓越的性能,在六个基准测试中显著提升了Qwen2.5-7B(例如,AIME24: 3.2% -> 40.4%,HMMT25: 0.0% -> 28.6%)。此外,令牌成本效益分析证实了交错思考的实用性,展示了DeepTool在性能和令牌效率之间的最佳平衡。

英文摘要

Tool-Integrated Reasoning (TIR) extends LLM capabilities by leveraging external environments. However, existing methods lack the deliberation during sequential tool invocation required for strategic planning and self-correction. While RL mitigates this, conventional approaches for Tool-Integrated Reasoning are hindered by sparse outcome-based rewards, failing to supervise intermediate reasoning steps and tool invocations. To address this, we propose DeepTool, a novel framework that scales deliberate thinking within the interleaved process of thinking, action, and observation at each turn. In DeepTool, we first introduce a synthesis pipeline that evolves extended thinking into interleaved trajectories, integrating adversarial perturbations to ensure robustness and self-correction. Secondly, we devise Process-Supervised Reinforcement Learning based on GRPO, which utilizes an Action-Centric Process Reward to reinforce intermediate interleaved thinking and enforce precise tool invocation at every turn. Extensive experiments demonstrate that DeepTool achieves superior performance, boosting Qwen2.5-7B significantly across six benchmarks (e.g., AIME24: 3.2% -> 40.4% and HMMT25: 0.0% -> 28.6%). Furthermore, the token cost-effectiveness analysis confirms the utility of interleaved thinking, demonstrating DeepTool's optimal balance between performance and token efficiency.

2605.29564 2026-05-29 cs.RO

VE2VF: Vision-Enabled to Vision-Free Distillation via Real-world Reinforcement Learning for Robust Contact-Rich Manipulation

VE2VF: 基于真实世界强化学习的视觉使能到无视觉蒸馏用于鲁棒接触丰富操作

Victor Kowalski, Chengxi Li, Dongheui Lee

发表机构 * Autonomous Systems, Technische Universitaet Wien (TU Wien)(自动系统,维也纳技术大学) Institute of Robotics and Mechatronics (DLR)(机器人与机电研究所)

AI总结 提出一种人在环强化学习框架,通过教师-学生蒸馏将视觉使能策略的知识迁移到仅依赖本体感知的无视觉策略,在真实世界训练中实现鲁棒泛化,无需域随机化或数据增强。

详情
AI中文摘要

当使用强化学习进行接触丰富的机器人操作时,视觉可以提供任务相关信息,加速学习,超越仅靠本体感知所能达到的效果。然而,视觉使能策略容易过拟合训练时看到的视觉条件,限制了其鲁棒性和可迁移性。我们提出一种人在环强化学习框架,采用教师-学生蒸馏,在完全真实世界训练中实现跨多个任务变体的鲁棒性能,无需域随机化或数据增强。视觉使能教师将其知识蒸馏到仅依赖位姿、扭转和力传感的无视觉学生中,结合了快速训练与强任务泛化。在真实世界的NIST装配基准板上,我们的方法在3个代表性任务上经过约50分钟训练后达到95%的整体成功率,包括对8个未见任务变体的鲁棒泛化。通过蒸馏微调在最困难的任务上实现了完全成功。我们证明所得策略在鲁棒性和适应性上均优于基线。

英文摘要

When using reinforcement learning (RL) for contact-rich robotic manipulation, vision can provide task-relevant information that accelerates learning beyond what proprioception alone can achieve. However, vision-enabled policies tend to overfit to the visual conditions seen during training, limiting their robustness and transferability. We present a human-in-the-loop RL framework that employs teacher-student distillation to achieve robust performance across multiple task variants, trained entirely in the real world without requiring domain randomization or data augmentation. A vision-enabled teacher distills its knowledge into a vision-free student that relies solely on pose, twist, and wrench sensing, combining fast training with strong task generalization. On the real-world NIST assembly benchmark board, our approach achieves 95\% overall success after approximately 50 minutes of training on 3 representative tasks, including robust generalization to 8 unseen task variants. Fine-tuning with distillation achieves full success on the most challenging task. We demonstrate that the resulting policies outperform baselines in both robustness and adaptability.

2605.29562 2026-05-29 cs.RO cs.AI cs.CV

VLA-Pro: Cross-Task Procedural Memory Transfer for Vision-Language-Action Models

VLA-Pro:面向视觉-语言-动作模型的跨任务程序性记忆迁移

Shengyu Si, Yuanzhuo Lu, Ruimeng Yang, Ziyi Ye, Zuxuan Wu, Yu-Gang Jiang

发表机构 * Institute of Trustworthy Embodied AI, Fudan University(复旦大学可信具身人工智能研究院) Shanghai Key Laboratory of Multimodal Embodied AI(上海多模态具身人工智能重点实验室) Shanghai Xinzhi Embodied Intelligence Technology Co., Ltd.(上海新智具身智能技术有限公司)

AI总结 提出VLA-Pro框架,通过存储和检索任务相关的LoRA适配器作为程序性记忆,实现跨任务泛化,在仿真和真实任务中成功率显著提升。

详情
AI中文摘要

视觉-语言-动作(VLA)模型在通用机器人操作中展现出强大潜力,但在泛化到需要跨物体、场景和动作模式迁移相关经验的新任务时仍面临挑战。本文提出VLA-Pro,一种即插即用框架,通过在训练时存储任务相关的程序性记忆并在推理时迁移这些记忆来增强跨任务泛化。具体而言,VLA-Pro在训练时将任务特定的LoRA适配器存储为参数化的程序性记忆。在推理时,VLA-Pro基于当前多模态上下文检索相关程序性记忆,并动态融合这些记忆以生成当前动作块。在RoboTwin、RLBench和真实世界操作任务上的实验表明,VLA-Pro在多个骨干网络上持续提升跨任务泛化能力,在仿真中实现高达207%的相对改进,并将真实世界成功率从5.8%提升至65.0%。这些结果表明,程序性记忆检索与自适应为将操作经验迁移到新任务提供了一种有效机制,同时保持了模块化和执行稳定性。

英文摘要

Vision-Language-Action~(VLA) models have shown strong potential for general-purpose robotic manipulation, yet they still struggle to generalize to unseen tasks that necessitate transferring relevant experience across objects, scenes, and action patterns. This paper proposes VLA-Pro, a plug-and-play framework designed to enhance cross-task generalization by storing task-relevant procedural memories at training time and transferring these memories during inference. Specifically, VLA-Pro stores task-specific LoRA adapters as parameterized procedural memories during training. At inference time, VLA-Pro retrieves relevant procedural memories based on the current multi-modal context and dynamically fuses these memories for generating the current action chunk. Experiments on RoboTwin, RLBench, and real-world manipulation tasks show that VLA-Pro consistently improves cross-task generalization across multiple backbones, achieving up to a 207% relative improvement in simulation and increasing real-world success rate from 5.8% to 65.0%. These results suggest that procedural memory retrieval and adaptation provide an effective mechanism for transferring manipulation experience to novel tasks while preserving modularity and execution stability.

2605.29561 2026-05-29 cs.AI cs.SE

ParaTool: Shifting Tool Representations from Context to Parameters

ParaTool: 将工具表示从上下文转移到参数中

Zekai Yu, Qi Meng, Qizhi Chu, Yu Hao, Chuan Shi, Cheng Yang

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 提出ParaTool框架,通过将每个工具投影为可加载的参数集,结合参数化工具预训练、软工具选择和参数化工具微调三个阶段,使大语言模型无需上下文文档即可进行工具调用,在Stable ToolBench和BFCL上显著优于基于ICL的基线方法。

详情
AI中文摘要

工具调用通过使大语言模型(LLM)能够与外部可执行接口进行基于环境的交互,从而扩展了其能力。然而,主流的上下文学习(ICL)方法通常将详细的工具文档和使用示例直接纳入上下文中,这导致随着上下文长度的增长,推理开销显著增加,并且幻觉风险升高。相反,基于微调的方法虽然提高了通用工具调用能力,但往往无法有效内化先前见过的工具的特定细节,从而仍然依赖于上下文文档。为了解决这些限制,我们提出了ParaTool,一个将每个工具投影到专用的、可加载的参数集中的框架。通过配备这些参数化工具的动态集成,LLM可以在不依赖上下文文档或示例的情况下执行工具调用。具体来说,我们的方法包括三个阶段:(1)参数化工具预训练将不同工具的知识封装到独立的参数模块中;(2)软工具选择使用门控网络动态加权和聚合相关工具参数;(3)参数化工具微调联合更新工具参数以对齐训练和推理过程。在Stable ToolBench和BFCL上的实验表明,ParaTool显著优于基于ICL的强基线方法,在降低计算复杂度的同时实现了优越的性能。

英文摘要

Tool calling extends large language models (LLMs) by enabling grounded interaction with external executable interfaces, thereby supporting environment-coupled problem solving. However, mainstream in-context learning (ICL) approaches typically incorporate detailed tool documentation and usage examples directly into the context. This results in substantial inference overhead and heightened risks of hallucination as the context length grows. Conversely, while tuning-based methods improve general tool-calling capabilities, they often fail to effectively internalize the specific details of previously seen tools, thereby retaining a dependency on in-context documentation. To address these limitations, we propose ParaTool, a framework that projects each tool into a dedicated, loadable set of parameters. By equipping a dynamic integration of these parameterized tools, the LLM can perform tool calling without relying on in-context documents or examples. Specifically, our approach consists of three stages: (1) parametric tool pre-training encapsulates the knowledge of different tools into independent parameter modules; (2) soft tool selection employs a gating network to dynamically weigh and aggregate relevant tool parameters; and (3) parametric tool fine-tuning jointly updates tool parameters to align the training and inference processes. Experiments on Stable ToolBench and BFCL demonstrate that ParaTool significantly outperforms strong ICL-based baselines, achieving superior performance while reducing computational complexity.

2605.29560 2026-05-29 cs.AI

Battery-Sim-Agent: Leveraging LLM-Agent for Inverse Battery Parameter Estimation

Battery-Sim-Agent: 利用LLM智能体进行电池逆参数估计

Jiawei Chen, Xiaofan Gui, Shikai Fang, Shengyu Tao, Shun Zheng, Weiqing Liu, Jiang Bian

发表机构 * Peking University(北京大学) Microsoft Research(微软研究院) Zhejiang University(浙江大学) Chalmers University of Technology(皇家理工学院)

AI总结 提出Battery-Sim-Agent框架,将电池逆参数估计重构为推理任务,利用LLM智能体与高保真模拟器闭环交互,通过物理假设和结构化参数更新,显著优于贝叶斯优化等黑箱优化方法。

详情
AI中文摘要

对高保真电池“数字孪生”进行参数化是一个关键但具有挑战性的逆问题,阻碍了电池创新的步伐。现有方法将此表述为黑箱优化(BBO)任务,采用样本效率低且忽视底层物理的算法。在这项工作中,我们引入了一种新范式,将逆问题重新定义为推理任务,并提出了Battery-Sim-Agent,这是第一个将大型语言模型(LLM)智能体与高保真电池模拟器闭环部署的框架。该智能体模仿人类科学家的工作流程:它解释来自模拟器的丰富多模态反馈,形成基于物理的假设来解释差异,并提出结构化的参数更新。在一个系统构建的基准套件上,涵盖多种电池化学成分、操作条件和难度级别,我们的智能体在识别准确参数方面显著优于贝叶斯优化等强BBO基线。我们进一步展示了该框架在复杂长期退化拟合任务中的能力,并在真实电池数据集上验证了其实用性。我们的结果突显了LLM智能体作为基于推理的优化器在科学发现和电池参数估计中的前景。

英文摘要

Parameterizing high-fidelity "digital twins" of batteries is a critical yet challenging inverse problem that hinders the pace of battery innovation. Prevailing methods formulate this as a black-box optimization (BBO) task, employing algorithms that are sample-inefficient and blind to the underlying physics. In this work, we introduce a new paradigm that reframes the inverse problem as a reasoning task, and present Battery-Sim-Agent, the first framework to deploy a Large Language Model (LLM) agent in a closed loop with a high-fidelity battery simulator. The agent mimics a human scientist's workflow: it interprets rich, multi-modal feedback from the simulator, forms physically-grounded hypotheses to explain discrepancies, and proposes structured parameter updates. On a systematically constructed benchmark suite spanning diverse battery chemistries, operating conditions, and difficulty levels, our agent significantly outperforms strong BBO baselines like Bayesian optimization in identifying accurate parameters. We further demonstrate the framework's capability in complex long-horizon degradation fitting tasks and validate its practical applicability on real-world battery datasets. Our results highlight the promise of LLM-agents as reasoning-based optimizers for scientific discovery and battery parameter estimation.

2605.29559 2026-05-29 cs.CL

LiteCoder-Terminal: Scaling Long-Horizon Terminal Environments for Learning Language Agents

LiteCoder-Terminal: 扩展用于学习语言代理的长时程终端环境

Xiaoxuan Peng, Kaiqi Zhang, Xinyu Lu, Boxi Cao, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun

发表机构 * Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所信息处理实验室) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 提出零依赖合成框架LiteCoder-Terminal-Gen,自动生成可执行且可验证的终端训练环境,构建大规模SFT和RL数据集,通过监督微调和直接多轮偏好优化显著提升语言代理在终端任务上的性能。

详情
AI中文摘要

掌握终端环境需要语言代理具备多步规划、基于反馈的执行和动态状态适应能力。然而,当前训练此类代理的瓶颈在于依赖从外部仓库抓取的数据,这限制了领域多样性、环境可控性以及针对特定能力缺陷的优化。我们引入了LiteCoder-Terminal-Gen,一个零依赖的合成流水线,能够直接从领域规范自动生成可执行且可验证的终端训练环境。利用该框架,我们构建了两个大规模资源:LiteCoder-Terminal-SFT,包含跨10个领域的11,255条专家轨迹;以及LiteCoder-Terminal-RL,包含602个可验证环境,用于轨迹级偏好优化。在SFT数据集上对Qwen系列模型进行监督微调,得到的代理显著优于其基础版本。值得注意的是,我们的32B变体在Terminal Bench 1.0、2.0和Pro上分别达到了29.06%、18.54%和34.00%的pass@1。此外,在RL环境上应用直接多轮偏好优化(DMPO)进一步提升了性能。这些结果系统性地表明,完全合成的可执行环境为掌握复杂的真实命令行工作流提供了可扩展且可验证的监督信号。

英文摘要

Mastering terminal environments requires language agents capable of multi-step planning, feedback-grounded execution, and dynamic state adaptation. However, training such agents is currently bottlenecked by a reliance on scraped external repositories, which limits domain diversity, environment controllability, and the targeting of specific capability deficits. We introduce LiteCoder-Terminal-Gen, a zero-dependency synthesis pipeline that autonomously generates executable and verifiable terminal training environments directly from domain specifications. Using this framework, we construct two large-scale resources: LiteCoder-Terminal-SFT, comprising 11,255 expert trajectories across 10 domains, and LiteCoder-Terminal-RL, featuring 602 verifiable environments for trajectory-level preference optimization. Supervised fine-tuning of Qwen-family models on our SFT dataset yields agents that significantly outperform their base counterparts. Notably, our 32B variant achieves 29.06%, 18.54%, and 34.00% pass@1 on Terminal Bench 1.0, 2.0, and Pro, respectively. Furthermore, applying Direct Multi-turn Preference Optimization (DMPO) on our RL environments yields additional performance gains. These results systematically demonstrate that fully synthetic, executable environments offer a scalable and verifiable supervision signal for mastering complex, real-world command-line workflows.

2605.29558 2026-05-29 cs.CV

TAE: Target-aware enhancer for nighttime UAV tracking

TAE:面向夜间无人机跟踪的目标感知增强器

Yanyan Chen, Ruigang Fu, Yu Song, Ping Zhong

发表机构 * College of Electrical Science and Technology(电子科学与技术学院) National University of Defense Technology(国防科技大学)

AI总结 提出一种目标感知的低光增强框架TAE,利用跟踪框弱监督信号进行区域感知增强和自适应RGB多曲线融合,显著提升夜间无人机跟踪性能,并贡献了包含268个序列的DarkSOT基准。

Comments Accepted at ICIP 2026. Dataset is avaliable at: https://github.com/Fu0511/DarkSOT-Dataset

详情
AI中文摘要

夜间低光条件下的严重图像退化是基于无人机的单目标跟踪全天候应用的核心瓶颈。现有的图像增强方法通常难以区分目标和背景区域,容易放大背景噪声或损害目标特征。为克服这一限制,我们提出TAE,一种专为夜间目标跟踪设计的目标感知低光增强框架。在跟踪边界框的弱监督信号显式引导下,该框架进行区域感知增强,确保操作聚焦于目标区域。它进一步采用自适应RGB多曲线融合机制,实现不同区域的精细建模和自适应调整。为促进该领域研究,我们还贡献了DarkSOT,一个新的夜间无人机跟踪基准,包含9个目标类别的268个序列。在DarkSOT和UAVDark135上的实验结果表明,TAE显著提升了低光夜间场景下的跟踪性能,展现出强鲁棒性和泛化能力。DarkSOT数据集可在https://github.com/Fu0511/DarkSOT-Dataset获取。

英文摘要

Severe image degradation under low-light nighttime conditions constitutes a core bottleneck preventing all-day applications for UAV-based single object tracking. Existing image enhancement methods often struggle to distinguish between target and background regions, which can easily lead to amplified background noise or compromise target features. To overcome this limitation, we propose TAE, a target-aware low-light enhancement framework tailored for nighttime object tracking. Guided explicitly by weak supervisory signals from tracking bounding boxes, the framework performs region-aware enhancement to ensure operations focus on the target area. It further adopts an adaptive RGB multi-curve fusion mechanism to achieve refined modeling and adaptive adjustment across different regions. To facilitate research in this domain, we also contribute DarkSOT, a new benchmark for nighttime UAV tracking, comprising 268 sequences across 9 target categories. Experimental results on the DarkSOT and UAVDark135 demonstrate that TAE significantly improves tracking performance in low-light nighttime scenarios, exhibiting strong robustness and generalization. The DarkSOT dataset is available at https://github.com/Fu0511/DarkSOT-Dataset.

2605.29556 2026-05-29 cs.AI

Opt-Verifier: Unleashing the Power of LLMs for Optimization Modeling via Dual-Side Verification

Opt-Verifier:通过双面验证释放大语言模型在优化建模中的潜力

Haoyang Liu, Jie Wang, Boxuan Niu, Xiongwei Han, Yian Xu, Mingxuan Ye, Zijie Geng, Fangzhou Zhu, Tao Zhong, Mingxuan Yuan, Jianye Hao

发表机构 * MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition(脑启发式感知与认知MoE实验室) University of Science and Technology of China(中国科学技术大学) Noah's Ark Lab, Huawei Technologies(华为技术诺亚实验室) Tianjin University(天津大学)

AI总结 提出Opt-Verifier框架,通过结构侧和解决方案侧的双面验证,利用大语言模型自动构建数学优化模型,显著提升建模准确性。

Journal ref International Conference on Machine Learning (ICML), 2026

详情
AI中文摘要

构建数学优化模型在运筹学中至关重要,但需要大量人类专业知识。最近的进展利用大语言模型(LLMs)来自动化这一建模过程。然而,现有工作往往难以验证生成的优化模型的正确性,既不检查约束和变量的合理性,也不检查生成模型解的有效性。这阻碍了后续的验证和纠正步骤,从而严重损害了建模准确性。为了解决这一挑战,我们提出了一种新颖的基于LLM的框架,具有从结构和解决方案两个角度的双面验证(Opt-Verifier),从而提高建模准确性。结构侧验证确保生成的优化模型的建模结构与原始问题描述一致,准确捕捉问题的约束和要求。同时,解决方案侧验证解释和评估解的有效性,确认优化模型在逻辑和数学上是合理的。在流行基准上的实验表明,我们的方法在准确性上提高了20%以上。

英文摘要

Building mathematical optimization models is critical in operations research (OR), while it requires substantial human expertise. Recent advancements have utilized large language models (LLMs) to automate this modeling process. However, existing works often struggle to verify the correctness of the generated optimization models, without checking the rationality of the constraints and variables or the validity of solutions to the generated models. This hampers the subsequent verification and correction steps, and thus it severely hurts the modeling accuracy. To address this challenge, we propose a novel LLM-based framework with Dual-side Verification (Opt-Verifier) from both structure and solution perspectives, thereby improving the modeling accuracy. The structure-side verification ensures that the modeling structure of the generated optimization models aligns with the original problem description, accurately capturing the problem's constraints and requirements. Meanwhile, the solution-side verification interprets and evaluates the solutions' validity, confirming that the optimization models are logically and mathematically sound. Experiments on popular benchmarks demonstrate that our approach achieves over 20\% improvement in accuracy.

2605.29555 2026-05-29 cs.CL

From Blind Guess to Informed Judgment: Teaching LLMs to Evaluate Materials by Building Knowledge-Augmented Preference Signals

从盲目猜测到知情判断:通过构建知识增强的偏好信号教会LLM评估材料

Yeyong Yu, Wenya Hu, Xing Wu, Quan Qian

发表机构 * School of Computer Engineering & Science, Shanghai University(上海大学计算机工程与科学学院) Center of Materials Informatics and Data Science, Materials Genome Institute, Shanghai University(上海大学材料信息与数据科学中心) Key Laboratory of Silicate Cultural Relics Conservation (Shanghai University), Ministry of Education, China(教育部硅酸盐文化 relics 保护重点实验室(上海大学)) Shanghai Institute for Advanced Communication and Data Science, Shanghai University(上海大学高级通信与数据科学研究院)

AI总结 提出知识增强偏好信号框架MaterEval,通过成对偏好数据引导大语言模型从直觉判断转向基于证据的可靠评估,并引入快慢推理方案平衡吞吐量、成本和可靠性,在高熵合金评估中验证了有效性。

Comments 33 pages, 5 figures

详情
AI中文摘要

随着候选生成和高通量实验的进步,材料发现的主要瓶颈正从性质预测转向在大量候选集中进行可靠评估。我们提出了知识增强偏好信号框架MaterEval,该框架自动为同一候选生成两种评估:一种遵循专家规则并提供支持证据的知情判断,另一种是移除规则的盲目猜测。通过将这两种评估配对作为偏好数据,我们引导原本缺乏材料特定标准的通用大语言模型(LLM)从直觉判断转向由明确证据支持的可靠评估。为了平衡吞吐量、成本和可靠性,我们进一步引入了一种快慢推理方案,将大规模快速筛选与对小子集的深入审查解耦。以高熵合金(HEA)评估为例,我们表明,无需外部检索,仅依赖内化能力,小型开源LLM在准确性、结论一致性和证据区分度上取得了显著提升,接近基于规则的闭源LLM的性能。这些结果表明,专家规则可以系统地转化为可学习的偏好信号,从而为自主材料发现循环提供低成本且可部署的评估模块。

英文摘要

As candidate generation and high-throughput experimentation advance, the primary bottleneck in materials discovery is shifting from property prediction to making reliable evaluations among massive candidate sets. We propose a Knowledge-Augmented Preference Signals Framework, MaterEval, that automatically produces, for the same candidate, two evaluations: an informed judgment that follows expert rules and provides supporting evidence, and a rule-removed blind guess. By pairing the two evaluations as preference data, we guide general-purpose large language models (LLMs), originally lacking materials-specific criteria, from intuitive judgment toward reliable evaluation supported by explicit evidence. To balance throughput, cost, and reliability, we further introduce a fast-slow reasoning scheme that decouples large-scale rapid screening from in-depth review on a small subset. Using high-entropy alloy (HEA) assessment as a case study, we show that, without external retrieval and relying solely on internalized capabilities, small open-source LLMs achieve substantial gains in accuracy, conclusion consistency, and evidence discrimination, approaching the performance of rule-based closed-source LLMs. These results demonstrate that expert rules can be systematically transformed into learnable preference signals, enabling a low-cost and deployable evaluation module for autonomous materials discovery loops.

2605.29549 2026-05-29 cs.CV

Learning Representations from 3D Gaussian Splats

从3D高斯溅射中学习表示

Julia Farganus, Krzysztof Żurawicki, Arkadiusz Gaweł, Weronika Jakubowska, Halina Kwaśnicka

发表机构 * Department of Artificial Intelligence, Wroc aw University of Science

AI总结 本研究通过比较多种几何深度学习架构,评估了基于3D高斯溅射的场景表示在分类任务中的有效性,揭示了不同架构和输入特征对表示质量的影响。

Comments 5 figures, 15 pages

详情
AI中文摘要

3D高斯溅射(3DGS)是一种用于场景渲染的最新方法。尽管其主要设计用于视图合成,但其在场景理解任务中的潜力尚未得到充分探索。在这项工作中,我们对使用高斯溅射表示的3D场景分类的各种几何深度学习架构进行了比较评估。我们在传统点云数据集和专用高斯溅射数据集上对基于点和基于图的模型进行了基准测试。场景被嵌入到潜在表示中,并通过端到端分类、线性探测和聚类分析进行评估。我们的研究为不同几何感知架构和输入特征配置在学习有效3D高斯溅射表示方面的适用性提供了见解。结果突出了架构家族之间的一致差异,并揭示了高斯特定属性对表示质量的影响。

英文摘要

3D Gaussian Splatting (3DGS) is a recent approach for scene rendering. Although primarily designed for view synthesis, its potential for scene understanding tasks remains underexplored. In this work, we conduct a comparative evaluation of various geometric deep learning architectures for the classification of 3D scenes represented using Gaussian Splatting. We benchmark point-based and graph-based models across both traditional point cloud datasets and dedicated Gaussian Splatting datasets. Scenes are embedded into latent representations, which are evaluated through end-to-end classification, linear probing, and clustering analysis. Our study provides insight into the suitability of different geometry-aware architectures and input feature configurations for learning effective 3D Gaussian Splat representations. The results highlight consistent differences between architectural families and reveal the impact of Gaussian-specific attributes on the quality of representation.

2605.29547 2026-05-29 cs.LG cs.AI math.OC

Singularity-aware Optimization via Randomized Geometric Probing: Towards Stable Non-smooth Optimization

基于随机几何探测的奇异性感知优化:迈向稳定的非光滑优化

Ruoran Xu, Borong She, Xiaobo Jin, Qiufeng Wang

发表机构 * Xi'an Jiaotong-Liverpool University(西安交通大学利物浦大学)

AI总结 针对非光滑优化中Adam优化器的梯度抖动问题,提出奇异性感知Adam(S-Adam),通过局部几何不稳定性(LGI)度量动态调整步长,实现稳定训练并提升泛化性能。

Comments International Conference on Machine Learning (ICML), 2026

详情
AI中文摘要

深度学习优化严重依赖于损失景观平滑的假设,而现代架构由于ReLU激活和量化算子等非光滑组件系统性地违反了这一条件。在这种非光滑情况下,Adam等自适应优化器会出现梯度抖动,即由Clarke次微分内冲突信号引起的剧烈振荡,导致收敛性差和泛化能力欠佳。为解决此问题,我们引入了奇异性感知Adam(S-Adam),一种通过基于局部几何不稳定性动态调整步长来稳定训练的新型优化器。我们的关键贡献是局部几何不稳定性(LGI)度量,一种从随机方向导数方差导出的Clarke次微分直径的计算高效估计量。S-Adam采用自适应阻尼机制exp(-$λ$$ρ$),在高不稳定性区域减缓更新,同时在平滑盆地保持快速收敛。我们使用微分包含提供了严格的收敛性分析,证明S-Adam以最优的O(1/$\sqrt(T)$)速率几乎必然收敛到($δ$,$ε$)-Clarke稳定点。在量化感知训练(QAT)和高噪声小批量学习上的实证评估表明,S-Adam持续优于AdamW和Prox-SGD,在CIFAR-100上实现高达6%的准确率提升,在TinyImageNet上实现3%的提升,同时有效缓解梯度振荡。

英文摘要

Deep learning optimization relies heavily on the assumption of smooth loss landscapes, a condition systematically violated by modern architectures due to non-smooth components such as ReLU activations and quantization operators. In such non-smooth regimes, adaptive optimizers such as Adam suffer from gradient chattering, violent oscillations caused by conflicting signals within the Clarke subdifferential, leading to poor convergence and suboptimal generalization. To address this, we introduce Singularity-aware Adam (S-Adam), a novel optimizer that stabilizes training by dynamically modulating step sizes based on local geometric instability. Our key contribution is the Local Geometric Instability (LGI) metric, a computationally efficient estimator of the Clarke subdifferential diameter derived from the variance of randomized directional derivatives. S-Adam incorporates an adaptive damping mechanism exp(-$λ$$ρ$) that decelerates updates in high-instability regions while preserving fast convergence in smooth basins. We provide a rigorous convergence analysis using differential inclusions, proving that S-Adam converges almost surely to ($δ$,$ε$)-Clarke stationary points at the optimal O(1/$\sqrt(T)$) rate. Empirical evaluations on Quantization-Aware Training (QAT) and high-noise small-batch learning demonstrate that S-Adam consistently outperforms AdamW and Prox-SGD, achieving accuracy gains of up to 6 percent on CIFAR-100 and 3 percent on TinyImageNet while effectively mitigating gradient oscillations.

2605.29543 2026-05-29 cs.LG cs.AI cs.CL cs.HC cs.IR

SCOPE: A Lightweight-training LLM Framework for Air Traffic Control Readback Monitoring

SCOPE:一种用于空中交通管制复诵监控的轻量训练LLM框架

Qihan Deng, Minghua Zhang, Yang Yang, Zhenyu Gao

发表机构 * Department of Mechanical and Aerospace Engineering, The Hong Kong University of Science and Technology(香港科学与技术大学机械与航空航天工程系) School of Electronic and Information Engineering, Beihang University(北航电子与信息工程学院) State Key Laboratory of CNS/ATM(国家空管自动化系统实验室)

AI总结 提出SCOPE框架,通过冻结LLM结合插件式开放集分类器和上下文学习机制,实现高效准确的空管复诵监控,在少样本设置下开放集检测准确率达91.05%,异常纠正率96.63%。

详情
AI中文摘要

飞行员对空中交通管制(ATC)语音指令的复诵是航空运输中防止沟通失误的主要保障。然而,复诵异常仍与约80%的航空事故相关。这一脆弱性因交通量增加和认知负荷升高而进一步加剧,从而推动了机器自动化复诵监控的需求。传统的基于规则和机器学习的方法难以在高度可变且不断演变的空管-飞行员通信术语中泛化。尽管大语言模型(LLM)凭借其强大的推理和泛化能力开辟了新途径,但现有方法在实践中仍面临部署和计算障碍。在这项工作中,我们提出了SCOPE(Semantic reasoning for Communication via Open-set Plug-in with Examples),一种新颖的轻量训练LLM框架,提升了基于机器的ATC复诵监控的效率和准确性。核心思想是在冻结的LLM之上,将插件式开放集分类器与精心设计的上下文学习机制相结合。在半合成通信数据集上的大量实验表明,SCOPE在实现运行环境所需的低延迟响应的同时,达到了优越的准确性。在少样本设置下,SCOPE在开放集检测中达到91.05%的准确率,并纠正了96.63%的异常复诵,从而在提供决策解释的同时优于现有最强基线。这些发现证明了我们的框架作为通向可解释和可控的ATC复诵监控的实用途径的潜力。

英文摘要

Pilot readback of Air Traffic Control (ATC) voice instructions is a primary safeguard against miscommunication in air transportation. However, readback anomalies remain implicated in approximately 80% of aviation incidents. This vulnerability is further exacerbated by rising traffic volume and elevated cognitive workload, thereby motivating automated readback monitoring by machine. Traditional rule-based and machine learning approaches struggle to generalize across the highly variable and evolving phraseology of air traffic controller-pilot communications. While Large Language Models (LLMs) have opened a new avenue through their strong reasoning and generalization capabilities, existing approaches still face deployment and computational barriers in practice. In this work, we propose Semantic reasoning for Communication via Open-set Plug-in with Examples (SCOPE), a novel lightweight-training LLM framework that advances both the efficiency and accuracy of machine-based ATC readback monitoring. The core idea is to couple a plug-in open-set classifier with a carefully designed in-context learning mechanism on top of a frozen LLM. Extensive experiments on the semi-synthetic communication dataset show that SCOPE attains superior accuracy while delivering the low-latency response required for operational environments. Under a few-shot setting, SCOPE achieves 91.05% accuracy in open-set detection and corrects 96.63% of anomalous readbacks, thereby outperforming the strongest available baselines while providing explanations for its decisions. These findings demonstrate the potential of our framework as a practical pathway toward interpretable and controllable ATC readback monitoring.

2605.29538 2026-05-29 cs.CV

RadioFormer3D: Weakly Supervised 3D Radio Map Estimation in Low-Altitude Airspace via Generative Modeling

RadioFormer3D:通过生成式建模在低空空域中进行弱监督三维无线电地图估计

Zheng Fang, Junjie Liu, Kangjun Liu, Jianguo Zhang, Yaowei Wang, Ke Chen

发表机构 * Pengcheng Laboratory(鹏城实验室) Southern University of Science and Technology(南方科技大学) Harbin Institute of Technology(哈尔滨工业大学)

AI总结 提出RadioFormer3D模型,采用傅里叶采样编码器、体素解码器和联合频谱完整性损失,在弱监督下实现三维空间稀疏测量的无线电地图估计,有效提升未标注高度层的重建质量。

详情
AI中文摘要

随着三维环境中无线应用(如低空空域和三维异构网络)的出现,无线电地图估计越来越需要表征信号在水平和垂直维度上的传播。然而,由于空间稀疏性增加和连续高度上的监督有限,将无线电地图估计从二维扩展到三维仍然具有挑战性。在本文中,我们提出了 extbf{ extit{RadioFormer3D}},一种专门用于弱监督下体素频谱重建的模型。基于 extit{RadioFormer}的双流多粒度融合架构, extit{RadioFormer3D}引入了基于傅里叶的采样编码器和体素解码器,以有效处理三维空间中的稀疏测量。为了缓解垂直监督的缺乏,我们提出了 extbf{ extit{联合频谱完整性损失}},它将体素级伪标签监督、地图级几何感知无线电渲染和像素级局部约束整合到一个统一的优化方案中。这种设计使模型能够在稀疏监督下更有效地捕捉复杂的垂直结构关系。在多个无线电地图数据集上的大量实验表明,与现有代表性方法相比, extit{RadioFormer3D}实现了优越的整体性能。特别是,它在保持精度和推理效率之间良好权衡的同时,在未标注高度层上展示了改进的重建质量,使其成为未来三维环境感知无线网络的一个非常有前景的解决方案。

英文摘要

With the emergence of wireless applications in three-dimensional environments, such as the low-altitude airspace and 3D heterogeneous networks, radio map estimation is increasingly required to characterize signal propagation across both horizontal and vertical dimensions. However, extending radio map estimation from 2D to 3D remains challenging due to increased spatial sparsity and limited supervision across continuous altitudes. In this paper, we propose \textbf{\textit{RadioFormer3D}}, a specialized model for volumetric spectrum reconstruction under weak supervision. Building on the dual-stream, multi-granularity fusion architecture of \textit{RadioFormer}, \textit{RadioFormer3D} introduces a Fourier-based sampling encoder and a volumetric decoder to efficiently process sparse measurements in 3D space. To alleviate the lack of vertical supervision, we propose the \textbf{\textit{Joint Spectrum Integrity Loss}}, which integrates volume-level pseudo-label supervision, map-level geometry-aware radio rendering, and pixel-level localized constraints within a unified optimization scheme. This design enables the model to capture complex vertical structural relationships more effectively under sparse supervision. Extensive experiments across several radio map datasets show that \textit{RadioFormer3D} achieves superior overall performance compared to representative existing methods. In particular, it demonstrates improved reconstruction quality at unlabeled altitudes while maintaining a favorable trade-off between accuracy and inference efficiency, positioning it as a highly promising solution for future 3D environment-aware wireless networks.

2605.29535 2026-05-29 cs.LG

AsymVLM: Asymmetric Token Pruning for Efficient Vision-Language Model Inference

AsymVLM:面向高效视觉-语言模型推理的非对称令牌剪枝

Yilin Feng, Ahmed Burak Gulhan, Mahmut Taylan Kandemir

发表机构 * The Pennsylvania State University(宾夕法尼亚州立大学)

AI总结 针对视觉和文本令牌在预填充与解码阶段的不同特性,提出非对称剪枝方法AsymVLM,通过视觉令牌的激进剪枝和文本令牌的基于阈值的驱逐,实现高达54%的FLOPs节省并在文档和图表理解任务上提升2-3%的准确率。

详情
AI中文摘要

视觉-语言模型(VLM)每张图像处理数千个视觉令牌,而文本令牌相对较少,但现有压缩方法对两种模态一视同仁。我们观察到两种模态具有根本不同的特性:视觉令牌在空间上冗余且主导预填充阶段,而文本令牌具有因果依赖性并在解码过程中累积。基于这种非对称性,我们提出并实证评估了AsymVLM,该方法在预填充前使用学习的重要性评分器结合每样本自适应预算对视觉令牌进行激进剪枝,并仅在文本令牌超过固定预算时执行基于时间阈值的驱逐。实验表明,AsymVLM在现有方法中实现了最高的FLOPs节省(高达54%),同时在视觉信息空间局部化且与查询相关的文档和图表理解任务上,比现有方法提升2-3%的准确率,并在整体基准上保持竞争性精度。在文本主导的场景中,我们的驱逐策略通过适应VLM的短上下文特性,显著优于标准的LLM缓存压缩方法。

英文摘要

Vision-Language Models (VLMs) process thousands of visual tokens per image alongside comparatively few text tokens, yet existing compression methods treat both modalities uniformly. We observe that the two modalities have fundamentally different properties: vision tokens are spatially redundant and dominate prefill, while text tokens are causally dependent and accumulate during decoding. Based on this asymmetry, we propose and empirically evaluate AsymVLM, which applies aggressive pruning to vision tokens before prefill using a learned importance scorer with per-sample adaptive budgeting, and temporal threshold-based eviction to text tokens only when they exceed a fixed budget. Our experiments indicate that AsymVLM achieves the highest FLOPs savings (up to 54%) among state-of-the-art methods while outperforming existing approaches by 2--3% on document and chart understanding tasks where visual information is spatially localized and query-specific, and maintaining competitive accuracy on holistic benchmarks. In text-dominated scenarios, our eviction strategy substantially outperforms standard LLM cache compression methods by adapting to the short-context nature of VLM.

2605.29534 2026-05-29 cs.AI

UI-KOBE: Knowledge-Oriented Behavior Exploration for Lightweight Graph-Guided GUI Agents

UI-KOBE:面向轻量级图引导GUI代理的知识导向行为探索

Yuxiang Chai, Han Xiao, Xinyu Fu, Jinpeng Chen, Rui Liu, Hongsheng Li

发表机构 * CUHK MMLab(香港大学多模态实验室) Huawei Research(华为研究) Shenzhen Loop Area Institute(深圳环城区域研究所) CPII under InnoHK(创新香港下的CPII)

AI总结 提出UI-KOBE框架,通过自动构建应用知识图谱并引导轻量级GUI代理进行运行时决策,以提升其移动端GUI任务执行效果。

详情
AI中文摘要

近期移动GUI代理的进展显示出自动化移动任务的强大潜力,但大多数有效系统仍依赖大型视觉语言模型进行截图理解和长期规划。可直接部署在移动设备上的小型GUI代理在实际应用中更具吸引力,具有更低的推理成本和更好的敏感设备信息保护。然而,由于模型容量有限,这些轻量级代理在仅凭截图端到端规划和执行GUI任务时仍不可靠。我们提出知识导向行为探索(UI-KOBE),一种利用可复用的应用特定图知识来改进轻量级移动GUI代理的框架。UI-KOBE首先自主探索移动应用并构建应用知识图谱,其中节点代表不同的UI状态,边代表可执行的转换。运行时,轻量级GUI代理将图作为外部指导:给定用户任务和当前截图,它识别当前图节点,并选择与该节点关联的自循环动作、相邻转换、任务完成或回退自由动作。通过用应用特定的图指导支持运行时决策,UI-KOBE减轻了端到端GUI规划的负担,帮助轻量级模型更有效地执行移动GUI任务,为高效、可解释且注重隐私的设备端GUI代理提供了实用的一步。

英文摘要

Recent advances in mobile GUI agents have shown strong potential for automating mobile tasks, but most effective systems still depend on large vision-language models for screenshot understanding and long-horizon planning. Small GUI agents that can be deployed directly on mobile devices are more attractive for practical use, offering lower inference cost and better protection of sensitive on-device information. However, due to limited model capacity, such lightweight agents remain unreliable when planning and executing GUI tasks end-to-end from screenshots alone. We propose Knowledge-Oriented Behavior Exploration (\textbf{UI-KOBE}), a framework that improves lightweight mobile GUI agents with reusable app-specific graph knowledge. UI-KOBE first autonomously explores a mobile application and constructs an app knowledge graph, where nodes represent distinct UI states and edges represent executable transitions. At runtime, a lightweight GUI agent uses the graph as external guidance: given a user task and the current screenshot, it identifies the current graph node and selects among self-loop actions, neighboring transitions, task completion, or fallback free actions associated with that node. By supporting runtime decisions with app-specific graph guidance, UI-KOBE reduces the burden of end-to-end GUI planning and helps lightweight models perform mobile GUI tasks more effectively, offering a practical step toward efficient, interpretable, and privacy-conscious on-device GUI agents.