arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2328
专题追踪
2412.14461 2026-05-13 cs.CL

To Err Is Human; To Annotate, SILICON? Toward Robust Reproducibility in LLM Annotation

Xiang Cheng, Raveesh Mayya, João Sedoc

发表机构 * Robert H. Smith School of Business, University of Maryland(马里兰大学罗伯特·H·史密斯商学院) Leonard N. Stern School of Business, New York University(纽约大学伦纳德·N·斯坦商学院)

AI总结 本文探讨了基于大语言模型(LLM)进行文本标注时面临的可重复性挑战,尤其是当原始模型被弃用时如何保证标注结果的长期可复现。研究提出了一种名为SILICON的分析框架,将测量误差分解为四个来源,并设计了针对性的干预措施以降低误差。实验验证表明,该方法有效提升了标注质量,并通过回归方法建立了可长期使用的开放权重模型,为管理研究中的标注任务提供了更稳健的解决方案。

详情
英文摘要

Unstructured text data annotation is foundational to management research. LLMs offer a cost-effective and scalable alternative to human annotation, but they introduce a novel challenge: the annotator itself can be retired. Proprietary models undergo regular deprecation cycles, threatening long-term reproducibility. Hence, the ability to reproduce annotation results when the original model becomes unavailable, i.e., robust reproducibility, is a central methodological challenge for LLM-based annotation. Achieving robust reproducibility requires first controlling measurement error. We develop an analytical framework that decomposes measurement error into four sources: guideline-induced error from inconsistent annotation criteria, baseline-induced error from unreliable human references, prompt-induced error from suboptimal meta-instruction, and model-induced error from architectural differences across LLMs. We develop the SILICON workflow that instantiates the analytical framework, prescribing targeted interventions at each error source. Empirical validation across nine management research tasks confirms that these interventions reduce measurement error, and simulations show that the resulting error reduction yields more accurate downstream statistical estimates. With measurement error controlled, we address two further aspects of robust reproducibility. First, we propose a regression-based methodology to establish backup open-weight models, which are permanently accessible. Every tested task has at least one open-weight model with no statistically detectable performance difference. Second, we quantify the upper bound of annotation quality attainable from the current set of available models by proposing a routing procedure that selectively sends low-confidence items to auxiliary models, revealing when model aggregation improves performance and when that may adversely affect labeling quality.

2412.13050 2026-05-13 cs.LG cs.AI cs.CL cs.CV cs.SD eess.AS

Modality-Inconsistent Continual Learning of Multimodal Large Language Models

Weiguo Pian, Shijian Deng, Shentong Mo, Mingrui Liu, Yunhui Guo, Yapeng Tian

发表机构 * The University of Texas at Dallas(德克萨斯大学达拉斯分校) Carnegie Mellon University(卡内基梅隆大学) George Mason University(乔治·梅森大学)

AI总结 本文提出了一种新的多模态大语言模型持续学习场景——模态不一致持续学习(MICL),该场景涉及图像、音频或视频等不一致模态以及图文生成或问答等不同任务类型的持续学习任务。为应对模态和任务类型变化带来的灾难性遗忘问题,研究提出了MoInCL方法,通过伪目标生成模块和基于指令的知识蒸馏技术,有效缓解了模态和任务类型变化对模型性能的影响。实验结果表明,MoInCL在多个任务上优于现有的持续学习方法,具有显著优势。

Comments Accepted at Transactions on Machine Learning Research (TMLR), 2026

详情
英文摘要

In this paper, we introduce Modality-Inconsistent Continual Learning (MICL), a new continual learning scenario for Multimodal Large Language Models (MLLMs) that involves tasks with inconsistent modalities (image, audio, or video) and varying task types (captioning or question-answering). Unlike existing vision-only or modality-incremental settings, MICL combines modality and task type shifts, both of which drive catastrophic forgetting. To address these challenges, we propose MoInCL, which employs a Pseudo Targets Generation Module to mitigate forgetting caused by task type shifts in previously seen modalities. It also incorporates Instruction-based Knowledge Distillation to preserve the model's ability to handle previously learned modalities when new ones are introduced. We benchmark MICL using a total of six tasks and conduct experiments to validate the effectiveness of our MoInCL. The experimental results highlight the superiority of MoInCL, showing significant improvements over representative and state-of-the-art continual learning baselines.

2411.19517 2026-05-13 cs.LG cs.AI

RL-SPH: Learning to Achieve Feasible Solutions for Integer Linear Programs

Tae-Hoon Lee, Min-Soo Kim

发表机构 * School of Computing, KAIST, Daejeon, Republic of Korea(韩国釜山国立大学计算机学院)

AI总结 该研究提出了一种基于强化学习的初始原启发式方法RL-SPH,旨在为整数线性规划问题快速生成可行解。与现有方法不同,RL-SPH能够独立生成高质量的可行解,即使在涉及非二进制整数的问题中也表现优异。实验表明,RL-SPH在可行性率、原始间隙和原始积分等指标上均优于现有方法,展现出显著的性能提升。

Comments Accepted at ICML 2026. 30 pages, 12 figures, 22 tables

详情
英文摘要

Primal heuristics play a crucial role in quickly finding feasible solutions for NP-hard integer linear programming (ILP). Although $\textit{end-to-end learning}$-based primal heuristics (E2EPH) have recently been proposed, they are typically unable to independently generate feasible solutions. To address this challenge, we propose RL-SPH, a novel reinforcement learning-based start primal heuristic capable of independently generating feasible solutions, even for ILP involving non-binary integers. Empirically, RL-SPH rapidly obtains high-quality feasible solutions with a 100% feasibility rate, achieving on average a 28.6$\times$ lower primal gap and a 2.6$\times$ lower primal integral compared to existing start primal heuristics.

2405.10271 2026-05-13 cs.LG cs.AI cs.DC cs.ET

Pruning Federated Models through Loss Landscape Analysis and Client Agreement Scoring

Christian Internò, Elena Raponi, Markus Olhofer, Ali Raza, Thomas Bäck, Niki van Stein, Yaochu Jin, Barbara Hammer

发表机构 * Bielefeld University(比勒菲尔德大学) School of Engineering, Westlake University(西湖大学工程学院)

AI总结 本文针对联邦学习中资源受限设备部署时面临的大模型训练成本高和数据异构性带来的不稳定性问题,提出了一种基于损失景观分析和客户端一致性评分的自动模型剪枝框架AutoFLIP。该方法将客户端数据多样性视为一种可利用的特性,通过一次性的联邦损失探索构建全局损失景观图,进而指导动态剪枝策略,显著提升了模型效率和鲁棒性。实验表明,AutoFLIP在非独立同分布场景下平均减少52%的计算开销和65%的通信成本,同时保持了最先进的准确率。

Journal ref Published in IEEE Internet of Things Journal, 2026

详情
英文摘要

The practical deployment of Federated Learning (FL) on resource-constrained devices is fundamentally limited by the high cost of training large models and the instability caused by heterogeneous (non-IID) client data. Conventional pruning methods often treat data heterogeneity as a problem to be mitigated. In this work, we introduce a paradigm shift: we reframe client diversity as a feature to be harnessed. We propose AutoFLIP, a framework that begins not with training, but with a one-time federated loss exploration. During this phase, clients collaboratively build a map of the collective loss landscape, using their diverse data to reveal the problem's essential structure. This shared intelligence then guides an adaptive pruning strategy that is dynamically refined by client agreement throughout training. This approach allows AutoFLIP to identify robust and efficient sub-networks from the outset. Our extensive experiments show that AutoFLIP reduces computational overhead by an average of 52% and communication costs by over 65% while simultaneously achieving state-of-the-art accuracy in challenging non-IID settings.

2212.02011 2026-05-13 cs.CV

PointCaM: Cut-and-Mix for Open-Set Point Cloud Learning

Jie Hong, Shi Qiu, Weihao Li, Saeed Anwar, Mehrtash Harandi, Nick Barnes, Lars Petersson

发表机构 * The University of Hong Kong(香港大学) The Chinese University of Hong Kong(香港中文大学) Australian National University(澳大利亚国立大学) The University of Western Australia(西澳大学) Monash University(墨尔本大学)

AI总结 本文研究了开放集点云学习问题,即在训练时不使用未知类别数据,而在推理阶段识别未知对象。为此,作者提出了一种名为PointCaM的新型点云“切割-混合”机制,包含未知点模拟器和未知点估计器两个模块,通过模拟分布外数据并利用多级特征上下文来区分已知与未知点云。实验表明,该方法在多个数据集上显著提升了开放集识别性能,验证了其有效性。

Comments Accepted in CVIU

详情
英文摘要

Point cloud learning is receiving increasing attention. However, most existing point cloud models lack the practical ability to deal with the unavoidable presence of unknown objects. This paper primarily discusses point cloud learning in open-set settings, where we train the model without data from unknown classes and identify them during the inference stage. In essence, we propose a novel Point Cut-and-Mix mechanism for solving open-set point cloud learning, comprising an Unknown-Point Simulator and an Unknown-Point Estimator module. Specifically, we use the Unknown-Point Simulator to simulate out-of-distribution data in the training stage by manipulating the geometric context of partially known data. Based on this, the Unknown-Point Estimator module learns to exploit the point cloud's feature context to discriminate between known and unknown data. Unlike existing methods that only consider classifier features, our proposed solution leverages multi-level feature contexts to recognize unknown point cloud objects more effectively. We test the proposed approach on several datasets, including customized S3DIS, ModelNet40, and ScanObjectNN. The improved open-set performances over comparative baselines show the effectiveness of our PointCaM method. Our code is available at https://github.com/JHome1/pointcam.

2605.11832 2026-05-13 cs.RO

Learning Action Manifold with Multi-view Latent Priors for Robotic Manipulation

Junjin Xiao, Dongyang Li, Yandan Yang, Shuang Zeng, Tong Lin, Xinyuan Chang, Feng Xiong, Mu Xu, Xing Wei, Zhiheng Ma, Qing Zhang, Wei-Shi Zheng

发表机构 * Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, China(机器智能与先进计算国家重点实验室,教育部,中国) AMap, Alibaba Group(阿里的AMap) Xi’an Jiaotong University(西安交通大学) Shenzhen University of Advanced Technology(深圳先进技术大学)

AI总结 本文研究了视觉-语言-动作(VLA)模型在空间感知与操作中的挑战,提出了一种基于多视角潜在先验的动作流形学习方法。通过预训练的多视角扩散模型生成新视角,并结合几何引导的门控变换器(G3T)对齐多视角特征并过滤遮挡噪声,同时引入动作流形学习(AML)直接预测有效动作,提升操作效率。实验表明,该方法在多个基准和真实机器人任务中均优于现有先进方法。

详情
英文摘要

This paper tackles spatial perception and manipulation challenges in Vision-Language-Action (VLA) models. To address depth ambiguity from monocular input, we leverage a pre-trained multi-view diffusion model to synthesize latent novel views and propose a Geometry-Guided Gated Transformer (G3T) that aligns multi-view features under 3D geometric guidance while adaptively filtering occlusion noise. To improve action learning efficiency, we introduce Action Manifold Learning (AML), which directly predicts actions on the valid action manifold, bypassing inefficient regression of unstructured targets like noise or velocity. Experiments on LIBERO, RoboTwin 2.0, and real-robot tasks show our method achieves superior success rate and robustness over SOTA baselines. Project page: https://junjxiao.github.io/Multi-view-VLA.github.io/.

2605.11825 2026-05-13 cs.RO

Mapping Embodied Affective Touch Strategies on a Humanoid Robot

Qiaoqiao Ren, Omar Eldardeer, Francesca Cocchella, Rea Francesco, Alessandra Sciutti, Tony Belpaeme

发表机构 * Faculty of Engineering and Architecture(工程与建筑学院) IDLab-AIRO, Ghent University – imec(IDLab-AIRO,根特大学 – imec) Italian Institute of Technology(意大利理工学院) School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology(电气工程与计算机科学学院,皇家理工学院)

AI总结 该研究探讨了人类与人形机器人交互中的情感触觉策略,分析了触觉表达如何受到机器人身体结构和触觉位置的限制。通过32名参与者与配备全身触觉传感器的iCub机器人进行互动实验,研究发现不同身体区域和空间约束显著影响触觉的位置和动态特征,并且情感表达在不同身体部位和约束条件下存在差异。研究结果表明,情感触觉策略具有高度的身体区域依赖性,并受到机器人本体特征的显著影响。

详情
英文摘要

Affective touch in human-robot interaction is shaped not only by emotional intent, but also by robot embodiment, including touch location, physical constraints, and perceived agency or social role. Existing HRI studies typically focus on one or two isolated body parts, limiting understanding of how affective touch generalises across the full humanoid body. We present a study with 32 participants interacting with the iCub robot, which is equipped with full-body distributed tactile sensors. Participants expressed eight emotions under three conditions: free touch, arm-only touch, and torso-only touch. Results show that body region and spatial constraints jointly shaped both touch location and dynamics. In free touch, participants preferred socially accessible upper-body regions, while less frequently touched areas showed stronger emotion-specific selectivity. Emotion-related variation was more evident in motion features for arm-only touch and pressure features for torso-only touch. Touch strategies also did not transfer directly between free and constrained conditions, even within the same coarse body region. Participants reported increased closeness to the robot after interaction, with around 30 percent reporting a change in perceived social relationship. Together, these findings show that affective touch expression is strongly body-region dependent and shaped by embodiment constraints.

2605.11824 2026-05-13 cs.CV cs.AI

REFNet++: Multi-Task Efficient Fusion of Camera and Radar Sensor Data in Bird's-Eye Polar View

Kavin Chandrasekaran, Sorin Grigorescu, Gijs Dubbelman, Pavol Jancura

发表机构 * ElektroBit Automotive GmbH Eindhoven University of Technology(埃因霍温理工大学) Transilvania University of Brasov(布拉索夫特拉扬大学)

AI总结 该论文提出了一种名为REFNet++的多任务高效融合方法,用于将摄像头和雷达传感器数据在鸟瞰极坐标视图中进行融合。研究通过变分编码器-解码器架构,将摄像头图像转换为极坐标域,并从雷达的范围-多普勒谱中提取角度信息以生成范围-方位角特征,从而实现两种模态数据在统一域中的对齐。该方法在保证融合精度的同时提升了计算效率,并在车辆检测和自由空间分割任务中取得了优于现有方法的性能。

Comments IEEE Intelligent Transportation Systems Conference (ITSC) 2025

详情
英文摘要

A realistic view of the vehicle's surroundings is generally offered by camera sensors, which is crucial for environmental perception. Affordable radar sensors, on the other hand, are becoming invaluable due to their robustness in variable weather conditions. However, because of their noisy output and reduced classification capability, they work best when combined with other sensor data. Specifically, we address the challenge of multimodal sensor fusion by aligning radar and camera data in a unified domain, prioritizing not only accuracy, but also computational efficiency. Our work leverages the raw range-Doppler (RD) spectrum from radar and front-view camera images as inputs. To enable effective fusion, we employ a variational encoder-decoder architecture that learns the transformation of front-view camera data into the Bird's-Eye View (BEV) polar domain. Concurrently, a radar encoder-decoder learns to recover the angle information from the RD data that produce Range-Azimuth (RA) features. This alignment ensures that both modalities are represented in a compatible domain, facilitating robust and efficient sensor fusion. We evaluated our fusion strategy for vehicle detection and free space segmentation against state-of-the-art methods using the RADIal dataset.

2605.11818 2026-05-13 cs.CV

RevealLayer: Disentangling Hidden and Visible Layers via Occlusion-Aware Image Decomposition

Binhao Wang, Shihao Zhao, Bo Cheng, Qiuyu Ji, Yuhang Ma, Liebucha Wu, Shanyuan Liu, Dawei Leng, Yuhui Yin

发表机构 * Wenzhou University(温州大学) AI Research(360人工智能研究院)

AI总结 该论文提出了一种基于扩散模型的图像分层分解方法RevealLayer,旨在解决复杂自然图像中隐藏层与可见层的分离难题,以及遮挡区域内容的恢复问题。方法引入了区域感知注意力模块、遮挡引导适配器和复合损失函数,以实现更精确的层分离和遮挡内容重建。同时,研究团队构建了高质量的RevealLayer-100K数据集和评估基准RevealLayerBench,实验表明该方法在层分解任务上优于现有方法。

详情
英文摘要

Recent diffusion-based approaches have made substantial progress in image layer decomposition. However, accurately decomposing complex natural images remains challenging due to difficulties in occlusion completion, robust layer disentanglement, and precise foreground boundaries. Moreover, the scarcity of high-quality multi-layer natural image datasets limits advancement. To address these challenges, we propose RevealLayer, a diffusion-based framework that decomposes an RGB image into multiple RGBA layers, enabling precise layer separation and reliable recovery of occluded content in natural images. RevealLayer incorporates three key components: (1) a Region-Aware Attention module to disentangle hidden and visible layers; (2) an Occlusion-Guided Adapter to leverage contextual information to enhance overlapping regions; and (3) a composite loss to enforce sharp alpha boundaries and suppress residual artifacts. To support training and evaluation, we introduce RevealLayer-100K, a high-quality multi-layer natural image constructed through a collaboration between automated algorithms and human annotation, and further establish RevealLayerBench for benchmarking layer decomposition in general natural scenes. Extensive experiments demonstrate that RevealLayer consistently outperforms existing approaches in layer decomposition.

2605.11815 2026-05-13 cs.LG

Fed-BAC: Federated Bandit-Guided Additive Clustering in Hierarchical Federated Learning

Satwat Bashir, Tasos Dagiuklas, Muddesar Iqbal

发表机构 * Informatics London South Bank University London, UK(伦敦南银行大学信息学伦敦英国) Digital Technologies London South Bank University London, UK(伦敦南银行大学数字技术伦敦英国)

AI总结 本文提出了一种名为Fed-BAC的联邦学习方法,旨在解决分层联邦学习中数据异构环境下聚类分配与客户端选择联合优化的问题。Fed-BAC结合了加法聚类个性化与两层老虎机框架,在云端通过上下文老虎机进行聚类分配,边缘服务器则利用汤普森采样选择高贡献客户端。实验表明,Fed-BAC在多个分类任务中相比现有方法显著提升了准确率与收敛速度,并增强了跨服务器的公平性。

Comments 9 pages, 5 figures. Accepted at the 2nd International Conference on Federated Learning and Intelligent Computing Systems (FLICS 2026), Valencia, Spain, June 9-12, 2026. To appear in IEEE proceedings

详情
英文摘要

Hierarchical federated learning (HFL) leverages edge servers for partial aggregation in edge computing. Yet existing FL methods lack mechanisms for jointly optimizing cluster assignment and client selection under data heterogeneity. This paper proposes Fed-BAC, which integrates additive cluster personalization with a two-level bandit framework: contextual bandits at the cloud learn server-to-cluster assignments, while Thompson Sampling at each edge server identifies high-contributing clients. The additive decomposition enables the sharing of knowledge between groups through a globally aggregated network, while cluster-specific networks capture distribution variations. Across three classification benchmarks (CIFAR-10, SVHN, Fashion-MNIST) under moderate ($α= 0.5$) and severe ($α= 0.1$) Dirichlet non-IID partitioning, Fed-BAC achieves distributed accuracy gains of up to +35.5pp over HierFAVG and +8.4pp over IFCA, while requiring only 80% client participation, converging 1.5 to 4.8$\times$ faster depending on dataset and accuracy target, and improving cross-server fairness. These gains are further validated at 5$\times$ deployment scale on CIFAR-10. The advantage of Fed-BAC increases with heterogeneity severity, confirming that additive cluster personalization becomes increasingly valuable as data distributions diverge.

2605.11814 2026-05-13 cs.AI

MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare

Yihao Wang, Haoran Xu, Renjie Gu, Yixuan Ye, Xinyi Chen, Xinyu Mu, Yuan Gao, Chunxiao Guo, Peng Wei, Jinjie Gu, Huan Li, Ke Chen, Lidan Shou

发表机构 * Zhejiang University(浙江大学) Ant Group(蚂蚁集团) Alibaba Group(阿里巴巴集团) Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 MedMemoryBench 是一个用于评估个性化医疗智能体记忆能力的基准测试平台,旨在应对大规模医疗场景中对高精度、安全且具备长期追踪能力的记忆机制的需求。该研究通过构建基于临床真实患者模型的高仿真医疗交互数据集,并引入“构建即评估”的动态评估方法,揭示了主流模型在复杂医疗推理和噪声鲁棒性方面的严重不足,为开发可靠、实用的医疗智能体奠定了基础。

详情
英文摘要

The large-scale deployment of personalized healthcare agents demands memory mechanisms that are exceptionally precise, safe, and capable of long-term clinical tracking. However, existing benchmarks primarily focus on daily open-domain conversations, failing to capture the high-stakes complexity of real-world medical applications. Motivated by the stringent production requirements of an industry-leading health management agent serving tens of millions of active users, we introduce MedMemoryBench. We develop a human-agent collaborative pipeline to synthesize highly realistic, long-horizon medical trajectories based on clinically grounded, synthetic patient archetypes. This process yields a massive, expertly validated dataset comprising approximately 2,000 sessions and 16,000 interaction turns. Crucially, MedMemoryBench departs from traditional static evaluations by pioneering an "evaluate-while-constructing" streaming assessment protocol, which precisely mirrors dynamic memory accumulation in production environments. Furthermore, we formalize and systematically investigate the critical phenomenon of memory saturation, where sustained information influx actively degrades retrieval and reasoning robustness. Comprehensive benchmarking reveals severe bottlenecks in mainstream architectures, particularly concerning complex medical reasoning and noise resilience. By exposing these fundamental flaws, MedMemoryBench establishes a vital foundation for developing robust, production-ready medical agents.

2605.11813 2026-05-13 cs.AI

Automated Reformulation of Robust Optimization via Memory-Augmented Large Language Models

Jinbiao Chen, Shuang Jin, Guoyun Zhang, Junyu Zhang, Guanyi Wang, Hanzhang Qin

发表机构 * Department of Industrial Systems Engineering and Management, National University of Singapore(新加坡国立大学工业系统工程与管理系) Department of Data and Systems Engineering, The University of Hong Kong(香港大学数据与系统工程系) Institute of Operations Research and Analytics, National University of Singapore(新加坡国立大学运筹与分析研究所) Agency for Science, Technology and Research (A*STAR)(科技研究局(A*STAR))

AI总结 该研究旨在解决鲁棒优化(RO)中将不确定优化模型转化为可解确定性模型时需要手动重述的问题。为此,作者提出了AutoREM,一种无需参数更新且无需领域专家知识的基于经验记忆的自动重述框架,通过离线适应过程构建结构化文本记忆以提升重述效果。研究还构建了AutoRO-Bench基准,用于系统评估基于大语言模型的RO重述能力,并在多种数据集和基础模型上验证了AutoREM在准确性和效率方面的优越性。

详情
英文摘要

Robust optimization (RO) provides a principled framework for decision-making under uncertainty, but its practical use is often limited by the need to manually reformulate uncertain optimization models into tractable deterministic counterparts. Recent large language models (LLMs) have been shown promising for automating optimization formulation, yet RO reformulation remains challenging because it requires precise multi-step reasoning and mathematically consistent transformations. To facilitate systematic evaluation of LLM-based reformulation, for which no dedicated benchmark currently exists, we develop AutoRO-Bench, a benchmark featuring an automated data generation pipeline for the core RO reformulation task and a curated dataset for the RO application task. To address the reformulation challenge, we propose Automated Reformulation with Experience Memory (AutoREM), a tuning-free memory-augmented framework that autonomously builds a structured textual experience memory by reflecting on past failed trajectories through a tailored offline adaptation procedure. AutoREM requires neither domain-specific expert knowledge nor parameter updates, and the resulting memory readily transfers across different base LLMs. Experimental results show that AutoREM consistently improves the accuracy and efficiency of RO reformulation across in-distribution datasets, out-of-distribution datasets, and diverse base LLMs.

2605.11809 2026-05-13 cs.AI

Beyond World-Frame Action Heads: Motion-Centric Action Frames for Vision-Language-Action Models

Huoren Yang, Jianchao Zhao, Hu Yusong, Qiguan Ou, Yuyang Gao, Wei Ke, Yuhang He, SongLin Dong, Zhiheng Ma, Yihong Gong

发表机构 * Xi’an Jiaotong University(西安交通大学) One Robotics Shenzhen University of Advanced Technology(深圳大学先进技术学院)

AI总结 该论文提出了一种名为MCF-Proto的轻量级动作头,用于改进视觉-语言-动作(VLA)模型的动作预测能力。不同于传统在固定世界坐标系中直接预测动作指令的方法,该方法引入了以运动为中心的动作框架(MCF),通过旋转变换将动作预测转换到局部坐标系中,并基于原型进行动作参数化,最终映射回世界坐标系进行端到端训练。这种方法无需额外监督信号,能够自动生成稳定的几何结构,提升动作表示的紧凑性和鲁棒性,尤其在面对几何扰动时表现出色。

详情
英文摘要

Vision-Language-Action (VLA) models have advanced rapidly with stronger backbones, broader pre-training, and larger demonstration datasets, yet their action heads remain largely homogeneous: most directly predict action commands in a fixed world coordinate frame. We propose \textbf{MCF-Proto}, a lightweight action head that equips VLA policies with a Motion-Centric Action Frame (MCF) and a prototype-based action parameterization. At each step, the policy predicts a rotation $R_t \in SO(3)$, composes actions in the transformed local frame from a set of prototypes, and maps them back to the world frame for end-to-end training, using only standard demonstrations without auxiliary supervision. This simple design induces stable emergent structure. Without explicit directional labels, the learned local frames develop a stable geometric structure whose axes are strongly compatible with demonstrated end-effector motion. Meanwhile, actions in the learned representation become substantially more compact, with variation captured by fewer dominant directions and more regularly organized by shared prototypes. These structural properties translate into improved robustness, especially under geometric perturbations. Our results suggest that adding lightweight geometric and compositional structure to the action head can materially improve how VLA policies organize and generalize robotic manipulation behavior. An anonymized code repository is provided in the supplementary material.

2605.11808 2026-05-13 cs.CV

Mitigating Action-Relation Hallucinations in LVLMs via Relation-aware Visual Enhancement

Zhenxin Qin, Qiang Li, Qingzhuo Wang, Ruiyang Qin, Zhihua Wei, Wen Shen

发表机构 * Tongji University(同济大学)

AI总结 本文研究了大视觉语言模型(LVLMs)在生成文本时产生的动作关系幻觉问题,即模型生成的文本与视觉输入中的动作关系不一致。为解决这一问题,作者提出了一种基于关系感知的视觉增强方法(RVE),通过定义动作关系敏感度(ARS)评分定位包含关键视觉线索的动作相关区域,并增强模型对这些区域的注意力。实验表明,该方法在缓解动作关系幻觉方面优于现有方法,且几乎不增加推理成本,同时在空间关系和物体幻觉任务中也表现出良好的泛化能力。

详情
英文摘要

Large Vision-Language Models (LVLMs) have achieved remarkable performance on diverse vision-language tasks. However, LVLMs still suffer from hallucinations, generating text that contradicts the visual input. Existing research has primarily focused on mitigating object hallucinations, but often overlooks more complex relation hallucinations, particularly action relations involving interactions between objects. In this study, we empirically observe that the primary cause of action-relation hallucinations in LVLMs is the insufficient attention allocated to visual information. Thus, we propose a framework to locate action-relevant image regions and enhance the LVLM's attention to those regions. Specifically, we define the Action-Relation Sensitivity (ARS) score to identify attention heads that are most sensitive to action-relation changes, thereby localizing action-relevant image regions that contain key visual cues. Then, we propose the Relation-aware Visual Enhancement (RVE) method to enhance the LVLM's attention to these action-relevant image regions. Extensive experiments demonstrate that, compared to existing baselines, our method achieves superior performance in mitigating action-relation hallucinations with negligible additional inference cost. Furthermore, it effectively generalizes to spatial-relation hallucinations and object hallucinations.

2605.11807 2026-05-13 cs.AI

Why Users Go There: World Knowledge-Augmented Generative Next POI Recommendation

Qiuyu Ding, Heng-Da Xu, Wei Zhang, Dongyi Lv, Changda Xia, Feng Xiong, Mu Xu

发表机构 * Amap, Alibaba Group(阿里巴巴集团阿地图) Xi’an Jiaotong University(西安交通大学)

AI总结 该研究针对生成式兴趣点(POI)推荐模型无法感知现实世界动态变化的问题,提出了一种基于大语言模型(LLM)的增强方法AWARE。该方法通过引入基于代理的LLM生成具有时空感知能力的上下文叙事,捕捉区域文化特征、季节趋势和实时事件,并结合用户行为特征进行个性化推荐。实验表明,AWARE在三个真实数据集上显著优于现有方法,相对提升了12.4%的推荐效果。

详情
英文摘要

Generative point-of-interest (POI) recommendation models based on large language models (LLMs) have shown promising results by formulating next POI prediction as a sequence generation task. However, the knowledge encoded in these models remains fixed after training, making them unable to perceive evolving real-world conditions that shape user mobility decisions, such as local events and cultural trends. To bridge this gap, we propose AWARE (Agent-based World knowledge Augmented REcommendation), which employs an LLM agent to generate location- and time-aware contextual narratives that capture regional cultural characteristics, seasonal trends, and ongoing events relevant to each user. Rather than introducing generic or noisy information, AWARE further anchors these narratives in each user's behavioral context, grounding external world knowledge in personalized spatial-temporal patterns. Extensive experiments on three real-world datasets demonstrate that AWARE consistently outperforms competitive baselines, achieving up to 12.4% relative improvement.

2605.11804 2026-05-13 cs.LG cs.CV

Stop Marginalizing My Dreams: Model Inversion via Laplace Kernel for Continual Learning

Patryk Krukowski, Jacek Tabor, Przemysław Spurek, Marek Śmieja, Łukasz Struski

发表机构 * Jagiellonian University(雅盖隆大学) IDEAS Research Institute(IDEAS研究院)

AI总结 本文研究了无数据持续学习(DFCIL)中的模型逆问题,旨在生成高质量伪样本以缓解灾难性遗忘。现有方法通常假设特征分布具有对角协方差,忽略了特征间的相关性,导致生成样本质量不高。为此,作者提出REMIX方法,通过拉普拉斯核参数化实现结构化协方差建模,在保证计算效率的同时捕捉特征依赖关系,显著提升了合成样本的保真度和DFCIL性能。

详情
英文摘要

Data-free continual learning (DFCIL) relies on model inversion to synthesize pseudo-samples and mitigate catastrophic forgetting. However, existing inversion methods are fundamentally limited by a simplifying assumption: they model feature distributions using diagonal covariance, effectively ignoring correlations that define the geometry of learned representations. As a result, synthesized samples often lack fidelity, limiting knowledge retention. In this work, we show that modeling feature dependencies is a key ingredient for effective DFCIL. We introduce REMIX, a structured covariance modeling framework that enables scalable full-covariance modeling without the prohibitive cost of dense matrix inversion and log-determinant computation. By leveraging a Laplace kernel parameterization, REMIX captures structured feature dependencies using memory that scales linearly with the feature dimensionality, while requiring only an additional logarithmic factor in computation. Modeling these correlations produces more coherent synthetic samples and consistently improves performance across standard DFCIL benchmarks. Our results demonstrate that moving beyond diagonal assumptions is essential for effective and scalable data-free continual learning. Our code is available at https://github. com/pkrukowski1/REMIX-Model-Inversion-via-Laplace-Kernel.

2605.11803 2026-05-13 cs.CV cs.AI

OTT-Vid: Optimal Transport Temporal Token Compression for Video Large Language Models

Minseok Kang, Minhyeok Lee, Jungho Lee, Minjung Kim, Donghyeong Kim, Dayeon Lee, Heeseung Choi, Ig-jae Kim, Sangyoun Lee

发表机构 * Yonsei University(延世大学) LG Electronics(LG电子) KIST(韩国科学技术院)

AI总结 随着视频大语言模型(Video-LLMs)处理更长更复杂的视频,其推理成本因帧间视觉标记数量的增加而迅速上升。为解决这一问题,本文提出OTT-Vid,一种基于最优运输的时序标记压缩方法。该方法通过空间剪枝识别每帧中的关键内容,并利用非均匀标记质量的最优运输模型评估相邻帧间的压缩潜力,从而动态分配压缩预算,有效保护语义重要标记。实验表明,OTT-Vid在保留仅10%标记的情况下,仍能保持95.8%的视频问答和73.9%的时序定位性能,优于现有无训练压缩方法。

Comments 22pages, 9 figures. Code available at https://github.com/minseokii/OTT-Vid

详情
英文摘要

As Video Large Language Models (Video-LLMs) scale to longer and more complex videos, their inference cost grows rapidly due to the large volume of visual tokens accumulated across frames. Training-free token compression has emerged as a practical solution to this bottleneck. However, existing temporal compression methods rely primarily on cross-frame token similarity or segmentation heuristics, overlooking each token's semantic role within its frame and failing to adapt compression strength to the compressibility of each frame pair. In this work, we propose OTT-Vid, a transport-derived allocation framework for temporal token compression. Our approach consists of two stages: spatial pruning identifies representative content within each frame, and optimal transport (OT) is then solved between neighboring frames to estimate temporal compressibility. We formulate this OT with non-uniform token mass, which protects semantically important tokens from aggressive compression, and a locality-aware cost that captures both feature and spatial disparities. The resulting transport plan jointly balances token importance and matching cost, while its total cost defines the transport difficulty of each frame pair, which we use to allocate compression budgets dynamically. Experiments on six benchmarks spanning video question answering and temporal grounding show that OTT-Vid preserves 95.8% of VQA and 73.9% of VTG performance while retaining only 10% of tokens, consistently outperforming existing state-of-the-art training-free compression methods.

2605.11800 2026-05-13 cs.LG cs.CL

ROMER: Expert Replacement and Router Calibration for Robust MoE LLMs on Analog Compute-in-Memory Systems

Wenyong Zhou, Yuannuo Feng, Yizhe Chen, Taiqiang Wu, Wendong Xu, Wenbo Qi, Zhengwu Liu, Wang Kang, Ngai Wong

发表机构 * The Department of Electrical and Computer Engineering, The University of Hong Kong(香港大学电子与计算机工程系) The School of Integrated Circuit Science and Engineering, Beihang University(北航集成电路科学与工程学院)

AI总结 该研究针对基于专家混合(MoE)架构的大语言模型在模拟计算-in-内存(CIM)系统中的鲁棒性问题,提出了一种名为ROMER的后训练校准框架。研究发现,硬件噪声会显著破坏专家负载平衡并导致路由决策性能下降,ROMER通过替换低激活专家和校准路由逻辑,有效提升了模型在噪声环境下的表现。实验表明,ROMER在多个MoE模型上显著降低了困惑度,验证了其有效性和广泛适用性。

Comments 11 pages, 5 figures, 4 tables

详情
英文摘要

Large language models (LLMs) with mixture-of-experts (MoE) architectures achieve remarkable scalability by sparsely activating a subset of experts per token, yet their frequent expert switching creates memory bandwidth bottlenecks that compute-in-memory (CIM) architectures are well-suited to mitigate. However, analog CIM systems suffer from inherent hardware imperfections that perturb stored weights, and its negative impact on MoE-based LLMs in noisy CIM environments remains unexplored. In this work, we present the first systematic investigation of MoE-based LLMs under noise model calibrated with real chip measurements, revealing that hardware noise critically disrupts expert load balance and renders clean-trained routing decisions consistently suboptimal. Based on these findings, we propose ROMER, a post-training calibration framework that (1) replaces underactivated experts with high-frequency ones to restore load balance, and (2) recalibrates router logits via percentile-based normalization to stabilize routing under noise. Extensive experiments across multiple benchmarks demonstrate that ROMER achieves up to 58.6\%, 58.8\%, and 59.8\% reduction in perplexity under real-chip noise conditions for DeepSeek-MoE, Qwen-MoE, and OLMoE, respectively, establishing its effectiveness and generalizability across diverse MoE architectures.

2605.11799 2026-05-13 cs.CV

SB-BEVFusion: Enhancing the Robustness against Sensor Malfunction and Corruptions

Markus Essl, Marta Moscati, Mubashir Noman, Muhammad Zaigham Zaheer, Usman Naseem, Shah Nawaz, Markus Schedl

发表机构 * Johannes Kepler University Linz(约翰·凯撒大学林茨分校) MBZUAI(穆罕默德·本·拉希德人工智能研究所) Macquarie University(麦考瑞大学) Linz Institute of Technology(林茨技术学院)

AI总结 该论文提出了一种增强自动驾驶车辆三维目标检测中多模态传感器融合鲁棒性的方法,针对摄像头和激光雷达数据缺失或受污染的情况,设计了一个框架无关的融合模块。该模块能够有效应对单一模态失效或数据损坏的问题,并在BEVFusion框架中进行实例化验证。实验表明,该方法在多种传感器退化场景下表现出色,尤其在极端天气和传感器故障条件下达到了最先进的性能。

Comments Accepted at ICIP 2026

详情
英文摘要

Multimodal sensor fusion has demonstrated remarkable performance improvements over unimodal approaches in 3D object detection for autonomous vehicles. Typically, existing methods transform multimodal data from independent sensors, such as camera and LiDAR, into a unified bird's-eye view (BEV) representation for fusion. Although effective in ideal conditions, this strategy suffers from substantial performance deterioration when camera or LiDAR data are missing, corrupted, or noisy. To address this vulnerability, we develop a framework-agnostic fusion module for camera and LiDAR data that allows for handling cases when one of the two modalities is missing or corrupted. To demonstrate the effectiveness of our module, we instantiate it in BEVFusion [1], a well-established framework to combine camera and LiDAR data for 3D object detection. By means of quantitative experiments on the MultiCorrupt dataset, we demonstrate that our module achieves favorable performance improvements under scenarios of missing and corrupted modalities, substantially outperforming existing unified representation approaches across a wide range of sensor deterioration scenarios and reaching state-of-the-art performance in scenarios of corrupted modality due to extreme weather conditions and sensor failure.

2605.11789 2026-05-13 cs.AI

Beyond Inefficiency: Systemic Costs of Incivility in Multi-Agent Monte Carlo Simulations

Alison Moldovan-Mauer, Benedikt Mangold

发表机构 * Technische Hochschule Nürnberg(纽伦堡技术大学)

AI总结 该研究探讨了不文明交流对多智能体系统中协作效率的影响,通过构建基于大语言模型的多智能体系统,利用蒙特卡洛模拟方法进行大规模实验。研究发现,不文明行为显著延长了智能体达成共识所需的时间,并且这种延迟在参数规模较小的模型中更为明显。此外,研究还揭示了“先发优势”现象,即率先发言的智能体在不同毒性条件下均更有可能赢得讨论。

详情
英文摘要

Unconstructive debate and uncivil communication carry well-documented costs for productivity and cohesion, yet isolating their effect on operational efficiency has proven difficult. Human subject research in this domain is constrained by ethical oversight, limited reproducibility, and the inherent unpredictability of naturalistic settings. We address this gap by leveraging Large Language Model (LLM) based Multi-Agent Systems as a controlled sociological sandbox, enabling systematic manipulation of communicative behavior at scale. Using a Monte Carlo simulation framework, we generate thousands of structured 1-on-1 adversarial debates across varying toxicity conditions, measuring convergence time, defined as the number of rounds required to reach a conclusion, as a proxy for interactional efficiency. Building on a prior study, we replicate and extend its findings across two additional LLM agents of varying parameter size, allowing us to assess whether the effects of toxic behavior on debate dynamics generalize across model scale. The convergence latency of 25% reported in the previous study was confirmed. It was found that this latency is significantly bigger for models with fewer parameters. We further identify a significant first-mover advantage, whereby the agent initiating the discussion wins significantly above chance regardless of toxicity condition.

2605.11782 2026-05-13 cs.CV

Urban Risk-Aware Navigation via VQA-Based Event Maps for People with Low Vision

Antoni Valls, Jordi Sanchez-Riera

发表机构 * Institut de Robòtica i Informàtica Industrial, CSIC-UPC(机器人与信息技术研究所,CSIC-UPC)

AI总结 该研究针对视力障碍者在城市环境中安全自主导航的问题,提出了一种基于视觉问答(VQA)的事件地图框架,利用视觉语言模型(VLMs)对行人场景进行描述和危险识别。通过三级分层查询结构,实现无需任务特定再训练的细粒度场景理解,并将模型响应聚合为加权风险评分系统,生成四类安全等级的可导航风险地图。研究还构建了一个涵盖六大洲20个城市的多样化数据集,并验证了生成式多模态大语言模型在该任务中的优越性能。

Comments 10 pages, 6 figures, submitted to IEEE T-ITS

详情
英文摘要

Visual impairment affects hundreds of millions of people worldwide, severely limiting their ability to navigate urban environments safely and independently. While wearable assistive devices offer a promising platform for real-time hazard detection, existing approaches rely on task-specific vision pipelines that lack flexibility and generalizability. In this work, we propose an event map framework based on visual question answering that leverages Vision-Language Models (VLMs) for pedestrian scene description and hazard identification across diverse real-world environments, using a three-level hierarchical query structure to enable fine-grained scene understanding without task-specific retraining. Model responses are aggregated into a weighted risk scoring system that maps street segments into four discrete safety categories, producing navigable risk-aware event maps for route planning. To support evaluation and future research, we introduce a geographically diverse dataset spanning 20 cities across six continents, comprising over 800 annotated images and 18,000 answered questions. We benchmark four VQA architectures -ViLT, LLaVA, InstructBLIP, and Qwen-VL- and find that generative Multimodal Large Language Models (MLLMs) substantially outperform classification-based approaches, with Qwen-VL achieving the best overall balance of precision and recall. These results demonstrate the viability of MLLMs as a flexible and generalizable foundation for assistive navigation systems for visually impaired people.

2605.11779 2026-05-13 cs.CL

Choosing features for classifying multiword expressions

Eric Laporte

发表机构 * Université Paris-Est, Laboratoire d’informatique Gaspard-Monge CNRS UMR 8049(巴黎-est大学,Gaspard-Monge信息学实验室,CNRS UMR 8049)

AI总结 本文研究了如何为多词表达(MWEs)选择合适的特征以进行有效分类。作者指出,虽然已有多种特征可供选择,但并非所有特征都能可靠地用于分类,因此需要优化特征选择以提高分类效果。为此,作者提出了一种增强的分类方法,并结合多种语言的研究成果,以提升分类方案在多语言环境下的适用性。

Journal ref Multiword expressions: Insights from a multi-lingual perspective, 2018, Language Science Press, pp.143-186

详情
英文摘要

Multiword expressions (MWEs) are a heterogeneous set with a glaring need for classifications. Designing a satisfactory classification involves choosing features. In the case of MWEs, many features are a priori available. Not all features are equal in terms of how reliably MWEs can be assigned to classes. Accordingly, resulting classifications may be more or less fruitful for computational use. I outline an enhanced classification. In order to increase its suitability for many languages, I use previous works taking into account various languages.

2605.11774 2026-05-13 cs.CL cs.LG

From Token to Token Pair: Efficient Prompt Compression for Large Language Models in Clinical Prediction

Mingcheng Zhu, Zhiyao Luo, Yu Liu, Tingting Zhu

发表机构 * Department of Engineering Science, University of Oxford, Oxford, United Kingdom(牛津大学工程科学系,英国牛津)

AI总结 该研究针对医疗电子健康记录(EHR)在临床预测任务中产生的长序列问题,提出了一种名为MedTPE的高效无损压缩方法。该方法通过合并高频共现的医学词对生成复合词,实现对原始序列的压缩,同时保持计算复杂度和模型性能。实验表明,MedTPE在多个临床预测任务中有效减少了输入长度和推理延迟,且在不同模型和语言环境下均表现出良好的鲁棒性和泛化能力。

Comments 21 pages, 6 figures, 13 tables

详情
英文摘要

By processing electronic health records (EHRs) as natural language sequences, large language models (LLMs) have shown potential in clinical prediction tasks such as mortality prediction and phenotyping. However, longitudinal or highly frequent EHRs often yield excessively long token sequences that result in high computational costs and even reduced performance. Existing solutions either add modules for compression or remove less important tokens, which introduce additional inference latency or risk losing clinical information. To achieve lossless compression of token sequences without additional cost or loss of performance, we propose Medical Token-Pair Encoding (MedTPE), a layered method that extends standard tokenisation for EHR sequences. MedTPE merges frequently co-occurring medical token pairs into composite tokens, providing lossless compression while preserving the computational complexity through a dependency-aware replacement strategy. Only the embeddings of the newly introduced tokens of merely 0.5-1.0% of the LLM's parameters are fine-tuned via self-supervised learning. Experiments on real-world datasets for two clinical scenarios demonstrate that MedTPE reduces input token length by up to 31% and inference latency by 34-63%, while maintaining or even improving both predictive performance and output format compliance across multiple LLMs and four clinical prediction tasks. Furthermore, MedTPE demonstrates robustness across different input context lengths and generalisability to scientific and financial domains and different languages.

2605.11773 2026-05-13 cs.LG cs.AI

Is Monotonic Sampling Necessary in Diffusion Models?

Muhammad Haris Khan

发表机构 * Department of Computer Science, University of Copenhagen, Denmark(哥本哈根大学计算机科学系)

AI总结 本文探讨了扩散模型中是否必须采用单调采样策略。研究设计了四种非单调噪声调度方案,并在多个生成模型上进行广泛实验,结果表明所有非单调方案均未优于单调基线。研究进一步揭示了模型对调度策略的敏感性差异,并提出了一个用于评估扩散模型质量的新指标——调度敏感系数。

详情
英文摘要

Diffusion models generate samples by iteratively denoising a Gaussian prior, traversing a sequence of noise levels that, in every published sampler, decreases monotonically. Six years of intensive work has refined nearly every aspect of this recipe, including the corruption operator, the training objective, the schedule shape, the architecture, and the ODE solver. Yet the assumption of monotonicity itself has never been systematically tested. Here we ask whether monotonic sampling is load-bearing or merely conventional. We design four families of structured nonmonotonic schedules and apply them to three architecturally distinct generative models, DDPM, EDM, and Flow Matching, across NFE budgets ranging from 10 to 200 function evaluations, plus a 42-cell hyperparameter ablation, on CIFAR-10. Across all 90 tested configurations, no tested nonmonotonic schedule improves on the monotonic baseline. The magnitude of the penalty, however, spans nearly three orders of magnitude: persistent and substantial in DDPM, intermediate in Flow Matching, and indistinguishable from zero in EDM. We show that this variation is not noise but a structural property of each trained denoiser, and we formalize it as the Schedule Sensitivity Coefficient, a cheap, architecture-agnostic diagnostic that provides evidence of non-convergence to the Bayes-optimal denoiser at the critical noise level. Our findings justify the field's tacit reliance on monotonic schedules and supply a new probe of diffusion model quality complementary to sample-quality metrics such as Frechet Inception Distance.

2605.11771 2026-05-13 cs.CV

Revisiting Shadow Detection from a Vision-Language Perspective

Yonghui Wang, Wengang Zhou, Hao Feng, Houqiang Li

发表机构 * CAS Key Laboratory of Technology in Geo-spatial Information Processing and Application System, Department of Electronic Engineering and Information Science, University of Science and Technology of China(中国科学技术大学地球空间信息处理与应用系统重点实验室,电子工程与信息科学系)

AI总结 本文从视觉-语言视角重新审视阴影检测问题,指出传统基于视觉线索的方法在视觉模糊场景下可能失效,因此提出SVL框架,利用语言作为显式的语义参考来区分阴影与相似的暗色区域。SVL通过场景级阴影比例回归对齐图像与文本嵌入,并引入全局到局部的耦合机制,实现整体与细粒度预测的一致性,同时保持参数高效,实验表明其在多个基准测试中表现出优异的性能与鲁棒性。

详情
英文摘要

Shadow detection is commonly formulated as a vision-driven dense prediction problem, where models rely primarily on pixel-wise visual supervision to distinguish shadows from non-shadow regions. However, this formulation can become unreliable in visually ambiguous cases, where similar dark regions may correspond either to cast shadows or to intrinsically dark surfaces, making visual evidence alone insufficient for establishing a stable decision rule. In this work, we revisit shadow detection from a vision--language perspective and argue that robust prediction benefits from an explicit semantic reference beyond visual cues alone. We propose SVL, a Shadow Vision--Language framework that uses language as an explicit semantic reference to disambiguate shadows from visually similar dark regions. SVL aligns the global image representation with shadow-related text embeddings through a scene-level shadow ratio regression objective, thereby providing image-level guidance on the overall extent of shadows. To transfer this global guidance to dense inference, SVL introduces a global-to-local coupling mechanism that enforces consistency between image-level guidance and patch-level predictions. In parallel, SVL applies local patch-level constraints with text embeddings to improve fine-grained discrimination under challenging appearance conditions. Built on a frozen DINOv3 image encoder, the framework learns only lightweight projection and decoding modules, yielding a parameter-efficient design with less than $1\%$ trainable parameters. Extensive experiments on multiple shadow detection benchmarks, including dedicated hard-case evaluations, suggest strong overall performance and improved robustness under visually ambiguous conditions.

2605.11769 2026-05-13 cs.CL

Safety-Oriented Evaluation of Language Understanding Systems for Air Traffic Control

Yujing Chang, Yash Guleria, Duc-Thinh Pham, Nhut-Huy Pham, Ningli Wang, Vu N. Duong, Sameer Alam

发表机构 * ATMRI, Nanyang Technological University (NTU), Singapore(航空交通管理研究所,南洋理工大学(NTU),新加坡) School of Management, Indian Institute of Technology Mandi, India(管理学院,印度理工学院曼迪分校,印度) Centre of AI Research, VinUniversity, Vietnam(人工智能研究中心,文大学,越南)

AI总结 本文研究了大型语言模型在空中交通管制(ATC)这一安全关键领域中的可靠性问题,提出了一种基于安全导向、关注后果的评估框架,以弥补现有方法在处理高风险语义错误方面的不足。研究发现,尽管当前语言模型在整体准确率上表现良好,但在涉及跑道标识或操作限制等关键信息时,其可靠性显著下降,表明其在实际ATC应用中仍存在结构性理解缺陷。该研究为AI辅助空中交通管制系统的负责任部署提供了重要的评估依据。

详情
英文摘要

Air Traffic Control (ATC) is a safety-critical domain in which incorrect interpretation of instructions may lead to severe operational consequences. While large language models (LLMs) demonstrate strong general performance, their reliability in operational ATC environments remains unclear. Existing evaluation approaches, largely based on aggregate metrics such as F1 or macro accuracy, treat all errors uniformly and fail to account for the asymmetric consequences of high-risk semantic mistakes (e.g., incorrect runway identifiers or movement constraints). To address this gap, we propose a safety-oriented, consequence-aware evaluation framework tailored to ATC operations. Our results reveal that while current LLMs achieve reasonable aggregate accuracy, their operational reliability is severely limited. Evaluated on clean transcripts, the peak Risk Score reaches only 0.69, with most models scoring below 0.6 despite high macro-F1 performance. Further analysis shows that errors concentrate in high-impact entities despite relatively stable action-type classification, indicating structural grounding deficiencies. These findings highlight the necessity of consequence-aware evaluation protocols for the responsible deployment of AI-assisted ATC systems.

2605.11764 2026-05-13 cs.LG q-bio.BM

Decomposing the Generalization Gap in PROTAC Activity Prediction: Variance Attribution and the Inter-Laboratory Ceiling

Thor Klamt, Wolfgang Nejdl, Ming Tang

发表机构 * L3S Research Center, Leibniz Universität Hannover(莱布尼茨汉诺威大学L3S研究中心) Institute of Data Science (Knowledge-Based Systems), Faculty of Electrical Engineering and Computer Science, Leibniz Universität Hannover(数据科学研究所(知识系统))

AI总结 该研究探讨了机器学习预测PROTAC(蛋白降解靶向嵌合体)生物活性时存在的泛化差距问题,指出在不同实验室间测量变异是导致这一差距的主要因素。通过分析多个模型在不同评估协议下的表现,研究揭示了跨实验室数据差异对预测性能的显著影响,并提出了分解该差距的框架。此外,研究还开发了PROTAC-Bench数据集及相关评估工具,为后续研究提供了重要资源。

Comments 32 pages, 11 figures, 11 tables. Dataset: https://huggingface.co/datasets/ThorKl/protac-bench (CC-BY-4.0). Code: https://github.com/ThorKlm/PROTAC-Bench (MIT)

详情
英文摘要

Machine-learning predictors of biochemical activity often exhibit large random-split-to-leave-one-target-out generalisation gaps that have been documented but not decomposed. We frame this as an evaluation-science question and use targeted protein degradation as the empirical test bed. PROTACs (proteolysis-targeting chimeras) are heterobifunctional small molecules that induce targeted protein degradation, with more than forty candidates currently in clinical trials; published predictors report AUROC of 0.85 to 0.91 under random-split cross-validation, while the leave-one-target-out (LOTO) protocol of Ribes et al. reduces performance to approximately 0.67. Random splits reward within-target interpolation, whereas LOTO measures the novel-target prediction that de-novo design depends on. We decompose this gap and identify inter-laboratory measurement variance as the dominant component, anchored by a within-target cross-laboratory cascade bounding the inter-laboratory contribution at 0.124 AUROC, well above the 0.05 contribution from binarisation-threshold choice. Across eight published architectures and ESM-2 protein language models up to 3B parameters, LOTO AUROC plateaus near 0.67, with a comparable plateau under SMILES-level deduplication; a 21-dimensional 2000-trial hyperparameter optimisation cannot break this ceiling, and the rank-1 single-seed configuration regresses by 0.161 AUROC under multi-seed validation, matching a closed-form selection-bias prediction (Bailey and Lopez de Prado, 2014). Few-shot k=5 stratified per-target retraining combined with ADMET features lifts 65-target LOTO AUROC from 0.668 to 0.7050, and post-hoc Platt scaling recovers raw output to within the 0.05 well-calibrated threshold. We release PROTAC-Bench (10,748 measurements, 173 targets, 65 LOTO folds), the variance-decomposition framework, the per-target calibration protocol, and the evaluation code.

2605.11762 2026-05-13 cs.RO

NavOL: Navigation Policy with Online Imitation Learning

Xiaofei Wei, Chun Gu, Li Zhang

发表机构 * Shanghai Innovation Institute(上海创新研究院) School of Data Science, Fudan University(复旦大学数据科学学院)

AI总结 本文提出了一种在线模仿学习框架NavOL,用于解决机器人导航中鲁棒策略学习的难题。NavOL通过与模拟器交互,在线收集专家示范数据并更新策略,避免了传统离线模仿学习中的分布偏移和误差累积问题,同时省去了强化学习中复杂的奖励设计。该方法基于预训练的导航扩散策略,结合全局路径规划器进行在线训练,显著提升了学习效率和泛化能力,并在多个仿真和现实场景中验证了其有效性。

Comments Project page: https://logosroboticsgroup.github.io/NavOL/

详情
英文摘要

Learning robust navigation policies remains a core challenge in robotics. Offline imitation learning suffers from distribution shift and compounding errors at rollout, while reinforcement learning requires reward engineering and learns inefficiently. In this paper, we propose NavOL, an online imitation learning paradigm that interacts with a simulator and updates itself using expert demonstrations gathered online. Built upon a pretrained navigation diffusion policy that maps local observations to future waypoints, NavOL trains in a rollout update loop: during rollout, the policy acts in the simulator and queries a global planner which has privileged access to the global environment for the optimal path segment as ground truth trajectory labels; during update, the policy is trained on the online collected observation trajectory pairs. This online imitation loop removes the need for reward design, improves learning efficiency, and mitigates distribution shift by training on the policy own explored rollouts. Built on IsaacLab with fast, high-fidelity parallel rendering and domain randomization of camera pose and start-goal pairs, our system scales across 50 scenes on 8 RTX 4090 GPUs, collecting over 2,000 new trajectories per hour, each averaging more than 400 steps. We also introduce an indoor visual navigation benchmark with predefined start and goal positions for zero-shot generalization. Extensive evaluations on simulation benchmarks, including the NavDP benchmark and our proposed benchmark, as well as carefully designed real-world experiments, demonstrate the effectiveness of NavOL, showing consistent performance gains in online imitation learning.

2605.11760 2026-05-13 cs.CV

M$^4$-SAM: Multi-Modal Mixture-of-Experts with Memory-Augmented SAM for RGB-D Video Salient Object Detection

Jiyuan Liu, Jia Lin, Xiaofei Zhou, Runmin Cong, Deyang Liu, Zhi Liu

发表机构 * Hangzhou Dianzi University(杭州电子科技大学) Shandong University(山东大学) Anqing Normal University(安庆师范学院) Shanghai University(上海大学)

AI总结 该论文提出了一种名为 M$^4$-SAM 的多模态混合专家模型,旨在提升 RGB-D 视频显著目标检测的性能。通过引入模态感知的 LoRA 机制、多级特征融合模块以及无需手动提示的伪引导初始化方法,M$^4$-SAM 有效解决了 SAM2 在空间建模、多尺度特征利用和初始化依赖等方面的局限性。实验表明,该方法在三个公开数据集上取得了当前最优的检测性能。

Comments 10 pages, 3 figures

详情
英文摘要

The Segment Anything Model 2 (SAM2) has emerged as a foundation model for universal segmentation. Owing to its generalizable visual representations, SAM2 has been successfully applied to various downstream tasks. However, extending SAM2 to the RGB-D video salient object detection (RGB-D VSOD) task encounters three challenges including limited spatial modeling of linear LoRA, insufficient employment of SAM's multi-scale features, and dependence of initialization on explicit prompts. To address the issues, we present Multi-Modal Mixture-of-Experts with Memory-Augmented SAM (M$^4$-SAM), which equips SAM2 with modality-related PEFT, hierarchical feature fusion, and prompt-free memory initialization. Firstly, we inject Modality-Aware MoE-LORA, which employs convolutional experts to encode local spatial priors and introduces a modality dispatcher for efficient multi-modal fine-tuning, into SAM2's encoder. Secondly, we deploy Gated Multi-Level Feature Fusion, which hierarchically aggregates multi-scale encoder features with an adaptive gating mechanism, to balance spatial details and semantic context. Finally, to conduct zero-shot VSOD without manual prompts, we utilize a Pseudo-Guided Initialization, where a coarse mask is regarded as a pseudo prior and used to bootstrap the memory bank. Extensive experiments demonstrate that M$^4$-SAM achieves the state-of-the-art performance across all evaluation metrics on three public RGB-D VSOD datasets.

2605.11756 2026-05-13 cs.CV cs.AI

Focusable Monocular Depth Estimation

Yuxin Du, Tao Lin, Zile Zhong, Runting Li, Xiyao Chen, Jiting Liu, Chenglin Liu, Ying-Cong Chen, Yuqian Fu, Bo Zhao

发表机构 * School of Artificial Intelligence, Shanghai Jiao Tong University(上海交通大学人工智能学院) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) King Abdullah University of Science and Technology(国王 Abdullah 科学与技术大学)

AI总结 本文提出了一种可聚焦的单目深度估计方法(FDE),旨在提升模型对用户指定或任务相关区域的深度估计精度。该方法引入了基于提示的FocusDepth框架,通过多尺度空间对齐融合(MSSA)技术,将多尺度特征与目标区域提示进行对齐和融合,从而在保持全局场景几何结构的同时,增强对目标区域的深度感知能力。研究还构建了FDE-Bench基准,实验证明该方法在目标边界和前景区域的深度估计上表现显著优于现有基线模型。

详情
英文摘要

Monocular depth foundation models generalize well across scenes, yet they are typically optimized with uniform pixel-wise objectives that do not distinguish user-specified or task-relevant target regions from the surrounding context. We therefore introduce Focusable Monocular Depth Estimation (FDE), a region-aware depth estimation task in which, given a specified target region, the model is required to prioritize foreground depth accuracy, preserve sharp boundary transitions, and maintain coherent global scene geometry. To prioritize task-critical region modeling, we propose FocusDepth, a prompt-conditioned monocular relative depth estimation framework that guides depth modeling to focus on target regions via box/text prompts. The core Multi-Scale Spatial-Aligned Fusion (MSSA) in FocusDepth spatially aligns multi-scale features from Segment Anything Model 3 to the Depth Anything family and injects them through scale-specific, gated conditional fusion. This enables dense prompt cue injection without disrupting geometric representations, thereby endowing the depth estimation model with focused perception capability. To study FDE, we establish FDE-Bench, a target-centric monocular relative depth benchmark built from image-target-depth triplets across five datasets, containing 252.9K/72.5K train/val triplets and 972 categories spanning real-world and embodied simulation environments. On FDE-Bench, FocusDepth consistently improves over globally fine-tuned DA2/DA3 baselines under both box and text prompts, with the largest gains appearing in target boundary and foreground regions while preserving global scene geometry. Ablations show that MSSA's spatial alignment is the key design factor, as disrupting prompt-geometry correspondence increases AbsRel by up to 13.8%.