arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2332
2605.11774 2026-05-13 cs.CL cs.LG

From Token to Token Pair: Efficient Prompt Compression for Large Language Models in Clinical Prediction

Mingcheng Zhu, Zhiyao Luo, Yu Liu, Tingting Zhu

AI总结 该研究针对医疗电子健康记录(EHR)在临床预测任务中产生的长序列问题,提出了一种名为MedTPE的高效无损压缩方法。该方法通过合并高频共现的医学词对生成复合词,实现对原始序列的压缩,同时保持计算复杂度和模型性能。实验表明,MedTPE在多个临床预测任务中有效减少了输入长度和推理延迟,且在不同模型和语言环境下均表现出良好的鲁棒性和泛化能力。

Comments 21 pages, 6 figures, 13 tables

详情
英文摘要

By processing electronic health records (EHRs) as natural language sequences, large language models (LLMs) have shown potential in clinical prediction tasks such as mortality prediction and phenotyping. However, longitudinal or highly frequent EHRs often yield excessively long token sequences that result in high computational costs and even reduced performance. Existing solutions either add modules for compression or remove less important tokens, which introduce additional inference latency or risk losing clinical information. To achieve lossless compression of token sequences without additional cost or loss of performance, we propose Medical Token-Pair Encoding (MedTPE), a layered method that extends standard tokenisation for EHR sequences. MedTPE merges frequently co-occurring medical token pairs into composite tokens, providing lossless compression while preserving the computational complexity through a dependency-aware replacement strategy. Only the embeddings of the newly introduced tokens of merely 0.5-1.0% of the LLM's parameters are fine-tuned via self-supervised learning. Experiments on real-world datasets for two clinical scenarios demonstrate that MedTPE reduces input token length by up to 31% and inference latency by 34-63%, while maintaining or even improving both predictive performance and output format compliance across multiple LLMs and four clinical prediction tasks. Furthermore, MedTPE demonstrates robustness across different input context lengths and generalisability to scientific and financial domains and different languages.

2605.11773 2026-05-13 cs.LG cs.AI

Is Monotonic Sampling Necessary in Diffusion Models?

Muhammad Haris Khan

AI总结 本文探讨了扩散模型中是否必须采用单调采样策略。研究设计了四种非单调噪声调度方案,并在多个生成模型上进行广泛实验,结果表明所有非单调方案均未优于单调基线。研究进一步揭示了模型对调度策略的敏感性差异,并提出了一个用于评估扩散模型质量的新指标——调度敏感系数。

详情
英文摘要

Diffusion models generate samples by iteratively denoising a Gaussian prior, traversing a sequence of noise levels that, in every published sampler, decreases monotonically. Six years of intensive work has refined nearly every aspect of this recipe, including the corruption operator, the training objective, the schedule shape, the architecture, and the ODE solver. Yet the assumption of monotonicity itself has never been systematically tested. Here we ask whether monotonic sampling is load-bearing or merely conventional. We design four families of structured nonmonotonic schedules and apply them to three architecturally distinct generative models, DDPM, EDM, and Flow Matching, across NFE budgets ranging from 10 to 200 function evaluations, plus a 42-cell hyperparameter ablation, on CIFAR-10. Across all 90 tested configurations, no tested nonmonotonic schedule improves on the monotonic baseline. The magnitude of the penalty, however, spans nearly three orders of magnitude: persistent and substantial in DDPM, intermediate in Flow Matching, and indistinguishable from zero in EDM. We show that this variation is not noise but a structural property of each trained denoiser, and we formalize it as the Schedule Sensitivity Coefficient, a cheap, architecture-agnostic diagnostic that provides evidence of non-convergence to the Bayes-optimal denoiser at the critical noise level. Our findings justify the field's tacit reliance on monotonic schedules and supply a new probe of diffusion model quality complementary to sample-quality metrics such as Frechet Inception Distance.

2605.11771 2026-05-13 cs.CV

Revisiting Shadow Detection from a Vision-Language Perspective

Yonghui Wang, Wengang Zhou, Hao Feng, Houqiang Li

AI总结 本文从视觉-语言视角重新审视阴影检测问题,指出传统基于视觉线索的方法在视觉模糊场景下可能失效,因此提出SVL框架,利用语言作为显式的语义参考来区分阴影与相似的暗色区域。SVL通过场景级阴影比例回归对齐图像与文本嵌入,并引入全局到局部的耦合机制,实现整体与细粒度预测的一致性,同时保持参数高效,实验表明其在多个基准测试中表现出优异的性能与鲁棒性。

详情
英文摘要

Shadow detection is commonly formulated as a vision-driven dense prediction problem, where models rely primarily on pixel-wise visual supervision to distinguish shadows from non-shadow regions. However, this formulation can become unreliable in visually ambiguous cases, where similar dark regions may correspond either to cast shadows or to intrinsically dark surfaces, making visual evidence alone insufficient for establishing a stable decision rule. In this work, we revisit shadow detection from a vision--language perspective and argue that robust prediction benefits from an explicit semantic reference beyond visual cues alone. We propose SVL, a Shadow Vision--Language framework that uses language as an explicit semantic reference to disambiguate shadows from visually similar dark regions. SVL aligns the global image representation with shadow-related text embeddings through a scene-level shadow ratio regression objective, thereby providing image-level guidance on the overall extent of shadows. To transfer this global guidance to dense inference, SVL introduces a global-to-local coupling mechanism that enforces consistency between image-level guidance and patch-level predictions. In parallel, SVL applies local patch-level constraints with text embeddings to improve fine-grained discrimination under challenging appearance conditions. Built on a frozen DINOv3 image encoder, the framework learns only lightweight projection and decoding modules, yielding a parameter-efficient design with less than $1\%$ trainable parameters. Extensive experiments on multiple shadow detection benchmarks, including dedicated hard-case evaluations, suggest strong overall performance and improved robustness under visually ambiguous conditions.

2605.11769 2026-05-13 cs.CL

Safety-Oriented Evaluation of Language Understanding Systems for Air Traffic Control

Yujing Chang, Yash Guleria, Duc-Thinh Pham, Nhut-Huy Pham, Ningli Wang, Vu N. Duong, Sameer Alam

AI总结 本文研究了大型语言模型在空中交通管制(ATC)这一安全关键领域中的可靠性问题,提出了一种基于安全导向、关注后果的评估框架,以弥补现有方法在处理高风险语义错误方面的不足。研究发现,尽管当前语言模型在整体准确率上表现良好,但在涉及跑道标识或操作限制等关键信息时,其可靠性显著下降,表明其在实际ATC应用中仍存在结构性理解缺陷。该研究为AI辅助空中交通管制系统的负责任部署提供了重要的评估依据。

详情
英文摘要

Air Traffic Control (ATC) is a safety-critical domain in which incorrect interpretation of instructions may lead to severe operational consequences. While large language models (LLMs) demonstrate strong general performance, their reliability in operational ATC environments remains unclear. Existing evaluation approaches, largely based on aggregate metrics such as F1 or macro accuracy, treat all errors uniformly and fail to account for the asymmetric consequences of high-risk semantic mistakes (e.g., incorrect runway identifiers or movement constraints). To address this gap, we propose a safety-oriented, consequence-aware evaluation framework tailored to ATC operations. Our results reveal that while current LLMs achieve reasonable aggregate accuracy, their operational reliability is severely limited. Evaluated on clean transcripts, the peak Risk Score reaches only 0.69, with most models scoring below 0.6 despite high macro-F1 performance. Further analysis shows that errors concentrate in high-impact entities despite relatively stable action-type classification, indicating structural grounding deficiencies. These findings highlight the necessity of consequence-aware evaluation protocols for the responsible deployment of AI-assisted ATC systems.

2605.11764 2026-05-13 cs.LG q-bio.BM

Decomposing the Generalization Gap in PROTAC Activity Prediction: Variance Attribution and the Inter-Laboratory Ceiling

Thor Klamt, Wolfgang Nejdl, Ming Tang

AI总结 该研究探讨了机器学习预测PROTAC(蛋白降解靶向嵌合体)生物活性时存在的泛化差距问题,指出在不同实验室间测量变异是导致这一差距的主要因素。通过分析多个模型在不同评估协议下的表现,研究揭示了跨实验室数据差异对预测性能的显著影响,并提出了分解该差距的框架。此外,研究还开发了PROTAC-Bench数据集及相关评估工具,为后续研究提供了重要资源。

Comments 32 pages, 11 figures, 11 tables. Dataset: https://huggingface.co/datasets/ThorKl/protac-bench (CC-BY-4.0). Code: https://github.com/ThorKlm/PROTAC-Bench (MIT)

详情
英文摘要

Machine-learning predictors of biochemical activity often exhibit large random-split-to-leave-one-target-out generalisation gaps that have been documented but not decomposed. We frame this as an evaluation-science question and use targeted protein degradation as the empirical test bed. PROTACs (proteolysis-targeting chimeras) are heterobifunctional small molecules that induce targeted protein degradation, with more than forty candidates currently in clinical trials; published predictors report AUROC of 0.85 to 0.91 under random-split cross-validation, while the leave-one-target-out (LOTO) protocol of Ribes et al. reduces performance to approximately 0.67. Random splits reward within-target interpolation, whereas LOTO measures the novel-target prediction that de-novo design depends on. We decompose this gap and identify inter-laboratory measurement variance as the dominant component, anchored by a within-target cross-laboratory cascade bounding the inter-laboratory contribution at 0.124 AUROC, well above the 0.05 contribution from binarisation-threshold choice. Across eight published architectures and ESM-2 protein language models up to 3B parameters, LOTO AUROC plateaus near 0.67, with a comparable plateau under SMILES-level deduplication; a 21-dimensional 2000-trial hyperparameter optimisation cannot break this ceiling, and the rank-1 single-seed configuration regresses by 0.161 AUROC under multi-seed validation, matching a closed-form selection-bias prediction (Bailey and Lopez de Prado, 2014). Few-shot k=5 stratified per-target retraining combined with ADMET features lifts 65-target LOTO AUROC from 0.668 to 0.7050, and post-hoc Platt scaling recovers raw output to within the 0.05 well-calibrated threshold. We release PROTAC-Bench (10,748 measurements, 173 targets, 65 LOTO folds), the variance-decomposition framework, the per-target calibration protocol, and the evaluation code.

2605.11762 2026-05-13 cs.RO

NavOL: Navigation Policy with Online Imitation Learning

Xiaofei Wei, Chun Gu, Li Zhang

AI总结 本文提出了一种在线模仿学习框架NavOL,用于解决机器人导航中鲁棒策略学习的难题。NavOL通过与模拟器交互,在线收集专家示范数据并更新策略,避免了传统离线模仿学习中的分布偏移和误差累积问题,同时省去了强化学习中复杂的奖励设计。该方法基于预训练的导航扩散策略,结合全局路径规划器进行在线训练,显著提升了学习效率和泛化能力,并在多个仿真和现实场景中验证了其有效性。

Comments Project page: https://logosroboticsgroup.github.io/NavOL/

详情
英文摘要

Learning robust navigation policies remains a core challenge in robotics. Offline imitation learning suffers from distribution shift and compounding errors at rollout, while reinforcement learning requires reward engineering and learns inefficiently. In this paper, we propose NavOL, an online imitation learning paradigm that interacts with a simulator and updates itself using expert demonstrations gathered online. Built upon a pretrained navigation diffusion policy that maps local observations to future waypoints, NavOL trains in a rollout update loop: during rollout, the policy acts in the simulator and queries a global planner which has privileged access to the global environment for the optimal path segment as ground truth trajectory labels; during update, the policy is trained on the online collected observation trajectory pairs. This online imitation loop removes the need for reward design, improves learning efficiency, and mitigates distribution shift by training on the policy own explored rollouts. Built on IsaacLab with fast, high-fidelity parallel rendering and domain randomization of camera pose and start-goal pairs, our system scales across 50 scenes on 8 RTX 4090 GPUs, collecting over 2,000 new trajectories per hour, each averaging more than 400 steps. We also introduce an indoor visual navigation benchmark with predefined start and goal positions for zero-shot generalization. Extensive evaluations on simulation benchmarks, including the NavDP benchmark and our proposed benchmark, as well as carefully designed real-world experiments, demonstrate the effectiveness of NavOL, showing consistent performance gains in online imitation learning.

2605.11760 2026-05-13 cs.CV

M$^4$-SAM: Multi-Modal Mixture-of-Experts with Memory-Augmented SAM for RGB-D Video Salient Object Detection

Jiyuan Liu, Jia Lin, Xiaofei Zhou, Runmin Cong, Deyang Liu, Zhi Liu

AI总结 该论文提出了一种名为 M$^4$-SAM 的多模态混合专家模型,旨在提升 RGB-D 视频显著目标检测的性能。通过引入模态感知的 LoRA 机制、多级特征融合模块以及无需手动提示的伪引导初始化方法,M$^4$-SAM 有效解决了 SAM2 在空间建模、多尺度特征利用和初始化依赖等方面的局限性。实验表明,该方法在三个公开数据集上取得了当前最优的检测性能。

Comments 10 pages, 3 figures

详情
英文摘要

The Segment Anything Model 2 (SAM2) has emerged as a foundation model for universal segmentation. Owing to its generalizable visual representations, SAM2 has been successfully applied to various downstream tasks. However, extending SAM2 to the RGB-D video salient object detection (RGB-D VSOD) task encounters three challenges including limited spatial modeling of linear LoRA, insufficient employment of SAM's multi-scale features, and dependence of initialization on explicit prompts. To address the issues, we present Multi-Modal Mixture-of-Experts with Memory-Augmented SAM (M$^4$-SAM), which equips SAM2 with modality-related PEFT, hierarchical feature fusion, and prompt-free memory initialization. Firstly, we inject Modality-Aware MoE-LORA, which employs convolutional experts to encode local spatial priors and introduces a modality dispatcher for efficient multi-modal fine-tuning, into SAM2's encoder. Secondly, we deploy Gated Multi-Level Feature Fusion, which hierarchically aggregates multi-scale encoder features with an adaptive gating mechanism, to balance spatial details and semantic context. Finally, to conduct zero-shot VSOD without manual prompts, we utilize a Pseudo-Guided Initialization, where a coarse mask is regarded as a pseudo prior and used to bootstrap the memory bank. Extensive experiments demonstrate that M$^4$-SAM achieves the state-of-the-art performance across all evaluation metrics on three public RGB-D VSOD datasets.

2605.11756 2026-05-13 cs.CV cs.AI

Focusable Monocular Depth Estimation

Yuxin Du, Tao Lin, Zile Zhong, Runting Li, Xiyao Chen, Jiting Liu, Chenglin Liu, Ying-Cong Chen, Yuqian Fu, Bo Zhao

AI总结 本文提出了一种可聚焦的单目深度估计方法(FDE),旨在提升模型对用户指定或任务相关区域的深度估计精度。该方法引入了基于提示的FocusDepth框架,通过多尺度空间对齐融合(MSSA)技术,将多尺度特征与目标区域提示进行对齐和融合,从而在保持全局场景几何结构的同时,增强对目标区域的深度感知能力。研究还构建了FDE-Bench基准,实验证明该方法在目标边界和前景区域的深度估计上表现显著优于现有基线模型。

详情
英文摘要

Monocular depth foundation models generalize well across scenes, yet they are typically optimized with uniform pixel-wise objectives that do not distinguish user-specified or task-relevant target regions from the surrounding context. We therefore introduce Focusable Monocular Depth Estimation (FDE), a region-aware depth estimation task in which, given a specified target region, the model is required to prioritize foreground depth accuracy, preserve sharp boundary transitions, and maintain coherent global scene geometry. To prioritize task-critical region modeling, we propose FocusDepth, a prompt-conditioned monocular relative depth estimation framework that guides depth modeling to focus on target regions via box/text prompts. The core Multi-Scale Spatial-Aligned Fusion (MSSA) in FocusDepth spatially aligns multi-scale features from Segment Anything Model 3 to the Depth Anything family and injects them through scale-specific, gated conditional fusion. This enables dense prompt cue injection without disrupting geometric representations, thereby endowing the depth estimation model with focused perception capability. To study FDE, we establish FDE-Bench, a target-centric monocular relative depth benchmark built from image-target-depth triplets across five datasets, containing 252.9K/72.5K train/val triplets and 972 categories spanning real-world and embodied simulation environments. On FDE-Bench, FocusDepth consistently improves over globally fine-tuned DA2/DA3 baselines under both box and text prompts, with the largest gains appearing in target boundary and foreground regions while preserving global scene geometry. Ablations show that MSSA's spatial alignment is the key design factor, as disrupting prompt-geometry correspondence increases AbsRel by up to 13.8%.

2605.11753 2026-05-13 cs.AI

Towards Visually Grounded Multimodal Summarization via Cross-Modal Transformer and Gated Attention

Abid Ali, Diego Molla-Aliod, Usman Naseem

AI总结 该论文研究了多模态摘要生成问题,旨在从文本和图像中生成语义连贯且内容准确的摘要。为了解决现有方法中视觉特征与语言模型表征不匹配的问题,作者提出了一种统一框架SPeCTrA-Sum,通过深度对齐视觉与语言编码器,并引入视觉相关性预测模块来选择具有代表性的图像。实验表明,该方法在生成视觉相关性更强的摘要和选择更具代表性的图像方面表现优异。

Comments Accepted to Findings of ACL 2026

详情
英文摘要

Multimodal summarization requires models to jointly understand textual and visual inputs to generate concise, semantically coherent summaries. Existing methods often inject shallow visual features into deep language models, leading to representational mismatches and weak cross-modal grounding. We propose a unified framework that jointly performs text summarization and representative image selection. Our system, SPeCTrA-Sum (Sampler Perceiver with Cross-modal Transformer and gated Attention for Summarization), introduces two key innovations. First, a Deep Visual Processor (DVP) aligns the visual encoder with the language model at corresponding depths, enabling hierarchical, layer-wise fusion that preserves semantic consistency. Second, a lightweight Visual Relevance Predictor (VRP) selects salient and diverse images by distilling soft labels from a Determinantal Point Processes (DPP) teacher. SPeCTrA-Sum is trained using a multi-objective loss that combines autoregressive summarization, cross-modal alignment, and DPP-based distillation. Experiments show that our system produces more accurate, visually grounded summaries and selects more representative images, demonstrating the benefits of depth-aware fusion and principled image selection for multimodal summarization.

2605.11752 2026-05-13 cs.LG

Federated Client Selection under Partial Visibility: A POMDP Approach with Spatio-Temporal Attention

Qijun Hou, Yuchen Shi, Pingyi Fan, Khaled B. Letaief

AI总结 本文研究了在部分可见性场景下的联邦学习客户端选择问题,即服务器在每轮通信中只能观测到部分客户端。为此,作者将该问题建模为部分可观测马尔可夫决策过程(POMDP),并提出了一种基于时空注意力机制的强化学习框架,通过融合历史全局模型和客户端身份嵌入,捕捉训练过程中的时间上下文和客户端的持续特性。实验结果表明,该方法在异构且部分可见的环境下优于现有基线,验证了其在实际联邦学习系统中应对不完全观测挑战的有效性。

详情
英文摘要

Federated learning relies on effective client selection to alleviate the performance degradation caused by data heterogeneity. Most existing methods assume full visibility of all clients at each communication round. However, in large-scale or edge-based deployments, the server can only access a subset of clients due to communication, mobility, or availability constraints, resulting in partial visibility where only a subset of clients is observable for aggregation in each communication round. In this paper, we formulate federated client selection under partial visibility as a Partially Observable Markov Decision Process (POMDP) and propose a Spatial-Temporal attention-based reinforcement learning framework. By integrating historical global models and client identity embeddings, the proposed method captures both the temporal contexts of training and the persistent characteristics of clients. Experimental results across multiple datasets demonstrate that our approach achieves superior performance compared to existing baselines in heterogeneous and partially visible settings, validating its effectiveness in addressing the challenges of incomplete observations in practical federated learning systems.

2605.11750 2026-05-13 cs.RO cs.AI cs.CL cs.CV

DreamAvoid: Critical-Phase Test-Time Dreaming to Avoid Failures in VLA Policies

Xianzhe Fan, Yuxiang Lu, Shenyuan Gao, Xiaoyang Wu, Ruihua Han, Manling Li, Hengshuang Zhao

AI总结 Vision-Language-Action(VLA)模型在精细操作任务中容易因关键阶段的微小动作错误而引发不可恢复的失败。为解决这一问题,本文提出DreamAvoid,一种在测试阶段通过“梦境”模拟来预判并规避失败的框架。该方法引入梦境触发机制、动作提案和梦境评估器,通过模拟候选动作的短期未来结果,选择最优动作以提升任务成功率。实验表明,DreamAvoid能有效减少失败情况,提高实际操作任务的完成率。

Comments 19 pages, 7 figures

详情
英文摘要

Vision-Language-Action (VLA) models are often brittle in fine-grained manipulation, where minor action errors during the critical phases can rapidly escalate into irrecoverable failures. Since existing VLA models rely predominantly on successful demonstrations for training, they lack an explicit awareness of failure during these critical phases. To address this, we propose DreamAvoid, a critical-phase test-time dreaming framework that enables VLA models to anticipate and avoid failures. We also introduce an autonomous boundary learning paradigm to refine the system's understanding of the subtle boundary between success and failure. Specifically, we (1) utilize a Dream Trigger to determine whether the execution has entered a critical phase, (2) sample multiple candidate action chunks from the VLA via an Action Proposer, and (3) employ a Dream Evaluator, jointly trained on mixed data (success, failure, and boundary cases), to "dream" the short-horizon futures corresponding to the candidate actions, evaluate their values, and select the optimal action. We conduct extensive evaluations on real-world manipulation tasks and simulation benchmarks. The results demonstrate that DreamAvoid can effectively avoid failures, thereby improving the overall task success rate. Our code is available at https://github.com/XianzheFan/DreamAvoid.

2605.11749 2026-05-13 cs.LG

Learning Feature Encoder with Synthetic Anomalies for Weakly Supervised Graph Anomaly Detection

Yingjie Zhou, Yuqin Xie, Fanxing Liu, Dongjin Song, Ce Zhu, Lingqiao Liu

AI总结 本文研究弱监督图异常检测问题,旨在在仅有少量标注异常样本和大量未标注数据的情况下,识别行为显著偏离正常模式的图实例。为解决如何学习对异常敏感且能区分正常类别的图特征表示这一挑战,作者提出了一种基于合成异常的多任务学习方法,通过生成多种方式扰动的合成异常样本,并为每类异常分配专用检测头,从而引导模型学习更具判别性的特征表示。实验表明,该方法在多个公开数据集上优于现有方法。

Comments 14 pages, 7 figures, published by IEEE Transactions on Knowledge and Data Engineering,2026

详情
Journal ref
IEEE Transactions on Knowledge and Data Engineering, vol. 38, no. 4, pp. 2326-2339, 2026
英文摘要

Weakly supervised graph anomaly detection aims to unveil unusual graph instances, e.g., nodes, whose behaviors significantly differ from normal ones, given only a limited number of annotated anomalies and abundant unlabeled samples. A major challenge is to learn a meaningful latent feature representation that reduces intra-class variance among normal data while remaining highly sensitive to anomalies. Although recent works have applied self-supervised feature learning for graph anomaly detection, their strategies are not specifically tailored to its unique requirements, motivating our exploration of a more domain-specific approach. In this paper, we introduce a weakly supervised graph anomaly detection method that leverages a feature learning strategy tailored for graph anomalies. Our approach is built upon a multi-task learning scheme that extracts robust feature representations through synthesized anomalies. We generate synthetic anomalies by perturbing the normal graph in various ways and assign a dedicated detection head to each anomaly type, ensuring that learned features are sensitive to potential deviations from normal patterns. Although synthetic anomalies may not perfectly replicate real-world patterns, they provide valuable auxiliary data for effective feature learnin, much like features learned from ImageNet classification transfer to downstream vision tasks. Additionally, we adopt a two-phase learning strategy: an initial warm-up phase using only synthetic samples, followed by a full-training phase integrating both tasks, to balance the influence of synthetic and real data. Extensive experiments on public datasets demonstrate the superior performance of our method over its competitors. Code is available at https://github.com/yj-zhou/SAWGAD.

2605.11748 2026-05-13 cs.CV

BronchoLumen: Analysis of recent YOLO-based architectures for real-time bronchial orifice detection in video bronchoscopy

Yongchao Li, Marian Himstedt

AI总结 本文提出了一种基于YOLO的实时系统BronchoLumen,用于在视频支气管镜图像中检测支气管开口,旨在辅助支气管镜导航和计算机辅助诊断系统。研究比较了YOLOv8和集成注意力模块的YOLOv12在不同图像域中的检测性能,结果表明YOLOv12在定位精度上略优于YOLOv8,但整体精度稍低,系统在多数场景下表现出良好的鲁棒性。该方法为跨域支气管开口检测提供了高效且准确的解决方案,并已开源以促进相关研究。

Comments 10 pages, 4 figures, IPCAI 2026

详情
英文摘要

Bronchoscopy is routinely conducted in pulmonary clinics and intensive care units, but navigating the complex branching of the respiratory tract remains challenging. This paper introduces BronchoLumen, a real-time YOLO-based system for detecting bronchial orifices in video bronchoscopy, aiming to assist navigation and CAD systems. The paper investigates if bronchial orifices can be robustly detected across image domains using state-of-the-art object detection and a limited set of public image data. The study includes the description and comparison of YOLOv8, a widely adopted architecture, and YOLOv12, a more recent architecture integrating attention-based modules to improve spatial reasoning. Both models are trained and tested solely on publicly available datasets comprising different image domains. A comparison of both models is conducted based on the common metrics mAP@0.5 and mAP@0.5:0.9 with the latter emphasizing localization accuracy. For YOLOv8 we obtained a mAP@0.5 of 0.91 on an in-domain and 0.68 on a cross-domain test set. YOLOv12 achieved 0.84 and 0.68 respectively with slightly better localization accuracy with mAP@0.5:0.9 of 0.48 and 0.26 compared to YOLOv8 with 0.45 and 0.25. Challenges like motion blur and low contrast occasionally entailed uncertainties but the system demonstrated overall robustness in most scenarios. BronchoLumen is an open-weight, YOLO-based solution for bronchial orifice detection offering high accuracy and efficiency across multiple image domains. While the more recent YOLOv12 achieves better localization accuracy, we observed a slightly worse precision. The models have been made publicly available to foster further research in bronchoscopy navigation.

2605.11746 2026-05-13 cs.AI

When Reasoning Traces Become Performative: Step-Level Evidence that Chain-of-Thought Is an Imperfect Oversight Channel

Wenkai Li, Fan Yang, Ananya Hazarika, Shaunak A. Mehta, Koichi Onoue

AI总结 该研究探讨了思维链(Chain-of-thought, CoT)推理过程中,可见的推理轨迹与实际计算过程之间的一致性问题。通过构建Detect-Classify-Compare框架,并结合多种验证方法,发现大多数模型在推理步骤中存在轨迹与答案承诺不一致的现象,尤其是推理轨迹在答案确定后仍继续生成看似深思熟虑但实际无实质影响的文本。研究还表明,CoT在提升模型性能方面仍具价值,但其作为答案形成时间的可靠记录存在显著偏差。

详情
英文摘要

Chain-of-thought (CoT) traces are increasingly used both to improve language model capability and to audit model behavior, implicitly assuming that the visible trace remains synchronized with the computation that determines the answer. We test this assumption with a step-level Detect-Classify-Compare framework built around an answer-commitment proxy that is cross-validated with Patchscopes, tuned-lens probes, and causal direction ablation. Across nine models and seven reasoning benchmarks, latent commitment and explicit answer arrival align on only 61.9% of steps on average. The dominant mismatch pattern is confabulated continuation: 58.0% of detected mismatch events occur after the answer-commitment proxy has already stabilized while the trace continues producing deliberative-looking text, and a vacuousness analysis shows that the committed answer does not change during these steps. In architecture-matched Qwen2.5/DeepSeek-R1-Distill comparisons, the reasoning pipeline changes failure composition more than aggregate alignment, most clearly at 32B where confabulated steps decrease as contradictory states increase. Lower step-level alignment is also associated with larger CoT utility, suggesting that the settings that benefit most from CoT are often the least temporally faithful. Paired truncation and a complementary donor-corruption test further indicate that much post-commitment text is not load-bearing for the final answer. These findings suggest that CoT can remain useful while still being an unreliable report of when the answer was formed.

2605.11744 2026-05-13 cs.CL cs.LG

Training-Inference Consistent Segmented Execution for Long-Context LLMs

Xianpeng Shang, Jiang Li, Zehua Duo, Qianyi Cai, Xiangdong Su

AI总结 本文针对基于Transformer的大语言模型在长上下文生成中面临的计算和内存瓶颈问题,提出了一种训练与推理一致的分段执行框架。该方法在训练过程中模拟推理阶段的分段执行语义,通过限制梯度传播仅作用于前一段的KV状态,从而保证训练与推理的一致性。实验表明,该方法在长上下文任务中性能接近全上下文注意力机制,同时在延迟与内存消耗方面优于现有高效推理方法,显著提升了超长上下文场景下的可扩展性。

Comments Accepted by ICML 2026. 19 pages, 6 figures, 3 tables

详情
英文摘要

Transformer-based large language models face severe scalability challenges in long-context generation due to the computational and memory costs of full-context attention. Under practical computation and memory constraints, many inference-efficient long-context methods improve efficiency by adopting bounded-context or segment-level execution only during inference, while continuing to train models under full-context attention, resulting in a mismatch between training and inference execution and state-transition semantics. Based on this insight, we propose a training-inference consistent segment-level generation framework, in which training and inference follow the same segment-level forward execution semantics. During training, consistency with inference is enforced by restricting gradient propagation to KV states carried over from the immediately preceding segment, while permitting head-specific access to past KV states during the forward pass without involving them in gradient propagation. Across long-context benchmarks, our approach achieves performance comparable to full-context attention, while achieving competitive latency-memory trade-offs against strong inference-efficient baselines, and substantially improving scalability at very long context lengths (e.g., approximately 6x lower peak prefill memory at 128K compared to full-context attention with FlashAttention).

2605.11743 2026-05-13 cs.CV cs.LG

WorldComp2D: Spatio-semantic Representations of Object Identity and Location from Local Views

SeongMin Jin, Doo Seok Jeong

AI总结 本文提出了一种名为 WorldComp2D 的轻量级表征学习框架,旨在从局部视角中学习物体身份和位置的时空语义表示。该方法通过多尺度局部感受野显式构建与物体身份和空间邻近性相关的潜在空间结构,包含一个依赖邻近性的编码器和一个用于定位输入中物体坐标的局部化模块。实验表明,相比现有轻量模型,WorldComp2D 在参数量和计算量上分别减少达 4.0 倍和 2.2 倍,同时在 CPU 上仍能保持实时性能,验证了其在时空语义推理中的高效性和通用性。

Comments Accepted as a regular paper at ICML2026

详情
英文摘要

Learning latent representations that capture both semantic and spatial information is central to efficient spatio-semantic reasoning. However, many existing approaches rely on implicit latent structures combined with dense feature maps or task-specific heads, limiting computational efficiency and flexibility. We propose WorldComp2D, a novel lightweight representation learning framework that explicitly structures latent space geometry according to object identity and spatial proximity using multiscale local receptive fields. This framework consists of (i) a proximity-dependent encoder that maps a given observation into a spatio-semantic latent space and (ii) a localizer that infers the coordinates of objects in the input from the resulting spatio-semantic representation. Using facial landmark localization as a proof-of-concept, we show that, compared to SoTA lightweight models, WorldComp2D reduces the numbers of parameters and FLOPs by up to 4.0X and 2.2X, respectively, while maintaining real-time performance on CPU. These results demonstrate that explicitly structured latent spaces provide an efficient and general foundation for spatio-semantic reasoning. This framework is open-sourced at https://github.com/JinSeongmin/WorldComp2D.

2605.11742 2026-05-13 cs.LG

Online Continual Learning with Dynamic Label Hierarchies

Xinrui Wang, Shao-Yuan Li, Bartłomiej Twardowski, Alexandra Gomez-Villa, Songcan Chen

AI总结 本文研究了在线持续学习中动态标签层次结构的问题,即在非平稳数据流中学习时,如何应对标签层次结构在细粒度和粗粒度之间的动态演变。现有方法大多假设标签空间是扁平的,忽略了现实世界中概念的层次组织特性。为此,作者提出了一个新的问题设定DHOCL,并设计了HALO方法,通过自适应组合分类头和结构化原型,实现快速适应与知识保持,在多个基准测试中表现出色。

Comments Accepted to ICML2026

详情
英文摘要

Online Continual Learning (OCL) aims to learn from endless non\text{-}stationary data streams, yet most existing methods assume a flat label space and overlook the hierarchical organization of real\text{-}world concepts that evolves both horizontally (sibling classes) and vertically (coarse or fine categories). To better reflect this context, we introduce a new problem setting, DHOCL (Online Continual Learning from Dynamic Hierarchies), where taxonomies evolve across granularities and each sample provides supervision at a single hierarchical level. In this setting, we find two fundamental issues: (i) partial supervision under mixed granularities provides only point-wise signals over an evolving path-wise hierarchy, which constrains plasticity and undermines cross-level semantic consistency, and (ii) the dynamically evolving hierarchies induce granularity-dependent interference, destabilizing popular replay and regularization mechanisms and thereby exacerbating catastrophic forgetting. To tackle these issues, we propose HALO (Hierarchical Adaptive Learning with Organized Prototypes), which adaptively combines complementary classification heads, regularized by organized learnable hierarchical prototypes, enabling rapid adaptation, hierarchical consistency, and structured knowledge consolidation as the taxonomy evolves. Extensive experiments on multiple benchmarks demonstrate that HALO consistently outperforms existing methods across hierarchical accuracy, mistake severity, and continual performance.

2605.11738 2026-05-13 cs.AI

OptArgus: A Multi-Agent System to Detect Hallucinations in LLM-based Optimization Modeling

Zhong Li, Zihan Guo, Xiaohan Lu, Juntao Wang, Jie Song, Chao Shen, Jiageng Wu, Mingyang Sun

AI总结 本文提出OptArgus,一个用于检测基于大语言模型(LLM)的优化建模中幻觉问题的多智能体系统。研究聚焦于LLM在将自然语言优化问题转化为数学模型和求解代码时可能产生的结构不一致问题,并构建了一个细粒度的幻觉分类体系,涵盖目标函数、变量、约束和实现等多个方面。OptArgus通过多智能体协作机制,结合引导路由、专家审计和证据整合,显著提升了检测准确性和定位能力,并在包含多种类型数据的基准测试中表现出优于单一智能体方法的性能。

详情
英文摘要

Large language models (LLMs) are increasingly used to translate natural-language optimization problems into mathematical formulations and solver code, but matching the reference objective value is not a reliable test of correctness: an artifact may agree numerically while still changing the underlying optimization semantics. We formulate this issue as \emph{optimization-modeling hallucination detection}, namely structural consistency auditing over the problem description, symbolic model, and solver implementation. We develop, to our knowledge, the first fine-grained hallucination taxonomy specifically for optimization modeling, spanning objective, variable, constraint, and implementation failures. We use this taxonomy to design OptArgus, a multi-agent detector with conductor routing, specialist auditors, and evidence consolidation. To evaluate this setting, we introduce a three-part benchmark suite with $484$ clean artifacts, $1266$ controlled injected artifacts, and $6292$ natural LLM-generated artifacts. Against a matched single-agent baseline, OptArgus produces fewer false alarms on clean artifacts, more accurate top-ranked localization on controlled single-error cases, and stronger detection on natural model outputs. Together, these contributions turn optimization-modeling hallucination detection into a concrete empirical problem and suggest that modular, taxonomy-grounded auditing is a practical route to more reliable optimization modeling.

2605.11735 2026-05-13 cs.LG eess.SP

U-STS-LLM A Unified Spatio-Temporal Steered Large Language Model for Traffic Prediction and Imputation

Yichen Zhang, Jun Li

AI总结 本文提出了一种统一的时空引导大语言模型U-STS-LLM,用于交通预测与缺失值填补。该模型通过动态生成时空注意力偏差,显式引导大语言模型关注关键时空结构,并结合低秩适配和门控融合机制,实现了高效稳定的参数优化。实验表明,U-STS-LLM在真实蜂窝网络数据集上取得了优于现有方法的预测和填补性能,展示了其在结构化非语言领域应用大模型的潜力。

Comments 14 pages, 6 figures

详情
英文摘要

The efficient operation of modern cellular networks hinges on the accurate analysis of spatio-temporal traffic data. Mastering these patterns is essential for core network functions, chiefly forecasting future load to pre-empt congestion and imputing missing values caused by sensor failures or transmission errors to ensure data continuity. While deeply connected, forecasting and imputation have historically evolved as separate sub-fields. The dominant paradigm, Spatio-Temporal Graph Neural Networks (STGNNs), while effective, are often specialized, computationally intensive, and exhibit limited generalization. Concurrently, adapting large pre-trained language models (LLMs) offers a powerful alternative for sequence modeling, yet existing approaches provide weak structural guidance, leading to unstable convergence and a narrow focus on forecasting. To bridge these gaps, we propose U-STS-LLM, a unified framework built on a spatio-temporally steered LLM. Our core innovation is a Dynamic Spatio-Temporal Attention Bias Generator that synthesizes a persistent functional graph with transient nodal states to explicitly steer the LLM's attention. Coupled with a partially frozen backbone tuned via Low-Rank Adaptation (LoRA) and a Gated Adaptive Fusion mechanism, the model achieves stable, parameter-efficient adaptation. Trained under a unified multi-task objective, U-STS-LLM learns a holistic data representation. Extensive experiments on real-world cellular datasets demonstrate that U-STS-LLM establishes new state-of-the-art performance in both long-horizon forecasting and high-missing-rate imputation, while maintaining remarkable training efficiency and stability, offering a novel blueprint for harnessing foundation models in structured, non-linguistic domains.

2605.11730 2026-05-13 cs.LG cs.CR

Persona-Conditioned Adversarial Prompting: Multi-Identity Red-Teaming for Adversarial Discovery and Mitigation

Cristian Morasso, Anisa Halimi, Muhammad Zaid Hameed, Douglas Leith

AI总结 该研究提出了一种名为“Persona-Conditioned Adversarial Prompting(PCAP)”的方法,通过引入多样化的攻击者角色和策略集,提升大语言模型的红队测试效果,以更全面地发现和应对潜在威胁。PCAP通过并行搜索不同角色的攻击方式,生成更具现实场景覆盖性的攻击样本,并显著提高了攻击成功率和防御数据的多样性。实验表明,基于PCAP生成的数据进行微调,能有效增强模型的鲁棒性,同时保持较低的误报率,展示了从漏洞发现到自动对齐的实用闭环流程。

详情
英文摘要

Automated red-teaming for LLMs often discovers narrow attack slices, missing diverse real-world threats, and yielding insufficient data for safety fine-tuning. We introduce Persona-Conditioned Adversarial Prompting (PCAP), which conditions adversarial search on diverse attacker personas (e.g., doctors, students, malicious actors) and strategy sets to explore realistic attack scenarios. By running parallel persona-conditioned searches, PCAP discovers transferable jailbreaks across different contexts and generates rich defense datasets with automatic metadata tracking. On GPT-OSS 120B, PCAP increases attack success from 57\% to 97\% while producing 2-6$\times$ more diverse prompts covering varied real-world scenarios. Critically, fine-tuning lightweight adapters on PCAP-generated data significantly improves model robustness (recall: 0.36 $\rightarrow$ 0.99, F1: 0.53 $\rightarrow$ 0.96) with minimal false positives, demonstrating a practical closed-loop approach from vulnerability discovery to automated alignment.

2605.11727 2026-05-13 cs.AI cs.CL cs.CV

Allegory of the Cave: Measurement-Grounded Vision-Language Learning

Kepeng Xu, Li Xu, Gang He, Wenxin Yu

AI总结 该研究探讨了如何通过更贴近原始相机测量数据的视觉输入来提升视觉-语言模型的感知能力。提出了一种基于原始测量值的视觉-语言学习框架PRISM-VL,结合了RAW图像输入、相机条件化对齐和曝光区间监督聚合等方法,以增强模型对真实环境信息的感知。实验表明,该方法在低光、高动态范围等复杂场景下显著提升了模型的性能,验证了保留测量域信息对多模态推理的重要性。

详情
英文摘要

Vision-language models typically reason over post-ISP RGB images, although RGB rendering can clip, suppress, or quantize sensor evidence before inference. We study whether grounding improves when the visual interface is moved closer to the underlying camera measurement. We formulate measurement-grounded vision-language learning and instantiate it as PRISM-VL, which combines RAW-derived Meas.-XYZ inputs, camera-conditioned grounding, and Exposure-Bracketed Supervision Aggregation for transferring supervision from RGB proxies to measurement-domain observations. Using a quality-controlled 150K instruction-tuning set and a held-out benchmark targeting low-light, HDR, visibility-sensitive, and hallucination-sensitive cases, PRISM-VL-8B reaches 0.6120 BLEU, 0.4571 ROUGE-L, and 82.66\% LLM-Judge accuracy, improving over the RGB Qwen3-VL-8B baseline by +0.1074 BLEU, +0.1071 ROUGE-L, and +4.46 percentage points. These results suggest that part of VLM grounding error arises from information lost during RGB rendering, and that preserving measurement-domain evidence can improve multimodal reasoning.

2605.11722 2026-05-13 cs.CV cs.LG

EPIC: Efficient Predicate-Guided Inference-Time Control for Compositional Text-to-Image Generation

Sunung Mun, Sunghyun Cho, Jungseul Ok

AI总结 EPIC 是一种无需训练的推理时优化框架,用于解决复杂文本到图像生成中多对象、数量、属性和关系等组合性提示的生成难题。该方法通过将提示解析为固定的视觉程序,利用谓词引导搜索进行图像验证与修正,确保所有条件满足后才判定生成成功。实验表明,EPIC 在 GenEval2 数据集上显著提升了生成准确率,并在计算资源消耗上相比现有方法大幅降低。

详情
英文摘要

Recent text-to-image (T2I) generators can synthesize realistic images, but still struggle with compositional prompts involving multiple objects, counts, attributes, and relations. We introduce EPIC (Efficient Predicate-Guided Inference-Time Control), a training-free inference-time refinement framework for compositional T2I generation. EPIC casts refinement as predicate-guided search: it parses the original prompt once into a fixed visual program of object variables and typed predicates, covering checkable conditions such as object presence, counts, attributes, and relations. Each generated or edited image is verified against this program using visual evidence extracted from that image. An image is judged to satisfy the prompt only when all predicates are satisfied; otherwise, failed predicates decide the next step, routing local failures to targeted editing and global failures to resampling while the fixed visual program remains unchanged. On GenEval2, EPIC improves prompt-level accuracy from 34.16% for single-pass generation with the base generator to 71.46%. Under the same generator/editor setting and maximum image-model execution budget, EPIC outperforms the strongest prior refinement baseline by 19.23 points while reducing realized cost by 31% in image-model executions, 72% in MLLM calls, and 81% in MLLM tokens per prompt.

2605.11716 2026-05-13 cs.AI

SafeSteer: A Decoding-level Defense Mechanism for Multimodal Large Language Models

Xinyi Zeng, Xue Yang, Jingyuan Zhang, Huanqian Yan, Xiang Chen, Kaiwen Wei, Hankun Kang, Yu Tian

AI总结 多模态大语言模型(MLLMs)在面对 Jailbreak 攻击时面临较大安全挑战,现有防御方法依赖昂贵的微调或低效的后处理,难以应对新型攻击且存在性能折衷。本文提出 SafeSteer,一种基于解码阶段的防御机制,通过引入轻量级的 Decoding-Probe 检测并修正有害输出,并结合模态语义对齐向量将文本安全对齐能力迁移至视觉模态。实验表明,SafeSteer 在无需微调的情况下可提升 MLLMs 的安全性达 33.40%,同时保持模型的有效性与实用性。

详情
英文摘要

Multimodal large language models (MLLMs) are gaining increasing attention. Due to the heterogeneity of their input features, they face significant challenges in terms of jailbreak defenses. Current defense methods rely on costly fine-tuning or inefficient post-hoc interventions, limiting their ability to address novel attacks and involving performance trade-offs. To address the above issues, we explore the inherent safety capabilities within MLLMs and quantify their intrinsic ability to discern harmfulness at decoding stage. We observe that 1) MLLMs can distinguish the harmful and harmless inputs during decoding process, 2) Image-based attacks are more stealthy. Based on these insights, we introduce SafeSteer, a decoding-level defense mechanism for MLLMs. Specifically, it includes a Decoding-Probe, a lightweight probe for detecting and correcting harmful output during decoding, which iteratively steers the decoding process toward safety. Furthermore, a modal semantic alignment vector is integrated to transfer the strong textual safety alignment to the vision modality. Experiments on multiple MLLMs demonstrate that SafeSterr can improve MLLMs' safety by up to 33.40\% without fine-tuning. Notably, it can maintain the effectiveness of MLLMs, ensuring a balance between their helpfulness and harmlessness.

2605.11714 2026-05-13 cs.RO

Introducing Environmental Constraints to Grasping Strategies for Paper-Like Flexible Materials Using a Soft Gripper

Yi Dong, Yang Li, Jinjun Duan, Zhendong Dai

AI总结 本文研究了使用软夹爪抓取纸张类柔性材料时如何引入环境约束以提升抓取效果。通过分析材料特性和工作条件对抓取的影响,提出了一套基于环境约束的系统抓取策略,并建立了其力学与运动学模型。实验验证了不同策略的适用场景和性能,为家庭服务机器人抓取平面柔性物体提供了可行的解决方案。

Comments Under Review

详情
英文摘要

Robotic manipulation of flexible objects is widely required in both industrial and service applications. Among such objects, paper-like materials exhibit distinct mechanical characteristics compared to cloth, being more sensitive to compressive stress, where minor variations in physical properties can significantly affect grasping. This study systematically investigates grasping strategies for paper-like materials using a universal soft gripper by exploiting environmental constraints. Based on manipulation primitives employed in existing grasping strategies, we proposed systematic grasping strategies for flexible materials by exploiting environmental constraints and analyzed their mechanical and kinematic models. To investigate the influence of materials and working conditions on grasping, an evaluation system for measuring grasping force and success rate was defined and experimentally evaluated. Finally, we summarized the specific workspaces and characteristics of different strategies that can satisfy various task requirements and lead to potential applications in household service robots for grasping planar flexible objects.

2605.11712 2026-05-13 cs.AI

Toward Stable Value Alignment: Introducing Independent Modules for Consistent Value Guidance

Wenhao Chen, Sirui Sun, Shengyuan Bai, Guojie Song

AI总结 本文针对大语言模型(LLM)在价值对齐过程中因残差流动态性导致的价值表达不稳定问题,提出了一种名为 Stable Value Guidance Transformer(SVGT)的新架构。该方法通过引入独立的价值模块,将价值表示与主干网络分离,并利用可学习的桥接标记实现稳定的价值引导,从而在保持生成流畅性的同时显著提升模型的安全性。实验表明,SVGT 在多个基准测试中有效降低了有害输出,验证了其在结构化价值建模方面的有效性。

Comments Accepted to ICML 2026 (Spotlight). 32 pages

详情
英文摘要

Aligning large language models (LLMs) with human values typically relies on post-training or inference-time steering that directly manipulates the backbone's parameters or representation space. However, a critical gap exists: the model's residual stream is highly dynamic, in which values exist as fragile, low-dimensional properties, inherently incompatible with the stability required for consistent value expression. In this paper, we propose the Stable Value Guidance Transformer (SVGT), which addresses this gap through an independent value module incorporating two key designs: (1) independent value modeling, maintaining normative representations in a dedicated value space isolated from the backbone, and (2) explicit behavioral guidance, transducing these stable signals into learnable latent Bridge Tokens. These tokens serve as dynamic value anchors to explicitly steer the generative trajectory, ensuring robust adherence across diverse contexts without disrupting the backbone's internal representations. Experiments across multiple backbones and safety benchmarks show that SVGT generally reduces harmful scores by over 70% while maintaining generation fluency, demonstrating the efficacy of architecturally grounded value modeling. Our code is available at https://github.com/Clervils/SVGT.git.

2605.11711 2026-05-13 cs.LG cs.AI

Debiased Model-based Representations for Sample-efficient Continuous Control

Jiafei Lyu, Zichuan Lin, Scott Fujimoto, Kai Yang, Yangkun Chen, Saiyong Yang, Zongqing Lu, Deheng Ye

AI总结 本文提出了一种去偏的基于模型的表示学习方法DR.Q,用于提高连续控制任务中样本效率。该方法通过最大化当前状态-动作对与其下一状态之间的互信息,并结合衰减优先经验回放策略,有效缓解了传统方法在表示学习中的偏差和过拟合问题。实验表明,DR.Q在多个基准任务上表现优异,能够匹配甚至超越现有先进方法。

Comments ICML 2026

详情
英文摘要

Model-based representations recently stand out as a promising framework that embeds latent dynamics information into the representations for downstream off-policy actor-critic learning. It implicitly combines the advantages of both model-free and model-based approaches while avoiding the training costs associated with model-based methods. Nevertheless, existing model-based representation methods can fail to capture sufficient information about relevant variables and can overfit to early experiences in the replay buffer. These incur biases in representation and actor-critic learning, leading to inferior performance. To address this, we propose Debiased model-based Representations for Q-learning, tagged DR.Q algorithm. DR.Q explicitly maximizes the mutual information between the representations of the current state-action pair and the next state besides minimizing their deviations, and samples transitions with faded prioritized experience replay. We evaluate DR.Q on numerous continuous control benchmarks with a single set of hyperparameters, and the results demonstrate that DR.Q can match or surpass recent strong baselines, sometimes outperforming them by a large margin. Our code is available at https://github.com/dmksjfl/DR.Q.

2605.11706 2026-05-13 cs.LG

GRAFT: Graph-Tokenized LLMs for Tool Planning

Xinyi Gao, Xinyu Ren, Junliang Yu, Tong Chen, Quoc Viet Hung Nguyen, Hongzhi Yin

AI总结 GRAFT 是一种用于工具规划的图标记化大语言模型框架,旨在解决复杂任务中工具选择与子任务意图对齐以及满足工具间依赖关系的问题。该方法通过将每个工具节点映射为专用特殊标记,并在表示空间中学习有向工具依赖关系,从而将工具图内部化到模型中。此外,GRAFT 引入了基于策略的工具上下文蒸馏技术,提升模型在复杂工作流中生成合法、准确工具序列的能力,实验表明其在序列匹配和依赖合法性方面达到最优性能。

详情
英文摘要

Large language models (LLMs) are increasingly used to complete complex tasks by selecting and coordinating external tools across multiple steps. This requires aligning tool choices with subtask intent while satisfying directional execution dependencies among tools. To do this, existing methods model these dependencies as tool graphs and incorporate the graphs with LLMs through retrieval, serialization, or prompt-level injection. However, these external graph-use strategies all follow a matching paradigm, which often fails to align tool choices with the underlying subtask structure, producing semantically plausible plans that violate graph constraints. This issue is further exacerbated by error accumulation, where an early incorrect tool selection shifts the plan into an invalid graph state and causes subsequent predictions to drift away from the valid execution path. To address these challenges, we propose GRAFT, a graph-tokenized language model framework for dependency-aware tool planning. GRAFT internalizes the tool graph by mapping each tool node to a dedicated special token and learning directed tool dependencies within the representation space. It further introduces on-policy tool context distillation, training the model on its own sampled trajectories while distilling stepwise planning signals. Experiments show that GRAFT achieves state-of-the-art performance in exact sequence matching and dependency legality, supporting more reliable LLM tool planning in complex workflows.

2605.11705 2026-05-13 cs.CV

CAST: Collapse-Aware multi-Scale Topology Fusion for Multimodal Coreset Selection

Boran Zhao, Hetian Liu, Zhenxian Hu, Yuqing Yuan, Yu Yan, Pengju Ren

AI总结 本文提出了一种名为CAST的多模态核心集选择框架,旨在解决大规模图像-文本数据集在训练多模态模型时带来的高计算成本问题。CAST通过构建图像和文本模态的拓扑结构,并结合局部坍缩感知的融合策略,实现跨模态信息的均衡表示。同时,CAST引入多尺度扩散小波域分布匹配和局部软关系覆盖机制,有效提升了核心集在语义结构、细粒度细节和冗余抑制方面的表现。实验表明,CAST在多个数据集上优于现有方法,展现出更强的跨架构泛化能力和计算效率。

详情
英文摘要

The training of large multimodal models fundamentally relies on massive image-text datasets, which inevitably incur prohibitive computational overhead. Dataset selection offers a promising paradigm by identifying a highly informative coreset. However, existing approaches suffer from two critical limitations: (i) single-modality-dominated sampling methods, which ignore the fine-grained cross-modal information imbalance inherent in multimodal datasets and thus lead to semantic loss in the other modality; and (ii) coarse-grained sample-scoring-based sampling methods, where the selected coreset tends to be biased toward the scoring model, making it difficult to guarantee distributional equivalence between the coreset and the original dataset. Meanwhile, existing distribution matching and discrete sampling strategies often fail to jointly account for global semantic structure, local fine-grained details, and redundancy-aware coverage in dense regions. To this end, we propose CAST, a Collapse-Aware multi-Scale Topology fusion framework for multimodal coreset selection. We first construct image- and text-modality topologies, and derive a unified topology via local-collapse-aware refinement and cross-modal fusion. We then introduce a multi-scale distribution matching criterion in the diffusion wavelet domain, encouraging the coreset to approximate the original dataset at multiple scales. Finally, we introduce a local soft relational coverage mechanism that extends pure geometric coverage to relation-aware indirect coverage, penalizing redundant selections in dense clusters. Extensive experiments on Flickr30K and MS-COCO show that CAST outperforms existing dataset selection baselines, showcasing great superiority in cross-architecture generalization and energy efficiency over state-of-the-art multimodal synthesis methods.

2605.11704 2026-05-13 cs.CV

ScaleMoGen: Autoregressive Next-Scale Prediction for Human Motion Generation

Inwoo Hwang, Hojun Jang, Bing Zhou, Jian Wang, Young Min Kim, Chuan Guo

AI总结 本文提出 ScaleMoGen,一种基于尺度自回归的文本驱动人体运动生成框架。该方法将运动生成视为从粗到细的过程,通过多尺度骨骼-时序离散化标记进行自回归预测,从而生成高质量的运动序列。研究通过位级量化和预测策略,提升了标记词汇量并优化了生成稳定性,实验表明其在多个指标上优于现有方法,并支持无需训练的文本引导运动编辑。

Comments Project page: https://inwoohwang.me/ScaleMoGen

详情
英文摘要

We present ScaleMoGen, a scale-wise autoregressive framework for text-driven human motion generation. Unlike conventional autoregressive approaches that rely on standard next-token prediction, ScaleMoGen frames motion generation as a coarse-to-fine process. We quantize 3D motions into compositional discrete tokens across multiple skeletal-emporal scales of increasing granularity, learning to generate motion by autoregressively predicting next-scale token maps. To maintain structural integrity, our motion tokenizers and quantizers are explicitly designed so that discrete tokens at every scale strictly preserve the skeletal hierarchy. Additionally, we employ bitwise quantization and prediction, which efficiently scale up the tokenizer vocabulary to preserve motion details and stabilize optimization. Extensive experiments demonstrate that ScaleMoGen achieves state-of-the-art performance, establishing an FID of 0.030 (vs. 0.045 for MoMask) on HumanML3D and a CLIP Score of 0.693 (vs. 0.685 for MoMask++) on the SnapMoGen dataset. Furthermore, we demonstrate that our skeletal-temporal multi-scale representation naturally facilitates training-free, text-guided motion editing.

2605.11697 2026-05-13 cs.RO

Rainbow Deep Q-Learning with Kinematics-Aware Design for Cooperative Delta and 3-RRS Parallel Robot Insertion

Hassen Nigatu, Gaokun Shi, Jituo Li, Wang Jin, Lu Guodong

AI总结 本文提出了一种基于彩虹深度Q网络(Rainbow DQN)的运动学感知深度强化学习框架,用于Delta并联机器人与3-RRS并联机械臂的协作插孔操作。研究通过优化3-RRS机械臂的几何结构,扩大其无奇异工作空间,从而提升强化学习策略的探索安全性。该框架将协作插入任务建模为马尔可夫决策过程,并结合定制奖励函数与两阶段训练课程,最终在高保真仿真环境中实现了稳定策略收敛与可靠插入效果。

Comments 10 pages

详情
英文摘要

This paper presents a kinematics-aware deep reinforcement learning framework based on Rainbow Deep Q-Networks (DQN) for cooperative peg-in-hole manipulation by a Delta parallel robot and a 3-RRS (Revolute--Revolute--Spherical) parallel manipulator. A key contribution is the integration of a geometric design-optimization stage that precedes learning: the 3-RRS geometry is tuned to maximize the singularity-free workspace and improve conditioning, which in turn enlarges the safe region in which the reinforcement learning policy can explore. Together the two manipulators expose a 6~degree-of-freedom (DoF) controllable subspace (three Delta translations, two 3-RRS rotations, and one 3-RRS vertical translation); the peg-in-hole task is invariant to rotation about the peg axis, so the task-relevant manifold is five dimensional. The cooperative insertion problem is cast as a Markov Decision Process with a 12-dimensional state vector and a discrete action set containing $6 \times 2 = 12$ incremental commands (one positive and one negative per controlled DoF). A shaped reward combines dense proximity guidance, penalties for kinematic and workspace violations, and sparse bonuses for successful insertions. The Rainbow DQN -- integrating double Q-learning, dueling architecture, prioritized replay, multi-step returns, noisy linear layers for exploration, and a distributional value head -- is trained with a two-stage curriculum. The co-designed framework is validated in a high-fidelity kinematic simulator, where it achieves stable policy convergence, reliable insertions, and reduced constraint violations compared against a vanilla DQN agent and a classical sampling-based planner.