arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2075
专题追踪 全部专题
2605.13228 2026-05-14 cs.CV cs.AI

ReTool-Video: Recursive Tool-Using Video Agents with Meta-Augmented Tool Grounding

Xiao Liu, Nayu Liu, Junnan Zhu, Ruirui Chen, Guohui Xiang, Changjian Wang, Kaiwen Wei, Rongzhen Li, Jiang Zhong

发表机构 * Chongqing University(重庆大学) Tianjin University(天津大学) MAIS, Institute of Automation, Chinese Academy of Sciences(自动化研究所,中国科学院MAIS) Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR), Singapore(新加坡科技研究局高性能计算研究所) Chongqing National Data AI Research Institute, AI Research Lab(重庆国家数据AI研究院,AI研究实验室)

AI总结 该论文提出了一种名为 ReTool-Video 的递归工具使用视频代理方法,旨在提升视频理解中复杂推理和跨模态分析的能力。为了解决现有视频代理在工具粒度和动作空间上的局限,研究构建了包含134个工具的 MetaAug-Video 工具库(MVTL),支持细粒度操作和多级信息访问,并设计了递归工具调用机制,将高层视频意图逐步分解为可执行的工具链。实验表明,该方法在多个基准测试中表现优异,显著提升了复杂视频理解的稳定性和效果。

详情
英文摘要

Video understanding requires active evidence seeking, motivating tool-augmented video agents for temporal reasoning, cross-modal understanding, and complex question answering. Existing video agents have improved video reasoning with retrieval, memory, frame inspection, and verifier tools, but they still face two limitations: (1) a coarse tool space that lacks fine-grained operations for compositional reasoning; and (2) a flat action space that forces high-level video intents into primitive executable tool calls. In this paper, we address these challenges with two complementary designs. First, we construct a MetaAug-Video Tool Library (MVTL), an extensible tool library with 134 registered tools, including 26 base tools for general multimodal signal processing and 108 meta tools for filtering, aggregation, reranking, formatting, and other intermediate-result operations. MVTL supports dual-level access to both structured video information and raw modal evidence, enabling diverse video reasoning scenarios. Second, we propose ReTool-Video, a recursive tool-using method that grounds high-level video intents into executable tool chains. In ReTool-Video, matched actions are executed directly, while unmatched intents are delegated to a resolver for parameter repair, tool substitution, or decomposition. This allows abstract actions such as temporal merging, cross-modal verification, or repeated-event aggregation to be progressively translated into concrete multimodal operations at runtime. Experiments on MVBench, MLVU, and Video-MME w/o sub. show that ReTool-Video consistently outperforms strong baselines. Further analysis demonstrates that recursive grounding and fine-grained meta tools improve the stability and effectiveness of complex video understanding.

2605.13225 2026-05-14 cs.LG

Mix, Don't Tune: Bilingual Pre-Training Outperforms Hyperparameter Search in Data-Constrained Settings

Paul Jeha, Anastasiia Sedova, Louis Béthune, Skyler Seto, Jes Frellsen, Pierre Ablin, Natalie Schluter

发表机构 * Apple(苹果公司) DTU(丹麦技术大学)

AI总结 在数据受限的语言模型预训练中,研究对比了超参数调优和双语数据混合两种方法,发现数据混合在验证损失和下游任务准确率上均优于超参数调优,且效果随模型规模增大而增强。研究进一步量化了数据混合的增益,表明其效果相当于增加了2到13倍的目标语言数据,并揭示了验证损失无法全面反映混合带来的好处。基于实验结果,作者建议在数据受限场景中优先采用高资源语言的数据混合,并通过μP方法迁移超参数设置。

详情
英文摘要

For most languages of the world, language model pre-training operates in a data-constrained regime where models must repeat their training data many times, degrading generalization. Two remedies exist: aggressive hyperparameter tuning such as high weight decay, and mixing in data from a high-resource auxiliary language to directly aid the low-resource target. While hyperparameter tuning regularizes the model by shrinking weights to restrict network capacity, auxiliary data mixing uses a tunable mixing ratio to expand the training distribution and diversify the training signal with new knowledge. Both offer a principled way to improve training in a data-constrained domain. We compare these levers systematically across four model scales from 150M to 1.43B parameters, using Arabic as the low-resource target and English as the auxiliary, over approximately 1000 pre-training runs. Three findings emerge. First, mixing yields larger improvements than hyperparameter tuning on both validation loss and downstream task accuracy, and the gap grows with model size. Second, we quantify how much mixing helps: it boosts performance by an amount equivalent to 2--3$\times$ the unique target data on validation loss and 2--13$\times$ on downstream task accuracy, with the gain scaling steeply with model size. Third, this divergence reveals that target-language validation loss systematically underestimates mixing's value. Mixing regularizes by diversifying the training signal and contributes knowledge the repeated target corpus cannot supply; validation loss captures only the first effect. Our practical recommendations are: mix in a high-resource language, prioritize the mixing ratio over hyperparameter tuning, and transfer hyperparameters from a small proxy model via $μ$P.

2605.13223 2026-05-14 cs.CV

Skill-Aligned Annotation for Reliable Evaluation in Text-to-Image Generation

Abdelrahman Eldesokey, Merey Ramazanova, Ahmad Sait, Ansar Khangeldin, Karen Sanchez, Tong Zhang, Bernard Ghanem

发表机构 * King Abdullah University of Science and Technology(卡斯泰大学)

AI总结 随着文本到图像生成技术的快速发展,可靠的模型评估变得尤为重要。本文提出了一种技能对齐注释方法,使注释策略更符合不同评估技能的本质特征,从而提升评估的一致性和稳定性。研究还构建了一个自动化评估流程,实现了可扩展的细粒度评估,并强调改进评估基础可以提高效率,而无需单纯增加注释工作量。

Comments Project Page: https://abdo-eldesokey.github.io/skill-aligned-eval/

详情
英文摘要

Text-to-image (T2I) generation has advanced rapidly, making reliable evaluation critical as performance differences between models narrow. Existing evaluation practices typically apply uniform annotation mechanisms, such as Likert-scale or binary question answering (BQA), across heterogeneous evaluation skills, despite fundamental differences in their nature. In this work, we revisit T2I evaluation through the lens of skill-aligned annotation, where annotation strategies reflect the underlying characteristics of each evaluation skill. We systematically compare skill-aligned annotation against uniform baselines and show that it produces more consistent evaluation signals, with higher inter-annotator agreement and improved stability across models. Finally, we present an automated pipeline that instantiates the proposed evaluation protocol, enabling scalable and fine-grained evaluation with spatially grounded feedback. Our work highlights that improving the foundations of image evaluation can increase reliability and efficiency without simply scaling annotation effort. We hope this motivates further research on refining evaluation protocols as a central component of reliable model assessment.

2605.13221 2026-05-14 cs.AI cs.LG

An Agentic AI Framework with Large Language Models and Chain-of-Thought for UAV-Assisted Logistics Scheduling with Mobile Edge Computing

Hanwen Zhang, Dusit Niyato, Wei Zhang, Xin Lou, Malcolm Yoke Hean Low

发表机构 * Nanyang Technological University(南洋理工大学) Singapore Institute of Technology(新加坡理工学院) Seatrium New Energy Laboratory(Seatrium 新能源实验室) Ministry of Education (MOE) Tier 1(教育部 Tier 1) Research Innovation and Enterprise (RIE) 2025 Industry Alignment Fund-Industry Collaboration Projects (IAF-ICP)(研究创新与企业 (RIE) 2025 行业对齐基金-行业合作项目 (IAF-ICP))

AI总结 本文研究了无人机辅助物流调度中结合边缘计算的混合调度问题,该问题涉及物理物流决策与计算任务调度的耦合。为解决这一挑战,作者提出了一种基于智能体AI的优化框架,结合大语言模型与链式推理技术将用户输入转化为可解释的数学模型,并设计了一种基于近端策略优化的分层深度强化学习方法,以优化无人机路径规划与任务执行资源分配。实验表明,该框架在任务截止时间满足率和产品收集成功率方面表现出色,性能稳定且优于传统方法。

Comments 15 pages

详情
英文摘要

In cloud manufacturing, unmanned aerial vehicles (UAVs) can support both product collection and mobile edge computing (MEC). This joint operation forms a hybrid scheduling problem, where physical logistics decisions are coupled with computational task scheduling. In this paper, UAVs collect finished products from manufacturing stations and transport them back to a central depot. Meanwhile, computational tasks generated by industrial sensor devices at these stations are processed locally, at UAVs, or offloaded via UAVs to the cloud. This coupling makes the problem challenging. A UAV can provide MEC services only during its service window at a station, so routing decisions directly determine when UAV-assisted offloading is available. Routing decisions also affect the UAV energy budget and the availability of onboard computing and communication resources for computational task execution under task deadline constraints. To address this, we propose an agentic-AI-assisted optimization framework with two components. First, we develop an agentic AI that combines large language models, retrieval-augmented generation, and chain-of-thought reasoning to translate user input into an interpretable mathematical formulation for the hybrid scheduling problem. Second, we design a hierarchical deep reinforcement learning approach based on proximal policy optimization (PPO), where the upper layer learns UAV routing and the lower layer optimizes per-slot task execution and resource allocation. Simulation results show that the proposed framework yields more consistent formulations, while the hierarchical PPO achieves full product collection in 99.6% of the last 500 episodes and maintains a 100% deadline satisfaction rate, with more stable performance than the advantage actor-critic approach.

2605.13218 2026-05-14 cs.LG

Machine Learning-Driven Multimodal Spectroscopic Liquid Biopsy for Early Multicancer Detection

Alejandro Leonardo García Navarro, Javier Cachón Ortiz, Javier González Colsa, Samuel García Díaz, Carlos Viadero Valderrama

发表机构 * Signal Processing Group(信号处理组) Gregorio Marañón Health Research Institute(格雷戈里奥·马兰农健康研究中心) Amber Health Solutions(艾默健康解决方案)

AI总结 该研究提出了一种基于多种光谱技术与机器学习的多模态液态活检方法,用于早期多癌种检测。通过结合傅里叶变换红外光谱(FTIR)、拉曼光谱和激发-发射矩阵(EEM)荧光光谱,并利用机器学习进行数据融合与分类,实现了对乳腺癌和结直肠癌的高精度检测。实验结果表明,多模态融合方法在灵敏度和特异性方面表现出更均衡的优异性能,ROC-AUC值分别达到0.997和0.994。

详情
英文摘要

Cancer is one of the leading causes of death worldwide, making the development of rapid, minimally invasive, label-free and scalable diagnostic strategies a major challenge in modern oncology. In this context, spectroscopic liquid biopsy has emerged as a promising alternative, as it enables the holistic characterization of biochemical alterations in biological fluids. In this work, we propose a multimodal spectroscopic liquid biopsy framework for multicancer detection based on the combination of Fourier Transform Infrared (FTIR) spectroscopy, Raman spectroscopy, and Excitation-Emission Matrix (EEM) fluorescence spectroscopy together with Machine Learning (ML) methodologies. Serum samples from breast cancer patients, colorectal cancer patients, and healthy controls were analyzed through the three spectroscopic modalities. After modality-specific preprocessing, low-level data fusion (LLDF) was employed to integrate the complementary biochemical information encoded within the different spectroscopic measurements, and classification was performed using XGBoost models. Seven experimental configurations were evaluated, including the three unimodal approaches, all pairwise bimodal configurations, and the full multimodal approach of FTIR, Raman, and EEM fluorescence. The results show that although several individual modalities achieved high discrimination performance, the multimodal fusion provided the most balanced overall results, reaching a ROC-AUC of 0.997 for breast cancer and 0.994 for colorectal cancer, together with highly balanced sensitivity and specificity values.

2605.13208 2026-05-14 cs.RO

Calibration-Free Gas Source Localization with Mobile Robots: Source Term Estimation Based on Concentration Measurement Ranking

Wanting Jin, Agatha Duranceau, İzzet Kağan Erünsal, Alcherio Martinoli

发表机构 * Distributed Intelligent Systems and Algorithms Laboratory, School of Architecture, Civil and Environmental Engineering, École Polytechnique Fédérale de Lausanne (EPFL)(分布式智能系统与算法实验室,建筑、 civil 和环境工程学院,洛桑联邦理工学院(EPFL))

AI总结 本文研究了无需校准的移动机器人气体源定位问题,提出了一种基于浓度测量排名的源项估计方法。该方法通过比较动态采集数据与物理扩散模型之间的浓度排名差异,估计气体源在环境中的概率分布,从而实现高效定位。该方法避免了低成本传感器校准的需求,在仿真和实际实验中均表现出良好的定位精度,适用于真实场景中的应急监测等应用。

Comments This paper has been accepted for publication in the IEEE International Conference on Robotics and Automation (ICRA), 2026

详情
英文摘要

Efficient Gas Source Localization (GSL) in real-world settings is crucial, especially in emergency scenarios. Mobile robots equipped with low-cost, in-situ gas sensors offer a safer alternative to human inspection in hazardous environments. Probabilistic algorithms enhance GSL efficiency with scattered gas measurements by comparing gas concentration measurements gathered by robots to physical dispersion models. However, accurately deriving gas concentrations from data acquired with low-cost sensors is challenging due to the nonlinear sensor response, environmental dependencies (e.g., humidity, temperature, and other gas influences), and robot motion. Mitigating these disturbance factors requires frequent sensor calibration in controlled environments, which is often impractical for real-world deployments. To overcome these issues, we propose a novel feature extraction algorithm that leverages the relative ranking of gas measurements within the dynamically accumulated dataset. By comparing the rank differences between gathered and modeled values, we estimate the probabilistic distribution of source locations across the entire environment. We validate our approach in high-fidelity simulations and physical experiments, demonstrating consistent localization accuracy with uncalibrated gas sensors. Compared to existing methods, our technique eliminates the need for gas sensor calibration, making it well-suited for real-world applications.

2605.13207 2026-05-14 cs.LG

Switching Successor Measures for Hierarchical Zero-shot Reinforcement Learning

Stefan Stojanovic, Alexandre Proutiere

发表机构 * KTH(瑞典皇家理工学院) KTH, Digital Futures(瑞典皇家理工学院,数字未来)

AI总结 本文研究了如何在零样本强化学习中实现分层控制,提出了一种称为“切换继承者度量”的方法,无需额外监督、固定时间范围或手动设计子目标即可实现分层决策。该方法基于经典继承者度量进行扩展,保持其结构特性,并在此基础上设计了FB $π$-Switch算法,能够从正向-反向表示中直接提取高层子目标策略和底层控制策略,从而实现分层行为。实验表明,该方法在目标条件任务和一般奖励任务中均优于非分层基线,并在目标条件任务中达到现有分层方法的性能水平。

详情
英文摘要

Hierarchical reinforcement learning can improve generalization by decomposing long-horizon decision-making into simpler subproblems. However, existing approaches often rely on restrictive design choices, such as fixed temporal abstractions or goal-conditioned objectives, which largely confine them to goal-reaching tasks and limit their applicability to general reward functions. In this paper, we introduce switching successor measures, an extension of successor measures that enables hierarchical control in zero-shot reinforcement learning without additional supervision, fixed horizons, or manually designed subgoals. We show that switching successor measures arise naturally from classical successor measures while preserving their underlying structure. Building on this result, we propose FB $π$-Switch, an algorithm that extracts both a high-level subgoal-selection policy and a low-level control policy directly from forward-backward (FB) representations, allowing hierarchical behavior to emerge from a single learned representation. Experiments on both goal-conditioned and general reward-based tasks show that FB $π$-Switch improves over non-hierarchical baselines and matches state-of-the-art hierarchical methods in goal-conditioned settings. These results demonstrate that structured successor representations provide a flexible foundation for hierarchical zero-shot reinforcement learning beyond goal-reaching tasks. Our project website is available at: https://stestokth.github.io/switching-successors/.

2605.13202 2026-05-14 cs.CV cs.AI

STAR: Semantic-Temporal Adaptive Representation Learning for Few-Shot Action Recognition

Hongli Liu, Yu Wang, Shengjie Zhao

发表机构 * School of Computer Science and Technology, Tongji University(同济大学计算机科学与技术学院) Engineering Research Center of Key Software Technologies for Smart City Perception and Planning, Ministry of Education(教育部智能城市感知与规划关键软件技术工程研究中心)

AI总结 本文研究了少样本动作识别(FSAR)中的语义-时序对齐问题,提出了一种统一的语义-时序自适应表示学习框架STAR。该方法通过引入时序语义注意力机制和语义时序原型细化模块,有效解决了文本提示与动作序列中稀疏视觉线索的对齐问题,并增强了对多尺度时序动态的建模能力。实验表明,STAR在多个基准数据集上均优于现有方法,验证了其在有限样本条件下的有效性。

Comments Accepted for publication in IEEE Transactions on Circuits and Systems for Video Technology (TCSVT)

详情
英文摘要

Few-shot action recognition (FSAR) requires models to generalize to novel action categories from only a handful of annotated samples. Despite progress with vision-language models, existing approaches still suffer from semantic-temporal misalignment, where static textual prompts fail to capture decisive visual cues that appear sparsely across sequences, and from inadequate modeling of multi-scale temporal dynamics, as short-term discriminative cues and long-range dependencies are often either oversmoothed or fragmented. To address these challenges, we propose Semantic Temporal Adaptive Representation Learning (STAR), a unified framework, consisting of a semantic-alignment component and a temporal-aware component, effectively bridging the semantic and temporal gaps and transferring the sequence modeling capability of Mamba into the FSAR. The semantic alignment module introduces a Temporal Semantic Attention (TSA) mechanism, which performs frame-level cross-modal alignment with textual cues, ensuring fine-grained semantic-temporal consistency. The temporal-aware module incorporates a Semantic Temporal Prototype Refiner (STPR) that integrates semantic-guided Mamba blocks with multi-frequency temporal sampling and bidirectional state-space refinement, yielding semantically aligned prototypes with enhanced discriminative fidelity and temporal consistency. Furthermore, temporally dependent class descriptors derived from large language models (LLMs) provide long-range semantic guidance. Extensive experiments on five FSAR benchmarks demonstrate the consistent superiority of STAR over state-of-the-art methods. For instance, STAR achieves up to 8.1% and 6.7% gains on the SSv2-Full and SSv2-Small datasets under the 1-shot setting, and 7.3% on HMDB51, validating its effectiveness under limited supervision. The code is available at https://github.com/HongliLiu1/STAR-main.

2605.13200 2026-05-14 cs.LG cs.ET

A Hybrid Tucker-LSTM Tensor Network Model for SOC Prediction in Electric Vehicles

Han Wang, Ying Wang, Bing Wang

发表机构 * College of Computer and Information Science(计算机与信息科学学院) School of Culture Tourism(文化旅游学院) Digital Intelligence Center(数字智能中心) China Automotive Engineering Research Institute Co., Ltd.(中国汽车工程研究院股份有限公司)

AI总结 本文提出了一种结合 Tucker 张量分解与长短期记忆网络(LSTM)的混合模型,用于电动汽车电池荷电状态(SOC)的预测。该方法利用全生命周期的电动汽车实际运行数据,通过 Tucker 分解在保持时间结构的同时降低数据维度,从而提升 LSTM 的预测性能。实验结果表明,该混合模型在多个评估指标上均优于传统 LSTM,显著提高了 SOC 预测的准确性,为基于张量分析的电池管理系统提供了新的研究方向。

详情
英文摘要

Accurate state of charge estimation is critical for the success of electric vehicle battery management strategies, but it is well known that conventional estimators suffer from two fundamental shortcomings: cumulative errors that grow over time and reliance on simplified battery models that do not reflect real world dynamics. Therefore, this paper presents a novel hybrid approach combining Tucker tensor decomposition with LSTM networks, using full - lifecycle EV field data for SOC prediction. The inputs are charge status, mileage, voltage, current, cell differentials, and temporal features. Tucker decomposition is skillfully used to reduce dimensionality while maintaining the temporal structure, hence allowing a direct, fair comparison with standard LSTM. The result is unequivocal: Tucker - LSTM outperforms the baseline on all metrics, with MSE dropping 70.5\% (from 21.07 to 6.22 ), MAE improving 48.7\% (from 3.37\% to 1.73\%), RMSE falling from 4.59\% to 2.49\%, and $R^2$ rising from 0.918 to 0.976. Since the experimental results demonstrably demonstrate that tensor decomposition compresses high-dimensional battery data very well without loss of predictive fidelity, this paper naturally opens up a new direction for tensor-based analytics in electric vehicle battery management.

2605.13197 2026-05-14 cs.LG cs.AI

McCast: Memory-Guided Latent Drift Correction for Long-Horizon Precipitation Nowcasting

Penghui Wen, Yu Luo, Lintao Wang, Mengwei He, Patrick Filippi, Thomas Francis Bishop, Zhiyong Wang

发表机构 * School of Computer Science, The University of Sydney, Australia(悉尼大学计算机科学学院,澳大利亚) School of Life and Environmental Science, The University of Sydney, Australia(悉尼大学生命与环境科学学院,澳大利亚)

AI总结 现有的降水临近预报方法通常采用自回归框架,但这种方法在长时间预测中容易累积误差,导致预报偏离物理合理的演变轨迹。为了解决这一问题,本文提出 McCast,一种基于记忆引导的潜在漂移校正方法,通过引入时序组织的记忆库,主动校正自回归过程中的潜在演变偏差,从而生成更加时序一致且可靠的长期预报。实验表明,McCast 在 SEVIR 和 MeteoNet 两个基准数据集上取得了最先进的性能,尤其在长期预报任务中表现突出。

详情
英文摘要

Existing precipitation nowcasting methods typically adopt an autoregressive formulation, where future states are predicted from previous outputs. However, such an approach accumulates errors over long rollouts, causing forecasts to drift away from physically plausible evolution trajectories. Although various studies have attempted to alleviate this problem by improving step-wise prediction accuracy, they largely neglect the global temporal evolution of meteorological systems and lack mechanisms to actively correct drift during rollouts. To address this issue, we propose McCast, a memory-guided latent drift correction method for precipitation nowcasting. Rather than treating memory as an unordered dictionary of latent states for passive conditioning, McCast leverages temporally organized memory to actively correct autoregressive latent evolution. Specifically, McCast introduces a Drift-Corrective Memory Bank (DCBank) that explicitly estimates the temporally consistent drift corrections to calibrate the divergent trajectory. DCBank performs drift correction in two stages: a Corrective Latent Extractor first predicts an initial correction from the current prediction and a reference latent state, and a Correction-Aware Memory Retrieval module then refines the initial correction using temporally organized historical memory. By explicitly correcting latent evolution, instead of improving step-wise prediction accuracy only, McCast produces more temporally coherent and reliable long-horizon forecasts. Experiments on two widely used benchmarks, SEVIR and MeteoNet, show that McCast achieves state-of-the-art performance, particularly in challenging long-horizon forecasting scenarios.

2605.13194 2026-05-14 cs.LG cs.AI

ECG-NAT: A Self-supervised Neighborhood Attention Transformer for Multi-lead Electrocardiogram Classification

Mahsa Gazeran, Sayvan Soleymanbaigi, Fatemeh Daneshfar, Amjad Seyedi, Fardin Akhlaghian Tab

发表机构 * Department of Computer Engineering, University of Kurdistan(库尔德斯坦大学计算机工程系) Department of Mathematics and Operational Research, University of Mons(蒙斯大学数学与运筹学系)

AI总结 本文提出了一种名为ECG-NAT的自监督邻域注意力变换器,用于多导联心电图(ECG)分类。该方法通过分两阶段训练:首先使用掩码自编码器在未标注数据上进行生成式预训练,学习鲁棒的跨数据集特征表示;随后通过结合监督对比损失和交叉熵损失的双损失函数进行判别式微调,提升分类性能。ECG-NAT采用分层注意力机制,高效捕捉从细粒度心跳形态到更广泛节律模式的多尺度时间特征,在少量标注数据下仍能取得优异的分类准确率,适用于实时心电诊断场景。

详情
英文摘要

Electrocardiogram (ECG) arrhythmia classification remains challenging due to signal variability, noise, limited labeled data, and the difficulty in achieving both accuracy and efficiency in models. While self-supervised learning reduces label dependency, most methods target either global contextual features or local morphological patterns, but rarely implement hierarchical multi-scale feature extraction. ECG signals require architectures that simultaneously capture fine-grained beat-level morphology and broader rhythm-level dependencies with computational efficiency. To overcome this limitation, this paper proposes the Electrocardiogram Neighborhood Attention Transformer (ECG-NAT), a novel self-supervised learning approach tailored for multi-lead ECG classification. Our two-stage approach begins with generative pretraining, using a masked autoencoder to reconstruct partially masked ECG signals across multiple diverse datasets, enabling the model to learn robust, domain-invariant representations from unlabeled data. This is followed by discriminative fine-tuning with a dual-loss function that combines supervised contrastive and cross-entropy losses, aligning representation learning with label prediction. The hierarchical attention mechanism efficiently captures multi-scale temporal features from localized beat morphology to broader rhythm patterns at low computational cost. ECG-NAT achieves robust performance on benchmark datasets, with 88.1\% accuracy using only 1\% labeled data, demonstrating strong efficacy in low-resource settings. The framework combines superior classification performance with computational efficiency, making it practical for real-time ECG diagnosis. The code will be made available upon acceptance at: https://github.com/Mahsagazeran/ECG-NAT.

2605.13192 2026-05-14 cs.RO

Dynamics Computation of Soft-Rigid Hybrid-Link System and Its Application to Motion Analysis of an Athlete Wearing Sport Prosthesis

Sunghee Kim, Yuta Shimane, Taiki Ishigaki, Ko Yamamoto

发表机构 * Department of Mechano-Informatics, University of Tokyo(东京大学机械信息学系) Research Institute for Science and Technology, Tokyo University of Science(东京科学大学研究所)

AI总结 本文提出了一种基于软刚混合连杆系统的运动分析框架,用于分析佩戴运动专用柔性假肢的运动员动作。该方法通过统一建模刚性人体骨骼与柔性假肢的相互作用力,解决了传统刚体多连杆模型难以处理柔性部件的问题。研究应用混合连杆系统的逆运动学进行动作重建,并通过逆动力学估计关节力矩和地面反作用力,实验表明地面反作用力估计误差约为12%,同时考虑了截肢后的肌肉力与假肢变形的相互作用。

Journal ref Advanced Robotics, Vol.40, No.4, 2026

详情
英文摘要

This paper presents a motion analysis framework for an athlete wearing sport-specific flexible prosthesis based on the soft-rigid hybrid-link system. Such a motion analysis is a challenging problem because we need to consider the interaction force between the rigid human skeleton system and a flexible prosthesis. However, most of human musculoskeletal models are based on the computation framework of a rigid-body multi-link system. Recently in soft robotics research field, fast and efficient modeling methods were developed for a flexible rod deformation, which allows us to build a hybrid-link system that integrates rigid-link and soft-bodies in a unified formulation. We apply inverse kinematics of the hybrid-link system to motion reconstruction from a motion captured data, and also present the estimation of the joint torques and ground reaction force by inverse dynamics. Through a human subject experiment, we show that the inverse dynamics achieved approximately 12% error on the ground reaction force estimation. Furthermore, we provide the muscle force estimation considering muscle amputation and interaction force with the prosthesis leg deformation.

2605.13190 2026-05-14 cs.LG cs.AI

N-vium: Mixture-of-Exits Transformer for Accelerated Exact Generation

Aleksander Lorenc, Frédéric Berdoz, Joël Mathys, Roger Wattenhofer

发表机构 * ETH Zurich(苏黎世联邦理工学院)

AI总结 本文提出了一种名为N-vium的混合退出Transformer模型,旨在提升自回归Transformer的推理效率。该方法通过在不同深度添加预测头,并采用自适应路由机制,将计算部分并行化,从而提高每秒的计算效率,而非单纯减少每个token的计算量。实验表明,N-vium在保持相同困惑度的前提下,实现了比标准Transformer高达57.9%的运行速度提升。

详情
英文摘要

Improving the inference efficiency of autoregressive transformers typically means reducing FLOPs per token, usually through approximations that degrade model quality. We introduce N-vium, a mixture-of-exits transformer that partially parallelizes computation across depth on standard hardware, increasing effective FLOPs per second rather than minimizing compute per token. N-vium attaches prediction heads at multiple depths and defines the next-token distribution as a learned mixture over these exits, with token-adaptive routing. This formulation strictly generalizes the standard transformer, which is recovered exactly when routing assigns zero mass to all intermediate heads. Sampling from the mixture is exact, and complete KV caches are recovered by deferring the upper-layer computation and batching it with later tokens. We pretrain N-vium at scales up to 1.5B parameters. Our largest model reaches 57.9% wall-clock speedup over a parameter- and data-matched standard transformer at no perplexity cost.

2605.13182 2026-05-14 cs.CV

DiffST: Spatiotemporal-Aware Diffusion for Real-World Space-Time Video Super-Resolution

Zheng Chen, Ruofan Yang, Jin Han, Dehua Song, Zichen Zou, Chunming He, Yong Guo, Yulun Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Huawei Noah’s Ark Lab(华为诺亚实验室) Duke University(杜克大学) Huawei Consumer Business Group(华为消费者业务集团)

AI总结 DiffST 是一种高效的时空感知扩散框架,旨在解决真实场景下的时空视频超分辨率(STVSR)问题。该方法通过引入跨帧上下文聚合和视频表示引导模块,提升了对时空信息的利用效率,并采用一步采样策略提高了推理速度。实验表明,DiffST 在多个真实场景任务中取得了领先的性能,且推理速度比现有方法快约17倍。

Comments Code is available at: https://github.com/zhengchen1999/DiffST

详情
英文摘要

Diffusion-based models have shown strong performance in video super-resolution (VSR) and video frame interpolation (VFI). However, their role in the coupled space-time video super-resolution (STVSR) setting remains limited. Existing diffusion-based STVSR approaches suffer from two issues: (1) low inference efficiency and (2) insufficient utilization of spatiotemporal information. These limitations impede deployment. To address these issues, we introduce DiffST, an efficient spatiotemporal-aware video diffusion framework for real-world STVSR. To improve efficiency, we adapt a pre-trained diffusion model for one-step sampling and process the entire video directly rather than operating on individual frames. Furthermore, to enhance spatiotemporal information utilization, we introduce cross-frame context aggregation (CFCA) and video representation guidance (VRG). The CFCA module aggregates information across multiple keyframes to produce intermediate frames. The VRG module extracts video-level global features to guide the diffusion process. Extensive experiments show that DiffST obtains leading results on real-world STVSR tasks. It also maintains high inference efficiency, running about 17$\times$ faster than previous diffusion-based STVSR methods. Code is available at: https://github.com/zhengchen1999/DiffST.

2605.13181 2026-05-14 cs.LG cs.AI

Stable Attention Response for Reliable Precipitation Nowcasting

Penghui Wen, Zexin Hu, Sen Zhang, Patrick Filippi, Xiaogang Zhu, Allen Benter, Thomas Bishop, Zhiyong Wang, Kun Hu

发表机构 * School of Computer Science, The University of Sydney(悉尼大学计算机科学学院) School of Life and Environmental Science, The University of Sydney(悉尼大学生命与环境科学学院) School of Computer Science and Information Technology, The University of Adelaide(阿德莱德大学计算机科学与信息技术学院) Digital Agriculture, Orange Agricultural Institute(数字农业,橙色农业研究所) School of Science, Edith Cowan University(埃迪斯科文大学科学学院)

AI总结 降水临近预报由于大气动力学的高度局部化、快速变化和异质性而具有挑战性。尽管近期方法在单模态和多模态设置中越来越多地采用基于注意力的架构,但主要关注于增强表示学习和预测能力,而忽视了注意力响应在不同样本间的稳定性。本文提出HARECast,一种基于头级注意力响应能量调控的降水临近预报框架,通过减少注意力响应能量在样本间的波动,提升预测的稳定性与可靠性,并在多个基准数据集上取得了最先进的性能。

详情
英文摘要

Precipitation nowcasting remains challenging due to the highly localized, rapidly evolving, and heterogeneous nature of atmospheric dynamics. Although recent methods increasingly adopt attention-based architectures in both unimodal and multimodal settings, they mainly emphasize stronger representation learning and prediction capacity, while paying less attention to the stability of attention responses across samples. In this work, we show that cross-sample instability of attention-response energy is an important and previously underexplored source of forecasting unreliability. Empirically, inaccurate forecasts are associated with larger attention-response energy variance across heads and layers. Theoretically, we show that cross-sample variability can propagate through self-attention, and enlarge a lower bound on prediction error. Based on this insight, we propose HARECast, a Head-wise Attention Response Energy-regulated framework for precipitation nowcasting. HARECast explicitly models head-wise attention-response energy and stabilizes it through a group-wise regularization objective that reduces cross-sample fluctuations. The proposed formulation is generic and applicable to both unimodal and multimodal nowcasting architectures. We instantiate HARECast in a standard forecasting pipeline with reconstruction branches and a diffusion-based predictor, and evaluate it on commonly used benchmarks--SEVIR and MeteoNet. Experimental results demonstrate that HARECast achieves state-of-the-art performance.

2605.13179 2026-05-14 cs.CV

Does Engram Do Memory Retrieval in Autoregressive Image Generation?

Jinghao Wang, Qiyuan He, Chunbin Gu, Pheng-Ann Heng

发表机构 * The Chinese University of Hong Kong(香港中文大学) National University of Singapore(新加坡国立大学)

AI总结 该研究探讨了Engram模块在自回归图像生成中的作用,发现其虽能减少计算量,但并未提升生成图像的质量。通过实验分析表明,Engram模块更像是一个带有门控机制的辅助路径,而非内容寻址的回忆机制。研究进一步指出,Engram模块对生成结果的改进主要来源于其结构本身,而非记忆表中的内容。

Comments 9 pages

详情
英文摘要

The Engram module -- a hash-keyed, O(1) associative memory injected into Transformer layers -- was recently shown to improve large language model pretraining, with the appealing interpretation that it provides a content-addressed shortcut to recurring local token patterns. We ask whether this interpretation transfers to autoregressive (AR) image generation, or whether the observed gains, if any, come from a different mechanism. We adapt the Engram module to vision with 2D spatial $n$-gram hashing, gated fusion, and KV-cache-compatible incremental inference, and inject it into a class-conditional AR generator trained on ImageNet 256x256. Across a sweep of backbone-to-memory budget ratios $ρ{\in}[0.17, 0.90]$, every Engram-augmented variant trails the pure AR baseline in FID, indicating that the module saves backbone FLOPs but does not, by itself, improve sample quality. We then probe how the module is used. A gate-clamp sweep shows that disabling the Engram pathway entirely is catastrophic, yet a tiny constant gate (g=0.10) matches or beats the learned gate -- inconsistent with a heavily content-addressed recall mechanism. A donor-probe experiment shows that swapping the hash inputs for matched, adversarial, or random same-class exemplars produces statistically indistinguishable next-token distributions, while collapsing or randomising the table degrades them by two to three orders of magnitude. Finally, training a model from scratch with the entire memory table frozen to $\mathcal{N}(0, 1)$ noise costs only $Δ\text{FID}{=}0.10$ and actually raises Inception Score. Together, these findings indicate that the Engram in AR image generation behaves not as a content-addressed retriever but as a gated architectural side-pathway: a hash-keyed residual stream whose benefit is dominated by the pathway itself, with the learned table contributing only a small distributional refinement.

2605.13171 2026-05-14 cs.AI

Formal Conjectures: An Open and Evolving Benchmark for Verified Discovery in Mathematics

Moritz Firsching, Paul Lezeau, Salvatore Mercuri, Miklós Z. Horváth, Yaël Dillies, Calle Sönne, Eric Wieser, Fred Zhang, Thomas Hubert, Blaise Agüera y Arcas, Pushmeet Kohli

发表机构 * Google DeepMind Imperial College London(帝国理工学院伦敦分校) Stockholms universitet(斯德哥尔摩大学)

AI总结 随着自动推理系统的发展,亟需高质量的数学问题用于评估其能力。为此,研究者提出了“Formal Conjectures”,一个包含2615个用Lean 4形式化的问题的持续演进基准,涵盖836个已解决的问题和1029个未解的数学猜想,用于评估自动证明发现的能力。该基准通过协作开源项目确保形式化正确性,并利用AI生成的证明与反例进行持续优化,已在实际中推动了新的数学发现。

Comments 21 pages, 4 figures, 5 tables

详情
英文摘要

As automated reasoning systems advance rapidly, there is a growing need for research-level formal mathematical problems to accurately evaluate their capabilities. To address this, we present Formal Conjectures, an evolving benchmark of currently 2615 mathematical problem statements formalized in Lean 4. Sourced from areas of active mathematical research, the dataset features 1029 open research conjectures providing a zero-contamination benchmark for mathematical proof discovery, and 836 solved problems for proof autoformalization. Notably, the repository provides a structured interface connecting mathematicians who formalize and clarify problems with the AI systems and humans attempting to solve them. Demonstrating its immediate utility, the benchmark has already been leveraged to make new mathematical discoveries, including the resolution of open research conjectures. We describe our approach to ensuring the correctness of these formalizations in a collaborative open-source project where contributions stem from an active community. In this framework, AI-generated proofs and disproofs serve as a valuable auditing mechanism to iteratively improve the fidelity of the benchmark. Finally, we provide a standardized evaluation setup and report baseline results on frozen evaluation subsets, demonstrating a climbable signal that measures the current frontier of automated reasoning on research-level mathematics.

2605.13170 2026-05-14 cs.LG cs.MA

Finding the Weakest Link: Adversarial Attack against Multi-Agent Communications

Maxwell Standen, Junae Kim, Claudia Szabo

发表机构 * The University of Adelaide(阿德莱德大学)

AI总结 本文研究了针对多智能体强化学习系统的对抗攻击问题,重点分析如何通过扰动通信信息来破坏系统性能。作者提出利用雅可比矩阵的梯度信息,识别最易受攻击的消息、智能体及时刻,并设计了两种新的对抗损失函数以平衡攻击成功率与影响程度。实验表明,该方法在多个环境中显著提升了攻击效果,优于随机选择策略。

Comments Full version of the Extended Abstract presented at AAMAS 2026

详情
英文摘要

Multi-agent systems rely on communication for information sharing and action coordination, which exposes a vulnerability to attacks. We investigate single-victim communication perturbation attacks against Multi-Agent Reinforcement Learning-trained systems and propose methods that use gradient information from the Jacobian to identify which messages, agent, and timesteps are most susceptible to attack and have the greatest impact on the system. We enhance these methods with two proposed adversarial loss functions that trade-off attack success for attack impact which also create more effective perturbations. We empirically demonstrate the effectiveness of our methods against two different multi-agent communication methods in navigation, PredatorPrey, and TrafficJunction environments. Our results show that our novel message selection method achieves a similar or greater impact than random message selection across almost all tested scenarios. Our victim selection, message selection, tempo, and loss functions improve attack effectiveness in half of the thirty scenarios we tested.

2605.13167 2026-05-14 cs.CL

GeoBuildBench: A Benchmark for Interactive and Executable Geometry Construction from Natural Language

Jinwoong Kim, Rui Yang, Huishuai Zhang

发表机构 * Peking University(北京大学) Wangxuan Institute of Computer Technology(王璇计算机技术研究院)

AI总结 本文介绍了GeoBuildBench,一个用于评估大型语言模型和多模态智能体能否将非正式的自然语言平面几何问题转化为可执行几何构造的基准。该基准不同于以往关注答案正确性或静态图示理解的几何测试集,而是将几何图示视为交互式构造任务,要求模型生成特定领域语言程序以满足明确的几何对象和可验证约束。研究发现,尽管现有模型在任务中取得了一定成效,但仍常出现结构幻觉、遗漏对象和无法满足几何约束等问题,表明几何构造是检验模型可执行推理能力的严格测试环境。

详情
英文摘要

We introduce GeoBuildBench, a benchmark designed to evaluate whether large language models and multimodal agents can ground informal natural-language plane geometry problems into executable geometric constructions. Unlike existing geometry benchmarks that focus on answer correctness or static diagram interpretation, GeoBuildBench treats geometry diagram as an interactive construction task: given a textual problem, an agent must generate a domain-specific language (DSL) program to produce a diagram satisfying explicitly specified geometric objects and verifiable constraints. The benchmark features 489 Chinese textbook-style problems, curated through automated filtering and human validation to ensure text-complete, constructible problem specifications. We evaluate several state-of-the-art multimodal models in a bounded iterative setting and show that, despite reasonable success rates, models frequently exhibit structural hallucinations, missing objects, and failures to satisfy geometric constraints, with limited ability to exploit visual and constraint-based feedback for self-correction. These results highlight geometry construction as a rigorous testbed for grounded, executable reasoning beyond textual or visual plausibility. Our benchmark and code are publicly available.

2605.13165 2026-05-14 cs.CL

STOP: Structured On-Policy Pruning of Long-Form Reasoning in Low-Data Regimes

Chenjun Xu, Zhennan Zhou, Zhan Su, Bill Howe, Lucy Lu Wang, Bingbing Wen

发表机构 * University of Washington(华盛顿大学) University of Montreal(蒙特利尔大学)

AI总结 本文提出了一种名为STOP的结构化策略,用于在数据量有限的情况下对长链推理过程进行高效剪枝。该方法通过自蒸馏生成推理轨迹,并将其映射为结构化的推理接口,再结合最早正确节点(ECN)策略,去除冗余推理步骤,从而在保持推理准确性的同时显著减少生成的token数量。实验表明,STOP在多个数学推理任务中有效提升了推理效率,并减少了分布偏移,优化了推理结构。

Comments 20 pages, 6 figures, 6 tables. Code available at: https://github.com/chenjux/ECN-STOP

详情
英文摘要

Long chain-of-thought (Long CoT) reasoning improves performance on multi-step problems, but it also induces overthinking: models often generate low-yield reasoning that increases inference cost and latency. This inefficiency is especially problematic in low-data fine-tuning regimes, where real applications adapt reasoning models with limited supervision and cannot rely on large-scale teacher distillation or heavy test-time control. To address this, we propose STOP (Structured On-policy Pruning), an on-policy algorithm for analyzing and pruning long-form reasoning traces. STOP constructs self-distilled traces from the model. Then it maps each trace into a structured reasoning interface through node segmentation, taxonomy annotation, and reasoning-tree construction. On top of this interface, we introduce ECN (Earliest Correct Node), which retains the shortest prefix ending at the earliest node that both functions as an answering conclusion and yields the correct final answer, removing redundant post-solution reasoning while preserving semantic continuity. Experiments on DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-LLaMA-3-8B across GSM8K, Math 500, and AIME 2024 show that STOP reduces generated tokens by 19.4-42.4% while largely preserving accuracy in low-data fine-tuning. Beyond efficiency, our analyses show that STOP induces much smaller distributional shift than teacher-guided pruning, improves the structural efficiency of generated reasoning, and reallocates reasoning effort away from redundant verification and backtracking toward more productive exploration.

2605.13162 2026-05-14 cs.LG

Continual Fine-Tuning of Large Language Models via Program Memory

Hung Le, Svetha Venkatesh

发表机构 * Deakin Applied AI Initiative(德金应用人工智能计划)

AI总结 本文研究了在持续学习场景下如何高效地对大语言模型进行微调,提出了一个基于程序记忆的持续LoRA框架ProCL。该方法受到神经科学中互补学习系统的启发,通过结构化的程序记忆槽和输入条件注意力机制,实现了快速适应与知识保留的平衡。实验表明,ProCL在多个基准上表现出更优的知识保持能力和更低的灾难性遗忘现象。

Comments 18 page, preprint

详情
英文摘要

Parameter-Efficient Fine-Tuning (PEFT), particularly Low-Rank Adaptation (LoRA), has become a standard approach for adapting Large Language Models (LLMs) under limited compute. However, in continual settings where models are updated sequentially with small datasets, conventional LoRA updates struggle to balance rapid adaptation and knowledge retention. Existing methods typically treat the low-rank space as a homogeneous update region, lacking mechanisms to regulate how short-term updates are consolidated over time. We propose a continual LoRA framework with \textbf{Pro}gram memory, inspired by \textbf{C}omplementary \textbf{L}earning Systems in neuroscience. Our approach, dubbed \textbf{ProCL}, organizes LoRA adapters into structured program memory slots that are dynamically retrieved through input-conditioned attention. This enables rapid and localized adaptation, encouraging similar inputs to reuse shared adapter regions while reserving unused capacity for future data. The slots are then combined with the underlying adapter, which maintains a distributed representation that gradually accumulates knowledge across tasks to balance plasticity and stability. Our method operates entirely within the LoRA parameterization and incurs no additional inference cost. Experiments on diverse benchmarks demonstrate improved retention and reduced catastrophic forgetting over other continual LoRA strategies.

2605.13158 2026-05-14 cs.CV

Unifying Physically-Informed Weather Priors in A Single Model for Image Restoration Across Multiple Adverse Weather Conditions

Jiaqi Xu, Xiaowei Hu, Lei Zhu, Pheng-Ann Heng

发表机构 * Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong SAR, China(香港中文大学(深圳)计算机科学与工程系) Shanghai Artificial Intelligence Laboratory, Shanghai, China(上海人工智能实验室) ROAS Thrust, the Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China and The Hong Kong University of Science and Technology, Department of Electronic and Computer Engineering, Hong Kong SAR, China(香港科学与技术大学(广州)ROAS方向及电子与计算机工程系,香港特别行政区)

AI总结 本文研究了在多种恶劣天气条件下进行图像修复的问题,提出了一种统一的物理感知天气先验模型,能够同时处理雨滴和雾等不同天气引起的退化现象。该方法基于对天气相关视觉因素的分析,构建了一个融合粒子散射和雾状聚集效应的成像模型,并设计了一种基于天气先验的网络结构,通过估计遮挡和透射信息增强特征以恢复清晰场景。实验表明,该方法在多种恶劣天气场景下均优于现有先进方法。

Comments Accepted by TCSVT

详情
英文摘要

Image restoration under multiple adverse weather conditions aims to develop a single model to recover the underlying scene with high visibility. Weather-related artifacts vary with the particle's distance to the camera according to the established scene visibility analysis, where close and faraway regions are more affected by falling drops and fog effects, respectively. Existing methods fail to consider this weather-specific physical visual process; thus, the restoration performance is limited. In this work, we analyze the common visual factors in adverse weather conditions and present a unified imaging model that considers the individually visible particles and fog-like aggregate scattering effects. Further, we design a novel weather-prior-based network, which leverages the weather-related prior information to help recover the scene by enhancing the features using the estimated occlusion and transmission. Experimental results in multiple adverse scenarios show the superiority of our method against state-of-the-art methods.

2605.13156 2026-05-14 cs.CV

Dual-Pathway Circuits of Object Hallucination in Vision-Language Models

Jiaxin Liu, Ding Zhong, Yue Wang, Zhidong Yang, Zhaolu Kang, Guangyuan Dong, Qishi Zhan, Pengcheng Fang, Aofan Liu

发表机构 * UIUC(伊利诺伊大学香槟分校) UMich(密歇根大学) Stanford(斯坦福大学) HKUST(香港科技大学) PKU(北京大学) NUS(新加坡国立大学) Marquette(马quette大学) Southampton(南安普顿大学)

AI总结 视觉语言模型(VLMs)在跨模态理解任务中表现出色,但常出现物体幻觉问题,即描述输入图像中并不存在的内容,影响其可靠性和可解释性。本文提出了一种双路径电路分析框架,用于识别和分析VLM中与幻觉相关的电路机制。通过激活路径修补和条件路径分析,研究发现了支持正确预测的视觉接地路径和导致错误输出的幻觉路径,并揭示了两者的交互机制。实验表明,抑制幻觉路径组件可显著减少物体幻觉,且该电路机制在不同模型架构和幻觉类型中具有良好的一致性和可迁移性。

详情
英文摘要

Vision-language models (VLMs) have demonstrated remarkable capabilities in bridging visual perception and natural language understanding, enabling a wide range of multimodal reasoning tasks. However, they often produce object hallucinations, describing content absent from the input image, which limits their reliability and interpretability. To address this limitation, we propose Dual-Pathway Circuit Analysis, a framework that identifies and characterizes hallucination-related circuits in VLMs for mechanistic understanding and causal probing. We first apply activation patching across five architecturally diverse VLMs to identify a visual grounding pathway that supports correct predictions and a hallucination pathway that drives erroneous outputs. We then introduce Conditional Pathway Analysis (CPA) to characterize pathway-level interactions, revealing that grounding components remain strongly redundant in both correct and hallucinating samples but undergo a consistent polarity flip, shifting from supporting the ground truth on correct samples to aligning with the hallucinated answer on erroneous ones. We further perform targeted suppression of hallucination-pathway components, showing that scaling these components reduces object hallucination by up to 76% with minimal accuracy cost, and validate that the same circuit selectively transfers to relational but not attribute hallucination. Evaluations on POPE-adversarial and AMBER show that the identified circuits are consistent across architectures, support causal intervention, and transfer selectively across hallucination types.

2605.13155 2026-05-14 cs.CV

Pareto-Guided Optimal Transport for Multi-Reward Alignment

Ying Ba, Tianyu Zhang, Mohan Zhou, Yalong Bai, Wenyi Mo, Guiwei Zhang, Bing Su, Ji-Rong Wen

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China(中国人民大学北京校区人工智能学院) Beijing Key Laboratory of Research on Large Models(北京大模型研究关键实验室) Engineering Research Center of Next-Generation Intelligent Search(下一代智能搜索与推荐工程技术研究中心) Rutgers University(罗格斯大学)

AI总结 文本到图像生成模型在偏好优化方面取得了显著进展,但在面对多样化的奖励模型时,实现稳健的对齐仍是一个重大挑战。本文提出了一种基于帕累托前沿引导的最优传输(PG-OT)框架,通过构建特定提示的帕累托前沿,并利用分布感知的最优传输将劣化样本映射至该前沿,从而有效缓解奖励黑客问题。此外,作者引入了联合支配率(JDR)和联合崩溃率(JCR)作为评估多奖励协同效应和奖励黑客风险的指标,实验表明该方法在多个指标上均优于现有方法。

Comments Accepted to ICML 2026

详情
英文摘要

Text-to-image generation models have achieved remarkable progress in preference optimization, yet achieving robust alignment across diverse reward models remains a significant challenge. Existing multi-reward fusion approaches rely on weighted summation, which is costly to tune and insufficient for balancing conflicting objectives. More critically, optimization with reward models is highly susceptible to reward hacking, where reward scores increase while the perceived quality of generated images deteriorates. We demonstrate that optimizing against a unified global target under heterogeneous reward upper bounds can induce reward hacking, a risk further exacerbated by the inherent instability of weak reward models. To mitigate this, we propose a Pareto Frontier-Guided Optimal Transport (PG-OT) framework. Our method constructs a prompt-specific Pareto frontier and maps dominated samples toward it via distribution-aware optimal transport. Furthermore, we develop both online and offline optimization strategies tailored to diverse reward signal characteristics. To provide a more rigorous assessment, we introduce the Joint Domination Rate (JDR) and Joint Collapse Rate (JCR) as principled metrics to quantify multi-reward synergy and reward hacking. Experimental results show that our approach outperforms strong baselines with an 11% gain in JDR and achieves a near 80% win rate in human evaluations.

2605.13153 2026-05-14 cs.AI

Strikingness-Aware Evaluation for Temporal Knowledge Graph Reasoning

Rikui Huang, Shengzhe Zhang, Wei Wei

发表机构 * School of Computer Science & Technology, Huazhong University of Science and Technology(华中科技大学计算机科学与技术学院) Institute of Artificial Intelligence, Huazhong University of Science and Technology(华中科技大学人工智能研究院) School of Artificial Intelligence & Automation, Huazhong University of Science and Technology(华中科技大学人工智能与自动化学院)

AI总结 本文针对时间知识图谱推理(TKGR)中的评估方法提出改进,指出当前方法对所有事件一视同仁,忽略了大多数事件是重复性的,从而高估了模型的推理能力。为此,作者提出一种基于“显著性”的评估框架,通过规则引导的显著性度量方法,区分并强调那些需要更深层次推理的罕见事件。实验表明,该框架能够更严格地评估模型在预测突出事件方面的能力,为TKGR研究提供了新的评价视角。

Comments Accepted to IJCAI-ECAI 2026

详情
英文摘要

Temporal Knowledge Graph Reasoning (TKGR) aims at inferring missing (especially future) events from historical data. Current evaluation in TKGR uniformly weights all events, ignoring that most are trivial repetitions, which overestimate the true reasoning ability. Therefore, the rare outstanding events, whose prediction demands deeper reasoning, should be distinguished and emphasized. To this end, we propose a strikingness-aware evaluation framework, which introduces a rule-based strikingness measuring framework (RSMF) to quantify event strikingness by comparing its expected occurrence with peer events derived from temporal rules. Strikingness is then integrated as a weighting factor into metrics like weighted MRR and Hits@k. Experiments on four TKG benchmarks reveal: 1) All representative models perform worse as event strikingness increases, 2) Path-based methods excel on low-strikingness events and representation-based ones on high-strikingness events, 3) We design an ensemble method whose gains stem from fitting trivial events rather than reasoning improvement. Our framework provides a more rigorous evaluation, refocusing the field on predicting outstanding events.

2605.13152 2026-05-14 cs.CV cs.AI cs.LG cs.RO

EvObj: Learning Evolving Object-centric Representations for 3D Instance Segmentation without Scene Supervision

Jiahao Chen, Zihui Zhang, Yafei Yang, Jinxi Li, Shenxing Wei, Zhixuan Sun, Bo Yang

发表机构 * Shenzhen Research Institute, The Hong Kong Polytechnic University(深圳研究 institute,香港理工大学) vLAR Group, The Hong Kong Polytechnic University(vLAR 团队,香港理工大学)

AI总结 本文提出了一种名为 EvObj 的无监督三维实例分割方法,旨在解决从合成数据到真实点云场景中几何域差距带来的挑战。该方法通过引入对象辨别模块和对象补全模块,实现了对物体先验的动态优化和部分几何结构的重建,从而提升了在真实场景中的分割性能。实验表明,EvObj 在多个数据集上均取得了优于现有方法的分割效果,达到了当前最先进的水平。

Comments CVPR 2026. Code and data are available at: https://github.com/vLAR-group/EvObj

详情
英文摘要

We introduce EvObj for unsupervised 3D instance segmentation that bridges the geometric domain gap between synthetic pretraining data and real-world point clouds. Current methods suffer from structural discrepancies when transferring object priors from synthetic datasets (e.g., ShapeNet) to real scans (e.g., ScanNet), particularly due to morphological variations and occlusion artifacts. To address this, EvObj integrates two innovative modules: (1) An object discerning module that dynamically refines object candidates, enabling continuous adaptation of object priors to target domains; and (2) An object completion module that reconstructs partial geometries after discovering objects. We conduct extensive experiments on both real-world and synthetic datasets, demonstrating superior 3D object segmentation performance over all baselines while achieving state-of-the-art results.

2605.13151 2026-05-14 cs.CV

GenCape: Structure-Inductive Generative Modeling for Category-Agnostic Pose Estimation

Jiyong Rao, Yu Wang, Shengjie Zhao

发表机构 * School of Computer Science and Technology, Tongji University(同济大学计算机科学与技术学院)

AI总结 GenCape 是一种面向类别无关姿态估计(CAPE)的生成式框架,旨在仅使用少量标注的支持样本,对任意类别的图像中的关键点进行定位。该方法通过图像支持输入自动推断关键点之间的关系,无需额外的文字描述或预定义的骨骼结构,克服了传统方法对人工标注的依赖和结构灵活性差的问题。GenCape 包含一个迭代结构感知变分自编码器和一个组合图转移模块,能够有效捕捉实例级别的结构信息,并在不同类别间实现语义对齐,实验表明其在少样本设置下优于现有基于图支持和文本支持的方法。

Comments Accepted in ICLR 2026

详情
英文摘要

Category-agnostic pose estimation (CAPE) aims to localize keypoints on query images from arbitrary categories, using only a few annotated support examples for guidance. Recent approaches either treat keypoints as isolated entities or rely on manually defined skeleton priors, which are costly to annotate and inherently inflexible across diverse categories. Such oversimplification limits the model's capacity to capture instance-wise structural cues critical for accurate pixel-level localization. To overcome these limitations, we propose GenCape, a Generative-based framework for CAPE that infers keypoint relationships solely from image-based support inputs, without additional textual descriptions or predefined skeletons. Our framework consists of two principal components: an iterative Structure-aware Variational Autoencoder (i-SVAE) and a Compositional Graph Transfer (CGT) module. The former infers soft, instance-specific adjacency matrices from support features through variational inference, embedded layer-wise into the Graph Transformer Decoder for progressive structural priors refinement. The latter adaptively aggregates multiple latent graphs into a query-aware structure via Bayesian fusion and attention-based reweighting, enhancing resilience to visual uncertainty and support-induced bias. This structure-aware design facilitates effective message propagation among keypoints and promotes semantic alignment across object categories with diverse keypoint topologies. Experimental results on the MP-100 dataset show that our method achieves substantial gains over graph-support baselines under both 1- and 5-shot settings, while maintaining competitive performance against text-support counterparts.

2605.13149 2026-05-14 cs.CL cs.AI cs.LG

AcquisitionSynthesis: Targeted Data Generation using Acquisition Functions

Ishika Agarwal, Sofia Stoica, Emre Can Acikgoz, Pradeep Natarajan, Mahdi Namazifar, Jiaqi Ma, Dilek Hakkani-Tür

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Amazon(亚马逊)

AI总结 本文提出了一种名为 AcquisitionSynthesis 的方法,利用主动学习中的获取函数作为奖励模型,训练语言模型生成高质量的合成数据,以解决模型训练中数据质量的瓶颈问题。该方法通过量化评估生成数据对下游学习器的影响,提升了数据生成的针对性和有效性。实验表明,使用 AcquisitionSynthesis 生成的数据能够提升学生模型的性能并增强其鲁棒性,同时该方法还可用于支持其他模型训练及资源从低到高的训练范式。

详情
英文摘要

Data quality remains a critical bottleneck in developing capable, competitive models. Researchers have explored many ways to generate top quality samples. Some works rely on rejection sampling: generating lots of synthetic samples and filtering out low-quality samples. Other works rely on larger or closed-source models to extract model weaknesses, necessary skills, or a curriculum off of which to base data generation. These works have one common limitation: there is no quantitative approach to measure the impact of the generated samples on the downstream learner. Active learning literature provides exactly this, in the form of acquisition functions. Acquisition functions measure the informativeness and/or influence of data, providing interpretable, model-centric signals. Inspired by this, we propose AcquisitionSynthesis: using acquisition functions as reward models to train language models to generate higher-quality synthetic data. We conduct experiments on classic verifiable tasks of math, medical question-answering, and coding. Our experimental results indicate that (1) student models trained with AcquisitionSynthesis data achieve good performance on in-distribution tasks (2-7% gain) and is more robust to catastrophic forgetting, and (2) AcquisitionSynthesis models can generate data for other models and for low-to-high resource training paradigms. By leveraging acquisition rewards, we seek to demonstrate a principled path toward model-aware self-improvement that surpasses static datasets.

2605.13148 2026-05-14 cs.LG cs.CV

Understanding Generalization through Decision Pattern Shift

Huiqi Deng, Yibo Li, Quanshi Zhang, Peng Zhang, Hongbin Pei, Xia Hu

发表机构 * Xi’an Jiaotong University(西安交通大学) Shanghai Jiao Tong University(上海交通大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 本文研究深度神经网络在未见样本上泛化失败的原因,提出了一种新的分析视角——决策模式偏移(DPS)。该方法通过分析模型内部决策模式的稳定性,量化其在训练与测试阶段的偏差,从而衡量泛化性能。研究发现,决策模式在类别间具有高度结构化和一致性,且其变化程度与泛化差距呈强线性相关,为理解不同泛化失败场景提供了统一的解释框架。

Comments 14pages, 12figures, computer vision and pattern recognition

详情
英文摘要

Understanding why deep neural networks (DNNs) fail to generalize to unseen samples remains a long-standing challenge. Existing studies mainly examine changes in externally observable factors such as data, representations, or outputs, yet offer limited insight into how a model's internal decision mechanism evolves from training to test. To address this gap, we introduce Decision Pattern Shift (DPS), a new perspective that defines generalization through the stability of internal decision patterns and quantifies failure as their deviation from those learned during training. Specifically, we represent each sample's decision pattern as a GradCAM-based channel-contribution vector, which captures how feature channels collectively support a prediction, and we propose the DPS metric to measure its discrepancy from the class-average pattern. Empirical analyses across multiple datasets and architectures show that, (i) decision patterns form a highly structured, class-consistent space with strong intra-class cohesion and low inter-class confusion, enabling direct analysis of a model's decision logic; (ii) the DPS magnitude correlates linearly with the generalization gap (nearly all Pearson r > 0.8), revealing generalization as a systematic drift in the model's internal decision mechanism; (iii) the DPS spectrum organizes diverse generalization degradation scenarios (covering ideal generalization, in-distribution degradation, domain shift, out-of-distribution, and shortcut learning) into a continuous trajectory, providing a unified explanation of their failure modes. These findings open up new possibilities for early generalization-risk detection, failure-mode diagnosis, and channel-level defect localization.

2605.13145 2026-05-14 cs.LG

Collaborating in Multi-Armed Bandits with Strategic Agents

Idan Barnea, Ofir Schlisselberg, Yishay Mansour

发表机构 * Tel Aviv University(特拉维夫大学) Google Research(谷歌研究)

AI总结 本文研究了多智能体贝叶斯老虎机问题中的协作学习,其中具有战略行为的智能体共同解决同一个老虎机实例。与以往假设短视智能体的文献不同,本文考虑了长期参与的智能体,并提出了一种名为CAOS的机制,能够在纳什均衡下维持协作,同时保证强遗憾上界。研究结果表明,仅通过信息共享即可实现有效的协作探索,其性能接近完全合作系统的水平。

详情
英文摘要

We study collaborative learning in multi-agent Bayesian bandit problems, where strategic agents collectively solve the same bandit instance. While multiple agents can accelerate learning by sharing information, strategic agents might prefer to free-ride and avoid exploration. We consider a setting with persistent agents that participate in multiple time periods. This is in contrast to most previous works on incentives in multi-agent MAB, which assume short-lived agents, namely each agent has a single decision to make and optimizes their expected reward in that single decision. As in the multi-agent MAB model with incentives, our model does not have monetary transfers, and the only incentives are through information sharing. We propose \texttt{CAOS}, a mechanism that sustains collaboration as a Nash equilibrium while achieving strong regret guarantees. Our results demonstrate that collaborative exploration can be sustained purely through information sharing, achieving performance close to that of fully cooperative systems despite strategic behavior.