arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.31096 2026-06-01 cs.CV

iVGR: Internalizing Visually Grounded Reasoning for MLLMs with Reinforcement Learning

iVGR: 通过强化学习将视觉基础推理内化到多模态大语言模型中

Chang-Bin Zhang, Yujie Zhong, Qiang Zhang, Kai Han

发表机构 * Visual AI Lab, The University of Hong Kong（香港大学视觉人工智能实验室）； Independent Researcher（独立研究者）； University of Science and Technology of China（中国科学技术大学）

AI总结提出iVGR框架，利用强化学习和双流训练策略将视觉定位能力内化到文本推理中，避免显式视觉基础在推理时的干扰，提升细粒度感知性能。

Comments Accepted by ICML 2026

详情

AI中文摘要

尽管视觉基础链式思维（CoT）已成为增强多模态大语言模型（MLLM）细粒度感知的有前途范式，但其在推理阶段的有效性仍未得到充分探索。在这项工作中，我们经验性地发现，与没有显式视觉基础的标准文本CoT相比，在推理时强制要求视觉基础CoT中的显式对象框通常会降低性能。我们假设视觉定位能力可以内化到文本CoT中，而强制性的显式基础会对模型的主要目标（答案预测）引入不必要的干扰。为了解决这个问题，我们提出了内化视觉基础推理（iVGR），一种新颖的强化学习框架，将定位能力转移到文本推理过程中。我们采用双流训练策略，通过提出的一致性奖励将文本流与高质量的视觉基础流对齐，使模型在推理时无需显式基础即可准确定位。大量实验表明，我们的方法在细粒度基准上显著优于现有基线，同时保持支持工具辅助推理工作流的灵活性。

英文摘要

While visually grounded Chain-of-Thought (CoT) has emerged as a promising paradigm to enhance fine-grained perception in multimodal large language models (MLLMs), its efficacy during the inference phase remains underexplored. In this work, we empirically find that mandating explicit object boxes in visually grounded CoT during inference often degrades performance compared to standard textual CoT, which reasons without explicit visual grounding. We hypothesize that the visual localization capability can be internalized into the textual CoT and that the mandatory explicit grounding introduces unnecessary interference with the model's primary objective of answer prediction. To address this problem, we propose Internalizing Visually Grounded Reasoning (\textbf{iVGR}), a novel reinforcement learning framework that transfers localization capabilities into the textual reasoning process. We employ a dual-stream training strategy, where a textual stream is aligned with a high-quality visually grounded stream via a proposed consistency reward, enabling the model to localize accurately without explicit grounding during inference. Extensive experiments demonstrate that our method significantly outperforms existing baselines on fine-grained benchmarks, while maintaining the flexibility to support tool-assisted inference workflows.

URL PDF HTML ☆

赞 0 踩 0

2605.31094 2026-06-01 cs.CV cs.AI

Redefining Instance Matching: A Unified Framework for Part-Aware Matching in Panoptic Segmentation Evaluation

重新定义实例匹配：全景分割评估中部件感知匹配的统一框架

Erik Großkopf, Soumya Snigdha Kundu, Hendrik Möller, Nicolas Münster, Mehdi Astaraki, Paula Tamara Buzduga, Kerstin Ritter, Benedikt Wiestler, Jan Kirschke, Jonathan Shapey, Tom Vercauteren, Florian Kofler

发表机构 * Hertie Institute for AI in Brain Health, University of Tübingen, Germany（人工智能与脑健康研究所，图宾根大学，德国）； King’s College London, UK（伦敦国王学院，英国）； Technical University of Munich, Germany（慕尼黑技术大学，德国）； Stockholm University, Sweden（斯德哥尔摩大学，瑞典）

AI总结提出将全景分割中的片段匹配重新表述为约束二分分配问题，定义四种匹配策略，并扩展至部件感知评估，发布基于Panoptica的统一开源包。

Comments 9 pages, 4 figures

详情

AI中文摘要

全景质量（PQ）度量是联合评估实例分割和语义分割的标准。然而，其原始定义依赖于预测片段和真实片段之间的一对一匹配，只有当IoU阈值超过0.5时才是直接的。低于0.5时，在一个探索不足的问题空间中会出现多种匹配策略。我们通过将片段匹配重新表述为约束二分分配问题，系统地阐明了这个空间。独立地约束预测端和真实端的度数，产生了四种匹配策略：一对一、多对一、一对多和多对多。我们表明，前三种在PQ框架内是良好定义的，而多对多则超出其范围。当实例被碎片化、相邻物体难以划分或标注有噪声时，这些策略变得相关。我们框架的核心是基于顶点的TP、FN和FP计数，锚定于真实片段和预测片段，而不是匹配边。我们进一步表明，该框架自然地扩展到部件感知全景分割，并在生物医学数据上探索了部件感知评估。在可配置的案例研究中，我们报告了不同阈值和匹配策略组合在实际中的表现。我们发布了一个基于Panoptica的统一开源包，它暴露了基于Voronoi的区域分析、部件感知评估和阈值下曲线面积作为可配置选项。

英文摘要

The Panoptic Quality (PQ) metric is the standard for jointly evaluating instance and semantic segmentation. However, its original definition relies on a One-to-One matching between predicted and ground truth segments, which is only straightforward when the IoU threshold exceeds 0.5. Below 0.5, multiple matching strategies emerge in a poorly explored problem space. We systematically elucidate this space by recasting segment matching as a constrained bipartite assignment problem. Independently bounding the prediction- and ground-truth-side degrees yields four matching strategies: One-to-One, Many-to-One, One-to-Many, and Many-to-Many. We show that the first three are well-defined within the PQ framework, while Many-to-Many falls outside it. These strategies become relevant when instances are fragmented, adjacent objects are difficult to delineate, or annotations are noisy. Central to our framework is a vertex-based accounting of TP, FN, and FP, anchored to ground truth and predicted segments rather than to matching edges. We further show that the framework extends naturally to part-aware panoptic segmentation, and we explore part-aware evaluation on biomedical data. Across configurable case studies we report how different combinations of thresholds and matching strategies behave in practice. We release a unified open-source package built on Panoptica. It exposes Voronoi-based region-wise analysis, part-aware evaluation, and Area Under Threshold Curve computations as configurable options.

URL PDF HTML ☆

赞 0 踩 0

2605.31093 2026-06-01 cs.CV

Cross-Modal Clinical Knowledge Integration for Mammography Report Generation

跨模态临床知识整合用于乳腺X线报告生成

Jiayi Zhu, Fuxiang Huang, Yu Xie, Xi Wang, Zhixuan Chen, Yuan Guo, Qingcong Kong, Zhenhui Li, Qiong Luo, Hao Chen

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； Lingnan University（岭南大学）； The Third Affiliated Hospital of Kunming Medical University, Yunnan Cancer Hospital, Peking University Cancer Hospital Yunnan（昆明医科大学第三附属医院、云南癌症医院、北京大学肿瘤医院云南分院）； Guangzhou First People's Hospital, South China University of Technology（广州第一人民医院、华南理工大学）； The Third Affiliated Hospital, Sun Yat-Sen University（中山大学第三附属医院）； HKUST Shenzhen-Hong Kong Collaborative Innovation Research Institute（香港科技大学深圳-香港协同创新研究院）

AI总结提出MammoRG框架，通过两阶段训练模拟临床报告流程，整合BI-RADS指南和先验知识，提升报告生成的临床一致性。

Comments 16 pages, 5 figures

详情

AI中文摘要

乳腺癌是一个主要的全球健康问题，乳腺X线筛查在早期检测中起着核心作用。大量的筛查检查给放射科医生带来了沉重的工作负担，使得准确且一致的报告生成成为一个关键的临床挑战。现有的自动乳腺X线报告生成方法主要关注直接的视觉到文本映射，而忽略了放射科医生在实际工作中遵循的结构化临床推理过程。为了解决这一局限性，我们提出了MammoRG，一个乳腺X线报告生成框架，它通过遵循BI-RADS指南并整合先验临床知识来明确模拟临床报告工作流程，从而生成诊断报告。具体来说，MammoRG采用两阶段训练框架。在第一阶段，模型通过基于分类的监督学习从患者的四视图乳腺X线图像中整合临床相关的先验知识。在第二阶段，引入术语感知的监督微调策略，将乳腺X线特异性临床术语建模为原子语义单元，从而生成具有更高临床一致性的高质量报告。为了促进生成报告的临床效能评估，我们进一步开发了MammoRGTool，一个专用的乳腺X线报告解析工具，它从自由文本报告中提取结构化临床信息。大量实验表明，MammoRG在多个临床效能指标上持续优于现有方法，特别是在与诊断相关的BI-RADS F1上，它在内部、外部1、外部2和VinDr-Mammo数据集上分别超过第二名模型2.73%、2.04%、1.90%和3.27%。

英文摘要

Breast cancer is a major global health concern, and mammography screening plays a central role in early detection. The large volume of screening examinations creates a substantial workload for radiologists, making accurate and consistent report generation a critical clinical challenge. Existing automated mammography report generation methods primarily focus on direct visual-to-text mapping, while overlooking the structured clinical reasoning process followed by radiologists in real-world practice. To address this limitation, we propose MammoRG, a mammography report generation framework that explicitly simulates the clinical reporting workflow by following the BI-RADS guideline and incorporating prior clinical knowledge to produce diagnostic reports. Specifically, MammoRG adopts a two-stage training framework. In the first stage, the model learns to integrate clinically relevant prior knowledge from a patient's four-view mammograms through classification-based supervision. In the second stage, a terminology-aware supervised fine-tuning strategy is introduced to model mammography-specific clinical terms as atomic semantic units, enabling the generation of high-quality reports with improved clinical consistency. To facilitate clinical efficacy evaluation of generated reports, we further develop MammoRGTool, a dedicated mammography report parsing tool that extracts structured clinical information from free-text reports. Extensive experiments demonstrate that MammoRG consistently outperforms existing methods across multiple clinical efficacy metrics, particularly in diagnosis-related BI-RADS F1, where it surpasses the second-best model by 2.73%, 2.04%, 1.90%, and 3.27% on the internal, external 1, external 2, and VinDr-Mammo datasets, respectively.

URL PDF HTML ☆

赞 0 踩 0

2605.31090 2026-06-01 cs.CV cs.AI

On Revisiting Entropy for Identifying Mislabeled Images

重新审视熵在识别错误标注图像中的应用

Chunlei Li, Zixuan Zheng, Yilei Shi, Guanglu Dong, Pengfei Li, Jingliang Hu, Xiao Xiang Zhu, Lichao Mou

发表机构 * MedAI Technology (Wuxi) Co. Ltd., Wuxi, China（MedAI技术（无锡）有限公司，无锡，中国）； Sichuan University, Chengdu, China（四川大学，成都，中国）； University of Basel, Allschwil, Switzerland（巴塞尔大学，阿勒西维尔，瑞士）； Technical University of Munich, Munich, Germany（慕尼黑技术大学，慕尼黑，德国）

AI总结提出基于训练动态的有符号熵积分（SEI）统计量，通过捕捉预测熵的幅度和时间趋势，有效识别训练集中的错误标注样本，在医学影像数据集上达到最优性能。

Comments ICML 2026

详情

AI中文摘要

训练数据集中的错误标注样本会严重降低深度网络的性能，因为过参数化模型倾向于记忆错误标签。我们通过提出一种利用训练动态的错误标注数据检测新方法来应对这一挑战。我们的方法基于一个关键观察：正确标注的样本在训练过程中熵持续下降，而错误标注的样本在整个训练过程中保持相对较高的熵。基于这一见解，我们引入了一个有符号熵积分（SEI）统计量，它捕捉了训练周期中预测熵的幅度和时间趋势。SEI广泛适用于分类网络，并且在与对比语言-图像预训练（CLIP）架构集成时表现出特别的有效性。通过在四个医学影像数据集（由于诊断复杂性，该领域特别容易受到标注错误的影响）上进行涵盖不同模态和病理的广泛实验，我们证明SEI在错误标注数据识别中达到了最先进的性能，在保持计算效率和实现简单性的同时优于现有方法。我们的代码可在 https://github.com/MedAITech/SEI 获取。

英文摘要

Mislabeled samples in training datasets severely degrade the performance of deep networks, as overparameterized models tend to memorize erroneous labels. We address this challenge by proposing a novel approach for mislabeled data detection that leverages training dynamics. Our method is grounded in the key observation that correctly labeled samples exhibit consistent entropy decrease during training, while mislabeled samples maintain relatively high entropy throughout the training process. Building on this insight, we introduce a signed entropy integral (SEI) statistic that captures both the magnitude and temporal trend of prediction entropy across training epochs. SEI is broadly applicable to classification networks and demonstrates particular effectiveness when integrated with contrastive language-image pretraining (CLIP) architectures. Through extensive experiments on four medical imaging datasets -- a domain particularly susceptible to labeling errors due to diagnostic complexity -- spanning diverse modalities and pathologies, we demonstrate that SEI achieves state-of-the-art performance in mislabeled data identification, outperforming existing methods while maintaining computational efficiency and implementation simplicity. Our code is available at https://github.com/MedAITech/SEI.

URL PDF HTML ☆

赞 0 踩 0

2605.31082 2026-06-01 cs.SD cs.MM

Sound effects in media:A comparative analysis of recorded and synthetic samples in live-action and animation

媒体中的音效：实拍与动画中录制样本与合成样本的比较分析

Nelly Garcia, Joshua Reiss

发表机构 * Centre for Digital Music (C4DM),Queen Mary University of London（数字音乐中心（C4DM）、伦敦女王玛丽大学）

AI总结通过比较程序化生成的合成音效与真实录制音效在实拍和动画场景中的可信度，发现合成音效在戏剧和科幻场景中表现良好，但在卡通日常动作中可信度较低。

Comments ArtsIT, Interactivity and Game Creation 2024

详情

AI中文摘要

为故事创作声音对于电影、电视剧和视频游戏等作品中环境的建立至关重要。这一过程通常涉及重复、分层和录制真实物体或使用音效库，这可能耗时且重复。为了解决这些挑战，程序化音频（也称为数字拟音）提供了一种解决方案，允许声音设计师快速生成样本。尽管效率高，但合成样本与真实样本相比的可信度仍存在问题。在我们的研究中，我们比较了由在线程序化引擎生成的合成样本，并将其与动画和实拍画面集成。我们的结果表明，程序化音频在戏剧和科幻场景中非常有效且被认为可信，特别是对于激光、打击、空气和火箭等声音模型，而合成声音在表现日常动作的卡通制作中不太可信。最后，我们确定了需要优化的特定模型，并根据音频专业人士的反馈强调了需要改进的音频特征。

英文摘要

Creating sound for storytelling is crucial to establishing the environment in productions such as films, TV series and video games. This process often involves repeating, layering and recording real objects or using sound libraries, which can be time-consuming and repetitive. To address these challenges, procedural audio, also known as digital foley, offers a solution by allowing sound designers to quickly generate samples. Despite its efficiency, questions remain about the believability of synthetic samples compared to real ones. In our study, we compared synthetic samples generated by an online procedural engine and integrated them with both animated and live-action visuals. Our results indicate that procedural audio is highly effective and perceived as believable in drama and sci-fi scenes, particularly for sound models such as lasers, hits, air and rockets, whereas synthetic sounds weren't as believable in cartoon productions when representing everyday actions. Finally, we identified specific models that needed optimisation and highlighted audio features that needed improvement with feedback from audio professionals.

URL PDF HTML ☆

赞 0 踩 0

2605.31075 2026-06-01 cs.CV

Task-Focused Memorization for Multimodal Agents

面向多模态智能体的任务聚焦记忆

Tao Zou, Yichen He, Tian Qiu, Yuan Lin, Hang Li

发表机构 * Fudan University（复旦大学）

AI总结提出基于强化学习的任务聚焦记忆策略学习框架TaskMem，通过两阶段训练使多模态智能体在流式观测中动态选择任务相关记忆，在三个流式基准上VQA准确率提升5.3%-7.0%。

详情

AI中文摘要

长期记忆对于多模态智能体构建连贯经验、积累世界知识和实现持续学习至关重要。然而，构建有效记忆不仅涉及记忆模块设计和准确性、保真度等基本要求，关键挑战在于决定记忆什么。多模态智能体（如具身智能体）在真实或虚拟环境中持续感知、推理和行动，接收无界的多模态观测流。面对这种信息组合爆炸，智能体必须选择性地保留与其环境角色相关且对未来任务有价值的内容。为弥合这一差距，我们将记忆生成建模为可学习的记忆策略，并引入TaskMem（任务聚焦记忆策略学习），一种基于强化学习的框架，使策略能够动态调整其关注点以适应环境中遇到的实际任务需求。TaskMem采用两阶段训练范式：第一阶段在基本保真度要求下优化记忆质量，学习如何记忆；第二阶段在部署后进行，智能体通过在其基础MLLM上调整适配器来学习记忆什么，利用近期环境任务定义奖励模型，引导记忆策略聚焦于任务相关的内容。为评估我们的方法，我们将VideoMME、EgoLife和EgoTempo重新构建为流式基准，模拟智能体处理流式观测并处理在线到达任务的真实场景。为隔离记忆评估，问题必须仅使用智能体的记忆回答，而不访问原始视频。基于Qwen3-VL-30B-A3B，TaskMem在这些基准上分别将VQA准确率提高了6.3%、7.0%和5.3%。

英文摘要

Long-term memory is essential for multimodal agents to build coherent experience, accumulate world knowledge, and achieve continual learning. However, constructing effective memory goes beyond memory module design and basic requirements such as accuracy and fidelity; the key challenge lies in determining what to memorize. Multimodal agents, such as embodied agents, continuously perceive, reason, and act in real or virtual environments, receiving an unbounded stream of multimodal observations. From this combinatorial explosion of information, an agent must selectively retain content that is relevant to its role in the environment and valuable for future tasks. To bridge this gap, we frame memory generation as a learnable memorization policy and introduce TaskMem (Task-focused Memorization Policy Learning), a reinforcement-learning-based framework that enables the policy to dynamically adjust its focus to the demands of real tasks encountered in the environment. TaskMem adopts a two-phase training paradigm: Phase One learns how to memorize by optimizing memory quality under fundamental fidelity requirements; Phase Two occurs after deployment, where the agent learns what to memorize by tuning an adapter on its base MLLM, using recent environment tasks to define a reward model that guides the memorization policy toward task-relevant content. To evaluate our approach, we reformulate VideoMME, EgoLife, and EgoTempo into streaming benchmarks that simulate a realistic setting in which an agent processes streaming observations and handles tasks arriving online. To isolate memory assessment, the questions must be answered using only the agent's memory, without access to raw video. Built on Qwen3-VL-30B-A3B, TaskMem improves VQA accuracy by 6.3%, 7.0%, and 5.3% on these benchmarks, respectively.

URL PDF HTML ☆

赞 0 踩 0

2605.31073 2026-06-01 cs.CL

ConsisGuard: Aligning Safety Deliberation with Policy Enforcement in LLM Guardrails

ConsisGuard：在LLM护栏中对齐安全审议与策略执行

Yan Wang, Zhixuan Chu, Zihao Xue, Zhen Bi, Bingyu Zhu, YueFeng Chen, Zeyu Yang, Jungang Lou, Longtao Huang, Ningyu Zhang, Kui Ren, Hui Xue

发表机构 * Alibaba Group（阿里巴巴集团）； Zhejiang University（浙江大学）； Huzhou Normal University（湖州师范学院）； Zhejiang Key Laboratory of Intelligent Education Technology and Application（浙江省智能教育技术与应用重点实验室）

AI总结提出ConsisGuard框架，通过策略到决策轨迹蒸馏和功能耦合对齐，解决基于推理的LLM护栏中审议与执行之间的不一致问题，提升安全检测性能并减少策略执行失败。

Comments 18 pages, 9 figures

详情

AI中文摘要

基于推理的LLM护栏通过在做出最终决策前生成明确理由来改进安全审核。然而，它们的理由并不总是导致忠实的执行：模型可能在推理中识别出有害意图，但仍然预测安全标签，或者在没有策略依据的情况下发布不安全决策。我们将这种安全关键性失败模式识别为审议到执行的差距。与一般的思维链忠实性不同，护栏可靠性要求策略执行一致性：生成的推理应基于安全策略，最终决策应由该推理蕴含。我们提出ConsisGuard，一个用于基于推理的LLM护栏的一致性感知框架。ConsisGuard执行策略到决策轨迹蒸馏和功能耦合对齐，对齐安全审议与决策执行之间的内部耦合。在提示和响应有害性检测基准上的实验表明，ConsisGuard在减少策略执行失败的同时提高了检测性能。这些结果表明，可靠的基于推理的护栏需要准确忠实地执行安全策略。

英文摘要

Reasoning-based LLM guardrails improve safety moderation by generating explicit rationales before issuing final decisions. However, their rationales do not always lead to faithful enforcement: a model may recognize a harmful intent in its reasoning but still predict a safe label, or issue an unsafe decision without policy-grounded justification. We identify this safety-critical failure mode as the deliberation-to-enforcement gap. Unlike general chain-of-thought faithfulness, guardrail reliability requires policy execution consistency: the generated reasoning should be grounded in the safety policy, and the final decision should be entailed by that reasoning. We propose ConsisGuard, a consistency-aware framework for reasoning-based LLM guardrails. ConsisGuard performs Policy-to-Decision Trajectory Distillation and Functional Coupling Alignment, aligning the internal coupling between safety deliberation and decision enforcement. Experiments on prompt and response harmfulness detection benchmarks show that ConsisGuard improves detection performance while reducing policy execution failures. These results suggest that reliable reasoning-based guardrails require accurate faithful execution of safety policies.

URL PDF HTML ☆

赞 0 踩 0

2605.31070 2026-06-01 cs.LG cs.GT

Learning to Bid in FCR Markets: A Best-of-Both-Worlds Approach

在FCR市场中学习投标：一种两全其美的方法

Marius Potfer, Cheng Wan, Pierre Gruet

发表机构 * EDF Lab Paris-Saclay, FiME (Laboratoire de Finance des Marchés de l’Énergie)（EDF巴黎萨克雷实验室，FiME（能源市场金融实验室））

AI总结针对欧洲频率控制储备（FCR）市场中投标者仅能观察到部分反馈（如出清价格和分配数量）的问题，提出了一种将多国FCR出清问题转化为重复多单位统一价格拍卖的方法，并采用两全其美的组合半强盗算法实现对数伪遗憾（随机环境）和平方根遗憾（对抗环境），实验验证了其理论缩放性和实际竞争力。

Comments Algorithms and data available at https://data.mendeley.com/datasets/htprbf47dg/1

详情

AI中文摘要

在欧洲频率控制储备（FCR）市场中，由于竞争报价是隐藏的，投标者只能观察到来自市场的部分反馈，如出清价格和分配数量，因此对于灵活性提供商而言，投标具有挑战性。对于活跃在单个国家的参与者，我们证明多国FCR出清问题可以转化为针对内生对手报价向量的重复多单位统一价格拍卖。这种重新表述产生了一个在线学习问题，并使我们能够适应一种两全其美的组合半强盗算法，该算法可从这种标准市场反馈中实现。由此产生的投标者在随机环境中实现对数伪遗憾，在对抗环境中实现$\mathcal{O}(\sqrt{T})$遗憾。综合实验验证了预期的缩放性，对历史欧洲FCR数据的回测显示了实际中的竞争性能：该方法在稳定产品上表现尤其出色，而EXP3类型的基线在更强的非平稳性下可能更安全。总体而言，结果表明，当学习规则与产品级市场稳定性相匹配时，基于学习的FCR市场投标在理论上是有根据的，在实践中是有用的。

英文摘要

Bidding in the European Frequency Containment Reserve (FCR) market is challenging for flexibility providers because competing offers are hidden and bidders observe only partial feedback form the market, such as, clearing price and awarded quantity. For a participant active in a single country, we show that the multi-country FCR clearing problem can be recast as a repeated multi-unit uniform-price auction against an endogenous vector of opposing bids. This reformulation yields an online learning problem and allows us to adapt a Best-of-Both-Worlds combinatorial semi-bandit algorithm implementable from this standard market feedback. The resulting bidder achieves logarithmic pseudo-regret in stochastic environments and $\mathcal{O}(\sqrt{T})$ regret in adversarial ones. Synthetic experiments confirm the expected scaling, and backtests on historical European FCR data show competitive performance in practice: the method performs especially well on stable products, while EXP3-type baselines can be safer under stronger non-stationarity. Overall, the results show that learning-based bidding in FCR markets is theoretically grounded and practically useful when the learning rule matches product-level market stability.

URL PDF HTML ☆

赞 0 踩 0

2605.31069 2026-06-01 cs.CV cs.CL

Towards Effective Long-Video Event Prediction via Multi-Level Event Semantics Mining

面向有效长视频事件预测的多级事件语义挖掘

Bo Peng, YuanJie Lyu, PengGang Qin, Tong Xu

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出VISTA框架，通过多级事件语义挖掘（细节级、事件级、未来级）实现长视频事件预测，解决现有模型无法精确提取事件细节和进行细粒度分析的问题。

详情

DOI: 10.1007/978-981-95-6950-2

AI中文摘要

准确预测未来事件是内容理解和决策制定的基础，涉及多个领域。先前研究主要关注文本或短视频场景，而长视频事件预测具有多模态上下文丰富和叙事复杂的特点，尚未得到充分探索。同时，基于大语言模型和视觉语言模型构建的近期长视频语言模型在长视频问答和摘要方面表现出潜力，但难以泛化到事件预测，因为它们既不能精确提取事件相关细节，也无法对事件发展进行细粒度分析。为弥补这一差距，我们提出VISTA，一个用于长视频事件预测的多级事件语义挖掘框架。首先，VISTA应用以角色为中心的视觉提示精确提取事件相关视觉细节，增强细节级语义；其次，采用知识增强的迭代检索策略，引导大语言模型逐步构建逻辑连贯的事件链，从而改善事件级叙事；最后，VISTA采用类人的先提议后检索策略生成多样化的面向未来的提议并整合多级线索，产生稳健准确的预测。在真实数据集上的大量实验验证了VISTA在长视频事件预测中的有效性。

英文摘要

Accurately predicting future events is fundamental to content understanding and decision-making across various domains. While prior research has primarily focused on text or short-video scenarios, long-video event prediction, characterized by vast multimodal context and more complex narratives, remains underexplored. Meanwhile, although recent Long-Video Language Models (LVLMs), built on Large Language Models (LLMs) and Vision-Language Models (VLMs), have shown promise in long-video question answering and summarization, they struggle to generalize to event prediction, as they can neither precisely extract event-related details nor perform fine-grained analysis of event development. To address this gap, we propose VISTA, a multi-level event semantics mining framework for long-video event prediction. Initially, VISTA applies a character-centric visual prompt to precisely extract event-related visual details, enhancing detail-level semantics; subsequently, it employs a knowledge-enhanced iterative retrieval strategy, guiding the LLM to progressively construct logically coherent event chains, thereby improving event-level narratives; ultimately, VISTA adopts a human-like propose-then-retrieve strategy to generate diverse future-oriented proposals and integrate multi-level clues, producing robust and accurate predictions. Extensive experiments on real-world datasets validate the effectiveness of VISTA for long-video event prediction.

URL PDF HTML ☆

赞 0 踩 0

2605.31068 2026-06-01 cs.CV

HQ-JEPA: Hybrid Quantum Joint-Embedding Predictive Architecture for Cross-Modal Remote Sensing Representation Learning

HQ-JEPA: 用于跨模态遥感表示学习的混合量子联合嵌入预测架构

Md Aminur Hossain, Ayush V. Patel, Sanjay K. Singh, Biplab Banerjee

发表机构 * Space Applications Centre, Indian Space Research Organisation（印度空间研究组织空间应用中心）； Centre of Studies in Resources Engineering, Indian Institute of Technology Bombay（印度理工学院孟买资源工程研究中心）

AI总结提出HQ-JEPA混合量子-经典架构，通过联合嵌入预测、跨模态对齐、SIGReg高斯正则化和量子保真度损失，在Sentinel-1/2图像上学习语义表示，在GeoBench分类和分割任务上取得优于强基线的性能。

Comments 19 pages

详情

AI中文摘要

我们提出了HQ-JEPA，一种用于跨模态遥感表示学习的混合量子-经典联合嵌入预测架构。该框架将JEPA风格的掩码潜在预测扩展到配对的Sentinel-1和Sentinel-2图像，通过从可见上下文区域预测掩码目标表示，同时在共享嵌入空间中对齐异构模态特征。为了提高表示质量，HQ-JEPA结合了四个互补目标：潜在令牌预测、跨模态令牌对齐、融合潜在空间中基于SIGReg的高斯正则化，以及基于可微SWAP测试的保真度量子相似性（FQS）损失。与像素重建方法不同，HQ-JEPA直接在潜在空间中学习语义表示，并使用基于量子态重叠的相似性作为额外的正则化信号。我们在线性探测和微调设置下，在GeoBench分类和分割任务上评估了预训练编码器。结果表明，HQ-JEPA在强自监督和遥感基础模型基线上取得了具有竞争力且通常更优的性能，证明了将预测性自监督、跨模态几何正则化和基于量子保真度的表示学习相结合对遥感应用的好处。

英文摘要

We introduce HQ-JEPA, a hybrid quantum-classical joint-embedding predictive architecture for cross-modal remote sensing representation learning. The proposed framework extends JEPA-style masked latent prediction to paired Sentinel-1 and Sentinel-2 imagery by predicting masked target representations from visible context regions while aligning heterogeneous modality features in a shared embedding space. To improve representation quality, HQ-JEPA combines four complementary objectives: latent token prediction, cross-modal token alignment, SIGReg-based Gaussian regularization in the fused latent space, and a differentiable SWAP-test-based Fidelity Quantum Similarity (FQS) loss. Unlike pixel reconstruction methods, HQ-JEPA learns semantic representations directly in latent space and uses quantum state-overlap-based similarity as an additional regularization signal. We evaluate the pretrained encoder on GeoBench classification and segmentation tasks under linear probing and fine-tuning settings. Results show that HQ-JEPA achieves competitive and often superior performance over strong self-supervised and remote sensing foundation-model baselines, demonstrating the benefit of integrating predictive self-supervision, cross-modal geometric regularization, and quantum fidelity-based representation learning for remote sensing applications.

URL PDF HTML ☆

赞 0 踩 0

2605.31066 2026-06-01 cs.RO

Can Aerial VLA Models Cooperate? Evaluating Closed-Loop Air-Ground Coordination with CARLA-Air

空中VLA模型能协作吗？基于CARLA-Air的闭环空地协调评估

Tianle Zeng, Yanci Wen, Xueang Yu, Hong Zhang

发表机构 * Southern University of Science and Technology（南方科技大学）； Fudan University（复旦大学）

AI总结本文通过构建CARLA-Air仿真环境，评估空中视觉-语言-动作模型在空地协作任务中的表现，发现当前模型难以将单智能体能力转化为稳定协作行为，并指出零样本协作需要伙伴状态显式感知、低延迟动作协调和团队目标对齐三个关键组件。

Comments Code at https://github.com/louiszengCN/CarlaAir

详情

AI中文摘要

最近的空中视觉-语言-动作（VLA）模型展示了有前景的单无人机能力，例如跟踪移动物体和导航到语言指定的地标。然而，这些能力能否转移到空地协作中尚不清楚，其中无人机和无人地面车辆必须在共享的闭环物理世界中联合行动。我们通过CARLA-Air研究这个问题，这是一个单进程空地评估环境，在同一个虚幻引擎运行时内统一了CARLA和AirSim。通过共享相同的世界状态、物理时钟和感知流水线，CARLA-Air实现了物理一致的无人机-无人地面车辆交互，并精确测量仿真时间戳对齐和有效协调延迟。利用CARLA-Air，我们在两个互补的诊断任务上评估了代表性的空中VLA和规划基线：移动平台降落和遮挡恢复护航。结果表明，当前的空中VLA模型通常能够跟踪或跟随地面伙伴，但难以将这种单智能体能力转化为稳定的协作行为。状态提示提供的益处有限，而朴素的双向交互未能持续提高性能，并且可能放大大多数基线的错误。这些发现表明，在测试的基于文本的提示接口下，零样本协作空地VLA需要当前范式之外的三个组件：显式的伙伴状态感知、低延迟动作协调和团队目标对齐。我们的代码可在https://github.com/louiszengCN/CarlaAir获取。

英文摘要

Recent aerial vision-language-action (VLA) models show promising single-UAV capabilities, such as tracking moving objects and navigating to language-specified landmarks. However, it remains unclear whether these capabilities can transfer to air-ground cooperation, where a UAV and a UGV must act jointly in a shared, closed-loop physical world. We study this question with CARLA-Air, a single-process air-ground evaluation environment that unifies CARLA and AirSim inside one Unreal Engine runtime. By sharing the same world state, physics tick, and sensing pipeline, CARLA-Air enables physically consistent UAV--UGV interaction and precise measurement of simulation-timestamp alignment and effective coordination latency. Using CARLA-Air, we evaluate representative aerial VLA and planning baselines on two complementary diagnostic tasks: moving-platform landing and occlusion-recovery escort. The results show that current aerial VLA models can often track or follow a ground partner, but struggle to convert this single-agent competence into stable cooperative behavior. State prompting provides limited benefit, and naive bidirectional interaction fails to consistently improve performance and can amplify errors for most baselines. These findings suggest that, under the tested text-based cue interfaces, zero-shot cooperative air-ground VLA requires three components beyond the current paradigm: explicit partner-state grounding, low-latency action coordination, and team-level objective alignment. Our code is available at https://github.com/louiszengCN/CarlaAir.

URL PDF HTML ☆

赞 0 踩 0

2605.31062 2026-06-01 cs.CL

AdaptR1: Reinforcement Learning Based Adaptive Interleaved Thinking in Multi-hop Question Answering

AdaptR1：基于强化学习的自适应交错思考在多跳问答中的应用

Yuxin Wang, Jiahao Lu, Qifeng Wu, Shicheng Fang, Chuanyuan Tan, Yining Zheng, Xuanjing Huang, Xipeng Qiu

发表机构 * Computer Science, Fudan University（复旦大学计算机科学系）； Institute of Modern Languages and Linguistics, Fudan University（复旦大学现代语言与语言学研究院）； Shanghai Innovation Institute（上海创新研究院）； Soochow University（苏州大学）

AI总结提出AdaptR1框架，通过强化学习动态分配每步推理预算，减少多跳问答中的过度思考，在保持性能的同时显著降低推理成本。

详情

AI中文摘要

大型语言模型（LLMs）通过思维链（CoT）提示在复杂推理任务中取得了显著性能。然而，这种方法常常导致“过度思考”，即模型为简单查询生成不必要长的推理轨迹，并产生可避免的推理成本。虽然最近的工作探索了自适应推理，但现有方法通常对是否进行推理做出单一的查询级决策。这忽略了多步任务的动态性质，其中显式推理的需求在中间阶段会有所不同。为了解决这一限制，我们引入了AdaptR1，一种基于强化学习（RL）的框架，用于多跳问答（QA）中的自适应交错思考。与需要监督微调（SFT）进行冷启动初始化的先前方法不同，AdaptR1使用完全基于RL的策略，并带有质量门控效率奖励，以动态分配每一步的推理预算。在Graph-R1设置下，AdaptR1将平均思考令牌减少了69.71%，在HotpotQA上减少了90.35%，同时保持与标准基线相当或更好的性能。此外，我们的分析揭示，多跳推理中的过度思考并非均匀分布，而是主要发生在初始规划阶段，这突显了逐步自适应预算分配的有效性。

英文摘要

Large Language Models (LLMs) have achieved remarkable performance in complex reasoning tasks through Chain-of-Thought (CoT) prompting. However, this approach often leads to ``over-thinking,'' where models generate unnecessarily long reasoning traces for simple queries and incur avoidable inference cost. While recent work has explored adaptive reasoning, existing methods typically make a single query-level decision about whether to reason. This overlooks the dynamic nature of multi-step tasks, where the need for explicit reasoning varies across intermediate stages. To address this limitation, we introduce AdaptR1, a Reinforcement Learning (RL) based framework for adaptive interleaved thinking in multi-hop Question Answering (QA). Unlike previous approaches that require Supervised Fine-Tuning (SFT) for cold-start initialization, AdaptR1 uses a fully RL-based strategy with a quality-gated efficiency reward to dynamically allocate reasoning budgets at each step. Under the Graph-R1 setting, AdaptR1 reduces average think tokens by 69.71\%, with a 90.35\% reduction on HotpotQA, while maintaining performance comparable to or better than standard baselines. Furthermore, our analysis reveals that overthinking in multi-hop reasoning is not uniformly distributed but occurs predominantly during the initial planning stages, highlighting the effectiveness of step-wise adaptive budget allocation.

URL PDF HTML ☆

赞 0 踩 0

2605.31061 2026-06-01 cs.LG cs.AI

STEP: Learning STructured Embeddings for Progressive Time Series

STEP：学习渐进时间序列的结构化嵌入

Lucas Thil, Jesse Read, Rim Kaddah, Guillaume Doquet

发表机构 * LIX, École Polytechnique（高等理工学院LIX）； IRT SystemX（系统X研究院）； Safran Tech（萨弗兰科技）

AI总结提出一种自监督对比学习方法，通过构建具有固定正交原型向量的低维流形几何结构，实现渐进时间序列的端状态预测、多步预测和可解释相位分离。

详情

AI中文摘要

我们提出了一种新颖的方法，用于学习渐进时间序列的可解释表示，即捕获不可逆状态转换（如退化或任务完成）的数据。我们的方法使用自监督对比目标来学习低维潜在空间，其几何结构本身就是解释：每个观测成为位于两个固定正交原型向量之间的流形上的一个点，轨迹成为穿过该流形的路径。从这种结构中，我们读取一个潜在指南针，即潜在向量的极坐标(θ, r)，其中θ跟踪潜在状态的进展（例如，从健康到故障），r识别活动模式（例如，操作条件），无需任何代理标签。我们在不同领域（包括工业退化、机器人任务和神经活动）上评估了该方法与最先进方法的对比，验证了三个关键能力：（1）端状态预测，（2）多步预测，以及（3）可解释的相位分离。我们的方法在所有方面匹配或优于黑盒对应方法，同时提供对底层机制的透明性。在潜在指南针坐标之上的简单线性回归器与深度架构具有竞争力，这是底层状态以几何可访问形式编码的直接定量证据。

英文摘要

We present a novel method for learning interpretable representations of progressive time series, that is, data capturing irreversible state transitions such as degradation or task completion. Our approach uses a self-supervised contrastive objective to learn a low-dimensional latent space whose geometry is itself the interpretation: each observation becomes a point on a manifold anchored between two fixed orthogonal prototype vectors, and a trajectory becomes a path across that manifold. From this structure we read a latent compass, the polar coordinates (θ, r) of the latent vector, in which θ tracks the progression of the underlying state (e.g., from healthy to failed) and r identifies the active mode (e.g., the operating condition), without any proxy labels. We evaluate the approach against the state of the art on diverse domains, including industrial degradation, robotic tasks, and neural activity, validating three key capabilities: (1) end-state prediction, (2) multi-step forecasting, and (3) interpretable phase separation. Our method matches or improves over black-box counterparts on all of these while providing transparency about the underlying mechanisms. A simple linear regressor on top of the latent compass coordinates is competitive with deep architectures, direct quantitative evidence that the underlying state is encoded in a geometrically accessible form.

URL PDF HTML ☆

赞 0 踩 0

2605.31058 2026-06-01 cs.CL cs.SE

Combinatorial Synthesis: Scaling Code RLVR via Atomic Decomposition and Recombination

组合合成：通过原子分解与重组扩展代码RLVR

Jiasheng Zheng, Boxi Cao, Boxi Yu, Yuzhong Zhang, Jialun Cao, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun

发表机构 * Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences（中国科学院软件研究所信息处理实验室）； University of Chinese Academy of Sciences（中国科学院大学）； Lero the Research Ireland Centre for Software, University of Limerick（利默尼克大学爱尔兰软件研究中心）； The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； The Hong Kong University of Science and Technology（香港科技大学）

AI总结提出原子分解与重组（ADR）框架，通过将代码任务分解为原子元素并受控重组，生成新颖且具有挑战性的可验证代码任务，以解决RLVR训练数据稀缺和扩展性问题，实验表明在多个下游领域显著提升代码能力。

Comments Work in progress

详情

AI中文摘要

基于可验证奖励的强化学习（RLVR）近期已成为塑造大型语言模型（LLMs）卓越编码能力的基石。然而，RLVR的可扩展性受到严重制约，因为缺乏足够具有挑战性的、针对模型能力边缘的可验证代码任务。先前的研究通常依赖启发式种子扩展进行数据合成，这严重限制了新颖性和难度。因此，此类数据的训练价值无法随合成规模成比例扩展。为此，我们提出原子分解与重组（ADR），一种通过将任务分解为原子元素并进行受控重组来生成可验证代码任务的新框架，从而能够生成真正新颖且具有挑战性的可验证代码任务。实验和分析表明，ADR在原创性、难度、多样性和测试质量方面优于现有基线，并在包括算法编程、工具使用和数据科学在内的多个下游领域的RLVR中持续带来更大的代码能力提升。我们的工作为新颖代码任务合成和可扩展的RLVR训练开辟了新范式。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as the cornerstone for shaping the remarkable coding abilities of Large Language Models (LLMs). However, the scalability of RLVR is severely constrained by the scarcity of sufficiently challenging verifiable code tasks that target near the model's edge of competence. Prior studies often rely on heuristic seed expansions for data synthesis, which severely limits both novelty and difficulty. Consequently, the training value of such data fails to scale proportionally with the size of its synthesis. To this end, we propose Atomic Decomposition and Recombination (ADR), a novel framework that generates verifiable code tasks via decomposition into atomic elements and controlled recombination, thereby enabling the generation of genuinely novel and challenging verifiable code tasks. Experiments and analysis demonstrate that ADR achieves superior originality, difficulty, diversity, and test quality over existing baselines, and consistently delivers greater improvements in code ability across RLVR in diverse downstream domains, including algorithmic programming, tool usage, and data science. Our work sheds light on a new paradigm for novel code task synthesis and scalable RLVR training.

URL PDF HTML ☆

赞 0 踩 0

2605.31057 2026-06-01 cs.CV cs.LG

LVSA: Training-Free Sparse Attention for Long Video Diffusion

LVSA：长视频扩散的无训练稀疏注意力

Gael Glorian, Ioannis Lamprou, Zhen Zhang, Yujie Yuan, Hongsheng Liu

发表机构 * Distributed Parallel Technology Laboratory, Paris Research Center, Huawei Technologies France（华为法国巴黎研究中心分布式并行技术实验室）； AI Framework and Data Technology Lab, Huawei Technologies Co., Ltd.（华为技术有限公司人工智能框架与数据技术实验室）

AI总结提出一种无需训练、模型无关的块稀疏注意力方法LVSA，通过结构化窗口模式与旋转全局锚点结合，在降低长视频扩散推理计算成本的同时消除固定网格偏差，支持超训练时域的视频生成。

Comments 10 pages, 5 figures, 4 tables. Code: https://github.com/JiusiServe/LongVideoSparseAttention

详情

AI中文摘要

密集自注意力是长视频扩散推理的计算和质量的瓶颈：成本随序列长度二次增长，且超出训练时域时模型收敛到近乎静态的输出，即“冻结”的重复视频。最先进的方法要么成本过高（例如需要重新训练），要么无法以可扩展的方式同时满足性能和质量目标。为此，我们提出长视频稀疏注意力（LVSA），一种无需训练、模型无关的块稀疏注意力方法，用于视频扩散Transformer，它结合了结构化窗口模式与旋转全局锚点，从而消除了导致长时域伪影的固定网格偏差。LVSA结合FlashInfer内核，与密集注意力相比，在Wan 2.1 1.3B上以6倍时域减少计算量达3.17倍，在Wan 2.1 14B上以6倍时域减少2.98倍，在HunyuanVideo 1.5上以1.5倍时域减少3.33倍。除了减少计算量，LVSA还使得HunyuanVideo 1.5能够在2倍时域下生成，否则在单个GPU上会内存不足。此外，与RIFLEx相比，LVSA在Wan 2.1 1.3B上提供高达2.41倍的加速，与UltraViCo相比提供3.27倍的加速。为了展示跨不同平台的适用性，我们将LVSA应用于NPU，与密集注意力相比，在Wan 2.2 A14B上实现高达2.71倍的加速，在Wan 2.1 1.3B上实现3.24倍的加速。为了公平地评估质量，我们引入了VQeval，一个正确评分循环视频失败的工具，而VBench-Long等最先进评估器则会奖励这类失败。LVSA在训练时域长度下生成时质量中性，在扩展长度下质量积极。

英文摘要

Dense self-attention is the compute and quality bottleneck of long-video diffusion inference: cost grows quadratically with the sequence length, and beyond the training horizon the model converges to near-static output, that is, "frozen" repetitive video. State of the art approaches are either too costly, e.g., they require retraining, or fail to satisfy both performance and quality objectives in a scalable manner. To this end, we introduce Long Video Sparse Attention (LVSA), a training-free model-agnostic block-sparse attention for video diffusion transformers that combines a structured window pattern with rotating global anchors, thus removing the fixed-grid bias which causes long-range temporal artifacts. LVSA, combined with a FlashInfer kernel, reduces compute up to 3.17x on Wan 2.1 1.3B at a 6x horizon, 2.98x on Wan 2.1 14B at a 6x horizon, and 3.33x on HunyuanVideo 1.5 at a 1.5x horizon, compared to dense attention. Beyond reducing compute, LVSA enables HunyuanVideo 1.5 generation at a 2x horizon, which is otherwise out-of-memory on a single GPU. Moreover, LVSA provides speedups up to 2.41x compared to RIFLEx and 3.27x compared to UltraViCo on Wan 2.1 1.3B. To demonstrate applicability across diverse platforms, we apply LVSA on NPUs and achieve speedups up to 2.71x on Wan 2.2 A14B and 3.24x on Wan 2.1 1.3B compared to dense attention. To evaluate quality in a fair way, we introduce VQeval, a tool properly scoring loopy video failures, which instead are rewarded in state of the art evaluators like VBench-Long. LVSA is quality-neutral for generation at training horizon length and quality-positive at extended lengths.

URL PDF HTML ☆

赞 0 踩 0

2605.31056 2026-06-01 cs.CL

How Much Do LLMs Know About Chinese Zero Pronouns?

LLMs 对中文零代词的了解程度如何？

Yifei Li, Guanyi Chen, Tingting He

发表机构 * Hubei Provincial Key Laboratory of Artificial Intelligence and Smart Learning（湖北省人工智能与智能学习重点实验室）； National Language Resources Monitoring and Research Center for Network Media（网络媒体语言资源监测与研究中心）； School of Computer Science, Central China Normal University（中央财经大学计算机学院）

AI总结通过一系列语言学动机任务（识别、指称性分类、指称类型分类、消解和翻译），系统评估了大型语言模型处理中文零代词的能力，发现当前LLMs在零代词处理上仍面临巨大挑战，尤其在识别和指称性分类等上游任务上表现不佳。

2605.31053 2026-06-01 cs.SD cs.AI

AnchorSteer: Self-Discovered Concept Injection for Structure-Preserving Music Editing

AnchorSteer: 自发现概念注入用于结构保持的音乐编辑

Chih-Heng Chang, Keng-Seng Ho, Chih-Yu Tsai, Kuan-Lin Chen, Yi-Hsuan Yang, Jian-Jiun Ding

发表机构 * National Taiwan University（国立台湾大学）

AI总结提出AnchorSteer框架，通过结构锚定与自发现语义注入解耦语义-结构纠缠，实现高保真结构保持下的显著语义变换。

Comments Accepted by the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)

详情

AI中文摘要

可控音乐编辑旨在修改高级属性，同时严格保留节奏和旋律结构。然而，这一任务面临语义-结构纠缠的挑战：引导方法往往为了编辑性能而牺牲结构，而结构适配器则抑制语义响应。我们提出AnchorSteer，一个通过将结构锚定与自发现语义引导耦合来解耦这种张力的框架。该方法通过自监督重构目标探测内部表示，提取可解释、无标签的概念向量，无需精心策划的数据即可隔离属性。在编辑过程中，这些便携、即插即用的概念向量被注入扩散隐空间，同时结构适配器强制执行一致性。提供了无条件和条件注入的变体，以平衡鲁棒性和语义强度。在ZoME-Bench和主观测试上的实验表明，所提出的框架优于纯引导和纯锚定的基线，实现了高保真结构保持下的显著语义变换。

英文摘要

Controllable music editing is to modify high-level attributes while strictly preserving rhythmic and melodic structures. However, this task is challenged by a semantic-structural entanglement: steering methods often degrade structure to achieve editing performance, while structural adaptors suppress semantic responsiveness. We propose AnchorSteer, a framework that disentangles this tension by coupling structural anchoring with self-discovered semantic steering. The proposed approach probes internal representations to extract interpretable, label-free concept vectors via a self-supervised reconstruction objective, isolating attributes without curated data. During editing, these portable, plug-and-play concept vectors are injected into diffusion hidden manifolds while a structural adaptor enforces consistency. Variants for unconditioned and conditioned injections are provided to balance robustness and semantic strength. Experiments on ZoME-Bench and subjective tests show that the proposed framework outperforms both steering-only and anchoring-only baselines, enabling significant semantic transformations with high-fidelity structural preservation.

URL PDF HTML ☆

赞 0 踩 0

2605.31050 2026-06-01 cs.LG

Best-Arm Identification-Based Trust Region Selection for Bayesian Optimization on Multimodal Functions

基于最佳臂识别的多模态函数贝叶斯优化信任区域选择

Nobuo Namura, Sho Takemori

发表机构 * Fujitsu Limited（富士通有限公司）

AI总结提出一种结合最佳臂识别与信任区域贝叶斯优化的轨迹感知框架，通过预测局部优化器最终性能并逐步淘汰次优候选，加速多模态函数全局优化。

Comments 19 pages, 13 figures

详情

AI中文摘要

基于高斯过程的贝叶斯优化是昂贵的黑箱优化的流行方法，但其性能在复杂多模态或高维问题上常常下降。基于信任区域的贝叶斯优化通过聚焦局部区域缓解了这一问题，最近的研究表明，选择有效区域可以建模为多臂老虎机问题。我们提出了一种轨迹感知框架，将最佳臂识别与基于信任区域的贝叶斯优化相结合，以高效求解多模态优化问题。我们的方法外推多个局部初始化优化器的优化轨迹以预测其最终性能，并通过最佳臂识别逐步淘汰次优候选。我们从理论上证明，在温和假设下，所提出的最佳臂识别引导的贝叶斯优化比传统贝叶斯优化更快收敛到全局最优，并通过在合成和真实世界基准上的大量实验证明了其有效性。

英文摘要

Gaussian process-based Bayesian optimization (BO) is a popular approach for expensive black-box optimization, but its performance often degrades on complex multimodal or high-dimensional problems. Trust region-based BO mitigates this issue by focusing on local regions, and recent studies suggest that selecting an effective region can be formulated as a multi-armed bandit problem. We propose a trajectory-aware framework that integrates best-arm identification (BAI) with trust region-based BO to efficiently solve multimodal optimization problems. Our method extrapolates the optimization trajectories of multiple locally initialized optimizers to predict their final performance and progressively eliminates suboptimal candidates via BAI. We theoretically show that the proposed BAI-guided BO converges faster to the global optimum than conventional BO under mild assumptions, and demonstrate its effectiveness through extensive experiments on synthetic and real-world benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2605.31049 2026-06-01 cs.LG cs.AI cs.LO

Learning to Solve and Optimize by Evolving Code

通过代码演化学习求解与优化

Veronika Semmelrock, Benedetta Strizzolo, Francesco Zuccato, Gerhard Friedrich, Patrick Rodler, Konstantin Schekotihin

发表机构 * University of Klagenfurt（克雷格福大学）； University of Udine（乌迪大学）

AI总结提出CHECKMATE工具，利用形式规范确保解的正确性并通过自然语言描述指导代码演化，自动生成算法，在配置与调度问题上超越最先进求解器。

Comments Preprint of a paper accepted to IJCAI26

详情

AI中文摘要

组合与优化问题是许多工业AI应用的基础。解决此类大规模现实世界实例通常需要仔细的问题形式化、专门的求解器以及专家设计的启发式方法。因此，专家不仅需要指定解是什么，还需要指定如何推导出解。通过引入工具CHECKMATE，我们展示了通过代码演化生成算法代表了一种范式转变，消除了制定如何的需求。CHECKMATE仅依赖于是什么。具体来说，形式规范确保了解的正确性，并能够对生成的程序进行系统性能评估，而自然语言描述则指导演化过程。我们的方法在两个工业领域（配置与调度）的选定问题上展示了有效性。在所有案例中，演化出的算法始终优于最先进的求解器。这凸显了形式方法在引导代码演化以自动解决复杂现实问题方面的潜力。

英文摘要

Combinatorial and optimization problems are fundamental to many industrial AI applications. Solving large-scale real-world instances of such problems typically requires careful problem formalization, specialized solvers, and expert-designed heuristics. Thus, experts need to specify not only what solutions are, but also how they are derived. By introducing the tool CHECKMATE, we show that algorithm generation via code evolution represents a paradigm shift by eliminating the need to formulate the how. CHECKMATE solely relies on the what. Specifically, a formal specification ensures solutions' correctness and enables systematic performance evaluation of the generated programs, while a natural language description guides the evolutionary process. The effectiveness of our method is demonstrated on selected problems from two industrial domains: configuration and scheduling. In all cases, the evolved algorithms consistently outperform state-of-the-art solvers. This underscores the potential of formal methods in guiding code evolution for automatically solving complex real-world problems.

URL PDF HTML ☆

赞 0 踩 0

2605.31048 2026-06-01 cs.CV

Rethinking Efficient Crack Segmentation with Task-Aligned Structural-Directional Modeling

重新思考基于任务对齐的结构-方向性建模的高效裂缝分割

Shipeng Liu, Liang Zhao, Dengfeng Chen, Weihua Zhang

发表机构 * xauat（西安理工大学）

AI总结将裂缝分割视为稀疏结构恢复问题，提出RIFT模型，通过轻量多尺度融合保留局部证据、聚合方向连续性，在16项指标上达到最优或并列最优。

详情

AI中文摘要

最近的裂缝分割方法通常遵循通用的语义分割设计，使用更强的骨干网络、混合CNN-Transformer-Mamba编码器和辅助增强分支。虽然有效，但这引发了疑问：更强的通用特征混合是否是裂缝分割最合适的方向。相反，我们将裂缝分割表述为稀疏结构恢复。裂缝具有有限的类别级语义，但具有很强的形态规律性，即细、稀疏、各向异性、局部碎片化，且容易与纹理或阴影混淆。因此，关键瓶颈在于保留弱结构证据、恢复方向连续性以及抑制背景耦合。我们提出RIFT，一个紧凑的形态对齐裂缝分割模型家族。RIFT设计简单，而不是压缩复杂的通用架构，它保留局部证据，聚合协作方向连续性，并通过轻量多尺度融合恢复裂缝结构。在四个公共基准上的实验表明，RIFT在16个主要指标上对再现的代表性基线取得了最佳或并列最佳结果。RIFT-B提供了最强的整体精度，而RIFT-T提供了最佳的部署效率，仅0.47M参数和高推理速度。拓扑感知评估、消融实验、迁移实验和可视化进一步验证了，当其归纳偏置与裂缝形态匹配时，任务对齐的简单性可以匹配或超越复杂的混合架构。代码：https://github.com/xauat-liushipeng/RIFT

英文摘要

Recent crack segmentation methods often follow generic semantic segmentation designs, using stronger backbones, hybrid CNN-Transformer-Mamba encoders, and auxiliary enhancement branches. Although effective, this raises whether stronger generic feature mixing is the most suitable direction for crack segmentation. We instead formulate crack segmentation as sparse structural recovery. Cracks have limited category-level semantics but strong morphological regularities, being thin, sparse, anisotropic, locally fragmented, and easily confused with textures or shadows. Thus, the key bottleneck lies in preserving weak structural evidence, recovering directional continuity, and suppressing background coupling. We propose RIFT, a compact family of morphology-aligned crack segmentation models. Rather than compressing a complex generic architecture, RIFT is simple by design, preserving local evidence, aggregating cooperative directional continuity, and restoring crack structures through lightweight multi-scale fusion. Experiments on four public benchmarks show that RIFT achieves the best or tied-best results across the 16 main metrics against reproduced representative baselines. RIFT-B gives the strongest overall accuracy, while RIFT-T provides the best deployment efficiency with only 0.47M parameters and high inference speed. Topology-aware evaluation, ablations, transfer experiments, and visualizations further verify that task-aligned simplicity can match or surpass complex hybrid architectures when its inductive bias fits crack morphology. Code: https://github.com/xauat-liushipeng/RIFT

URL PDF HTML ☆

赞 0 踩 0

2605.31044 2026-06-01 cs.LG

The Challenges of Using Reinforcement Learning for Controlling Industrial Energy Systems

使用强化学习控制工业能源系统的挑战

Tobias Lademann, Théo Vincent, Jan Peters, Matthias Weigold

发表机构 * Institute for Production Management, Technology and Machine Tools (PTW), Technical University of Darmstadt（技术大学达姆施塔特生产管理、技术与机床研究所）； DFKI GmbH, SAIROL（DFKI GmbH，SAIROL）； Department of Computer Science, Technical University of Darmstadt（技术大学达姆施塔特计算机科学系）； Hessian.ai, Technical University of Darmstadt（黑森人工智能公司，技术大学达姆施塔特）

AI总结本文以热力供暖网络为例，研究强化学习在真实工业能源系统部署中的挑战，包括部分可观测性、动作空间设计、奖励设计及仿真到现实的差距，并基于实际部署发现强化学习虽能实现运行稳定性但存在性能差距。

Comments Submitted to Finding the Frame Workshop at RLC 2026

2605.31041 2026-06-01 cs.CV cs.AI

Does Visual Information Play a Decisive Role in Vision-Language-Action Model Driving Behavior?

视觉信息在视觉-语言-动作模型驾驶行为中是否起决定性作用？

Jingtao He, Hongliang Lu, Xiaoyun Qiu, Yixuan Wang, Xinhu Zheng

发表机构 * Intelligent Transportation Thrust, The Hong Kong University of Science and Technology (Guangzhou)（科技与交通智能 thrust，香港科学与技术大学（广州））

AI总结本文提出结构化多级视觉扰动框架，系统分析VLA驾驶模型对视觉信息的依赖程度，揭示依赖模式随评估方式变化且在不同抽象层次上不均匀。

详情

AI中文摘要

视觉-语言-动作（VLA）模型在自动驾驶中展现出令人期待的能力，凸显了统一多模态架构联合建模感知与规划的潜力。然而，当前基于VLA的驾驶行为如何植根于视觉信息仍知之甚少。现有评估协议主要关注聚合性能指标，缺乏结构化和实用的诊断方法来量化视觉-行为依赖性。在这项工作中，我们引入了一个结构化的多级视觉扰动框架，以系统分析基于VLA的驾驶模型中的视觉-行为依赖性。该框架沿着三个互补维度组织受控视觉扰动：通道级退化、信息级破坏和结构级修改。我们将其应用于基于VLA的驾驶系统，并在开环轨迹预测和交互式闭环安全评估下评估行为响应。实验揭示了依赖于评估的依赖模式以及跨抽象层次的不均匀视觉基础。这些发现呼吁对VLA驾驶模型进行更结构化的分析和原则性设计，以更好地理解视觉信息如何塑造行为，并开发更安全、更鲁棒的系统。

英文摘要

Vision-Language-Action (VLA) models have demonstrated promising capability in autonomous driving, highlighting the potential of unified multimodal architectures for jointly modeling perception and planning. However, how current VLA-based driving behavior is grounded in visual information remains poorly understood. Existing evaluation protocols mainly focus on aggregate performance metrics, lacking structured and practical diagnostics to quantify visual-behavior dependency. In this work, we introduce a structured multi-level visual perturbation framework to analyze visual-behavior dependency in VLA-based driving models systematically. The framework organizes controlled visual perturbations along three complementary dimensions: channellevel degradation, information-level disruption, and structurelevel modification. We apply it to VLA-based driving systems and evaluate behavioral responses under both open-loop trajectory prediction and interactive closed-loop safety evaluation. Experimental results reveal evaluation-dependent dependency patterns and uneven visual grounding across abstraction levels. These findings call for more structured analyses and principled design of VLA driving models to better understand how visual information shapes behavior and develop safer, more robust systems.

URL PDF HTML ☆

赞 0 踩 0

2605.31040 2026-06-01 cs.LG

UniRTL: Unifying Code and Graph for Robust RTL Representation Learning

UniRTL：统一代码和图以实现稳健的RTL表示学习

Yi Liu, Hongji Zhang, Lei Chen, Mingxuan Yuan, Qiang Xu

发表机构 * Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong SAR（计算机科学与工程系，香港中文大学，香港特别行政区）； Noah's Ark Lab, Huawei, Hong Kong SAR（华为诺亚实验室，香港特别行政区）

AI总结提出UniRTL多模态预训练框架，通过互掩码建模和分层训练策略联合利用RTL代码与控制数据流图，实现细粒度对齐，在性能预测和代码检索任务上优于现有方法。

Comments Forty-Third International Conference on Machine Learning (ICML 2026)

详情

AI中文摘要

为寄存器传输级（RTL）设计开发有效的表示对于加速硬件设计工作流至关重要。然而，现有方法通常依赖于单一数据模态，即RTL代码或其相关的基于图的表示，限制了所学表示的表达能力和泛化能力。对于RTL，控制数据流图（CDFG）提供了保留完整信息的全面结构表示，而代码模态显式编码了语义和功能信息。我们认为，整合这些互补模态对于全面理解RTL设计至关重要。为此，我们提出UniRTL，一种多模态预训练框架，通过联合利用代码和CDFG学习统一的RTL表示。UniRTL通过互掩码建模实现代码和图之间的细粒度对齐，并采用分层训练策略，该策略结合了预训练的图感知分词器以及在图集成之前对文本（即功能摘要）和代码进行分阶段对齐。我们在两种下游任务（性能预测和代码检索）的多种设置下评估UniRTL。实验结果表明，UniRTL始终优于先前的方法，使其成为推进硬件设计自动化的更稳健和更强大的基础。

英文摘要

Developing effective representations for register transfer level (RTL) designs is crucial for accelerating the hardware design workflow. Existing approaches, however, typically rely on a single data modality, either the RTL code or its associated graph-based representation, limiting the expressiveness and generalization ability of the learned representations. For RTL, the control data flow graph (CDFG) offers a comprehensive structural representation that preserves complete information, while the code modality explicitly encodes semantic and functional information. We argue that integrating these complementary modalities is essential for a thorough understanding of RTL designs. To this end, we propose UniRTL, a multimodal pretraining framework that learns unified RTL representations by jointly leveraging code and CDFG. UniRTL achieves fine-grained alignment between code and graph through mutual masked modeling and employs a hierarchical training strategy that incorporates a pretrained graph-aware tokenizer and staged alignment of text (i.e., functional summary) and code prior to graph integration. We evaluate UniRTL on two downstream tasks, performance prediction and code retrieval, under multiple settings. Experimental results show that UniRTL consistently outperforms prior methods, establishing it as a more robust and powerful foundation for advancing hardware design automation.

URL PDF HTML ☆

赞 0 踩 0

2605.31034 2026-06-01 cs.LG cs.AI

Annealed Softmax Greedy in Many-Armed Bayesian Bandits

多臂贝叶斯老虎机中的退火Softmax贪婪算法

William Overman, Mohsen Bayati

发表机构 * Stanford University（斯坦福大学）

AI总结本文研究退火Softmax贪婪算法在多臂贝叶斯伯努利老虎机中的贝叶斯遗憾，证明在先验满足线性上尾条件（β=1的β正则性）时，算法达到接近最优的贝叶斯遗憾率，并与RLVR方法形成结构类比。

详情

AI中文摘要

具有可验证奖励的强化学习（RLVR）和基于组的策略优化方法（如GRPO）通过为每个提示采样多个完成并增加策略在奖励较高的完成上的概率来更新随机策略，同时通过KL惩罚向参考策略正则化。这些更新不包括追踪认知不确定性的显式机制。本文研究为何这种不确定性无关的更新仍然有效的一个风格化解释。我们分析了一个退火softmax（玻尔兹曼）策略，该策略在多臂贝叶斯伯努利老虎机中根据经验平均奖励的softmax选择动作。在先验满足线性上尾条件（β正则性的β=1情况）下，该条件意味着存在大量接近最优的臂，我们证明退火softmax贪婪算法实现了贝叶斯遗憾$ ilde{O}(m + T/m)$，特别地，当臂数$m = Θ(\sqrt{T})$时，遗憾为$ ilde{O}(\sqrt{T})$。这是该机制下接近最优的贝叶斯遗憾率，经验平均贪婪算法也能达到。在β正则性下，许多臂在整个学习过程中保持经验均值接近最优，因此当softmax采样一个非经验最优的臂时，该臂往往是另一个接近最优的臂，而不是明显较差的臂。相比之下，当臂数较少时，同类的softmax策略可能遭受线性遗憾。该结果也为RLVR提供了结构类比，其中以非可忽略概率产生正确完成的基础策略扮演了β正则性的角色。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) and group-based policy optimization methods such as GRPO update a stochastic policy by sampling multiple completions per prompt and increasing the policy's probability on those with higher reward, regularized by a KL penalty toward a reference policy. These updates do not include explicit mechanisms that track epistemic uncertainty. This paper studies a stylized explanation for why such uncertainty-agnostic updates can nevertheless be effective. We analyze an annealed softmax (Boltzmann) policy that selects actions according to a softmax of empirical mean rewards in a many-armed Bayesian Bernoulli bandit. Under a linear upper-tail condition on the prior (the $β=1$ case of $β$-regularity), which implies an abundance of near-optimal arms, we prove that annealed softmax greedy achieves Bayes regret $\tilde{O}(m + T/m)$, and in particular $\tilde{O}(\sqrt{T})$ when the number of arms scales as $m = Θ(\sqrt{T})$. This is the near-optimal Bayes regret rate in this regime, attained also by empirical-mean greedy. Under $β$-regularity, many arms maintain empirical means close to the optimum throughout learning, so when softmax samples an arm other than the empirically best, that arm tends to be another near-optimal one rather than a clearly inferior one. By contrast, with a small number of arms, the same kind of softmax policy can suffer linear regret. The result also provides a structural analogy to RLVR, where a base policy with a non-negligible probability of producing a correct completion plays the role of $β$-regularity.

URL PDF HTML ☆

赞 0 踩 0

2605.31033 2026-06-01 cs.CV

SlotMemory: Object-Centric KV Memory for Streaming Long-Video Generation

SlotMemory: 面向流式长视频生成的以对象为中心的KV记忆

Weijia Dou, Hui Li, Jiahao Cui, Lei Zhou, Jingdong Wang, Siyu Zhu

发表机构 * Fudan University（复旦大学）； Meta Superintelligence Labs（Meta超智能实验室）； Baidu（百度）

AI总结提出SlotMemory，一种以对象为中心的键值记忆机制，通过将变换器的键值流形分解为离散语义槽，实现实体级持久性和提示感知检索，在60秒交互叙事中动态一致性相对提升22.8%。

详情

AI中文摘要

流式视频生成模型通常依赖于以时间为中心的记忆，将历史上下文组织为原始帧、片段或未聚类的令牌。这种组织方式常导致实体离开画面或交互式提示转换时出现身份漂移和语义不一致。为解决这些限制，我们提出SlotMemory，一种用于流式视频扩散的以对象为中心的键值记忆机制。我们的方法通过将变换器的键值流形分解为离散、可重用的语义槽，将记忆抽象从事件发生的“何时”转移到所表示的“什么”。通过利用这些槽作为路由地址来索引和存储高保真键值令牌，我们实现了跨长时域的实体级持久性和提示感知检索。在使用Wan2.1-T2V-1.3B骨干网络对60秒交互叙事进行评估时，SlotMemory达到了81.61的最先进质量分数，并在动态一致性上比现有最强流式基线相对提升22.8%。我们的结果表明，结构化的语义表示，而非原始时间容量，是持久长视频合成的关键原语。我们的代码和检查点可在https://tj12323.github.io/SlotMemory/获取。

英文摘要

Streaming video generation models typically rely on temporal-centric memory, which organizes historical context as raw frames, chunk segments, or unclustered tokens. This organization frequently leads to identity drift and semantic inconsistency when entities exit the frame or during interactive prompt transitions. To address these limitations, we propose SlotMemory, an object-centric Key-Value memory mechanism for streaming video diffusion. Our approach shifts the memory abstraction from "when" an event occurred to "what" is being represented by decomposing the transformer's key-value manifold into discrete, reusable semantic slots. By utilizing these slots as routing addresses to index and store high-fidelity key-value tokens, we enable entity-level persistence and prompt-aware retrieval across long horizons. Evaluated on 60-second interactive narratives using the Wan2.1-T2V-1.3B backbone, SlotMemory achieves a state-of-the-art quality score of 81.61 and a 22.8 percent relative improvement in dynamic consistency over the strongest existing streaming baseline. Our results demonstrate that structured semantic representation, rather than raw temporal capacity, is the essential primitive for persistent long-form video synthesis. Our codes and checkpoints are available at https://tj12323.github.io/SlotMemory/.

URL PDF HTML ☆

赞 0 踩 0

2605.31031 2026-06-01 cs.AI

GraphARC: A Comprehensive Benchmark for Graph-Based Abstract Reasoning

GraphARC：基于图的抽象推理综合基准

Saku Peltonen, August Bøgh Rønberg, Andreas Plesner, Roger Wattenhofer

发表机构 * ETH Z\"urich Z\"urich Switzerland ； ETH Z\"urich

AI总结提出GraphARC基准，将抽象推理扩展到图结构数据，通过少样本变换学习任务评估模型在局部、全局和层次图变换上的泛化能力，并揭示语言模型的理解-执行差距和规模扩展障碍。

Comments Accepted at KDD 2026 Datasets and Benchmarks Track

详情

DOI: 10.1145/3770855.3817591

AI中文摘要

关系推理是智能的核心，但现有基准通常局限于网格或文本格式。我们引入了GraphARC，一个用于图结构数据抽象推理的基准。GraphARC推广了抽象与推理语料库（ARC）的少样本变换学习范式。每个任务需要从几个输入-输出对中推断变换规则，并将其应用于新的测试图，涵盖局部、全局和层次图变换。与基于网格的ARC不同，GraphARC实例可以在不同的图族和规模上大规模生成，从而能够系统评估泛化能力。我们在GraphARC上评估了最先进的语言模型，并观察到明显的局限性。模型能够回答关于图属性的问题，但往往无法解决完整的图变换任务，揭示了理解-执行差距。在更大实例上性能进一步下降，暴露了规模扩展障碍。更广泛地说，通过将节点分类、链接预测和图生成的方面结合在一个单一框架内，GraphARC为未来的图基础模型提供了一个有前景的测试平台。

英文摘要

Relational reasoning lies at the heart of intelligence, but existing benchmarks are typically confined to formats such as grids or text. We introduce GraphARC, a benchmark for abstract reasoning on graph-structured data. GraphARC generalizes the few-shot transformation learning paradigm of the Abstraction and Reasoning Corpus (ARC). Each task requires inferring a transformation rule from a few input-output pairs and applying it to a new test graph, covering local, global, and hierarchical graph transformations. Unlike grid-based ARC, GraphARC instances can be generated at scale across diverse graph families and sizes, enabling systematic evaluation of generalization abilities. We evaluate state-of-the-art language models on GraphARC and observe clear limitations. Models can answer questions about graph properties but often fail to solve the full graph transformation task, revealing a comprehension-execution gap. Performance further degrades on larger instances, exposing scaling barriers. More broadly, by combining aspects of node classification, link prediction, and graph generation within a single framework, GraphARC provides a promising testbed for future graph foundation models.

URL PDF HTML ☆

赞 0 踩 0

2605.31029 2026-06-01 cs.CV

PEEK: Picking Essential frames via Efficient Knowledge distillation

PEEK: 通过高效知识蒸馏提取关键帧

Killian Steunou, Anas Filali Razzouki, Khalil Guetari, Mounîm A. El-Yacoubi, Yannis Tevissen

发表机构 * Télécom SudParis — SAMOVAR（Telecom SudParis — SAMOVAR）； Institut Polytechnique de Paris（巴黎政治学院）； Moments Lab

AI总结提出PEEK方法，通过知识蒸馏将教师模型的帧相关性排名迁移至轻量级时序模型，实现高效动态帧采样，在低帧预算下显著提升视频字幕生成性能。

Comments Supplementary material at https://www.killian-steunou.com/peek/static/pdfs/peek_supplementary.pdf

详情

AI中文摘要

视频语言模型只能处理有限数量的帧，使得帧选择成为高效视频字幕生成的关键瓶颈。大多数字幕生成流程仍依赖均匀采样，该方法计算成本低但忽略视觉内容。自适应帧采样最近成为从视频中选择最具信息量帧的有前景方法，但现有方法计算成本仍然高昂。我们提出PEEK，一种高效的动态帧采样方法，它将字幕条件帧相关性排名从更强的教师模型蒸馏到仅基于视觉内容运行的轻量级时序模型中。我们发现，总体而言，在ActivityNet Captions和MSR-VTT上，我们的方法在所有评估的下游视觉语言模型中优于最先进方法，特别是当仅选择一或两帧进行字幕生成时，在大多数帧预算下获得最佳CIDEr分数。在ActivityNet Captions上，PEEK尤其强大，在16个配置中赢得14个。在MSR-VTT上的零样本评估表明，我们的模型在低帧预算下迁移效果最佳，而在四帧和八帧时结果更为混合，因为时间覆盖和视觉多样性变得更具竞争力。与最近的自适应基线相比，PEEK在低预算场景下更准确且更高效：它仅增加5.2%的字幕生成时间，而CSTA增加65.4%，MaxInfo增加211.9%。我们在https://github.com/momentslab/peek发布代码和预训练检查点。

英文摘要

Video-language models can process only a limited number of frames, making frame selection a key bottleneck for efficient video captioning. Most captioning pipelines still rely on uniform sampling, which is computationally cheap but agnostic to visual content. Adaptive frame sampling has recently emerged as a promising approach for selecting the most informative frames from a video; however, existing methods remain computationally expensive. We introduce PEEK, an efficient dynamic frame sampling method that distills caption-conditioned frame relevance rankings from a stronger teacher model into a lightweight temporal model that operates only on visual content. We find that, overall, on ActivityNet Captions and MSR-VTT, our method outperforms state-of-the-art methods across all evaluated downstream vision language models, especially when only one or two frames are selected for captioning, obtaining the best CIDEr for most frame budgets. On ActivityNet Captions, PEEK is particularly strong, winning 14 out of 16 configurations. Zero-shot evaluation on MSR-VTT shows that our model transfers best at low frame budgets, while results at four and eight frames are more mixed as temporal coverage and visual diversity become increasingly competitive. Compared with recent adaptive baselines, PEEK is both more accurate in the low-budget regime and more efficient: it adds only $5.2\%$ to the captioning time, compared with $65.4\%$ for CSTA and $211.9\%$ for MaxInfo. We release our code and pre-trained checkpoint at https://github.com/momentslab/peek.

URL PDF HTML ☆

赞 0 踩 0

2605.31025 2026-06-01 cs.CL

TRACE: Discovering Task-Specific Parameter via Adaptation-Aware Probing for Continual Fine-Tuning

TRACE: 通过适应感知探测发现任务特定参数以实现持续微调

Xiaosong Han, Ke Chen, Xindi Dai, Di Liang, Minlong Peng, Wei Pang, Fausto Giunchiglia, Xiaoyue Feng, Yonghao Liu, Renchu Guan

发表机构 * College of Computer Science and Technology, Jilin University（吉林大学计算机科学与技术学院）； College of Software, Jilin University（吉林大学软件学院）； Fudan University（复旦大学）； School of Mathematical and Computer Sciences, Heriot-Watt University（赫瑞-瓦特大学数学与计算机科学学院）； Department of Information Engineering and Computer Science, University of Trento（特伦托大学信息工程与计算机科学系）

AI总结提出TRACE方法，通过适应感知探测发现任务特定核心参数，在持续微调中仅更新这些参数以缓解灾难性遗忘，并验证了跨模型和规模的迁移性。

Comments KDD2026

详情

DOI: 10.1145/3770855.3817801

AI中文摘要

在实际部署中，大型语言模型通常需要跨任务持续适应以保持最新状态，新的微调应保留先前学到的技能。然而，不加区分地混合任务会稀释任务特化，而顺序微调（全参数或低秩适应）常因破坏性覆盖导致灾难性遗忘。基于回放的持续微调和维护单独的任务特定适配器可以缓解遗忘，但引入了额外的计算、存储和管理开销。认识到LLM参数对于任何单一任务都存在冗余，我们将持续任务适应重新定义为通过适应感知探测发现任务特定参数：短时预热探测暴露任务的适应轨迹，使我们能够识别并隔离每个任务所需的一小部分关键参数，以缓解灾难性遗忘。基于这一观点，我们引入了TRACE，一种通过适应感知探测发现任务特定参数以实现持续微调的新方法。我们进行短时预热微调，通过比较预热模型和预训练模型来推导任务特定核心参数。核心参数通过两种策略识别：重要性评分（L2范数和Fisher信息）和特异性分析（参数更新的余弦相似度）。在持续微调设置中，仅更新当前任务的核心参数，其余参数保持冻结，从而保留先前知识。我们在多个标准基准上进行了广泛实验，证明了所提方法的优越性能。此外，我们通过跨模型和规模迁移性研究验证了方法的泛化能力，展示了在资源约束下指导大规模模型微调的“小到大”范式。

英文摘要

In real-world deployment, LLMs are often adapted continually across tasks to keep LLMs up-to-date in production, where new fine-tuning should preserve previously learned skills. However, indiscriminately mixing tasks can dilute task specialization, while sequential fine-tuning (full-parameter or low rank adaptation) often causes catastrophic forgetting due to destructive overwriting. Replay-based continual tuning and maintaining separate task-specific adapters can mitigate forgetting, but introduce additional compute, storage, and management overhead. Recognizing the redundancy of LLM parameters for any single task, we reframe continual task adaptation as task-specific parameter discovery via adaptation-aware probing: a short warm-start probe exposes a task's adaptation trace, enabling us to identify and isolate the small subset of parameters essential for each task to mitigate catastrophic forgetting. Building on this view, we introduce TRACE, a novel approach for discovering Task-specific paRameters via Adaptation-aware probing for Continual finE-tuning. We perform a short warm-start fine-tune to derive task-specific core parameters by comparing the warm-started and pre-trained models. Core parameters are identified via two strategies: importance scoring (L$_2$ norm and Fisher Information) and specificity analysis (cosine similarity of parameter updates). In continual fine-tuning settings, only the active task's core parameters are updated while others remain frozen, preserving prior knowledge. We conduct extensive experiments across multiple standard benchmarks to demonstrate the superior performance of our proposed method. Additionally, we validate the generalization of our method through a cross-model and scale transferability study, demonstrating a "small-to-large" paradigm that guides the fine-tuning of large-scale models under resource constraints.

URL PDF HTML ☆

赞 0 踩 0

2605.31023 2026-06-01 cs.AI cs.LG cs.MA

HADT: A Heterogeneous Multi-Agent Differential Transformer for Autonomous Earth Observation Satellite Cluster

HADT: 一种用于自主对地观测卫星集群的异构多智能体差分Transformer

Mohamad A. Hady, Muhammad Anwar Masum, Siyi Hu, Mahardhika Pratama, Jimmy Cao, Ryszard Kowalczyk

发表机构 * School of Computer Science and Information Technology, Adelaide University（计算机科学与信息科技学院，阿德莱德大学）； School of Electrical Engineering, Computing and Mathematical Sciences (EECMS), Curtin University（电气工程、计算与数学科学学院（EECMS）， Curtin大学）； Systems Research Institute, Polish Academy of Sciences（波兰科学院系统研究所）

AI总结针对异构卫星集群自主对地观测任务，提出基于Transformer的架构，通过关系观测-动作令牌化和差分注意力机制实现自适应实时资源管理，性能显著优于基线。

Comments Accepted in ECML-PKDD 2026. arXiv admin note: text overlap with arXiv:2511.12792

详情

AI中文摘要

本文解决了执行对地观测任务（包括光学和合成孔径雷达卫星）的异构卫星集群中的自主资源管理问题。在自主运行模式下，卫星配备智能能力，能够根据最新条件实时决策，同时最小化与地面操作员的交互。传统的调度方法通常依赖数学模型来表示卫星任务和资源管理，然后通过优化算法求解。然而，当底层模型不可用、过于复杂或因空间任务环境中的动态变化和不确定性而不准确时，此类解决方案效果不佳。一个有前景的替代方案是将问题重新表述为序列决策过程，并应用无模型强化学习技术来实现自适应和实时资源管理。为此，我们提出了一种新颖的基于Transformer的架构，专门针对异构卫星集群自主对地观测任务，采用关系观测-动作令牌化和差分注意力机制。我们的实验结果表明，与现有基线相比，性能有显著提升。此外，所提出的架构在不同卫星集群数量下表现出强大的适应性和可迁移性。

英文摘要

This work addresses the problem of autonomous resource management in heterogeneous satellite cluster conducting Earth Observation (EO) missions including optical and Synthetic Aperture Radar (SAR) satellites. In autonomous operation mode, satellites are equipped with intelligent capabilities enabling real-time decision-making based on the latest conditions, while requiring minimal interaction with ground operators. Traditional scheduling approaches typically rely on mathematical models to represent satellite mission and resource management. Then, this problem is solved by using optimization algorithms. However, such solutions become less effective when the underlying models are not available, over complex, and inaccurate due to dynamic changes and uncertainties inherent in the space mission environment. A promising alternative is to reformulate the problem as a sequential decision-making process and apply model-free reinforcement learning techniques to enable adaptive and real-time resource management. To this end, we propose a novel transformer-based architecture tailored for heterogeneous satellite cluster autonomous EO Mission with relational observations-actions tokenization and differential attention mechanism. Our experimental results demonstrate significant performance improvements compared to the available baselines. Moreover, the proposed architecture exhibits strong adaptability and transferability with respect to varying numbers of satellite clusters.

URL PDF HTML ☆

赞 0 踩 0

2605.31022 2026-06-01 cs.LG

Augmented Lagrangian Predictive Coding

增广拉格朗日预测编码

Jeffrey Seely, Julian Gould

发表机构 * Sakana AI

AI总结提出增广拉格朗日预测编码（PC-ALM），通过层局部拉格朗日乘子累积约束误差，使局部更新对齐反向传播梯度，在深度网络中匹配反向传播性能。

Comments 22 pages, 10 figures

详情

AI中文摘要

预测编码（PC）是反向传播（BP）的一种局部学习替代方案，通过局部能量最小化动力学而非全局反向传播来训练深度网络。我们引入了增广拉格朗日预测编码（PC-ALM），它保持了PC的推理预算，但通过将每层约束误差累积到层局部拉格朗日乘子中，使每个权重更新与BP对齐。在线性PC网络中，PC-ALM收敛到一个平衡点，其中精确的BP梯度仅通过层局部更新分布在整个网络中。我们在深度达128的非线性PC网络中分析了PC-ALM，并表明它在所有宽度-深度设置下匹配BP性能，特别是在PC表现不佳的深度窄网络中。PC-ALM在每层激活中引入了循环动力学。与PC在标量能量上的热流相比，PC-ALM动力学由增广拉格朗日上的对偶上升驱动。我们观察到在非常深的网络中“弹道”式信用传播，信用信号均匀分布在各层，而PC则是缓慢、扩散的信用传播。除了算法本身，增广拉格朗日框架提供了PC的泛化，并可能为分布式系统如何通过纯局部动力学计算和传播类似BP的信用信号提供见解。

英文摘要

Predictive coding (PC) is a local-learning alternative to backpropagation (BP), training deep networks via local energy-minimization dynamics rather than a global backward pass. We introduce Augmented Lagrangian Predictive Coding (PC-ALM), which maintains PC's inference budget but aligns each weight update toward BP by accumulating per-layer constraint errors into a layer-local Lagrange multiplier. In linear PC networks, PC-ALM converges to an equilibrium with exact BP gradients distributed across the network via only layer-local updates. We analyze PC-ALM in nonlinear PC networks up to depth 128 and show that it matches BP performance across all width-depth regimes, notably in deep narrow networks where PC underperforms. PC-ALM introduces recurrent dynamics in each layer's activations. Compared to PC's heat flow on a scalar energy, PC-ALM dynamics are driven by dual ascent on the augmented Lagrangian. We observe "ballistic" credit propagation across very deep networks, with credit signals evenly distributed across layers, compared to PC's slow, diffusive credit propagation. Beyond the algorithm itself, the augmented Lagrangian framework offers a generalization of PC, and may yield insights into how distributed systems could compute and propagate BP-like credit signals through purely local dynamics.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

iVGR: Internalizing Visually Grounded Reasoning for MLLMs with Reinforcement Learning

Redefining Instance Matching: A Unified Framework for Part-Aware Matching in Panoptic Segmentation Evaluation

Cross-Modal Clinical Knowledge Integration for Mammography Report Generation

On Revisiting Entropy for Identifying Mislabeled Images

Sound effects in media:A comparative analysis of recorded and synthetic samples in live-action and animation

Task-Focused Memorization for Multimodal Agents

ConsisGuard: Aligning Safety Deliberation with Policy Enforcement in LLM Guardrails

Learning to Bid in FCR Markets: A Best-of-Both-Worlds Approach

Towards Effective Long-Video Event Prediction via Multi-Level Event Semantics Mining

HQ-JEPA: Hybrid Quantum Joint-Embedding Predictive Architecture for Cross-Modal Remote Sensing Representation Learning

Can Aerial VLA Models Cooperate? Evaluating Closed-Loop Air-Ground Coordination with CARLA-Air

AdaptR1: Reinforcement Learning Based Adaptive Interleaved Thinking in Multi-hop Question Answering

STEP: Learning STructured Embeddings for Progressive Time Series

Combinatorial Synthesis: Scaling Code RLVR via Atomic Decomposition and Recombination

LVSA: Training-Free Sparse Attention for Long Video Diffusion

How Much Do LLMs Know About Chinese Zero Pronouns?

AnchorSteer: Self-Discovered Concept Injection for Structure-Preserving Music Editing

Best-Arm Identification-Based Trust Region Selection for Bayesian Optimization on Multimodal Functions

Learning to Solve and Optimize by Evolving Code

Rethinking Efficient Crack Segmentation with Task-Aligned Structural-Directional Modeling

The Challenges of Using Reinforcement Learning for Controlling Industrial Energy Systems

Does Visual Information Play a Decisive Role in Vision-Language-Action Model Driving Behavior?

UniRTL: Unifying Code and Graph for Robust RTL Representation Learning

Annealed Softmax Greedy in Many-Armed Bayesian Bandits

SlotMemory: Object-Centric KV Memory for Streaming Long-Video Generation

GraphARC: A Comprehensive Benchmark for Graph-Based Abstract Reasoning

PEEK: Picking Essential frames via Efficient Knowledge distillation

TRACE: Discovering Task-Specific Parameter via Adaptation-Aware Probing for Continual Fine-Tuning

HADT: A Heterogeneous Multi-Agent Differential Transformer for Autonomous Earth Observation Satellite Cluster

Augmented Lagrangian Predictive Coding