arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2062
2605.31108 2026-06-01 cs.CV cs.LG

Remembering by Reconstructing: Domain Incremental Learning With Test-Time Training on Video Streams

通过重建来记忆:视频流上的域增量学习与测试时训练

Jonathan Swinnen, Tinne Tuytelaars

AI总结 提出一种结合主任务头和自监督掩码自编码器头的域增量学习方法,通过测试时训练识别最佳LoRA适配器以重新记忆域,适用于视频流数据。

详情
AI中文摘要

在这项工作中,我们提出了一种新颖的域增量学习方法,使模型能够随时间适应不断演变的非平稳数据。与其他工作不同,我们不试图避免灾难性遗忘,而是允许并利用它。我们的模型结合了一个主任务头和一个自监督掩码自编码器(MAE)头。然后在增量训练期间学习特定于域的LoRA适配器。每个适配器专攻其域,自然地在两个头上诱导对其他域的遗忘。在推理时,我们在自监督MAE头上进行在线测试时训练,以识别哪些LoRA最匹配当前输入,从而使模型能够再次“记住”该域。我们的方案特别适用于现实世界的流数据,例如视频,其中连续样本高度相关且域变化是渐进的。我们在域增量动作识别和语义分割任务上展示了我们的方法。

英文摘要

In this work we introduce a novel approach to domain incremental learning, adapting models over time to evolving, non-stationary data. In contrast to other works, we do not attempt to avoid catastrophic forgetting, but rather allow it and exploit it. Our model combines a main task head with a self-supervised masked autoencoder (MAE) head. We then learn domain-specific LoRA adapters during incremental training. Each adapter specializes to its domain, naturally inducing forgetting on other domains in both heads. At inference, we perform online test-time training on the self-supervised MAE head to identify which LoRAs best matches the current input, so the model can `remember' the domain again. Our scheme is especially well-suited to real-world streaming data, such as video, where consecutive samples are highly correlated and domain shifts are gradual. We demonstrate our method on domain-incremental action recognition and semantic segmentation tasks.

2605.31106 2026-06-01 cs.LG

Riemannian Diffusion Models on General Manifolds via Physics-Informed Neural Networks

基于物理信息神经网络的通用流形上的黎曼扩散模型

Gyeonghoon Ko, Juho Lee

AI总结 针对黎曼流形上热核难以解析计算的问题,提出用物理信息神经网络求解流形热方程来近似热核,从而实现扩散模型的训练与采样。

详情
AI中文摘要

黎曼扩散模型通过流形上的随机扩散方程将基于分数的生成建模推广到流形支持的数据。然而,训练需要从流形热核中采样并对其求导,而除少数高度对称的流形外,热核很少具有封闭形式。我们提出一种通用方法,通过使用物理信息神经网络(PINN)直接求解流形热方程来近似热核。给定显式流形规范,我们选择坐标系,推导相应的热(Fokker--Planck)方程和短时渐近近似,然后训练PINN学习对数热核。得到的替代模型能够实现前向加噪(热核采样)和去噪分数匹配的条件分数评估。我们在多种流形上演示了该方法,包括$S^2$、$SO(3)$、$\mathrm{SPD}(n)$和置换商点云。

英文摘要

Riemannian diffusion models generalize score-based generative modeling to manifold-supported data via stochastic diffusion equations on the manifold. However, training requires sampling from and differentiating the manifold heat kernel, which is rarely available in closed form beyond a few highly symmetric manifolds. We propose a general approach that approximates the heat kernel by directly solving the manifold heat equation with a physics-informed neural network (PINN). Given an explicit manifold specification, we choose a coordinate system, derive the corresponding heat (Fokker--Planck) equation and a short-time asymptotic approximation, and then train a PINN to learn the log heat kernel. The resulting surrogate enables both forward noising (heat-kernel sampling) and conditional-score evaluation for denoising score matching. We demonstrate the method on diverse manifolds including $S^2$, $SO(3)$, $\mathrm{SPD}(n)$, and permutation-quotiented point clouds.

2605.31105 2026-06-01 cs.CL

GRKV: Global Regression for Training-Free KV Cache Compression in Long-Context LLMs

GRKV: 长上下文LLM中免训练的KV缓存压缩的全局回归

Junjie Peng, You Wu, Haoyi Wu, Jialong Han, Xiaohua Xie, Kewei Tu, Jianhuang Lai

AI总结 提出GRKV方法,通过岭回归合并步骤最小化压缩缓存与完整缓存注意力输出的差异,解决基于跨度保留的合并模式不平衡导致的过度合并和信息损失问题。

Comments 21 pages, 7 figures

详情
AI中文摘要

具有扩展上下文长度的大型语言模型(LLM)依赖键值(KV)缓存来支持对先前令牌的注意力。然而,维护KV缓存会产生大量内存开销,促使通过驱逐和合并来强制执行固定预算的KV缓存压缩方法。现代驱逐方法越来越多地采用基于跨度的保留,因为保留连续跨度在经验上有效且更好地保持语义连贯性。然而,当与驱逐后合并结合时,基于跨度的保留将合并集中到一小部分跨度边界载体令牌上,产生高度不平衡的合并模式,加剧过度合并并增加信息损失。为了解决这种不平衡,我们提出GRKV(全局回归KV缓存),一种免训练的KV缓存合并方法,直接最小化压缩缓存与完整缓存注意力输出之间的差异。GRKV使用基于岭回归的合并步骤,将驱逐令牌的信息分布到保留令牌上,同时正则化更新以防止过度平滑。在LongBench和RULER长上下文基准测试中,GRKV是唯一一种以最小开销提高整体性能的合并方法。

英文摘要

Large language models (LLMs) with extended context lengths rely on the key-value (KV) cache to support attention over prior tokens. However, maintaining the KV cache incurs substantial memory overhead, motivating KV-cache compression methods that enforce a fixed budget through eviction and merging. Modern eviction methods increasingly adopt span-based retention because preserving contiguous spans is empirically effective and better preserves semantic coherence. Yet, when combined with post-eviction merging, span-based retention concentrates merges onto a small set of span-boundary carrier tokens, producing a highly imbalanced merge pattern that exacerbates over-merging and increases information loss. To address this imbalance, we propose GRKV (Global Regression for KV Cache), a training-free KV-cache merging method that directly minimizes the discrepancy between compressed-cache and full-cache attention outputs. GRKV uses ridge-regression-based merge steps to distribute information from evicted tokens across retained tokens, while regularizing the updates to prevent over-smoothing. Across the LongBench and RULER long-context benchmarks, GRKV is the only merging method that improves overall performance with minimal overhead.

2605.31100 2026-06-01 cs.AI cs.DB cs.IR

Vector Linking via Cross-Model Local Isometric Consistency

通过跨模型局部等距一致性的向量链接

Ziying Chen, Yang Cao, He Sun, Beining Yang, Tianjian Yang

AI总结 提出一种基于局部几何一致性的迭代参考几何嵌入哈希方法,从少量种子锚点恢复跨模型向量对应关系,实现准确鲁棒的向量链接。

Comments Accepted at ICML 2026

详情
AI中文摘要

我们研究向量链接:给定由不同黑盒编码器在部分重叠数据集上生成的两个嵌入云,仅使用向量恢复跨模型对象对应关系。实验和理论上表明,独立训练的对比编码器表现出局部几何一致性:短距离近似保持(按比例因子),而长距离因模型特定失真而不保持。基于此,我们提出一种迭代的、基于参考的几何嵌入哈希方法,从微小的种子锚点集恢复向量链接。它通过到采样配对锚点的距离表示每个向量,通过哈希空间匹配提出候选链接,并在Beta-Bernoulli后验中跨视图聚合证据,以引导高置信度链接作为新锚点。在多个基准测试和嵌入模型对上的实验表明,该方法在不同重叠度、种子预算和域外锚点下实现准确且鲁棒的链接,并应用于向量数据库集成和跨模型聚类。代码见https://github.com/DBgroup-Edinburgh/VecLinking。

英文摘要

We study Vector Linking: given two embedding clouds produced by different black-box encoders over partially overlapping datasets, recover cross-model object correspondences using only vectors. Empirically and theoretically, we show that independently trained contrastive encoders exhibit local geometric consistency: short-range distances are approximately preserved up to a scale factor, while long-range distances are not due to model-specific distortion. Building on this, we propose an iterative, reference-based geometric embedding hashing that recovers vector links from a tiny seed set of paired anchors. It represents each vector by distances to sampled paired anchors, proposes candidate links via hash-space matching, and aggregates evidence across views in a Beta-Bernoulli posterior to bootstrap high-confidence links as new anchors. Experiments across multiple benchmarks and embedding model pairs demonstrate accurate and robust linking under varying overlap, seed budgets, and out-of-domain anchors, with applications to vector database integration and cross-model clustering. Code is available at https://github.com/DBgroup-Edinburgh/VecLinking.

2605.31099 2026-06-01 cs.CL cs.AI

KnowledgeGain: Evaluating and Optimizing Science News Generation for Reader Learning

KnowledgeGain: 评估和优化面向读者学习的科学新闻生成

Dominik Soós, Meng Jiang, Jian Wu

AI总结 提出KnowledgeGain指标,通过测量读者知识增益来评估科学新闻质量,并利用LLM模拟器优化生成,提升读者学习效果。

详情
AI中文摘要

科学新闻是研究界与公众之间传播发现的重要媒介。然而,大多数用于生成或摘要文本的指标评估语义相似性和事实一致性,但并未衡量读者从新闻中学到了多少知识。我们引入了KnowledgeGain,这是一个通过测量读者阅读后获得的知识量来评估科学新闻质量的指标。为了评估该指标,我们首先进行了一项受控人类研究,表明该指标成功捕捉了人类读者阅读不同类型科学媒体时获得的知识差异。这些数据使我们能够校准一个仅基于提示的LLM读者模拟器。我们用它来在人类评估之前对候选文章进行排序和过滤。第二项人类研究表明,使用该模拟器选择的文章在阅读后准确性和标准化KnowledgeGain上均优于强生成基线。我们的工作是朝着生成更符合Bloom分类法知识和理解目标的科学新闻迈出的一步。

英文摘要

Science news is an important medium to communicate discoveries between the research communities and the public. Yet, most metrics for generated or summarized text evaluate semantic similarity and factual consistency, but do not measure how much knowledge readers learn from the news. We introduce KnowledgeGain, a metric that evaluates the quality of science news by measuring how much knowledge readers gained after reading it. To evaluate the metric, we first performed a controlled human study and showed that the metric successfully captures the differential knowledge gained by human readers reading different types of science media. The data allowed us to calibrate a prompt-only LLM reader simulator. We use it to rank and filter candidate articles before human evaluation. A second human study shows that articles selected with this simulator improve post-reading accuracy and normalized KnowledgeGain over a strong generation baseline. Our work is a step toward generating science news that better meets the knowledge and comprehension goals of Bloom's Taxonomy.

2605.31096 2026-06-01 cs.CV

iVGR: Internalizing Visually Grounded Reasoning for MLLMs with Reinforcement Learning

iVGR: 通过强化学习将视觉基础推理内化到多模态大语言模型中

Chang-Bin Zhang, Yujie Zhong, Qiang Zhang, Kai Han

AI总结 提出iVGR框架,利用强化学习和双流训练策略将视觉定位能力内化到文本推理中,避免显式视觉基础在推理时的干扰,提升细粒度感知性能。

Comments Accepted by ICML 2026

详情
AI中文摘要

尽管视觉基础链式思维(CoT)已成为增强多模态大语言模型(MLLM)细粒度感知的有前途范式,但其在推理阶段的有效性仍未得到充分探索。在这项工作中,我们经验性地发现,与没有显式视觉基础的标准文本CoT相比,在推理时强制要求视觉基础CoT中的显式对象框通常会降低性能。我们假设视觉定位能力可以内化到文本CoT中,而强制性的显式基础会对模型的主要目标(答案预测)引入不必要的干扰。为了解决这个问题,我们提出了内化视觉基础推理(iVGR),一种新颖的强化学习框架,将定位能力转移到文本推理过程中。我们采用双流训练策略,通过提出的一致性奖励将文本流与高质量的视觉基础流对齐,使模型在推理时无需显式基础即可准确定位。大量实验表明,我们的方法在细粒度基准上显著优于现有基线,同时保持支持工具辅助推理工作流的灵活性。

英文摘要

While visually grounded Chain-of-Thought (CoT) has emerged as a promising paradigm to enhance fine-grained perception in multimodal large language models (MLLMs), its efficacy during the inference phase remains underexplored. In this work, we empirically find that mandating explicit object boxes in visually grounded CoT during inference often degrades performance compared to standard textual CoT, which reasons without explicit visual grounding. We hypothesize that the visual localization capability can be internalized into the textual CoT and that the mandatory explicit grounding introduces unnecessary interference with the model's primary objective of answer prediction. To address this problem, we propose Internalizing Visually Grounded Reasoning (\textbf{iVGR}), a novel reinforcement learning framework that transfers localization capabilities into the textual reasoning process. We employ a dual-stream training strategy, where a textual stream is aligned with a high-quality visually grounded stream via a proposed consistency reward, enabling the model to localize accurately without explicit grounding during inference. Extensive experiments demonstrate that our method significantly outperforms existing baselines on fine-grained benchmarks, while maintaining the flexibility to support tool-assisted inference workflows.

2605.31094 2026-06-01 cs.CV cs.AI

Redefining Instance Matching: A Unified Framework for Part-Aware Matching in Panoptic Segmentation Evaluation

重新定义实例匹配:全景分割评估中部件感知匹配的统一框架

Erik Großkopf, Soumya Snigdha Kundu, Hendrik Möller, Nicolas Münster, Mehdi Astaraki, Paula Tamara Buzduga, Kerstin Ritter, Benedikt Wiestler, Jan Kirschke, Jonathan Shapey, Tom Vercauteren, Florian Kofler

AI总结 提出将全景分割中的片段匹配重新表述为约束二分分配问题,定义四种匹配策略,并扩展至部件感知评估,发布基于Panoptica的统一开源包。

Comments 9 pages, 4 figures

详情
AI中文摘要

全景质量(PQ)度量是联合评估实例分割和语义分割的标准。然而,其原始定义依赖于预测片段和真实片段之间的一对一匹配,只有当IoU阈值超过0.5时才是直接的。低于0.5时,在一个探索不足的问题空间中会出现多种匹配策略。我们通过将片段匹配重新表述为约束二分分配问题,系统地阐明了这个空间。独立地约束预测端和真实端的度数,产生了四种匹配策略:一对一、多对一、一对多和多对多。我们表明,前三种在PQ框架内是良好定义的,而多对多则超出其范围。当实例被碎片化、相邻物体难以划分或标注有噪声时,这些策略变得相关。我们框架的核心是基于顶点的TP、FN和FP计数,锚定于真实片段和预测片段,而不是匹配边。我们进一步表明,该框架自然地扩展到部件感知全景分割,并在生物医学数据上探索了部件感知评估。在可配置的案例研究中,我们报告了不同阈值和匹配策略组合在实际中的表现。我们发布了一个基于Panoptica的统一开源包,它暴露了基于Voronoi的区域分析、部件感知评估和阈值下曲线面积作为可配置选项。

英文摘要

The Panoptic Quality (PQ) metric is the standard for jointly evaluating instance and semantic segmentation. However, its original definition relies on a One-to-One matching between predicted and ground truth segments, which is only straightforward when the IoU threshold exceeds 0.5. Below 0.5, multiple matching strategies emerge in a poorly explored problem space. We systematically elucidate this space by recasting segment matching as a constrained bipartite assignment problem. Independently bounding the prediction- and ground-truth-side degrees yields four matching strategies: One-to-One, Many-to-One, One-to-Many, and Many-to-Many. We show that the first three are well-defined within the PQ framework, while Many-to-Many falls outside it. These strategies become relevant when instances are fragmented, adjacent objects are difficult to delineate, or annotations are noisy. Central to our framework is a vertex-based accounting of TP, FN, and FP, anchored to ground truth and predicted segments rather than to matching edges. We further show that the framework extends naturally to part-aware panoptic segmentation, and we explore part-aware evaluation on biomedical data. Across configurable case studies we report how different combinations of thresholds and matching strategies behave in practice. We release a unified open-source package built on Panoptica. It exposes Voronoi-based region-wise analysis, part-aware evaluation, and Area Under Threshold Curve computations as configurable options.

2605.31093 2026-06-01 cs.CV

Cross-Modal Clinical Knowledge Integration for Mammography Report Generation

跨模态临床知识整合用于乳腺X线报告生成

Jiayi Zhu, Fuxiang Huang, Yu Xie, Xi Wang, Zhixuan Chen, Yuan Guo, Qingcong Kong, Zhenhui Li, Qiong Luo, Hao Chen

AI总结 提出MammoRG框架,通过两阶段训练模拟临床报告流程,整合BI-RADS指南和先验知识,提升报告生成的临床一致性。

Comments 16 pages, 5 figures

详情
AI中文摘要

乳腺癌是一个主要的全球健康问题,乳腺X线筛查在早期检测中起着核心作用。大量的筛查检查给放射科医生带来了沉重的工作负担,使得准确且一致的报告生成成为一个关键的临床挑战。现有的自动乳腺X线报告生成方法主要关注直接的视觉到文本映射,而忽略了放射科医生在实际工作中遵循的结构化临床推理过程。为了解决这一局限性,我们提出了MammoRG,一个乳腺X线报告生成框架,它通过遵循BI-RADS指南并整合先验临床知识来明确模拟临床报告工作流程,从而生成诊断报告。具体来说,MammoRG采用两阶段训练框架。在第一阶段,模型通过基于分类的监督学习从患者的四视图乳腺X线图像中整合临床相关的先验知识。在第二阶段,引入术语感知的监督微调策略,将乳腺X线特异性临床术语建模为原子语义单元,从而生成具有更高临床一致性的高质量报告。为了促进生成报告的临床效能评估,我们进一步开发了MammoRGTool,一个专用的乳腺X线报告解析工具,它从自由文本报告中提取结构化临床信息。大量实验表明,MammoRG在多个临床效能指标上持续优于现有方法,特别是在与诊断相关的BI-RADS F1上,它在内部、外部1、外部2和VinDr-Mammo数据集上分别超过第二名模型2.73%、2.04%、1.90%和3.27%。

英文摘要

Breast cancer is a major global health concern, and mammography screening plays a central role in early detection. The large volume of screening examinations creates a substantial workload for radiologists, making accurate and consistent report generation a critical clinical challenge. Existing automated mammography report generation methods primarily focus on direct visual-to-text mapping, while overlooking the structured clinical reasoning process followed by radiologists in real-world practice. To address this limitation, we propose MammoRG, a mammography report generation framework that explicitly simulates the clinical reporting workflow by following the BI-RADS guideline and incorporating prior clinical knowledge to produce diagnostic reports. Specifically, MammoRG adopts a two-stage training framework. In the first stage, the model learns to integrate clinically relevant prior knowledge from a patient's four-view mammograms through classification-based supervision. In the second stage, a terminology-aware supervised fine-tuning strategy is introduced to model mammography-specific clinical terms as atomic semantic units, enabling the generation of high-quality reports with improved clinical consistency. To facilitate clinical efficacy evaluation of generated reports, we further develop MammoRGTool, a dedicated mammography report parsing tool that extracts structured clinical information from free-text reports. Extensive experiments demonstrate that MammoRG consistently outperforms existing methods across multiple clinical efficacy metrics, particularly in diagnosis-related BI-RADS F1, where it surpasses the second-best model by 2.73%, 2.04%, 1.90%, and 3.27% on the internal, external 1, external 2, and VinDr-Mammo datasets, respectively.

2605.31090 2026-06-01 cs.CV cs.AI

On Revisiting Entropy for Identifying Mislabeled Images

重新审视熵在识别错误标注图像中的应用

Chunlei Li, Zixuan Zheng, Yilei Shi, Guanglu Dong, Pengfei Li, Jingliang Hu, Xiao Xiang Zhu, Lichao Mou

AI总结 提出基于训练动态的有符号熵积分(SEI)统计量,通过捕捉预测熵的幅度和时间趋势,有效识别训练集中的错误标注样本,在医学影像数据集上达到最优性能。

Comments ICML 2026

详情
AI中文摘要

训练数据集中的错误标注样本会严重降低深度网络的性能,因为过参数化模型倾向于记忆错误标签。我们通过提出一种利用训练动态的错误标注数据检测新方法来应对这一挑战。我们的方法基于一个关键观察:正确标注的样本在训练过程中熵持续下降,而错误标注的样本在整个训练过程中保持相对较高的熵。基于这一见解,我们引入了一个有符号熵积分(SEI)统计量,它捕捉了训练周期中预测熵的幅度和时间趋势。SEI广泛适用于分类网络,并且在与对比语言-图像预训练(CLIP)架构集成时表现出特别的有效性。通过在四个医学影像数据集(由于诊断复杂性,该领域特别容易受到标注错误的影响)上进行涵盖不同模态和病理的广泛实验,我们证明SEI在错误标注数据识别中达到了最先进的性能,在保持计算效率和实现简单性的同时优于现有方法。我们的代码可在 https://github.com/MedAITech/SEI 获取。

英文摘要

Mislabeled samples in training datasets severely degrade the performance of deep networks, as overparameterized models tend to memorize erroneous labels. We address this challenge by proposing a novel approach for mislabeled data detection that leverages training dynamics. Our method is grounded in the key observation that correctly labeled samples exhibit consistent entropy decrease during training, while mislabeled samples maintain relatively high entropy throughout the training process. Building on this insight, we introduce a signed entropy integral (SEI) statistic that captures both the magnitude and temporal trend of prediction entropy across training epochs. SEI is broadly applicable to classification networks and demonstrates particular effectiveness when integrated with contrastive language-image pretraining (CLIP) architectures. Through extensive experiments on four medical imaging datasets -- a domain particularly susceptible to labeling errors due to diagnostic complexity -- spanning diverse modalities and pathologies, we demonstrate that SEI achieves state-of-the-art performance in mislabeled data identification, outperforming existing methods while maintaining computational efficiency and implementation simplicity. Our code is available at https://github.com/MedAITech/SEI.

2605.31082 2026-06-01 cs.SD cs.MM

Sound effects in media:A comparative analysis of recorded and synthetic samples in live-action and animation

媒体中的音效:实拍与动画中录制样本与合成样本的比较分析

Nelly Garcia, Joshua Reiss

AI总结 通过比较程序化生成的合成音效与真实录制音效在实拍和动画场景中的可信度,发现合成音效在戏剧和科幻场景中表现良好,但在卡通日常动作中可信度较低。

Comments ArtsIT, Interactivity and Game Creation 2024

详情
AI中文摘要

为故事创作声音对于电影、电视剧和视频游戏等作品中环境的建立至关重要。这一过程通常涉及重复、分层和录制真实物体或使用音效库,这可能耗时且重复。为了解决这些挑战,程序化音频(也称为数字拟音)提供了一种解决方案,允许声音设计师快速生成样本。尽管效率高,但合成样本与真实样本相比的可信度仍存在问题。在我们的研究中,我们比较了由在线程序化引擎生成的合成样本,并将其与动画和实拍画面集成。我们的结果表明,程序化音频在戏剧和科幻场景中非常有效且被认为可信,特别是对于激光、打击、空气和火箭等声音模型,而合成声音在表现日常动作的卡通制作中不太可信。最后,我们确定了需要优化的特定模型,并根据音频专业人士的反馈强调了需要改进的音频特征。

英文摘要

Creating sound for storytelling is crucial to establishing the environment in productions such as films, TV series and video games. This process often involves repeating, layering and recording real objects or using sound libraries, which can be time-consuming and repetitive. To address these challenges, procedural audio, also known as digital foley, offers a solution by allowing sound designers to quickly generate samples. Despite its efficiency, questions remain about the believability of synthetic samples compared to real ones. In our study, we compared synthetic samples generated by an online procedural engine and integrated them with both animated and live-action visuals. Our results indicate that procedural audio is highly effective and perceived as believable in drama and sci-fi scenes, particularly for sound models such as lasers, hits, air and rockets, whereas synthetic sounds weren't as believable in cartoon productions when representing everyday actions. Finally, we identified specific models that needed optimisation and highlighted audio features that needed improvement with feedback from audio professionals.

2605.31075 2026-06-01 cs.CV

Task-Focused Memorization for Multimodal Agents

面向多模态智能体的任务聚焦记忆

Tao Zou, Yichen He, Tian Qiu, Yuan Lin, Hang Li

AI总结 提出基于强化学习的任务聚焦记忆策略学习框架TaskMem,通过两阶段训练使多模态智能体在流式观测中动态选择任务相关记忆,在三个流式基准上VQA准确率提升5.3%-7.0%。

详情
AI中文摘要

长期记忆对于多模态智能体构建连贯经验、积累世界知识和实现持续学习至关重要。然而,构建有效记忆不仅涉及记忆模块设计和准确性、保真度等基本要求,关键挑战在于决定记忆什么。多模态智能体(如具身智能体)在真实或虚拟环境中持续感知、推理和行动,接收无界的多模态观测流。面对这种信息组合爆炸,智能体必须选择性地保留与其环境角色相关且对未来任务有价值的内容。为弥合这一差距,我们将记忆生成建模为可学习的记忆策略,并引入TaskMem(任务聚焦记忆策略学习),一种基于强化学习的框架,使策略能够动态调整其关注点以适应环境中遇到的实际任务需求。TaskMem采用两阶段训练范式:第一阶段在基本保真度要求下优化记忆质量,学习如何记忆;第二阶段在部署后进行,智能体通过在其基础MLLM上调整适配器来学习记忆什么,利用近期环境任务定义奖励模型,引导记忆策略聚焦于任务相关的内容。为评估我们的方法,我们将VideoMME、EgoLife和EgoTempo重新构建为流式基准,模拟智能体处理流式观测并处理在线到达任务的真实场景。为隔离记忆评估,问题必须仅使用智能体的记忆回答,而不访问原始视频。基于Qwen3-VL-30B-A3B,TaskMem在这些基准上分别将VQA准确率提高了6.3%、7.0%和5.3%。

英文摘要

Long-term memory is essential for multimodal agents to build coherent experience, accumulate world knowledge, and achieve continual learning. However, constructing effective memory goes beyond memory module design and basic requirements such as accuracy and fidelity; the key challenge lies in determining what to memorize. Multimodal agents, such as embodied agents, continuously perceive, reason, and act in real or virtual environments, receiving an unbounded stream of multimodal observations. From this combinatorial explosion of information, an agent must selectively retain content that is relevant to its role in the environment and valuable for future tasks. To bridge this gap, we frame memory generation as a learnable memorization policy and introduce TaskMem (Task-focused Memorization Policy Learning), a reinforcement-learning-based framework that enables the policy to dynamically adjust its focus to the demands of real tasks encountered in the environment. TaskMem adopts a two-phase training paradigm: Phase One learns how to memorize by optimizing memory quality under fundamental fidelity requirements; Phase Two occurs after deployment, where the agent learns what to memorize by tuning an adapter on its base MLLM, using recent environment tasks to define a reward model that guides the memorization policy toward task-relevant content. To evaluate our approach, we reformulate VideoMME, EgoLife, and EgoTempo into streaming benchmarks that simulate a realistic setting in which an agent processes streaming observations and handles tasks arriving online. To isolate memory assessment, the questions must be answered using only the agent's memory, without access to raw video. Built on Qwen3-VL-30B-A3B, TaskMem improves VQA accuracy by 6.3%, 7.0%, and 5.3% on these benchmarks, respectively.

2605.31073 2026-06-01 cs.CL

ConsisGuard: Aligning Safety Deliberation with Policy Enforcement in LLM Guardrails

ConsisGuard:在LLM护栏中对齐安全审议与策略执行

Yan Wang, Zhixuan Chu, Zihao Xue, Zhen Bi, Bingyu Zhu, YueFeng Chen, Zeyu Yang, Jungang Lou, Longtao Huang, Ningyu Zhang, Kui Ren, Hui Xue

AI总结 提出ConsisGuard框架,通过策略到决策轨迹蒸馏和功能耦合对齐,解决基于推理的LLM护栏中审议与执行之间的不一致问题,提升安全检测性能并减少策略执行失败。

Comments 18 pages, 9 figures

详情
AI中文摘要

基于推理的LLM护栏通过在做出最终决策前生成明确理由来改进安全审核。然而,它们的理由并不总是导致忠实的执行:模型可能在推理中识别出有害意图,但仍然预测安全标签,或者在没有策略依据的情况下发布不安全决策。我们将这种安全关键性失败模式识别为审议到执行的差距。与一般的思维链忠实性不同,护栏可靠性要求策略执行一致性:生成的推理应基于安全策略,最终决策应由该推理蕴含。我们提出ConsisGuard,一个用于基于推理的LLM护栏的一致性感知框架。ConsisGuard执行策略到决策轨迹蒸馏和功能耦合对齐,对齐安全审议与决策执行之间的内部耦合。在提示和响应有害性检测基准上的实验表明,ConsisGuard在减少策略执行失败的同时提高了检测性能。这些结果表明,可靠的基于推理的护栏需要准确忠实地执行安全策略。

英文摘要

Reasoning-based LLM guardrails improve safety moderation by generating explicit rationales before issuing final decisions. However, their rationales do not always lead to faithful enforcement: a model may recognize a harmful intent in its reasoning but still predict a safe label, or issue an unsafe decision without policy-grounded justification. We identify this safety-critical failure mode as the deliberation-to-enforcement gap. Unlike general chain-of-thought faithfulness, guardrail reliability requires policy execution consistency: the generated reasoning should be grounded in the safety policy, and the final decision should be entailed by that reasoning. We propose ConsisGuard, a consistency-aware framework for reasoning-based LLM guardrails. ConsisGuard performs Policy-to-Decision Trajectory Distillation and Functional Coupling Alignment, aligning the internal coupling between safety deliberation and decision enforcement. Experiments on prompt and response harmfulness detection benchmarks show that ConsisGuard improves detection performance while reducing policy execution failures. These results suggest that reliable reasoning-based guardrails require accurate faithful execution of safety policies.

2605.31070 2026-06-01 cs.LG cs.GT

Learning to Bid in FCR Markets: A Best-of-Both-Worlds Approach

在FCR市场中学习投标:一种两全其美的方法

Marius Potfer, Cheng Wan, Pierre Gruet

AI总结 针对欧洲频率控制储备(FCR)市场中投标者仅能观察到部分反馈(如出清价格和分配数量)的问题,提出了一种将多国FCR出清问题转化为重复多单位统一价格拍卖的方法,并采用两全其美的组合半强盗算法实现对数伪遗憾(随机环境)和平方根遗憾(对抗环境),实验验证了其理论缩放性和实际竞争力。

Comments Algorithms and data available at https://data.mendeley.com/datasets/htprbf47dg/1

详情
AI中文摘要

在欧洲频率控制储备(FCR)市场中,由于竞争报价是隐藏的,投标者只能观察到来自市场的部分反馈,如出清价格和分配数量,因此对于灵活性提供商而言,投标具有挑战性。对于活跃在单个国家的参与者,我们证明多国FCR出清问题可以转化为针对内生对手报价向量的重复多单位统一价格拍卖。这种重新表述产生了一个在线学习问题,并使我们能够适应一种两全其美的组合半强盗算法,该算法可从这种标准市场反馈中实现。由此产生的投标者在随机环境中实现对数伪遗憾,在对抗环境中实现$\mathcal{O}(\sqrt{T})$遗憾。综合实验验证了预期的缩放性,对历史欧洲FCR数据的回测显示了实际中的竞争性能:该方法在稳定产品上表现尤其出色,而EXP3类型的基线在更强的非平稳性下可能更安全。总体而言,结果表明,当学习规则与产品级市场稳定性相匹配时,基于学习的FCR市场投标在理论上是有根据的,在实践中是有用的。

英文摘要

Bidding in the European Frequency Containment Reserve (FCR) market is challenging for flexibility providers because competing offers are hidden and bidders observe only partial feedback form the market, such as, clearing price and awarded quantity. For a participant active in a single country, we show that the multi-country FCR clearing problem can be recast as a repeated multi-unit uniform-price auction against an endogenous vector of opposing bids. This reformulation yields an online learning problem and allows us to adapt a Best-of-Both-Worlds combinatorial semi-bandit algorithm implementable from this standard market feedback. The resulting bidder achieves logarithmic pseudo-regret in stochastic environments and $\mathcal{O}(\sqrt{T})$ regret in adversarial ones. Synthetic experiments confirm the expected scaling, and backtests on historical European FCR data show competitive performance in practice: the method performs especially well on stable products, while EXP3-type baselines can be safer under stronger non-stationarity. Overall, the results show that learning-based bidding in FCR markets is theoretically grounded and practically useful when the learning rule matches product-level market stability.

2605.31069 2026-06-01 cs.CV cs.CL

Towards Effective Long-Video Event Prediction via Multi-Level Event Semantics Mining

面向有效长视频事件预测的多级事件语义挖掘

Bo Peng, YuanJie Lyu, PengGang Qin, Tong Xu

AI总结 提出VISTA框架,通过多级事件语义挖掘(细节级、事件级、未来级)实现长视频事件预测,解决现有模型无法精确提取事件细节和进行细粒度分析的问题。

详情
AI中文摘要

准确预测未来事件是内容理解和决策制定的基础,涉及多个领域。先前研究主要关注文本或短视频场景,而长视频事件预测具有多模态上下文丰富和叙事复杂的特点,尚未得到充分探索。同时,基于大语言模型和视觉语言模型构建的近期长视频语言模型在长视频问答和摘要方面表现出潜力,但难以泛化到事件预测,因为它们既不能精确提取事件相关细节,也无法对事件发展进行细粒度分析。为弥补这一差距,我们提出VISTA,一个用于长视频事件预测的多级事件语义挖掘框架。首先,VISTA应用以角色为中心的视觉提示精确提取事件相关视觉细节,增强细节级语义;其次,采用知识增强的迭代检索策略,引导大语言模型逐步构建逻辑连贯的事件链,从而改善事件级叙事;最后,VISTA采用类人的先提议后检索策略生成多样化的面向未来的提议并整合多级线索,产生稳健准确的预测。在真实数据集上的大量实验验证了VISTA在长视频事件预测中的有效性。

英文摘要

Accurately predicting future events is fundamental to content understanding and decision-making across various domains. While prior research has primarily focused on text or short-video scenarios, long-video event prediction, characterized by vast multimodal context and more complex narratives, remains underexplored. Meanwhile, although recent Long-Video Language Models (LVLMs), built on Large Language Models (LLMs) and Vision-Language Models (VLMs), have shown promise in long-video question answering and summarization, they struggle to generalize to event prediction, as they can neither precisely extract event-related details nor perform fine-grained analysis of event development. To address this gap, we propose VISTA, a multi-level event semantics mining framework for long-video event prediction. Initially, VISTA applies a character-centric visual prompt to precisely extract event-related visual details, enhancing detail-level semantics; subsequently, it employs a knowledge-enhanced iterative retrieval strategy, guiding the LLM to progressively construct logically coherent event chains, thereby improving event-level narratives; ultimately, VISTA adopts a human-like propose-then-retrieve strategy to generate diverse future-oriented proposals and integrate multi-level clues, producing robust and accurate predictions. Extensive experiments on real-world datasets validate the effectiveness of VISTA for long-video event prediction.

2605.31068 2026-06-01 cs.CV

HQ-JEPA: Hybrid Quantum Joint-Embedding Predictive Architecture for Cross-Modal Remote Sensing Representation Learning

HQ-JEPA: 用于跨模态遥感表示学习的混合量子联合嵌入预测架构

Md Aminur Hossain, Ayush V. Patel, Sanjay K. Singh, Biplab Banerjee

AI总结 提出HQ-JEPA混合量子-经典架构,通过联合嵌入预测、跨模态对齐、SIGReg高斯正则化和量子保真度损失,在Sentinel-1/2图像上学习语义表示,在GeoBench分类和分割任务上取得优于强基线的性能。

Comments 19 pages

详情
AI中文摘要

我们提出了HQ-JEPA,一种用于跨模态遥感表示学习的混合量子-经典联合嵌入预测架构。该框架将JEPA风格的掩码潜在预测扩展到配对的Sentinel-1和Sentinel-2图像,通过从可见上下文区域预测掩码目标表示,同时在共享嵌入空间中对齐异构模态特征。为了提高表示质量,HQ-JEPA结合了四个互补目标:潜在令牌预测、跨模态令牌对齐、融合潜在空间中基于SIGReg的高斯正则化,以及基于可微SWAP测试的保真度量子相似性(FQS)损失。与像素重建方法不同,HQ-JEPA直接在潜在空间中学习语义表示,并使用基于量子态重叠的相似性作为额外的正则化信号。我们在线性探测和微调设置下,在GeoBench分类和分割任务上评估了预训练编码器。结果表明,HQ-JEPA在强自监督和遥感基础模型基线上取得了具有竞争力且通常更优的性能,证明了将预测性自监督、跨模态几何正则化和基于量子保真度的表示学习相结合对遥感应用的好处。

英文摘要

We introduce HQ-JEPA, a hybrid quantum-classical joint-embedding predictive architecture for cross-modal remote sensing representation learning. The proposed framework extends JEPA-style masked latent prediction to paired Sentinel-1 and Sentinel-2 imagery by predicting masked target representations from visible context regions while aligning heterogeneous modality features in a shared embedding space. To improve representation quality, HQ-JEPA combines four complementary objectives: latent token prediction, cross-modal token alignment, SIGReg-based Gaussian regularization in the fused latent space, and a differentiable SWAP-test-based Fidelity Quantum Similarity (FQS) loss. Unlike pixel reconstruction methods, HQ-JEPA learns semantic representations directly in latent space and uses quantum state-overlap-based similarity as an additional regularization signal. We evaluate the pretrained encoder on GeoBench classification and segmentation tasks under linear probing and fine-tuning settings. Results show that HQ-JEPA achieves competitive and often superior performance over strong self-supervised and remote sensing foundation-model baselines, demonstrating the benefit of integrating predictive self-supervision, cross-modal geometric regularization, and quantum fidelity-based representation learning for remote sensing applications.

2605.31066 2026-06-01 cs.RO

Can Aerial VLA Models Cooperate? Evaluating Closed-Loop Air-Ground Coordination with CARLA-Air

空中VLA模型能协作吗?基于CARLA-Air的闭环空地协调评估

Tianle Zeng, Yanci Wen, Xueang Yu, Hong Zhang

AI总结 本文通过构建CARLA-Air仿真环境,评估空中视觉-语言-动作模型在空地协作任务中的表现,发现当前模型难以将单智能体能力转化为稳定协作行为,并指出零样本协作需要伙伴状态显式感知、低延迟动作协调和团队目标对齐三个关键组件。

Comments Code at https://github.com/louiszengCN/CarlaAir

详情
AI中文摘要

最近的空中视觉-语言-动作(VLA)模型展示了有前景的单无人机能力,例如跟踪移动物体和导航到语言指定的地标。然而,这些能力能否转移到空地协作中尚不清楚,其中无人机和无人地面车辆必须在共享的闭环物理世界中联合行动。我们通过CARLA-Air研究这个问题,这是一个单进程空地评估环境,在同一个虚幻引擎运行时内统一了CARLA和AirSim。通过共享相同的世界状态、物理时钟和感知流水线,CARLA-Air实现了物理一致的无人机-无人地面车辆交互,并精确测量仿真时间戳对齐和有效协调延迟。利用CARLA-Air,我们在两个互补的诊断任务上评估了代表性的空中VLA和规划基线:移动平台降落和遮挡恢复护航。结果表明,当前的空中VLA模型通常能够跟踪或跟随地面伙伴,但难以将这种单智能体能力转化为稳定的协作行为。状态提示提供的益处有限,而朴素的双向交互未能持续提高性能,并且可能放大大多数基线的错误。这些发现表明,在测试的基于文本的提示接口下,零样本协作空地VLA需要当前范式之外的三个组件:显式的伙伴状态感知、低延迟动作协调和团队目标对齐。我们的代码可在https://github.com/louiszengCN/CarlaAir获取。

英文摘要

Recent aerial vision-language-action (VLA) models show promising single-UAV capabilities, such as tracking moving objects and navigating to language-specified landmarks. However, it remains unclear whether these capabilities can transfer to air-ground cooperation, where a UAV and a UGV must act jointly in a shared, closed-loop physical world. We study this question with CARLA-Air, a single-process air-ground evaluation environment that unifies CARLA and AirSim inside one Unreal Engine runtime. By sharing the same world state, physics tick, and sensing pipeline, CARLA-Air enables physically consistent UAV--UGV interaction and precise measurement of simulation-timestamp alignment and effective coordination latency. Using CARLA-Air, we evaluate representative aerial VLA and planning baselines on two complementary diagnostic tasks: moving-platform landing and occlusion-recovery escort. The results show that current aerial VLA models can often track or follow a ground partner, but struggle to convert this single-agent competence into stable cooperative behavior. State prompting provides limited benefit, and naive bidirectional interaction fails to consistently improve performance and can amplify errors for most baselines. These findings suggest that, under the tested text-based cue interfaces, zero-shot cooperative air-ground VLA requires three components beyond the current paradigm: explicit partner-state grounding, low-latency action coordination, and team-level objective alignment. Our code is available at https://github.com/louiszengCN/CarlaAir.

2605.31062 2026-06-01 cs.CL

AdaptR1: Reinforcement Learning Based Adaptive Interleaved Thinking in Multi-hop Question Answering

AdaptR1:基于强化学习的自适应交错思考在多跳问答中的应用

Yuxin Wang, Jiahao Lu, Qifeng Wu, Shicheng Fang, Chuanyuan Tan, Yining Zheng, Xuanjing Huang, Xipeng Qiu

AI总结 提出AdaptR1框架,通过强化学习动态分配每步推理预算,减少多跳问答中的过度思考,在保持性能的同时显著降低推理成本。

详情
AI中文摘要

大型语言模型(LLMs)通过思维链(CoT)提示在复杂推理任务中取得了显著性能。然而,这种方法常常导致“过度思考”,即模型为简单查询生成不必要长的推理轨迹,并产生可避免的推理成本。虽然最近的工作探索了自适应推理,但现有方法通常对是否进行推理做出单一的查询级决策。这忽略了多步任务的动态性质,其中显式推理的需求在中间阶段会有所不同。为了解决这一限制,我们引入了AdaptR1,一种基于强化学习(RL)的框架,用于多跳问答(QA)中的自适应交错思考。与需要监督微调(SFT)进行冷启动初始化的先前方法不同,AdaptR1使用完全基于RL的策略,并带有质量门控效率奖励,以动态分配每一步的推理预算。在Graph-R1设置下,AdaptR1将平均思考令牌减少了69.71%,在HotpotQA上减少了90.35%,同时保持与标准基线相当或更好的性能。此外,我们的分析揭示,多跳推理中的过度思考并非均匀分布,而是主要发生在初始规划阶段,这突显了逐步自适应预算分配的有效性。

英文摘要

Large Language Models (LLMs) have achieved remarkable performance in complex reasoning tasks through Chain-of-Thought (CoT) prompting. However, this approach often leads to ``over-thinking,'' where models generate unnecessarily long reasoning traces for simple queries and incur avoidable inference cost. While recent work has explored adaptive reasoning, existing methods typically make a single query-level decision about whether to reason. This overlooks the dynamic nature of multi-step tasks, where the need for explicit reasoning varies across intermediate stages. To address this limitation, we introduce AdaptR1, a Reinforcement Learning (RL) based framework for adaptive interleaved thinking in multi-hop Question Answering (QA). Unlike previous approaches that require Supervised Fine-Tuning (SFT) for cold-start initialization, AdaptR1 uses a fully RL-based strategy with a quality-gated efficiency reward to dynamically allocate reasoning budgets at each step. Under the Graph-R1 setting, AdaptR1 reduces average think tokens by 69.71\%, with a 90.35\% reduction on HotpotQA, while maintaining performance comparable to or better than standard baselines. Furthermore, our analysis reveals that overthinking in multi-hop reasoning is not uniformly distributed but occurs predominantly during the initial planning stages, highlighting the effectiveness of step-wise adaptive budget allocation.

2605.31061 2026-06-01 cs.LG cs.AI

STEP: Learning STructured Embeddings for Progressive Time Series

STEP:学习渐进时间序列的结构化嵌入

Lucas Thil, Jesse Read, Rim Kaddah, Guillaume Doquet

AI总结 提出一种自监督对比学习方法,通过构建具有固定正交原型向量的低维流形几何结构,实现渐进时间序列的端状态预测、多步预测和可解释相位分离。

详情
AI中文摘要

我们提出了一种新颖的方法,用于学习渐进时间序列的可解释表示,即捕获不可逆状态转换(如退化或任务完成)的数据。我们的方法使用自监督对比目标来学习低维潜在空间,其几何结构本身就是解释:每个观测成为位于两个固定正交原型向量之间的流形上的一个点,轨迹成为穿过该流形的路径。从这种结构中,我们读取一个潜在指南针,即潜在向量的极坐标(θ, r),其中θ跟踪潜在状态的进展(例如,从健康到故障),r识别活动模式(例如,操作条件),无需任何代理标签。我们在不同领域(包括工业退化、机器人任务和神经活动)上评估了该方法与最先进方法的对比,验证了三个关键能力:(1)端状态预测,(2)多步预测,以及(3)可解释的相位分离。我们的方法在所有方面匹配或优于黑盒对应方法,同时提供对底层机制的透明性。在潜在指南针坐标之上的简单线性回归器与深度架构具有竞争力,这是底层状态以几何可访问形式编码的直接定量证据。

英文摘要

We present a novel method for learning interpretable representations of progressive time series, that is, data capturing irreversible state transitions such as degradation or task completion. Our approach uses a self-supervised contrastive objective to learn a low-dimensional latent space whose geometry is itself the interpretation: each observation becomes a point on a manifold anchored between two fixed orthogonal prototype vectors, and a trajectory becomes a path across that manifold. From this structure we read a latent compass, the polar coordinates (θ, r) of the latent vector, in which θ tracks the progression of the underlying state (e.g., from healthy to failed) and r identifies the active mode (e.g., the operating condition), without any proxy labels. We evaluate the approach against the state of the art on diverse domains, including industrial degradation, robotic tasks, and neural activity, validating three key capabilities: (1) end-state prediction, (2) multi-step forecasting, and (3) interpretable phase separation. Our method matches or improves over black-box counterparts on all of these while providing transparency about the underlying mechanisms. A simple linear regressor on top of the latent compass coordinates is competitive with deep architectures, direct quantitative evidence that the underlying state is encoded in a geometrically accessible form.

2605.31058 2026-06-01 cs.CL cs.SE

Combinatorial Synthesis: Scaling Code RLVR via Atomic Decomposition and Recombination

组合合成:通过原子分解与重组扩展代码RLVR

Jiasheng Zheng, Boxi Cao, Boxi Yu, Yuzhong Zhang, Jialun Cao, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun

AI总结 提出原子分解与重组(ADR)框架,通过将代码任务分解为原子元素并受控重组,生成新颖且具有挑战性的可验证代码任务,以解决RLVR训练数据稀缺和扩展性问题,实验表明在多个下游领域显著提升代码能力。

Comments Work in progress

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)近期已成为塑造大型语言模型(LLMs)卓越编码能力的基石。然而,RLVR的可扩展性受到严重制约,因为缺乏足够具有挑战性的、针对模型能力边缘的可验证代码任务。先前的研究通常依赖启发式种子扩展进行数据合成,这严重限制了新颖性和难度。因此,此类数据的训练价值无法随合成规模成比例扩展。为此,我们提出原子分解与重组(ADR),一种通过将任务分解为原子元素并进行受控重组来生成可验证代码任务的新框架,从而能够生成真正新颖且具有挑战性的可验证代码任务。实验和分析表明,ADR在原创性、难度、多样性和测试质量方面优于现有基线,并在包括算法编程、工具使用和数据科学在内的多个下游领域的RLVR中持续带来更大的代码能力提升。我们的工作为新颖代码任务合成和可扩展的RLVR训练开辟了新范式。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as the cornerstone for shaping the remarkable coding abilities of Large Language Models (LLMs). However, the scalability of RLVR is severely constrained by the scarcity of sufficiently challenging verifiable code tasks that target near the model's edge of competence. Prior studies often rely on heuristic seed expansions for data synthesis, which severely limits both novelty and difficulty. Consequently, the training value of such data fails to scale proportionally with the size of its synthesis. To this end, we propose Atomic Decomposition and Recombination (ADR), a novel framework that generates verifiable code tasks via decomposition into atomic elements and controlled recombination, thereby enabling the generation of genuinely novel and challenging verifiable code tasks. Experiments and analysis demonstrate that ADR achieves superior originality, difficulty, diversity, and test quality over existing baselines, and consistently delivers greater improvements in code ability across RLVR in diverse downstream domains, including algorithmic programming, tool usage, and data science. Our work sheds light on a new paradigm for novel code task synthesis and scalable RLVR training.

2605.31057 2026-06-01 cs.CV cs.LG

LVSA: Training-Free Sparse Attention for Long Video Diffusion

LVSA:长视频扩散的无训练稀疏注意力

Gael Glorian, Ioannis Lamprou, Zhen Zhang, Yujie Yuan, Hongsheng Liu

AI总结 提出一种无需训练、模型无关的块稀疏注意力方法LVSA,通过结构化窗口模式与旋转全局锚点结合,在降低长视频扩散推理计算成本的同时消除固定网格偏差,支持超训练时域的视频生成。

Comments 10 pages, 5 figures, 4 tables. Code: https://github.com/JiusiServe/LongVideoSparseAttention

详情
AI中文摘要

密集自注意力是长视频扩散推理的计算和质量的瓶颈:成本随序列长度二次增长,且超出训练时域时模型收敛到近乎静态的输出,即“冻结”的重复视频。最先进的方法要么成本过高(例如需要重新训练),要么无法以可扩展的方式同时满足性能和质量目标。为此,我们提出长视频稀疏注意力(LVSA),一种无需训练、模型无关的块稀疏注意力方法,用于视频扩散Transformer,它结合了结构化窗口模式与旋转全局锚点,从而消除了导致长时域伪影的固定网格偏差。LVSA结合FlashInfer内核,与密集注意力相比,在Wan 2.1 1.3B上以6倍时域减少计算量达3.17倍,在Wan 2.1 14B上以6倍时域减少2.98倍,在HunyuanVideo 1.5上以1.5倍时域减少3.33倍。除了减少计算量,LVSA还使得HunyuanVideo 1.5能够在2倍时域下生成,否则在单个GPU上会内存不足。此外,与RIFLEx相比,LVSA在Wan 2.1 1.3B上提供高达2.41倍的加速,与UltraViCo相比提供3.27倍的加速。为了展示跨不同平台的适用性,我们将LVSA应用于NPU,与密集注意力相比,在Wan 2.2 A14B上实现高达2.71倍的加速,在Wan 2.1 1.3B上实现3.24倍的加速。为了公平地评估质量,我们引入了VQeval,一个正确评分循环视频失败的工具,而VBench-Long等最先进评估器则会奖励这类失败。LVSA在训练时域长度下生成时质量中性,在扩展长度下质量积极。

英文摘要

Dense self-attention is the compute and quality bottleneck of long-video diffusion inference: cost grows quadratically with the sequence length, and beyond the training horizon the model converges to near-static output, that is, "frozen" repetitive video. State of the art approaches are either too costly, e.g., they require retraining, or fail to satisfy both performance and quality objectives in a scalable manner. To this end, we introduce Long Video Sparse Attention (LVSA), a training-free model-agnostic block-sparse attention for video diffusion transformers that combines a structured window pattern with rotating global anchors, thus removing the fixed-grid bias which causes long-range temporal artifacts. LVSA, combined with a FlashInfer kernel, reduces compute up to 3.17x on Wan 2.1 1.3B at a 6x horizon, 2.98x on Wan 2.1 14B at a 6x horizon, and 3.33x on HunyuanVideo 1.5 at a 1.5x horizon, compared to dense attention. Beyond reducing compute, LVSA enables HunyuanVideo 1.5 generation at a 2x horizon, which is otherwise out-of-memory on a single GPU. Moreover, LVSA provides speedups up to 2.41x compared to RIFLEx and 3.27x compared to UltraViCo on Wan 2.1 1.3B. To demonstrate applicability across diverse platforms, we apply LVSA on NPUs and achieve speedups up to 2.71x on Wan 2.2 A14B and 3.24x on Wan 2.1 1.3B compared to dense attention. To evaluate quality in a fair way, we introduce VQeval, a tool properly scoring loopy video failures, which instead are rewarded in state of the art evaluators like VBench-Long. LVSA is quality-neutral for generation at training horizon length and quality-positive at extended lengths.

2605.31056 2026-06-01 cs.CL

How Much Do LLMs Know About Chinese Zero Pronouns?

LLMs 对中文零代词的了解程度如何?

Yifei Li, Guanyi Chen, Tingting He

AI总结 通过一系列语言学动机任务(识别、指称性分类、指称类型分类、消解和翻译),系统评估了大型语言模型处理中文零代词的能力,发现当前LLMs在零代词处理上仍面临巨大挑战,尤其在识别和指称性分类等上游任务上表现不佳。

详情
AI中文摘要

零代词(ZPs)是汉语等代语省略语言中普遍存在的语言现象,长期以来对自然语言处理系统构成挑战。尽管大型语言模型(LLMs)在许多中文任务上表现良好,但其处理ZPs的能力仍不清楚。我们通过一系列语言学动机任务(包括识别、指称性分类、指称类型分类、消解和翻译)对LLMs处理中文ZPs的能力进行了系统调查。评估了多种LLMs在所有任务上的表现。结果表明,中文ZPs对当前LLMs仍然高度具有挑战性,尤其是在识别和指称性分类等上游任务上。下游任务(如ZP翻译)的表现也持续较低:即使是最先进的推理型LLMs,也未能将超过一半的中文ZPs正确翻译成英语。

英文摘要

Zero Pronouns (ZPs) are a pervasive linguistic phenomenon in pro-drop languages such as Chinese and have long posed a challenge for natural language processing systems. Although Large Language Models (LLMs) perform well on many Chinese language tasks, their ability to process ZPs remains poorly understood. We conduct a systematic investigation of LLMs' handling of Chinese ZPs through a sequence of linguistically motivated tasks, including identification, referentiality classification, referential type classification, resolution, and translation. A diverse set of LLMs is evaluated across all tasks. Our results show that Chinese ZPs remain highly challenging for current LLMs, particularly for upstream tasks such as identification and referentiality classification. Performance on downstream tasks, such as ZP translation, is also consistently low: even state-of-the-art reasoning-oriented LLMs correctly translate fewer than half of Chinese ZPs into English.

2605.31053 2026-06-01 cs.SD cs.AI

AnchorSteer: Self-Discovered Concept Injection for Structure-Preserving Music Editing

AnchorSteer: 自发现概念注入用于结构保持的音乐编辑

Chih-Heng Chang, Keng-Seng Ho, Chih-Yu Tsai, Kuan-Lin Chen, Yi-Hsuan Yang, Jian-Jiun Ding

AI总结 提出AnchorSteer框架,通过结构锚定与自发现语义注入解耦语义-结构纠缠,实现高保真结构保持下的显著语义变换。

Comments Accepted by the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)

详情
AI中文摘要

可控音乐编辑旨在修改高级属性,同时严格保留节奏和旋律结构。然而,这一任务面临语义-结构纠缠的挑战:引导方法往往为了编辑性能而牺牲结构,而结构适配器则抑制语义响应。我们提出AnchorSteer,一个通过将结构锚定与自发现语义引导耦合来解耦这种张力的框架。该方法通过自监督重构目标探测内部表示,提取可解释、无标签的概念向量,无需精心策划的数据即可隔离属性。在编辑过程中,这些便携、即插即用的概念向量被注入扩散隐空间,同时结构适配器强制执行一致性。提供了无条件和条件注入的变体,以平衡鲁棒性和语义强度。在ZoME-Bench和主观测试上的实验表明,所提出的框架优于纯引导和纯锚定的基线,实现了高保真结构保持下的显著语义变换。

英文摘要

Controllable music editing is to modify high-level attributes while strictly preserving rhythmic and melodic structures. However, this task is challenged by a semantic-structural entanglement: steering methods often degrade structure to achieve editing performance, while structural adaptors suppress semantic responsiveness. We propose AnchorSteer, a framework that disentangles this tension by coupling structural anchoring with self-discovered semantic steering. The proposed approach probes internal representations to extract interpretable, label-free concept vectors via a self-supervised reconstruction objective, isolating attributes without curated data. During editing, these portable, plug-and-play concept vectors are injected into diffusion hidden manifolds while a structural adaptor enforces consistency. Variants for unconditioned and conditioned injections are provided to balance robustness and semantic strength. Experiments on ZoME-Bench and subjective tests show that the proposed framework outperforms both steering-only and anchoring-only baselines, enabling significant semantic transformations with high-fidelity structural preservation.

2605.31050 2026-06-01 cs.LG

Best-Arm Identification-Based Trust Region Selection for Bayesian Optimization on Multimodal Functions

基于最佳臂识别的多模态函数贝叶斯优化信任区域选择

Nobuo Namura, Sho Takemori

AI总结 提出一种结合最佳臂识别与信任区域贝叶斯优化的轨迹感知框架,通过预测局部优化器最终性能并逐步淘汰次优候选,加速多模态函数全局优化。

Comments 19 pages, 13 figures

详情
AI中文摘要

基于高斯过程的贝叶斯优化是昂贵的黑箱优化的流行方法,但其性能在复杂多模态或高维问题上常常下降。基于信任区域的贝叶斯优化通过聚焦局部区域缓解了这一问题,最近的研究表明,选择有效区域可以建模为多臂老虎机问题。我们提出了一种轨迹感知框架,将最佳臂识别与基于信任区域的贝叶斯优化相结合,以高效求解多模态优化问题。我们的方法外推多个局部初始化优化器的优化轨迹以预测其最终性能,并通过最佳臂识别逐步淘汰次优候选。我们从理论上证明,在温和假设下,所提出的最佳臂识别引导的贝叶斯优化比传统贝叶斯优化更快收敛到全局最优,并通过在合成和真实世界基准上的大量实验证明了其有效性。

英文摘要

Gaussian process-based Bayesian optimization (BO) is a popular approach for expensive black-box optimization, but its performance often degrades on complex multimodal or high-dimensional problems. Trust region-based BO mitigates this issue by focusing on local regions, and recent studies suggest that selecting an effective region can be formulated as a multi-armed bandit problem. We propose a trajectory-aware framework that integrates best-arm identification (BAI) with trust region-based BO to efficiently solve multimodal optimization problems. Our method extrapolates the optimization trajectories of multiple locally initialized optimizers to predict their final performance and progressively eliminates suboptimal candidates via BAI. We theoretically show that the proposed BAI-guided BO converges faster to the global optimum than conventional BO under mild assumptions, and demonstrate its effectiveness through extensive experiments on synthetic and real-world benchmarks.

2605.31049 2026-06-01 cs.LG cs.AI cs.LO

Learning to Solve and Optimize by Evolving Code

通过代码演化学习求解与优化

Veronika Semmelrock, Benedetta Strizzolo, Francesco Zuccato, Gerhard Friedrich, Patrick Rodler, Konstantin Schekotihin

AI总结 提出CHECKMATE工具,利用形式规范确保解的正确性并通过自然语言描述指导代码演化,自动生成算法,在配置与调度问题上超越最先进求解器。

Comments Preprint of a paper accepted to IJCAI26

详情
AI中文摘要

组合与优化问题是许多工业AI应用的基础。解决此类大规模现实世界实例通常需要仔细的问题形式化、专门的求解器以及专家设计的启发式方法。因此,专家不仅需要指定解是什么,还需要指定如何推导出解。通过引入工具CHECKMATE,我们展示了通过代码演化生成算法代表了一种范式转变,消除了制定如何的需求。CHECKMATE仅依赖于是什么。具体来说,形式规范确保了解的正确性,并能够对生成的程序进行系统性能评估,而自然语言描述则指导演化过程。我们的方法在两个工业领域(配置与调度)的选定问题上展示了有效性。在所有案例中,演化出的算法始终优于最先进的求解器。这凸显了形式方法在引导代码演化以自动解决复杂现实问题方面的潜力。

英文摘要

Combinatorial and optimization problems are fundamental to many industrial AI applications. Solving large-scale real-world instances of such problems typically requires careful problem formalization, specialized solvers, and expert-designed heuristics. Thus, experts need to specify not only what solutions are, but also how they are derived. By introducing the tool CHECKMATE, we show that algorithm generation via code evolution represents a paradigm shift by eliminating the need to formulate the how. CHECKMATE solely relies on the what. Specifically, a formal specification ensures solutions' correctness and enables systematic performance evaluation of the generated programs, while a natural language description guides the evolutionary process. The effectiveness of our method is demonstrated on selected problems from two industrial domains: configuration and scheduling. In all cases, the evolved algorithms consistently outperform state-of-the-art solvers. This underscores the potential of formal methods in guiding code evolution for automatically solving complex real-world problems.

2605.31048 2026-06-01 cs.CV

Rethinking Efficient Crack Segmentation with Task-Aligned Structural-Directional Modeling

重新思考基于任务对齐的结构-方向性建模的高效裂缝分割

Shipeng Liu, Liang Zhao, Dengfeng Chen, Weihua Zhang

AI总结 将裂缝分割视为稀疏结构恢复问题,提出RIFT模型,通过轻量多尺度融合保留局部证据、聚合方向连续性,在16项指标上达到最优或并列最优。

详情
AI中文摘要

最近的裂缝分割方法通常遵循通用的语义分割设计,使用更强的骨干网络、混合CNN-Transformer-Mamba编码器和辅助增强分支。虽然有效,但这引发了疑问:更强的通用特征混合是否是裂缝分割最合适的方向。相反,我们将裂缝分割表述为稀疏结构恢复。裂缝具有有限的类别级语义,但具有很强的形态规律性,即细、稀疏、各向异性、局部碎片化,且容易与纹理或阴影混淆。因此,关键瓶颈在于保留弱结构证据、恢复方向连续性以及抑制背景耦合。我们提出RIFT,一个紧凑的形态对齐裂缝分割模型家族。RIFT设计简单,而不是压缩复杂的通用架构,它保留局部证据,聚合协作方向连续性,并通过轻量多尺度融合恢复裂缝结构。在四个公共基准上的实验表明,RIFT在16个主要指标上对再现的代表性基线取得了最佳或并列最佳结果。RIFT-B提供了最强的整体精度,而RIFT-T提供了最佳的部署效率,仅0.47M参数和高推理速度。拓扑感知评估、消融实验、迁移实验和可视化进一步验证了,当其归纳偏置与裂缝形态匹配时,任务对齐的简单性可以匹配或超越复杂的混合架构。代码:https://github.com/xauat-liushipeng/RIFT

英文摘要

Recent crack segmentation methods often follow generic semantic segmentation designs, using stronger backbones, hybrid CNN-Transformer-Mamba encoders, and auxiliary enhancement branches. Although effective, this raises whether stronger generic feature mixing is the most suitable direction for crack segmentation. We instead formulate crack segmentation as sparse structural recovery. Cracks have limited category-level semantics but strong morphological regularities, being thin, sparse, anisotropic, locally fragmented, and easily confused with textures or shadows. Thus, the key bottleneck lies in preserving weak structural evidence, recovering directional continuity, and suppressing background coupling. We propose RIFT, a compact family of morphology-aligned crack segmentation models. Rather than compressing a complex generic architecture, RIFT is simple by design, preserving local evidence, aggregating cooperative directional continuity, and restoring crack structures through lightweight multi-scale fusion. Experiments on four public benchmarks show that RIFT achieves the best or tied-best results across the 16 main metrics against reproduced representative baselines. RIFT-B gives the strongest overall accuracy, while RIFT-T provides the best deployment efficiency with only 0.47M parameters and high inference speed. Topology-aware evaluation, ablations, transfer experiments, and visualizations further verify that task-aligned simplicity can match or surpass complex hybrid architectures when its inductive bias fits crack morphology. Code: https://github.com/xauat-liushipeng/RIFT

2605.31044 2026-06-01 cs.LG

The Challenges of Using Reinforcement Learning for Controlling Industrial Energy Systems

使用强化学习控制工业能源系统的挑战

Tobias Lademann, Théo Vincent, Jan Peters, Matthias Weigold

AI总结 本文以热力供暖网络为例,研究强化学习在真实工业能源系统部署中的挑战,包括部分可观测性、动作空间设计、奖励设计及仿真到现实的差距,并基于实际部署发现强化学习虽能实现运行稳定性但存在性能差距。

Comments Submitted to Finding the Frame Workshop at RLC 2026

详情
AI中文摘要

强化学习在优化工业能源系统控制方面显示出有希望的结果,然而现有研究大多局限于仿真环境中的应用。我们以热力供暖网络为例,研究了在真实工业能源系统中部署强化学习的挑战。我们将任务形式化为马尔可夫决策过程,并沿着形式化描述的结构系统分析了相关挑战,包括部分可观测性、动作空间设计、奖励设计以及仿真到现实的差距。这些挑战基于现有的真实部署,其中强化学习实现了运行稳定性,但与仿真相比表现出显著的性能差距。

英文摘要

Reinforcement learning has shown promising results for optimizing the control of industrial energy systems, yet most existing studies remain limited to the application in simulation environments. We investigate the challenges of deploying reinforcement learning in a real-world industrial energy system, considering a thermal heating network as a use case. We formulate the task as a Markov Decision Process and systematically analyze the associated challenges along the structure of the formal description, including partial observability, action space design, reward design, and the simulation-to-reality gap. The challenges are grounded in an existing real-world deployment, where reinforcement learning achieves operational stability but shows a significant performance gap compared to simulation.

2605.31041 2026-06-01 cs.CV cs.AI

Does Visual Information Play a Decisive Role in Vision-Language-Action Model Driving Behavior?

视觉信息在视觉-语言-动作模型驾驶行为中是否起决定性作用?

Jingtao He, Hongliang Lu, Xiaoyun Qiu, Yixuan Wang, Xinhu Zheng

AI总结 本文提出结构化多级视觉扰动框架,系统分析VLA驾驶模型对视觉信息的依赖程度,揭示依赖模式随评估方式变化且在不同抽象层次上不均匀。

详情
AI中文摘要

视觉-语言-动作(VLA)模型在自动驾驶中展现出令人期待的能力,凸显了统一多模态架构联合建模感知与规划的潜力。然而,当前基于VLA的驾驶行为如何植根于视觉信息仍知之甚少。现有评估协议主要关注聚合性能指标,缺乏结构化和实用的诊断方法来量化视觉-行为依赖性。在这项工作中,我们引入了一个结构化的多级视觉扰动框架,以系统分析基于VLA的驾驶模型中的视觉-行为依赖性。该框架沿着三个互补维度组织受控视觉扰动:通道级退化、信息级破坏和结构级修改。我们将其应用于基于VLA的驾驶系统,并在开环轨迹预测和交互式闭环安全评估下评估行为响应。实验揭示了依赖于评估的依赖模式以及跨抽象层次的不均匀视觉基础。这些发现呼吁对VLA驾驶模型进行更结构化的分析和原则性设计,以更好地理解视觉信息如何塑造行为,并开发更安全、更鲁棒的系统。

英文摘要

Vision-Language-Action (VLA) models have demonstrated promising capability in autonomous driving, highlighting the potential of unified multimodal architectures for jointly modeling perception and planning. However, how current VLA-based driving behavior is grounded in visual information remains poorly understood. Existing evaluation protocols mainly focus on aggregate performance metrics, lacking structured and practical diagnostics to quantify visual-behavior dependency. In this work, we introduce a structured multi-level visual perturbation framework to analyze visual-behavior dependency in VLA-based driving models systematically. The framework organizes controlled visual perturbations along three complementary dimensions: channellevel degradation, information-level disruption, and structurelevel modification. We apply it to VLA-based driving systems and evaluate behavioral responses under both open-loop trajectory prediction and interactive closed-loop safety evaluation. Experimental results reveal evaluation-dependent dependency patterns and uneven visual grounding across abstraction levels. These findings call for more structured analyses and principled design of VLA driving models to better understand how visual information shapes behavior and develop safer, more robust systems.

2605.31040 2026-06-01 cs.LG

UniRTL: Unifying Code and Graph for Robust RTL Representation Learning

UniRTL:统一代码和图以实现稳健的RTL表示学习

Yi Liu, Hongji Zhang, Lei Chen, Mingxuan Yuan, Qiang Xu

AI总结 提出UniRTL多模态预训练框架,通过互掩码建模和分层训练策略联合利用RTL代码与控制数据流图,实现细粒度对齐,在性能预测和代码检索任务上优于现有方法。

Comments Forty-Third International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

为寄存器传输级(RTL)设计开发有效的表示对于加速硬件设计工作流至关重要。然而,现有方法通常依赖于单一数据模态,即RTL代码或其相关的基于图的表示,限制了所学表示的表达能力和泛化能力。对于RTL,控制数据流图(CDFG)提供了保留完整信息的全面结构表示,而代码模态显式编码了语义和功能信息。我们认为,整合这些互补模态对于全面理解RTL设计至关重要。为此,我们提出UniRTL,一种多模态预训练框架,通过联合利用代码和CDFG学习统一的RTL表示。UniRTL通过互掩码建模实现代码和图之间的细粒度对齐,并采用分层训练策略,该策略结合了预训练的图感知分词器以及在图集成之前对文本(即功能摘要)和代码进行分阶段对齐。我们在两种下游任务(性能预测和代码检索)的多种设置下评估UniRTL。实验结果表明,UniRTL始终优于先前的方法,使其成为推进硬件设计自动化的更稳健和更强大的基础。

英文摘要

Developing effective representations for register transfer level (RTL) designs is crucial for accelerating the hardware design workflow. Existing approaches, however, typically rely on a single data modality, either the RTL code or its associated graph-based representation, limiting the expressiveness and generalization ability of the learned representations. For RTL, the control data flow graph (CDFG) offers a comprehensive structural representation that preserves complete information, while the code modality explicitly encodes semantic and functional information. We argue that integrating these complementary modalities is essential for a thorough understanding of RTL designs. To this end, we propose UniRTL, a multimodal pretraining framework that learns unified RTL representations by jointly leveraging code and CDFG. UniRTL achieves fine-grained alignment between code and graph through mutual masked modeling and employs a hierarchical training strategy that incorporates a pretrained graph-aware tokenizer and staged alignment of text (i.e., functional summary) and code prior to graph integration. We evaluate UniRTL on two downstream tasks, performance prediction and code retrieval, under multiple settings. Experimental results show that UniRTL consistently outperforms prior methods, establishing it as a more robust and powerful foundation for advancing hardware design automation.

2605.31034 2026-06-01 cs.LG cs.AI

Annealed Softmax Greedy in Many-Armed Bayesian Bandits

多臂贝叶斯老虎机中的退火Softmax贪婪算法

William Overman, Mohsen Bayati

AI总结 本文研究退火Softmax贪婪算法在多臂贝叶斯伯努利老虎机中的贝叶斯遗憾,证明在先验满足线性上尾条件(β=1的β正则性)时,算法达到接近最优的贝叶斯遗憾率,并与RLVR方法形成结构类比。

详情
AI中文摘要

具有可验证奖励的强化学习(RLVR)和基于组的策略优化方法(如GRPO)通过为每个提示采样多个完成并增加策略在奖励较高的完成上的概率来更新随机策略,同时通过KL惩罚向参考策略正则化。这些更新不包括追踪认知不确定性的显式机制。本文研究为何这种不确定性无关的更新仍然有效的一个风格化解释。我们分析了一个退火softmax(玻尔兹曼)策略,该策略在多臂贝叶斯伯努利老虎机中根据经验平均奖励的softmax选择动作。在先验满足线性上尾条件(β正则性的β=1情况)下,该条件意味着存在大量接近最优的臂,我们证明退火softmax贪婪算法实现了贝叶斯遗憾$ ilde{O}(m + T/m)$,特别地,当臂数$m = Θ(\sqrt{T})$时,遗憾为$ ilde{O}(\sqrt{T})$。这是该机制下接近最优的贝叶斯遗憾率,经验平均贪婪算法也能达到。在β正则性下,许多臂在整个学习过程中保持经验均值接近最优,因此当softmax采样一个非经验最优的臂时,该臂往往是另一个接近最优的臂,而不是明显较差的臂。相比之下,当臂数较少时,同类的softmax策略可能遭受线性遗憾。该结果也为RLVR提供了结构类比,其中以非可忽略概率产生正确完成的基础策略扮演了β正则性的角色。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) and group-based policy optimization methods such as GRPO update a stochastic policy by sampling multiple completions per prompt and increasing the policy's probability on those with higher reward, regularized by a KL penalty toward a reference policy. These updates do not include explicit mechanisms that track epistemic uncertainty. This paper studies a stylized explanation for why such uncertainty-agnostic updates can nevertheless be effective. We analyze an annealed softmax (Boltzmann) policy that selects actions according to a softmax of empirical mean rewards in a many-armed Bayesian Bernoulli bandit. Under a linear upper-tail condition on the prior (the $β=1$ case of $β$-regularity), which implies an abundance of near-optimal arms, we prove that annealed softmax greedy achieves Bayes regret $\tilde{O}(m + T/m)$, and in particular $\tilde{O}(\sqrt{T})$ when the number of arms scales as $m = Θ(\sqrt{T})$. This is the near-optimal Bayes regret rate in this regime, attained also by empirical-mean greedy. Under $β$-regularity, many arms maintain empirical means close to the optimum throughout learning, so when softmax samples an arm other than the empirically best, that arm tends to be another near-optimal one rather than a clearly inferior one. By contrast, with a small number of arms, the same kind of softmax policy can suffer linear regret. The result also provides a structural analogy to RLVR, where a base policy with a non-negligible probability of producing a correct completion plays the role of $β$-regularity.

2605.31033 2026-06-01 cs.CV

SlotMemory: Object-Centric KV Memory for Streaming Long-Video Generation

SlotMemory: 面向流式长视频生成的以对象为中心的KV记忆

Weijia Dou, Hui Li, Jiahao Cui, Lei Zhou, Jingdong Wang, Siyu Zhu

AI总结 提出SlotMemory,一种以对象为中心的键值记忆机制,通过将变换器的键值流形分解为离散语义槽,实现实体级持久性和提示感知检索,在60秒交互叙事中动态一致性相对提升22.8%。

详情
AI中文摘要

流式视频生成模型通常依赖于以时间为中心的记忆,将历史上下文组织为原始帧、片段或未聚类的令牌。这种组织方式常导致实体离开画面或交互式提示转换时出现身份漂移和语义不一致。为解决这些限制,我们提出SlotMemory,一种用于流式视频扩散的以对象为中心的键值记忆机制。我们的方法通过将变换器的键值流形分解为离散、可重用的语义槽,将记忆抽象从事件发生的“何时”转移到所表示的“什么”。通过利用这些槽作为路由地址来索引和存储高保真键值令牌,我们实现了跨长时域的实体级持久性和提示感知检索。在使用Wan2.1-T2V-1.3B骨干网络对60秒交互叙事进行评估时,SlotMemory达到了81.61的最先进质量分数,并在动态一致性上比现有最强流式基线相对提升22.8%。我们的结果表明,结构化的语义表示,而非原始时间容量,是持久长视频合成的关键原语。我们的代码和检查点可在https://tj12323.github.io/SlotMemory/获取。

英文摘要

Streaming video generation models typically rely on temporal-centric memory, which organizes historical context as raw frames, chunk segments, or unclustered tokens. This organization frequently leads to identity drift and semantic inconsistency when entities exit the frame or during interactive prompt transitions. To address these limitations, we propose SlotMemory, an object-centric Key-Value memory mechanism for streaming video diffusion. Our approach shifts the memory abstraction from "when" an event occurred to "what" is being represented by decomposing the transformer's key-value manifold into discrete, reusable semantic slots. By utilizing these slots as routing addresses to index and store high-fidelity key-value tokens, we enable entity-level persistence and prompt-aware retrieval across long horizons. Evaluated on 60-second interactive narratives using the Wan2.1-T2V-1.3B backbone, SlotMemory achieves a state-of-the-art quality score of 81.61 and a 22.8 percent relative improvement in dynamic consistency over the strongest existing streaming baseline. Our results demonstrate that structured semantic representation, rather than raw temporal capacity, is the essential primitive for persistent long-form video synthesis. Our codes and checkpoints are available at https://tj12323.github.io/SlotMemory/.