arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 4089
专题追踪
2606.01717 2026-06-02 cs.LG

Decentralized Instruction Tuning: Conflict-Aware Splitting and Weight Merging

去中心化指令微调:冲突感知拆分与权重合并

Minsik Choi, Geewook Kim

发表机构 * Korea University(韩国大学)

AI总结 提出MERIT方法,通过冲突感知拆分和权重合并实现去中心化指令微调,在Qwen2.5-VL-3B上8个基准平均分从54.3提升至57.0。

Comments 32 pages, 5 figures. Accepted for publication at ICML 2026

详情
AI中文摘要

指令微调使包括多模态在内的大语言模型与多样化的用户意图对齐,但扩展到异构混合数据时受到梯度干扰和带宽密集型同步的阻碍。我们提出是否可以通过独立训练部分混合数据并在参数空间中一次性协调它们来共同解决这两个瓶颈。我们在共享平坦盆地内发展了一个局部二次理论,得到三个结果:权重合并产生曲率加权方差减少;PCA对齐的冲突拆分沿高曲率方向最大化这一增益;合并还作为具有隐式范数正则化的谱滤波。这些结果直接激发了MERIT,一个去中心化的合并就绪指令微调流程,该流程估计数据集级别的梯度冲突,沿顶部PCA冲突轴划分混合数据,独立微调每个分区且无分区间通信,并通过令牌加权平均一次性合并。在Qwen2.5-VL-3B上使用136个Vision-FLAN任务,MERIT将8个基准平均分从54.3(联合训练)提升至57.0。相同的方案扩展到7B模型上,使用160万样本、176个源的混合数据——以最小成本开销匹配或超越集中式联合训练——并迁移到纯文本FLAN。我们的代码可在https://github.com/naver-ai/merit获取。

英文摘要

Instruction tuning aligns large language models, including multimodal ones, with diverse user intents, but scaling to heterogeneous mixtures is hindered by gradient interference and bandwidth-heavy synchronization. We ask whether these two bottlenecks can be addressed jointly by training parts of the mixture independently and reconciling them once in parameter space. We develop a local quadratic theory inside a shared flat basin that yields three results: weight merging produces a curvature-weighted variance reduction; PCA-aligned conflict splitting maximizes this gain along high-curvature directions; and merging additionally acts as spectral filtering with implicit norm regularization. These results directly motivate MERIT, a decentralized merge-ready instruction-tuning pipeline that estimates dataset-level gradient conflicts, partitions the mixture along the top PCA conflict axes, fine-tunes each partition independently with no inter-partition communication, and merges once via token-weighted averaging. On Qwen2.5-VL-3B with 136 Vision-FLAN tasks, MERIT improves the 8-benchmark average from 54.3 (joint training) to 57.0. The same recipe scales to a 7B model on a 1.6M-example, 176-source mixture -- matching or exceeding centralized joint training with minimal cost overhead -- and transfers to text-only FLAN. Our code is available at https://github.com/naver-ai/merit.

2606.01713 2026-06-02 cs.RO cs.SY eess.SY math.OC

FlipItRight: Stable Pose-Targeted Throw-Flip Across Diverse Objects

FlipItRight: 面向多样物体的稳定姿态目标投掷翻转

Axel Dawne, Shinkyu Park

发表机构 * King Abdullah University of Science and Technology(卡布斯大学)

AI总结 提出FlipItRight框架,通过将任务分解为物体级规划器和机器人级规划器,利用释放状态作为显式中间表示,实现无需先验数据或学习模型的高自由度机械臂稳定平面姿态目标投掷翻转,在120次实验中达到90%成功率。

详情
AI中文摘要

我们提出了FlipItRight,一个用于高自由度机械臂进行稳定平面姿态目标投掷翻转的框架。该任务被分解为一个物体级规划器,它生成满足期望着陆姿态的候选释放状态,以及一个机器人级规划器,它评估可执行性并构建可行的摆动运动。将释放状态视为显式中间表示,能够实现原则性的候选过滤、释放和预摆动配置的自适应选择,以及结构化的近释放运动设计——特别是在最终摆动阶段保持近似恒定的末端执行器速度,以提高对释放时间不确定性的鲁棒性。我们在一个真实平台上对不同形状、大小和质量的物体进行了验证,在120次试验中达到了90%的成功率。消融研究证实,每个设计选择都对投掷性能有所贡献,并且该框架不需要先验数据或学习模型,能够直接部署到新物体和目标上,无需特定环境的校准或数据收集。

英文摘要

We propose FlipItRight, a framework for stable planar pose-targeted throw-flip with a high-DoF manipulator. The task is decomposed into an object-level planner, which generates candidate release states satisfying the desired landing pose, and a robot-level planner, which evaluates executability and constructs a feasible swing motion. Treating the release state as an explicit intermediate representation enables principled candidate filtering, adaptive selection of release and pre-swing configurations, and structured near-release motion design -- in particular, approximately constant end-effector velocities during the final swing phase to improve robustness to release-timing uncertainty. We validate on a real platform across objects of varying shape, size, and mass, achieving a 90% success rate across 120 trials. Ablation studies confirm that each design choice contributes to throwing performance, and the framework requires no prior data or learned model, enabling direct deployment on new objects and targets without environment-specific calibration or data collection.

2606.01710 2026-06-02 cs.CV cs.LG

Density-Aware Translation of Spurious Correlations in Zero-Shot VLMs

零样本VLM中虚假相关性的密度感知转换

Afsaneh Hasanebrahimi, Hanxun Huang, Christopher Leckie, Sarah Erfani

发表机构 * School of Computing and Information Systems, The University of Melbourne, Victoria, Australia(计算与信息系统学院,墨尔本大学,维多利亚,澳大利亚)

AI总结 提出密度感知转换(DAT)方法,利用局部几何密度项修正图像-文本相似度,以缓解CLIP等视觉语言模型在零样本分类中因虚假相关性导致的性能下降。

Comments ICML 2026

详情
AI中文摘要

视觉语言模型(如CLIP)实现了强大的零样本分类。然而,它们的预测仍然对虚假相关性敏感,即上下文线索主导语义内容。早期的解决方案通常依赖于微调或提示工程,这要么削弱了预训练模型的优势,要么容易产生幻觉。在这项工作中,我们提出了密度感知转换(DAT),它使用从组参考集导出的局部几何密度项来细化图像-文本相似度分数。我们的方法受到以下现象的启发:CLIP嵌入表现出模态间隙,并位于特征空间中的各向异性壳上:常见模式聚集在均值附近,而罕见模式被推向外围。这种几何结构产生了不均匀的对齐,其中虚假相关性被放大,而语义上有意义但罕见的线索被边缘化。为了解决这个问题,我们采用相对度量根据嵌入密度重新缩放相似度,抑制扩散区域中过度自信的分数,同时保留密集、语义一致的匹配。在基准数据集上的实验结果表明,最差组和平均准确率持续提高,突出了密度感知转换作为一种简单有效的校准机制,用于使用多模态模型进行可靠的零样本分类。

英文摘要

Vision-Language models (VLMs), such as CLIP, achieve powerful zero-shot classification. However, their predictions remain sensitive to spurious correlations, where contextual cues dominate over semantic content. Earlier solutions typically rely on fine-tuning or prompt engineering, which either undermine the advantages of pre-trained models or are prone to hallucination. In this work, we propose Density-Aware Translation (DAT) that refines image-text similarity scores using a local geometric density term derived from group reference sets. Our approach is motivated by the phenomenon that CLIP embeddings exhibit a modality gap and lie on an anisotropic shell in the feature space: common patterns cluster near the mean, while rare patterns are pushed outward. This geometry creates uneven alignment, where spurious correlations are amplified while semantically meaningful but rare cues are marginalised. To address this, we employ a relative measure to rescale similarities based on embedding density, suppressing overconfident scores in diffuse regions while preserving dense, semantically consistent matches. Experimental results on benchmark datasets demonstrate consistent improvements in worst-group and average accuracy, highlighting density-aware translation as a simple and effective calibration mechanism for reliable zero-shot classification using multimodal models.

2606.01708 2026-06-02 cs.LG cs.AI

Two-Fidelity Best-Action Identification for Stochastic Minimax Tree

随机极小极大树的双保真度最优动作识别

Peter Chen, Xi Chen

发表机构 * Department of Mathematics, Columbia University(哥伦比亚大学数学系) Stern School of Business, New York University(纽约大学斯特恩商学院)

AI总结 针对随机极小极大树中的固定置信度最优动作识别问题,提出双保真度树搜索算法2FFS,结合极小极大快速扩展与MCTS随机采样,自适应选择廉价有偏评估或昂贵精确评估,理论证明固定置信度正确性、有限停止及多项式深度成本上界,实验表明比现有BAI-MCTS基线显著减少样本和计算。

Comments 36 pages

详情
AI中文摘要

我们研究随机极小极大树中的固定置信度最优动作识别(BAI)。该问题在现代AI规划中日益重要,其中深度极小极大搜索和带有语言模型长滚动的蒙特卡洛树搜索(MCTS)面临一个基本权衡:启发式评估廉价但有偏,而精确滚动可靠但代价高昂。我们提出2FFS,一种双保真度树搜索算法,将多保真度平面赌博机思想引入树中。该算法结合了极小极大风格的快速扩展和MCTS风格的随机采样,自适应地决定何时利用廉价有偏评估以及何时调用昂贵精确评估进行局部认证。我们证明了固定置信度正确性,建立了精确识别的有限停止性,并给出了通用深度树的多项式深度成本上界。在数值随机树实验中,与现有BAI-MCTS基线相比,2FFS使用的样本和计算操作显著减少。

英文摘要

We study fixed-confidence best-action identification (BAI) in stochastic minimax trees. This problem is increasingly relevant in modern AI planning, where deep minimax search and Monte Carlo Tree Search (MCTS) with language model long rollouts face a fundamental tradeoff: heuristic evaluations are cheap but biased, while accurate rollouts are reliable but prohibitively expensive. We propose 2FFS, a two-fidelity tree-search algorithm that brings multi-fidelity flat bandit ideas into trees. The algorithm combines minimax-style fast expansion with MCTS-style stochastic sampling, adaptively deciding when to exploit cheap biased evaluations and when to invoke expensive accurate evaluations for local certification. We prove fixed-confidence correctness, establish finite stopping for exact identification, and give a polynomial-depth cost upper bound for general-depth trees. Across numerical stochastic-tree experiments, 2FFS uses substantially fewer samples and computational operations comparing to existing BAI-MCTS baseline.

2606.01703 2026-06-02 cs.SD cs.AI cs.CV

JenBridge: Adaptive Long-Form Video Soundtracking across Scene Transitions

JenBridge: 跨场景转换的自适应长视频配乐

Jiashuo Yu, Yao Yao, Boyu Chen, Alex Wang

发表机构 * Jen Music AI

AI总结 提出JenBridge框架,通过基于Transformer的生成模型、双文本-视觉条件对齐和LLM代理驱动的自适应过渡机制,实现长视频配乐的高保真生成与场景转换自然连贯。

详情
AI中文摘要

我们解决了在场景转换中生成高保真、长格式配乐并保持连贯性的挑战。现有的AI音乐系统主要针对短片段设计,缺乏确保叙事连续性的机制。我们提出了JenBridge,一个模块化且可解释的自适应长视频配乐框架,确保高保真音频生成和转换自然性。核心架构是一个基于Transformer的生成模型,采用流匹配目标训练,遵循两阶段范式:在大规模文本-音频语料库上进行预训练以建立稳健的音乐先验,然后通过双文本-视觉条件适应视频领域以实现精确的跨模态对齐。关键的是,为了实现跨不同场景变化的长格式连贯性,JenBridge引入了一种新颖的自适应过渡机制。该系统具有一个多功能的过渡风格工具包,包括一种生成式过渡方法,并独特地采用了一个大型语言模型(LLM)代理,作为导演智能地为每个叙事转变选择最合适的过渡。为了严格评估这一任务,我们提出了LVS基准,这是一个新基准,包含一个精选数据集和新的评估指标,侧重于整体和过渡感知评估。在提出的基准上进行的大量实验表明,JenBridge在客观和主观指标上均显著优于现有方法,特别是在转换自然性和整体叙事连贯性方面。JenBridge代表了向全自动、专业质量的视频配乐迈出的重要一步。

英文摘要

We address the challenge of generating high-fidelity, long-form soundtracks that remain coherent across scene transitions. Existing AI music systems are mainly designed for short, isolated clips and lack mechanisms to ensure narrative continuity. We present JenBridge, a modular and interpretable framework for adaptive long-form video soundtracking that ensures both high-fidelity audio generation and transition naturalness. The core architecture is a Transformer-based generative model trained with a flow-matching objective, following a two-stage paradigm: pretraining on large-scale text-audio corpora to establish robust musical priors, then adapting to the video domain with dual text-visual conditioning for precise cross-modal alignment. Crucially, to achieve long-form coherence across diverse scene changes, JenBridge incorporates a novel adaptive transition mechanism. This system features a versatile toolkit of transition styles, including a generative transition method, and uniquely employs a Large Language Model (LLM) Agent that acts as a director to select the most appropriate transition for each narrative shift intelligently. To rigorously assess this task, we propose the LVS Benchmark, a new benchmark that includes a curated dataset and novel evaluation metrics focusing on holistic and transition-aware assessment. Extensive experiments on the proposed benchmark demonstrate that JenBridge significantly outperforms existing methods in both objective and subjective metrics, particularly in terms of transition naturalness and overall narrative coherence. JenBridge represents a significant step towards fully automated, professional-quality video soundtracking.

2606.01701 2026-06-02 cs.CV

Spatio-Temporal Correlation Guided Geometric Partitioning for Versatile Video Coding

时空相关性引导的几何划分用于多功能视频编码

Xuewei Meng, Chuanmin Jia, Xinfeng Zhang, Shanshe Wang, Siwei Ma

发表机构 * Institute of Digital Media, Department of Electronics Engineering and Computer Science, Peking University(数字媒体研究所,电子工程与计算机科学系,北京大学) Information Technology R&D Innovation Center of Peking University(北京大学信息科技研发创新中心) Peng Cheng Laboratory(鹏城实验室) School of Computer Science and Technology, University of Chinese Academy of Sciences(中国科学院大学计算机科学与技术学院)

AI总结 针对VVC中几何划分开销大的问题,提出时空相关性引导的几何划分(STGEO)方案,通过模式预测和运动候选选择减少边信息比特,提升编码效率。

Journal ref IEEE Transactions on Image Processing, vol. 31, pp. 30-42, 2022

详情
AI中文摘要

几何划分因其在混合视频编码框架中卓越的运动场描述能力而受到越来越多的关注。然而,多功能视频编码(VVC)中现有的几何划分(GEO)方案给边信息的信令带来了不可忽视的负担,从而限制了编码效率。鉴于此,我们提出了一种时空相关性引导的几何划分(STGEO)方案,以有效描述视频编码运动场中的物体信息。所提方法可以节省用于边信息信令的比特,包括划分模式和运动信息。我们首先以统计合理的方式分析了划分模式决策和运动矢量选择的特性。基于观察到的时空相关性,我们设计了一种模式预测和编码方法,以减少表示上述边信息的开销。主要思想是预测具有较高选择可能性的STGEO模式和运动候选,这可以指导熵编码,即用更少的比特表示预测的高概率模式和运动候选。特别地,高概率STGEO模式基于边缘信息和相邻STGEO编码块的历史模式进行预测。相应的运动信息由合并候选列表中的索引表示,该索引基于离线训练的合并候选选择概率自适应地推断。仿真结果表明,与未使用GEO的VTM-8.0相比,所提方法在随机接入和低延迟B配置下平均分别节省了0.95%和1.98%的比特率。

英文摘要

Geometric partitioning has attracted increasing attention by its remarkable motion field description capability in the hybrid video coding framework. However, the existing geometric partitioning (GEO) scheme in Versatile Video Coding (VVC) causes a non-negligible burden for signaling the side information. Consequently, the coding efficiency is limited. In view of this, we propose a spatio-temporal correlation guided geometric partitioning (STGEO) scheme to efficiently describe the object information in the motion field of video coding. The proposed method can economize the bits consumed for side information signaling, including the partitioning mode and motion information. We firstly analyze the characteristics of partitioning mode decision and motion vector selection in a statistically-sound way. Based on the observed spatio-temporal correlation, we design a mode prediction and coding method to reduce the overhead for representing the above mentioned side information. The main idea is to predict the STGEO modes and motion candidates that have higher selection possibilities, which can guide the entropy coding, i.e., representing the predicted high-probability modes and motion candidates with fewer bits. In particular, the high-probability STGEO modes are predicted based on the edge information and history modes of adjacent STGEO-coded blocks. The corresponding motion information is represented by the index in a merge candidate list, which is adaptively inferred based on the off-line trained merge candidate selection probability. Simulation results show that the proposed approach achieves 0.95% and 1.98% bit-rate savings on average compared to VTM-8.0 without GEO for Random Access and Low-Delay B configurations, respectively.

2606.01700 2026-06-02 cs.CV

MixerSENet: A Lightweight Framework for Efficient Hyperspectral Image Classification

MixerSENet: 一种用于高效高光谱图像分类的轻量级框架

Mohammed Q. Alkhatib, Swalpa Kumar Roy, Ali Jamali

发表机构 * College of Engineering and IT, University of Dubai(迪拜大学工程与信息技术学院) Department of Computer Science and Engineering, Alipurduar Government Engineering and Management College(阿利普杜尔政府工程与管理学院计算机科学与工程系) Department of Geography, Simon Fraser University(西蒙·弗雷泽大学地理系)

AI总结 提出轻量级框架MixerSENet,通过解耦空间与通道维度混合并引入挤压激励模块,在保持低参数量的同时实现高光谱图像分类的高精度与高效率。

Comments Accepted and Published in IEEE Geoscience and Remote Sensing Letters (GRSL)

详情
AI中文摘要

本文提出了一种新颖的框架MixerSENet,用于高光谱图像(HSI)分类,旨在解决计算效率和有限标注数据带来的挑战。所提出的模型处理高光谱图像块,同时在整个网络中保持一致的尺寸和分辨率,有效解耦了空间和通道维度的混合。值得注意的是,MixerSENet轻量且计算高效,与传统模型相比所需参数更少,适用于资源受限环境。模型中嵌入了挤压激励块以细化特征提取,增强网络捕获更多信息特征的能力。在两个基准数据集上的实验结果表明,MixerSENet实现了优越的性能,在Houston13数据集上达到82.47%的总体精度(OA),在Qingyun数据集上达到96.70%,优于包括3D-CNN、HybridKAN、HSIFormer、SimPoolFormer和MorphMamba在内的最先进方法。此外,对计算效率的详细分析表明,MixerSENet在准确性和效率之间实现了良好的平衡,仅需53,146个参数和较低的推理时间,证实了其在实际应用中的实用性。发布时,源代码将在https://github.com/mqalkhatib/MixerSENet公开。

英文摘要

In this paper, a novel framework, MixerSENet, is introduced for hyperspectral image (HSI) classification, designed to address the challenges of computational efficiency and limited labeled data. The proposed model processes hyperspectral image patches while maintaining consistent size and resolution throughout the network, effectively decoupling the mixing of spatial and channel dimensions. Notably, MixerSENet is lightweight and computationally efficient, requiring fewer parameters compared to traditional models, making it suitable for resource-constrained environments. A squeeze and excitation block is incorporated into the model to refine feature extraction, enhancing the network's ability to capture more informative features. Experimental results on two benchmark datasets demonstrate that MixerSENet achieves superior performance, reaching an overall accuracy (OA) of 82.47% on Houston13 dataset and 96.70% on the Qingyun dataset, outperforming state-of-the-art methods including 3D-CNN, HybridKAN, HSIFormer, SimPoolFormer, and MorphMamba. Furthermore, a detailed analysis of computational efficiency shows that MixerSENet achieves a favorable balance between accuracy and efficiency, with only 53,146 parameters and an low inference time, confirming its practicality for real-world applications. At publication, source code will be publicly available at https://github.com/mqalkhatib/MixerSENet.

2606.01698 2026-06-02 cs.CV

Learning Label-Efficient Interpretable Medical Image Diagnosis via Semi-supervised Hypergraph Concept Bottleneck Model

通过半监督超图概念瓶颈模型实现标签高效的医学图像可解释诊断

Yijun Yang, Ruiqiang Xiao, Lijie Hu, Angelica I Aviles-Rivero, Yunzhu Wu, Jing Qin, Lei Zhu

发表机构 * HKUST(GZ)(香港科技大学(广州)) Joy Future Academy(未来正义学院) MBZUAI(穆罕默德·本·拉希德智能研究院) Tsinghua University(清华大学) Sichuan University(四川大学) PolyU

AI总结 提出一种半监督超图概念瓶颈模型,利用双层超图学习建模高阶概念依赖并生成领域自适应伪标签,在胎盘植入谱系等医学图像诊断中实现高可解释性和性能。

详情
AI中文摘要

深度学习在医学图像分析中取得了革命性进展,在多种应用中提供了卓越的诊断准确性。然而,其决策缺乏可解释性阻碍了临床采纳,特别是在高风险医疗场景中,透明度对可信度至关重要。例如,在胎盘植入谱系(PAS)中,超声图像中的细微线索挑战了可靠诊断,使得黑盒模型难以获得准确的评分信任。为了解决这一问题,概念瓶颈模型(CBM)通过将临床上有意义的中间概念嵌入诊断流程,提供了一种有前景的途径,使临床医生能够审查和优化模型输出。然而,传统的CBM在捕捉复杂的概念间依赖关系方面表现不佳,并且需要昂贵、专家驱动的概念注释,限制了其可扩展性。本研究引入了一种新颖的半监督CBM框架,专为医学成像设计,利用双层超图学习来建模高阶概念依赖并生成领域自适应伪标签。我们的方法通过集成概念级超图以增强推理和图像级超图以生成鲁棒的伪标签,实现了卓越的可解释性和性能。在新标注的PAS超声数据集和乳腺超声公共数据集上的实验证明了所提出的概念标签高效可解释框架的有效性。其通用性在皮肤镜图像数据集SkinCon上得到了进一步验证。代码可在https://github.com/scott-yjyang/HyperCBM获取。

英文摘要

Deep learning has revolutionized medical image analysis, delivering exceptional diagnostic accuracy across diverse applications. Yet, the lack of interpretability in its decision-making hinders clinical adoption, particularly in high-stakes medical contexts where transparency is paramount for trustworthiness. For example, in Placenta Accreta Spectrum (PAS), subtle cues in ultrasound imaging challenge reliable diagnosis, rendering black-box models untrustworthy for accurate scoring. To address this, Concept Bottleneck Models (CBMs) offer a promising avenue by embedding clinically meaningful intermediate concepts into the diagnosis pipeline, enabling clinicians to scrutinize and refine model outputs. However, conventional CBMs falter in capturing complex inter-concept dependencies and demand costly, expert-driven concept annotations, limiting their scalability. This study introduces a novel semi-supervised CBM framework designed for medical imaging, which leverages dual-level hypergraph learning to model high-order concept dependencies and generate domain-adaptive pseudo-labels. Our approach achieves superior interpretability and performance by integrating a concept-level hypergraph for enhanced reasoning and an image-level hypergraph for robust pseudo-label generation. Experiments on a newly annotated PAS ultrasound dataset and a breast ultrasound public dataset demonstrate the effectiveness of the proposed concept label-efficient interpretable framework. Its universality is further validated on the dermoscopic image dataset SkinCon. The code is available at https://github.com/scott-yjyang/HyperCBM.

2606.01695 2026-06-02 cs.LG

CANARY: Zero-Label Detection of Fine-Tuning Contamination in Language Models

CANARY: 语言模型中微调污染的无标签检测

Swapnil Parekh

发表机构 * Switzerland(瑞士)

AI总结 提出CANARY方法,通过稀疏自编码器分析隐藏状态差异,在无标签情况下检测微调数据污染,实现1%污染率下AUROC=1.000,并支持检测、验证、优先排序和修复。

详情
AI中文摘要

攻击者可以通过污染仅1%的微调样本来植入潜在的有害行为。这种污染对所有的输出级防御都是不可见的:有害行为潜伏在模型的隐藏状态几何中,直到污染超过7.5%才会在生成的文本中出现。我们提出了CANARY(通过神经激活表示产出的污染审计器),这是一种无标签检查点审计器,可以直接通过对未标记提示集进行两次前向传递来检测这种隐藏的偏移。CANARY通过稀疏自编码器投影隐藏状态差异,过滤风格噪声以隔离有意义的语义漂移。它在四种模型架构和两种训练范式下,在1%污染率下实现了AUROC=1.000(95%置信区间=[0.997, 1.000];Cohen's d=3.28),比任何输出级方法触发点低7.5倍,并且在良性微调上零误报,对风格匹配和梯度噪声自适应攻击具有完全鲁棒性。相同的SAE特征基础驱动了一个完整的治理流程:SAE过滤放大以比标准生成高5倍的速率揭示潜在危害;得分排序的提示带来4.2倍的红队测试提升;在推理时抑制少数污染特定特征将危害从70%降低到10%,且无困惑度惩罚。CANARY是第一个仅从隐藏状态检测、验证、优先排序和修复供应链污染的无标签框架。

英文摘要

Adversaries can implant latent harmful behavior by poisoning as few as 1% of fine-tuning examples. The contamination is invisible to every output-level defense: harmful behavior lies dormant in the model's hidden-state geometry and does not appear in generated text until contamination exceeds 7.5%. We introduce CANARY (Contamination Auditor via Neural Activation Representation Yield), a zero-label checkpoint auditor that detects this hidden shift directly from two forward passes over an unlabeled prompt set. CANARY projects the hidden-state difference through a Sparse Autoencoder, filtering style noise to isolate meaningful semantic drift. It achieves AUROC = 1.000 at 1% contamination (95% CI = [0.997, 1.000]; Cohen's d = 3.28) across four model architectures and two training paradigms, 7.5x below where any output-level method fires, with zero false positives on benign fine-tuning and full robustness to style-matching and gradient-noise adaptive attacks. The same SAE feature basis drives a complete governance pipeline: SAE-filtered amplification surfaces latent harm at a 5x higher rate than standard generation; score-ranked prompts yield 4.2x red-teaming lift; and suppressing a handful of contamination-specific features at inference time reduces harm from 70% to 10% with no perplexity penalty. CANARY is the first zero-label framework to detect, verify, prioritize, and remediate supply-chain contamination from hidden states alone.

2606.01694 2026-06-02 cs.CV cs.AI cs.LG cs.MM

Understanding Identity Continuity in Thermal Video through Scene-Level Consistency

通过场景级一致性理解热视频中的身份连续性

Wei-Chieh Sun, Gyungmin Ko, Heejae Kwon, Hsiang-Wei Huang, Jenq-Neng Hwang

发表机构 * Department of Electrical and Computer Engineering, Information Processing Lab, University of Washington, USA(电气与计算机工程系,信息处理实验室,华盛顿大学,美国)

AI总结 针对热行人多目标跟踪中身份碎片化问题,提出轻量级后处理方法,通过在线短间隙重映射和离线轨迹重链接恢复身份连续性,在PBVS热行人MOT基准上提升IDF1。

Comments Accepted to CVPR 2026 Workshop on SVC. Published in CVPR Workshops proceedings

Journal ref Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026, pp. 1411-1419

详情
AI中文摘要

热行人多目标跟踪仍然具有挑战性,因为弱外观线索和频繁的检测中断导致严重的轨迹碎片化。我们研究轻量级后处理是否可以在不依赖重型重识别模型或复杂在线关联的情况下恢复身份连续性。从YOLOv8和SORT基线开始,我们添加了一个模块化的身份修复后端,包括基于时间、空间、运动和边界线索的在线短间隙重映射和离线轨迹重链接。在固定验证集上的受控消融实验和在官方PBVS热行人MOT基准上的评估表明,主要身份增益来自保守的重链接,将IDF1从82.25提升到84.93,同时保持MOTA,而许多启发式阈值在广泛的操作范围内保持稳定。这些结果表明,在低信息热图像中,通过高精度轨迹重链接比增加跟踪器复杂性更能有效地实现鲁棒的身份恢复。这些结果提供了对热视频中身份恢复的受控分析,表明与局部帧到帧关联相比,场景级时空一致性在身份连续性中起主导作用。

英文摘要

Thermal pedestrian MOT remains challenging because weak appearance cues and frequent detection interruptions cause severe trajectory fragmentation. We study whether lightweight post-processing can recover identity continuity without relying on heavy re-identification models or complex online association. Starting from a YOLOv8 and SORT baseline, we add a modular identity-repair backend consisting of online short-gap remapping and offline tracklet relinking based on temporal, spatial, motion, and border cues. Controlled ablations on a fixed validation split and evaluation on the official PBVS Thermal Pedestrian MOT benchmark show that the main identity gains arise from conservative relinking, improving IDF1 from 82.25 to 84.93 while preserving MOTA, whereas many heuristic thresholds remain stable across broad operating ranges. These results suggest that, in low-information thermal imagery, robust identity recovery can be achieved more effectively through high-precision trajectory relinking than through increasing tracker complexity. These results provide a controlled analysis of identity recovery in thermal video, showing that scene-level spatial-temporal consistency plays a dominant role in identity continuity compared to local frame-to-frame association.

2606.01689 2026-06-02 cs.CV cs.AI

RPCASSM: Robust PCA State Space Model For Infrared Small Target Detection

RPCASSM: 基于鲁棒主成分分析的状态空间模型用于红外小目标检测

Pingping Liu, Aohua Li, Yubing Lu, Jin Kuang, Tongshun Zhang, Qiuzhan Zhou

发表机构 * College of Computer Science and Technology, Jilin University(吉林大学计算机科学与技术学院) Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University(教育部符号计算与知识工程重点实验室) College of Software, Jilin University(吉林大学软件学院) School of Geosciences, Yangtze University(长江大学地球科学学院) College of Communication Engineering, Jilin University(吉林大学通信工程学院)

AI总结 针对红外小目标检测中主流状态空间模型难以准确建模目标边缘的问题,提出基于鲁棒主成分分析(RPCA)的RPCASSM网络,通过设计背景状态空间模块(BSSM)和目标状态空间模块(TSSM)分别利用空间异质信号显著性和目标稀疏局部高亮特性进行状态空间建模,有效解决了边缘建模难题。

Comments 12 pages, 8 figures, under review

详情
AI中文摘要

红外小目标的检测与分割在监控安防、海上救援等领域具有重要的应用意义。由于这些目标在远距离成像中占据像素少,主流的视觉状态空间模型效率低下且难以准确建模目标边缘。现有的红外状态空间模型并未从红外小目标的结构特性出发偏离主流视觉状态空间结构框架。为了解决这一问题,本文基于鲁棒主成分分析(RPCA)的模型范式提出了RPCASSM网络,旨在通过红外小目标在空间域的性质设计背景状态空间模块(BSSM)和目标状态空间模块(TSSM)。BSSM旨在利用空间异质信号的显著性设计空间探测扫描机制(SPCM)来建模背景信息。TSSM利用目标的稀疏性和局部高亮特性设计可变形提示扫描机制(DPCM),聚焦于目标的可变形空间进行状态空间建模。通过上述设计,我们有效解决了现有主流视觉状态空间模型难以准确建模红外小目标边缘结构的问题。在现有基准数据集上的实验结果证明了RPCASSM设计的有效性。我们的代码将在\href{https://github.com/PepperCS/RPCASSM}{RPCASSM}公开。

英文摘要

The detection and segmentation of infrared small targets have important application significance in the fields of surveillance and security, maritime rescue and so on. Due to the low occupancy of these targets in long-distance imaging, the mainstream visual state space model is inefficient and difficult to accurately model the target edge. The existing infrared state space models do not deviate from the mainstream visual state space structure framework from the structural properties of infrared small targets. In order to solve this problem, this paper proposes the RPCASSM network based on the model paradigm of robust principal component analysis(RPCA), which aims to design the background state space module(BSSM) and the target state space module(TSSM) by the nature of the infrared small target in the spatial domain. The BSSM aims to use the saliency of spatial heterogeneous signals to design a spatial probe scanning mechanism(SPCM) to model background information. The TSSM designs a deformable prompt scanning mechanism(DPCM) by using the sparsity and local highlight of the target to focus on the deformable space of the target for state space modeling. According to the above design, we effectively solve the problem that the existing mainstream vision state space model is difficult to accurately model the edge structure of infrared small target. Experimental results on the existing benchmark data sets prove the effectiveness of the RPCASSM design. Our code will be made public at \href{https://github.com/PepperCS/RPCASSM}{RPCASSM}.

2606.01686 2026-06-02 cs.SD cs.AI

HAIM: Human-AI Music Datasets for AI Music Production Tracking Benchmark

HAIM: 用于AI音乐制作跟踪基准的人机音乐数据集

Seonghyeon Go, Yumin Kim

发表机构 * KAIST(韩国科学技术院)

AI总结 针对当前AI音乐检测局限于二元分类的不足,提出HAIM数据集,通过多阶段标签定义“AI音乐跟踪”任务,评估现有检测器缺陷,推动向细粒度结构化评估转变。

详情
AI中文摘要

随着Suno和Udio等生成平台达到人类级音频质量,AI的实用性已扩展到整个音乐制作流程。除了简单的音轨生成,这些进步催生了AI驱动方法在各种形式中的应用,包括声音合成、编曲和专业母带处理。然而,当前的检测研究仍主要局限于二元“AI或人类”范式,未能反映当代音乐制作流程的现实。在真实制作中,AI工具越来越多地被用于优化或母带处理人类制作的音轨,而人类工程师同样对AI生成的材料进行后处理以确保专业质量。此外,用户经常采用对抗策略绕过AI检测器,例如对AI生成的音轨应用人类母带处理。这创造了一个简单的二元分类无法捕捉的灰色地带。在本文中,我们定义并研究“AI音乐跟踪”:在音乐制作的多面光谱中识别特定AI集成的挑战。为此,我们引入HAIM,一个具有音乐制作阶段多样化标签的数据集。它旨在隔离AI干预的阶段,包括混合制作和代理级跟踪。我们对最先进检测器的评估揭示了系统性缺陷。通过发布HAIM,我们提出了一个新的基准,将领域从二元分类转向对AI音乐的细粒度结构化评估。

英文摘要

As generative platforms such as Suno and Udio reach human-grade audio quality, the scope of AI's utility has expanded across the entire music production workflow. Beyond simple track generation, these advancements have catalyzed the adoption of AI-driven methodologies in diverse forms. These include vocal synthesis, arrangement, and professional mastering. However, current detection research remains largely confined to a binary `AI-or-human' paradigm. It fails to reflect the realities of contemporary music production workflows. In real-world production, AI tools are increasingly used to refine or master human-produced tracks, and human engineers likewise post-process AI-generated material to ensure professional quality. Moreover, users often employ adversarial tactics to bypass AI detectors, such as applying human mastering to AI-generated tracks. This creates a grey area that a simple binary classification fails to capture. In this paper, we define and investigate ``AI Music Tracking'': the challenge of identifying specific AI integration across the multifaceted spectrum of music production. To this end, we introduce HAIM, a dataset with diverse labels for stages of music production. It is designed to isolate stages of AI intervention, including hybrid production and agent-level tracking. Our evaluation of state-of-the-art detectors reveals systemic flaws. By releasing HAIM, we propose a new benchmark that shifts the field beyond binary classification toward a granular, structured evaluation of AI music.

2606.01682 2026-06-02 cs.CL cs.AI cs.LG

Off-the-Shelf LLMs as Process Scorers: Training-Free Alternative to PRMs for Mathematical Reasoning

现成的大语言模型作为过程评分器:数学推理中PRM的无训练替代方案

Atoosa Chegini, Soheil Feizi

发表机构 * Department of Computer Science, University of Maryland(马里兰大学计算机科学系)

AI总结 提出Chunk-Level Guided Generation方法,利用现成的大语言模型作为过程评分器,通过固定长度块评分和对比选择规则,无需训练即可在数学推理中匹配或超越PRM引导搜索的性能。

详情
AI中文摘要

使用更强的评分器从多个小模型样本中选择最佳响应是一种简单的推理时策略,但当小模型已经陷入错误推理路径时,该策略会失败。PRM引导搜索通过在生成过程中对候选延续进行评分来避免这一问题,但需要经过步骤级标签训练的奖励模型。我们提出Chunk-Level Guided Generation,一种无训练的替代方案,使用现成的大语言模型作为过程评分器。在每一步,小模型采样k个固定长度的候选块,而大模型使用似然度对候选块进行评分,无需生成任何文本。选中的块在下一步之前被提交,从而在错误传播之前引导生成。我们用两种选择规则实例化该框架:似然引导选择(LGS),选择具有最高长度归一化大模型对数概率的块;以及对比引导选择(CGS),减去小模型的对数概率,以偏向于大模型偏好与小模型偏好不同的块。我们证明,由于系统性的长度偏差(即使在长度归一化后仍然存在),使用大模型似然度对可变长度推理步骤进行评分是不可靠的,而固定长度块避免了这一混淆。在GSM8K、MATH、Minerva Math、AMC23和AIME24上,使用Qwen2.5-32B引导Qwen2.5-1.5B以及Llama-3.1-70B引导Llama-3.2-1B,CGS在多数投票上最多提升28个百分点,并且在匹配的引导预算下,在大多数基准测试中匹配或超越了Qwen2.5-Math-PRM-72B引导搜索,且无需奖励模型训练。使用Qwen2.5-72B引导Qwen2.5-7B,CGS在k=16时在MATH上达到81.8%,在Minerva Math上达到63.6%,超过多数投票4-6个百分点。最后,Chunk-Level Guided Generation产生的推理轨迹比PRM引导搜索短得多。

英文摘要

Selecting the best response from multiple small-model samples using a stronger scorer is a simple inference-time strategy, but fails when the small model has already committed to incorrect reasoning paths. PRM guided search avoids this by scoring candidate continuations during generation, but requires a reward model trained with step-level labels. We propose Chunk-Level Guided Generation, a training-free alternative that uses an off-the-shelf large language model as a process scorer. At each step, a small model samples k fixed-length candidate chunks, while the larger model scores the candidates using likelihoods without generating any text. The selected chunk is committed before the next step, steering generation before errors can propagate. We instantiate this framework with two selection rules: Likelihood-Guided Selection (LGS), which selects the chunk with the highest length-normalized large-model log-probability, and Contrastive-Guided Selection (CGS), which subtracts the small model's log-probability to favor chunks where the large model's preference diverges from the small model's. We show that scoring variable-length reasoning steps with large-model likelihoods is unreliable due to a systematic length bias that persists even after length normalization, and that fixed-length chunks avoid this confound. On GSM8K, MATH, Minerva Math, AMC23, and AIME24 with Qwen2.5-1.5B guided by Qwen2.5-32B and Llama-3.2-1B guided by Llama-3.1-70B, CGS outperforms majority voting by up to 28 pp and, under matched guidance budgets, matches or outperforms Qwen2.5-Math-PRM-72B guided search on most benchmarks without reward-model training. With Qwen2.5-7B guided by Qwen2.5-72B, CGS reaches 81.8% on MATH and 63.6% on Minerva Math at k=16, surpassing majority voting by 4--6 pp. Finally, Chunk-Level Guided Generation produces substantially shorter reasoning traces than PRM guided search.

2606.01679 2026-06-02 cs.CL

Encoded but Not Routed: Explaining the Table-Chart Gap in Scientific Claim Verification

编码但未路由:解释科学声明验证中的表格-图表差距

Sunisth Kumar, Xanh Ho, Tim Schopf, Andre Greiner-Petter, Florian Boudin, Akiko Aizawa

发表机构 * The University of Tokyo(东京大学) NII LLMC(日本信息处理学会LLMC) National Institute of Informatics(日本信息处理研究所) University of Göttingen(哥廷根大学) Inria, LS2N, Nantes Université(法国国家信息与自动化技术研究院(Inria)、LS2N实验室、南特大学)

AI总结 通过分层线性探测和注意力分析,发现多模态大语言模型在科学声明验证中,图表信息被编码但未路由到预测位置,导致表格与图表表现差距。

详情
AI中文摘要

多模态大语言模型越来越多地被用于辅助科学同行评审,其核心要求是验证论文中的声明是否得到其证据的支持。先前的研究表明,当证据是表格时,模型在此任务上的表现显著优于相同底层数据的图表。这引发了一个问题:模型是否未能从图表中提取信息,还是提取了信息但在形成预测时未能使用?我们通过分层线性探测和注意力分析,在三个开源视觉语言模型上研究了这个问题,这些模型处理代表相同底层数据的表格和图表证据。我们发现一致的证据支持后者。图表信息被编码在模型的中间表示中,但未到达预测位置,这一差距在表格中不存在,并且在所有测试条件下都成立。注意力分析进一步揭示,这种断开在不同模型家族中表现为两种架构上不同的形式。这些发现将表格-图表差距重新定义为编码的视觉信息在预测时路由失败的问题,而非编码本身失败。

英文摘要

Multimodal LLMs are increasingly used to assist scientific peer review, where a core requirement is verifying whether claims in a paper are supported by its evidence. Prior work has shown that models perform substantially better at this task when the evidence is a table than when it is a chart of the same underlying data. This raises the question of whether models fail to extract information from charts, or do they extract it but fail to use it when forming their prediction? We study this question through layer-wise linear probing and attention analysis on three open-weight VLMs over table and chart evidence, representing the same underlying data. We find consistent evidence for the latter. Chart information is encoded in the models' intermediate representations but does not reach the prediction position, a gap that is absent for tables and holds across all conditions tested. Attention analysis further reveals that this disconnect takes two architecturally distinct forms across model families. These findings reframe the table-chart gap as a failure of how encoded visual information is routed at prediction time, rather than a failure of encoding itself.

2606.01678 2026-06-02 cs.CL

Why Do Self-Harm Prediction Models Struggle to Generalise? Lexical and Semantic Variations in Emergency Department Triage Notes

为什么自伤预测模型难以泛化?急诊科分诊笔记中的词汇和语义变化

Liuliu Chen, Mike Conway, Jo Robinson, Vlada Rozova

发表机构 * School of Computing and Information Systems, The University of Melbourne, Australia(墨尔本大学计算与信息系) Orygen, The National Centre of Excellence in Youth Mental Health, Australia(青年心理健康国家卓越研究中心) Centre for Youth Mental Health, The University of Melbourne, Australia(青年心理健康中心) Centre for Digital Transformation of Health, The University of Melbourne, Australia(健康数字化转型中心)

AI总结 通过分析两家医院急诊科分诊笔记的词汇特征、预测特征和主题差异,发现机构间文档差异导致自伤检测模型跨站点性能下降,并提出改进泛化性的方法。

Comments Accepted to CLPsych2026

详情
AI中文摘要

急诊科(ED)的自伤表现与较高的自杀风险密切相关。NLP模型在单家医院的分诊笔记中检测自伤方面表现出稳健的性能,但跨机构时性能常常下降。为了探究潜在原因,我们通过分析词汇特征、高度相关的预测特征和显著主题,比较了两家医院的ED分诊笔记。我们的结果揭示了与自伤相关的词汇表达和特征重要性在不同医院之间存在差异,尽管核心主题如自我中毒和自我伤害是一致的。这些文档差异与跨站点性能下降相关。我们的发现揭示了机构差异如何影响临床文本中自伤的识别,并强调了改善模型泛化性的潜在方法。

英文摘要

Self-harm presentations to emergency departments (EDs) are strongly associated with higher suicide risk. NLP models have shown robust performance in detecting self-harm from triage notes within single hospitals, yet performance often declines across institutions. To examine potential causes, we compare ED triage notes from two hospitals by analyzing lexical characteristics, highly associated predictive features, and salient topics. Our results reveal variation in lexical expression and feature importance related to self-harm across hospitals, despite consistent core themes such as self-poisoning and self-injury. These documentation differences are associated with reduced cross-site performance. Our findings provide insight into how institutional variation affects the identification of self-harm in clinical text and highlight potential methods to improve model generalisability.

2606.01677 2026-06-02 cs.SD

UniVocal: Unified Speech-Singing Code-Switching Synthesis

UniVocal: 统一语音-歌唱代码切换合成

Yufei Shi, Qian Chen, Wen Wang, Xiangang Li, Zhen-Hua Ling, Yang Ai

发表机构 * Tongyi Fun Team, Alibaba Group(通义Fun团队,阿里巴巴集团) Independent Researcher(独立研究者)

AI总结 提出UniVocal统一框架,通过两阶段课程学习和链式思维生成,隐式从文本上下文推断发声模式,实现语音-歌唱代码切换合成,在SCSBench上达到最优性能。

Comments accepted by ACL 2026

详情
AI中文摘要

我们提出UniVocal,一个统一框架,隐式地从文本上下文中推断发声模式,开创了语音-歌唱代码切换(SCS)合成任务——其中转换由文本语义自主驱动,类似于无缝的人类语言混合。与单模式生成或依赖切换控制标签的系统不同,我们提出的UniVocal仅从文本上下文隐式推断发声模式。为实现这一点,我们采用了一种数据高效的两阶段课程学习策略,逐步训练一个具有竞争力的TTS系统以获得所需的SCS能力。针对数据稀缺问题,我们引入了一个可扩展的流水线来合成多样化的代码切换数据,这些数据在语义和声学上都很自然,同时引入了一个新的多场景基准SCSBench。为了解决语义分词器在捕捉声学细节方面的局限性,我们还引入了精炼的cent token和链式思维(CoT)生成,在内容生成之前规划韵律,有效增强了共情语音生成和歌唱旋律。实验结果表明,UniVocal在SCSBench上达到了最先进的性能,同时在常规语音和歌唱任务上保持了竞争性能。音频样本可在https://project-univocal-demo.github.io/demo/获取。代码和数据集已在https://github.com/FunAudioLLM/FunResearch/tree/main/UniVocal发布。

英文摘要

We propose UniVocal, a unified framework that implicitly infers vocal modes from text context to pioneer Speech-Singing Code-Switching (SCS) Synthesis - a task where transitions are autonomously driven by textual semantics, akin to seamless human language blending. Unlike single-mode generation or systems relying on switching-control tags, our proposed UniVocal implicitly infers vocal modes solely from text context. To achieve this, we employ a data-efficient two-stage curriculum learning strategy that progressively trains a competitive TTS system to acquire the desired SCS capability. Addressing data scarcity, we introduce a scalable pipeline to synthesize diverse code-switching data that is both semantically and acoustically natural, alongside a new multi-scenario benchmark, SCSBench. To address limitations of semantic tokenizers in capturing acoustic details, we also introduce refined cent token and Chain-of-Thought (CoT) generation for planning prosody before content generation, effectively enhancing empathetic speech generation and singing melody. Experimental results demonstrate that UniVocal achieves state-of-the-art performance on SCSBench while maintaining competitive performance on regular speech and singing tasks. Audio samples are available at https://project-univocal-demo.github.io/demo/. The code and dataset are released at https://github.com/FunAudioLLM/FunResearch/tree/main/UniVocal.

2606.01672 2026-06-02 cs.LG

RDA: Reward Design Agent for Reinforcement Learning

RDA:用于强化学习的奖励设计智能体

Hojoon Lee, Ajay Subramanian, Ben Abbatematteo, Vijay Veerabadran, Pedro Matias, Karl Ridgeway, Nitin Kamra

发表机构 * University of California, Berkeley(加州大学伯克利分校) DeepMind(深度Mind)

AI总结 提出基于视觉语言模型的奖励设计智能体RDA,通过任务分解、视觉轨迹评估和失败模式总结迭代优化奖励函数,在操作任务中生成更符合指令的策略。

Comments Accepted to RLC'26

详情
AI中文摘要

强化学习已经能够获得令人印象深刻的机器人技能,但通常需要手工设计的奖励函数,这些函数设计缓慢且难以与人类意图对齐。最近的工作,如Eureka,通过使用LLM从任务描述中迭代生成和优化奖励代码来自动化奖励设计。然而,它们依赖于粗糙的反馈信号,如成功率,这些信号对学习到的行为提供的语义洞察很少。因此,它们训练的策略达到了最终目标,但经常与任务指令对齐不良。我们引入了奖励设计智能体(RDA),一个基于VLM的智能体框架,将语义理解注入奖励设计。RDA分解任务,视觉评估轨迹,总结失败模式,并迭代修订奖励代码以更好地与任务指令对齐。在ManiSkill的12个桌面操作任务和HumanoidBench的4个全身操作任务中,RDA产生的策略在指令对齐方面显著优于其他基线,同时实现了相当的任务成功率。视频和生成的奖励代码可在https://nitinkamra1992.github.io/reward-design-agent获取。

英文摘要

Reinforcement learning has enabled the acquisition of impressive robotic skills, but typically requires hand-crafted reward functions that are slow to design and difficult to align with human intentions. Recent work, such as Eureka, automates reward design by using an LLM to iteratively generate and refine reward code from task descriptions. However, they rely on coarse feedback signals such as success rate, which provide little semantic insight into the learned behavior. As a result, their trained policies achieve the final goal but are frequently poorly aligned with task instructions. We introduce the Reward Design Agent (RDA), a VLM-based agentic framework that injects semantic understanding into reward design. RDA decomposes tasks, visually evaluates trajectories, summarizes failure modes, and iteratively revises reward code to better align with task instructions. Across 12 tabletop manipulation tasks from ManiSkill and 4 whole-body manipulation tasks from HumanoidBench, RDA produces policies substantially more instruction-aligned than those of other baselines, while achieving comparable task success rates. Videos and the generated reward code are available on https://nitinkamra1992.github.io/reward-design-agent.

2606.01671 2026-06-02 cs.CL

When Meaning Travels: A Granular Lens on Hybrid-MoE's Role in Idiomatic Understanding for Language Models

当意义旅行:混合专家模型在语言模型习语理解中的细粒度作用

Sarmistha Das, Vaibhav Vishal, Shreyas Guha, Amaan Ali, Kitsuchart Pasupa, Sriparna Saha

发表机构 * Department of Computer Science and Engineering, Indian Institute of Technology Patna, India(印度理工学院帕纳瓦分校计算机科学与工程系) School of Information Technology, King Mongkut’s Institute of Technology Ladkrabang, Thailand(泰国拉差班国王理工大学信息科技学院)

AI总结 针对低资源东南亚语言中习语的文化隐喻复杂性,提出Varnika多模态习语语料库和混合专家模型HybridMoE,通过控制混合和掩码多模态嵌入缓解专家稀疏性,实现习语理解性能提升5-6%。

详情
AI中文摘要

在多语言教育的当代,学习习语为通往创造力、文化价值观、历史背景以及各种语言传统固有的多元视角提供了迷人的途径。本文展示了在低资源东南亚语言(如印地语、孟加拉语和泰语)中保留比喻和文化语义的导航,这些语言中文化丰富的习语由于其深刻的隐喻复杂性,给计算建模和跨语言迁移带来了重大障碍。为应对这种复杂性,我们提出了Varnika,一个重建的多模态习语语料库,包含3,533个多语言习语,并丰富了与文本和视觉表示对齐的七种习语语气。此外,为了推断信息丰富的习语理解,我们引入了一个混合专家模型(HybridMoE)框架,该框架嵌入多个习语专家意见,同时通过受控混合整合来自选定和未选定专家的输出来缓解专家稀疏性,并通过掩码多模态嵌入进一步增强了习语属性信号。为了跨多个维度分析性能,我们提出了IDIO-TONE和习语验证分数,这是一个三阶段评估流程,衡量(i)字面翻译保真度,(ii)视觉语义对齐,以及(iii)习语意义保留。实证评估表明,HybridMoE在先进的视觉语言模型上实现了5-6%的性能提升,展示了在多语言多模态设置中比喻语言和文化嵌入意义的改进表示。

英文摘要

In the contemporary epoch of multilingual education, learning idioms provides a fascinating gateway towards creativity, cultural values, historical context, and diverse perspectives inherent to various linguistic traditions. This paper showcases the navigation of retaining figurative and cultural semantics in low-resource Southeast Asian languages such as Hindi, Bengali, and Thai, where culturally rich idioms pose significant obstacles for computational modeling and cross-linguistic transfer due to their deep metaphorical complexity. To tackle such complexity, we present Varnika, a reconstructed multimodal idiom corpus comprising 3,533 multilingual idioms, enriched with seven idiomatic tones aligned with both textual and visual representations. Additionally, to infer informative idiomatic understanding, we introduce a Hybrid Mixture-of-Experts (HybridMoE) framework that embeds multiple idiomatic expert opinions while mitigating expert sparsity by integrating outputs from both selected and unselected experts through controlled hybridization, further augmented with Idiomatic Property Signals via masked multimodal embeddings. To analyze the performance across multiple dimensions, we propose the IDIO-TONE and Idiomatic Validation Score, a three-stage evaluation pipeline measuring (i) literal translation fidelity, (ii) visual-semantic alignment, and (iii) idiomatic meaning retention. Empirical evaluations highlight that HybridMoE achieves 5--6\% performance gains across advanced vision language models, demonstrating improved representation of figurative language and culturally embedded meaning in multilingual multimodal settings

2606.01667 2026-06-02 cs.LG

ATLAS: Agentic Test-time Learning-to-Allocate Scaling

ATLAS: 智能体测试时学习分配缩放

Peijia Qin, Qi Cao, Pengtao Xie

发表机构 * University of California, San Diego(加州大学圣迭戈分校)

AI总结 提出ATLAS框架,让LLM编排器通过单一

详情
AI中文摘要

测试时缩放已成为提升大语言模型推理能力的主要方式,但其编排仍由设计者工程化:固定的样本预算、固定的改进循环、固定的评分规则或固定的搜索策略决定了计算如何分配,模型负责求解而非编排。我们提出ATLAS,一种智能体测试时缩放框架,其中LLM编排器端到端地拥有控制循环。通过单一动作“探索”(在原问题上派发一个全新的独立求解器),编排器决定是否收集更多证据、何时停止以及如何综合最终答案;动作空间是可扩展的,每次探索调用可选地指定求解器、推理努力或提示策略。我们在四个基准上评估ATLAS,涵盖科学问答、代码生成和多模态推理,使用Claude Sonnet 4.6骨干网络,在HLE-Verified上达到56.00%,在LiveCodeBench上达到82.29%,在GPQA-Diamond上达到85.75%,在BabyVision上达到23.71%,同时使用的API调用远少于固定工作流基线。多模型扩展ATLAS-MM将求解器选择作为额外动作维度,进一步将HLE-Verified提升至60.00%,LiveCodeBench提升至85.63%,并在GPQA-Diamond和BabyVision上持续改进。将编排器的直接综合替换为独立整合器的消融实验在四个基准中的三个上降低或未能提高准确率,这与有状态证据管理在产生增益中的作用一致。

英文摘要

Test-time scaling has become a major way to improve large language model reasoning, but its orchestration has remained designer-engineered: a fixed sample budget, a fixed refinement loop, a fixed scoring rule, or a fixed search policy decides how compute is spent, leaving the model in charge of solving but not of orchestration. We introduce ATLAS, an agentic test-time scaling framework in which an LLM orchestrator owns the control loop end-to-end. Through a single action, explore, which dispatches a fresh independent solver on the original problem, the orchestrator decides whether to gather more evidence, when to stop, and how to synthesize the final answer; the action space is extensible, with each explore call optionally specifying solver, reasoning effort, or prompting strategy. We evaluate ATLAS on four benchmarks covering scientific question answering, code generation, and multimodal reasoning under a Claude Sonnet 4.6 backbone, where it reaches 56.00% on HLE-Verified, 82.29% on LiveCodeBench, 85.75% on GPQA-Diamond, and 23.71% on BabyVision while using far fewer API calls than fixed-workflow baselines. A multi-model extension, ATLAS-MM, that exposes solver choice as an additional action dimension further improves HLE-Verified to 60.00% and LiveCodeBench to 85.63%, with consistent gains on GPQA-Diamond and BabyVision. Ablations replacing the orchestrator's direct synthesis with a separate integrator degrade or fail to improve accuracy on three of four benchmarks, consistent with the role of stateful evidence management in producing the gains.

2606.01666 2026-06-02 cs.LG cs.AI

DOT-MoE: Differentiable Optimal Transport for MoEfication

DOT-MoE:用于MoE化的可微最优传输

Udbhav Bamba, Arnav Chavan, Aryamaan Thakur, Steve Teig, Deepak Gupta

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出DOT-MoE框架,通过可微最优传输将密集层分解为专家,联合学习神经元分配和路由策略,在减少50%活跃参数的同时保留90%原始性能。

Comments Accepted at ICML 2026

详情
AI中文摘要

大型语言模型(LLMs)的扩展带来了显著的性能提升,但也造成了推理效率方面的重大挑战。虽然混合专家(MoEs)架构通过将模型大小与推理成本解耦来解决这一问题,但从头训练MoEs通常不稳定且计算密集。将预训练的密集模型转换为稀疏MoEs已成为一种替代方案;然而,现有方法通常依赖启发式神经元聚类或随机分割来将前馈网络(FFN)划分为专家。在这项工作中,我们提出了DOT-MoE,一种新颖的框架,将密集层的分解建模为可微最优传输(DOT)问题。与静态启发式方法不同,我们将神经元分配建模为平衡传输问题,利用可微的Sinkhorn-Knopp迭代来强制执行严格的专家容量约束。此外,我们利用直通估计器(STE)来联合学习离散的神经元到专家的分配和令牌到专家的路由策略。跨多个架构和基准的大量实验表明,DOT-MoE显著优于结构化剪枝、启发式聚类和随机分割基线,在减少50%活跃参数的同时保留了原始密集模型90%的性能。

英文摘要

The scaling of Large Language Models (LLMs) has driven significant performance gains but created substantial challenges in inference efficiency. While Mixture of Experts (MoEs) architectures address this by decoupling model size from inference cost, training MoEs from scratch is often unstable and compute intensive. Conversion of pre-trained dense models into sparse MoEs has emerged as an alternative solution; however, existing methods typically rely on heuristic neuron clustering or random splitting to partition the Feed-Forward Network (FFN) into experts. In this work, we propose DOT-MoE, a novel framework that formulates the decomposition of dense layers as a Differentiable Optimal Transport (DOT) problem. Instead of static heuristics, we model neuron assignment as a balanced transport problem, utilizing differentiable Sinkhorn-Knopp iterations to enforce strict expert capacity constraints. Furthermore, we utilize Straight-Through Estimators (STE) to jointly learn the discrete neuron-to-expert assignment and the token-to-expert routing policy end-to-end. Extensive experiments across multiple architectures and benchmarks demonstrate that DOT-MoE significantly outperforms structured pruning, heuristic clustering, and random-split baselines, retaining 90% of the original dense model's performance while reducing active parameters by 50%.

2606.01665 2026-06-02 cs.LG

Quantifying the Energy Floor: Direct Measurement and Replay Buffer Bias in SAC-Based HVAC Control on sbsim

量化能量下限:基于sbsim的SAC HVAC控制中的直接测量与回放缓冲区偏差

Bo Li, Chen Zhang

发表机构 * Shanghai Jiao Tong University College of Smart Energy(上海交通大学智能能源学院)

AI总结 通过最小动作实验直接测量SAC HVAC控制中的能量下限,发现回放缓冲区初始化是次优性的主要来源,消除后可将成本降至接近下限。

Comments 5 pages, 3 figures, 2 tables. Presented at AI-DEEDS 2026 Workshop, ACM Sustainability Week, Banff, Canada (non-archival)

详情
AI中文摘要

我们在sbsim校准建筑模拟器上量化了Soft Actor-Critic (SAC) HVAC控制的能量下限——在动作空间约束下的最小可实现成本。通过最小动作实验,我们直接测量到该下限为35.51美元/天,其中连续电力负载占主导(35.44美元,99.8%),燃气消耗可忽略。标准SAC基线使用调度策略回放缓冲区过渡初始化,收敛到37.18美元/天,高于下限4.7%。我们确定缓冲区初始化是此场景中次优性的主要来源:从空缓冲区训练可将成本降至35.57美元/天,消除了96%的差距。将供水温度范围扩大10 K仅带来可忽略的额外节省(0.03美元/天),进一步扩大则触发物理约束违反。我们还发现一个折扣因子耦合(gamma_eff = 0.891),将有效规划视野从8.3小时缩小至46分钟——这是一个需要审计的基准广泛问题。在规划视野、奖励权重和观测增强上的系统消融实验证实,所有预填充缓冲区配置的聚类范围在0.7%以内(37.18–37.42美元),表明设备最小功率(而非算法设计)构成了约束性限制。

英文摘要

We quantify the energy floor -- the minimum achievable cost given action space constraints -- for Soft Actor-Critic (SAC) HVAC control on the sbsim calibrated building simulator. Through minimum-action experiments, we directly measure this floor at USD 35.51/day, dominated by continuous electrical loads (USD 35.44, 99.8%) with negligible gas consumption. The standard SAC baseline, initialized with schedule-policy replay buffer transitions, converges to USD 37.18/day, 4.7% above the floor. We identify buffer initialization as the dominant source of sub-optimality in this scenario: training from an empty buffer reduces cost to USD 35.57/day, eliminating 96% of the gap. Expanding the supply water temperature range by 10 K yields negligible additional savings (USD 0.03/day), and further expansion triggers physical constraint violations. We additionally uncover a discount factor coupling (gamma_eff = 0.891) shrinking the effective planning horizon from 8.3 h to 46 min -- a benchmark-wide issue warranting audit. Systematic ablation across planning horizon, reward weights, and observation enrichment confirms all pre-filled-buffer configurations cluster within 0.7% (USD 37.18--USD 37.42), demonstrating that equipment minimum power -- not algorithmic design -- imposes the binding constraint.

2606.01660 2026-06-02 cs.LG

Gate the Filter, Not the Message: Node-Channel Mixtures for Pre-Propagation GNNs

门控滤波器而非消息:预传播图神经网络中的节点-通道混合

Zichao Yue, Zhiru Zhang

发表机构 * School of Electrical and Computer Engineering, Cornell University(康奈尔大学电气与计算机工程学院)

AI总结 针对预传播图神经网络中复杂跳聚合器性能不佳的问题,提出FilterMoE模型,通过3D门控张量联合路由节点和通道上的可学习切比雪夫滤波器专家,在11个同质和异质基准测试中平均提升1.53个测试分数。

详情
AI中文摘要

预传播图神经网络(PPGNNs)将所有图相关的计算推入预处理步骤,仅对生成的密集跳特征进行训练,这使得它们具有高度可扩展性。该领域的一个难题是,更复杂的跳聚合器并不总是可靠地优于简单的聚合器:在许多基准测试中,基于普通MLP的聚合器与跳注意力变体相当或更优。我们从图滤波器的角度重新审视这一行为。在预计算的扩散基上,现有的PPGNNs主要区别在于滤波器系数如何在节点和特征通道之间共享,而非仅仅在原始聚合器容量上。基于MLP的架构学习通道相关的滤波器,这些滤波器在节点之间大致共享,而基于跳注意力的架构学习节点相关的混合,这些混合在通道之间大致共享。这揭示了标准PPGNN设计中的一个缺失机制:在预传播计算约束下,联合节点和通道自适应滤波。我们提出FilterMoE,一种混合专家PPGNN,其中一小批可学习的切比雪夫滤波器专家通过3D门控张量在节点和通道上联合路由。在11个同质和异质基准测试中,FilterMoE在9个数据集上优于强PPGNN基线,并在所有三个大规模基准测试中排名第一,平均测试分数提高了1.53分。这些结果确立了联合节点-通道滤波器路由作为数据集特定跳聚合器选择的稳健替代方案。

英文摘要

Pre-propagation graph neural networks (PPGNNs) push all graph-dependent computation into a preprocessing step and train only on the resulting dense hop features, which makes them highly scalable. A puzzle in this regime is that more complex hop aggregators do not reliably outperform simpler ones: on many benchmarks, a plain MLP-based aggregator matches or beats hop-attention variants. We revisit this behavior from a graph-filter perspective. Over a precomputed diffusion basis, existing PPGNNs differ mainly in how filter coefficients are shared across nodes and feature channels, rather than simply in raw aggregator capacity. MLP-based architectures learn channel-dependent filters that are largely shared across nodes, while hop-attention-based architectures learn node-dependent mixtures that are largely shared across channels. This reveals a missing regime in standard PPGNN designs: joint node- and channel-adaptive filtering under the pre-propagation computational contract. We propose FilterMoE, a mixture-of-experts PPGNN in which a small bank of learnable Chebyshev filter experts is routed jointly over nodes and channels by a 3D gating tensor. Across eleven homophilic and heterophilic benchmarks, FilterMoE outperforms strong PPGNN baselines on nine datasets and ranks first on all three large-scale benchmarks, improving the average test score by 1.53 points. These results establish joint node-channel filter routing as a robust alternative to dataset-specific hop-aggregator selection.

2606.01651 2026-06-02 cs.CV

Restoring Initial Noise Sensitivity in Text-to-Image Distillation via Geometric Alignment

通过几何对齐恢复文本到图像蒸馏中的初始噪声敏感性

Huayang Huang, Ruoyu Wang, Jinhui Zhao, Wei Deng, Daiguo Zhou, Jian Luan, Yu Wu, Ye Zhu

发表机构 * Huazhong University of Science and Technology(华中科技大学)

AI总结 提出几何感知蒸馏(GAD)框架,通过匹配雅可比-向量积来对齐教师和学生模型的局部功能行为,从而恢复文本到图像蒸馏中丢失的初始噪声敏感性,提升下游噪声驱动控制任务的性能。

Comments ICML 2026

详情
AI中文摘要

生成式蒸馏通过将多步轨迹压缩为少步学生模型,在保持感知质量的同时显著加速文本到图像(T2I)生成。然而,现有方法主要优化效率和输出保真度,往往忽略了原始轨迹的关键属性。在这项工作中,我们识别出一个缺失的关键属性:对初始噪声的敏感性,其退化会损害依赖噪声优化和操作的下游控制方法。我们将此问题追溯到标准的蒸馏目标,这些目标强制逐点输出对齐,无意中压平了输入-输出景观并抑制了教师的局部几何结构。为了解决这个问题,我们提出了几何感知蒸馏(GAD),一种保持敏感性的框架,用于对齐教师和学生模型的局部功能行为。具体而言,GAD匹配关于输入噪声的雅可比-向量积,使学生能够再现教师对扰动的微分响应。在多个T2I范式和噪声驱动控制任务上的大量实验表明,GAD显著恢复了敏感性并提高了多样性,同时保持了高视觉保真度。代码可在 https://github.com/Hannah1102/GAD 获取。

英文摘要

Generative distillation significantly accelerates text-to-image (T2I) generation by compressing multi-step trajectories into few-step student models while preserving perceptual quality. However, existing methods primarily optimize efficiency and output fidelity, often neglecting critical properties of the original trajectory. In this work, we identify a key missing property: sensitivity to initial noise, whose degradation impairs downstream control methods relying on noise-based optimization and manipulation. We trace this issue to standard distillation objectives that enforce pointwise output alignment, inadvertently flattening the input-output landscape and suppressing the teacher's local geometric structure. To address this, we propose Geometry-Aware Distillation (GAD), a sensitivity-preserving framework that aligns the local functional behavior of teacher and student models. Specifically, GAD matches Jacobian-vector products with respect to input noise, enabling the student to reproduce the teacher's differential response to perturbations. Extensive experiments across multiple T2I paradigms and noise-driven control tasks demonstrate that GAD significantly restores sensitivity and improves diversity while maintaining high visual fidelity. Code is available at https://github.com/Hannah1102/GAD.

2606.01643 2026-06-02 cs.CV

Conditional Collapse in Sign Language Production: A Diagnostic and a Scaling Argument

手语生成中的条件坍塌:诊断与缩放论证

Rui Hong, Jana Košecká

发表机构 * George Mason University(乔治·马歇尔大学)

AI总结 本文通过提出三个独立评估层级(初始姿态条件、输出多样性、目标忠实度)并利用冻结运动自编码器的潜在表示计算成对距离比,诊断手语生成模型中的条件坍塌问题,并论证句子级配对数据集规模是瓶颈。

详情
AI中文摘要

手语生成(SLP)是从自然语言文本生成虚拟人物手语动作的任务。生成动作的质量通常通过运动空间弗雷歇距离(FID)和反向翻译(BT)BLEU分数在How2Sign等基准上进行评估。这两个指标可能大幅提升,而底层生成器未能忠实表示手语手势。在这项工作中,我们提出在三个独立层级上评估生成的动作:(τ1)初始姿态条件,(τ2)输出多样性,以及(τ3)目标忠实度。我们使用冻结运动自编码器(MoAE)的潜在表示计算这些成对距离比。我们在How2Sign数据集上评估了14个SLP模型检查点,包括重新实现的Neural Sign Actors(NSA),并表明τ3忠实度从未达到,而FID变化近两个数量级且与忠实度不相关。我们表明,在孤立词汇数据集ASL3DWord上可以达到有利的τ3,因此将句子级配对数据集的大小确定为瓶颈。

英文摘要

Sign Language Production (SLP) is the task of generating avatar sign language motion from natural language text. The quality of the generated motion is typically evaluated by a motion-space Fréchet distance (FID) and back-translation (BT) BLEU score on benchmarks such as How2Sign. Both metrics can improve substantially while the underlying generator fails to faithfully represent the sign language gestures. In this work we propose to evaluate the generated motion at three independent levels: ($\tau1$) initial-pose conditioning, ($\tau2$) output diversity, and ($\tau3$) target faithfulness. We compute these as pairwise-distance ratios using latent representations of a frozen motion autoencoder (MoAE). We evaluate 14 SLP model checkpoints on the How2Sign dataset, including a re-implemented Neural Sign Actors (NSA), and show that $\tau3$ faithfulness is never attained, while FID varies by nearly two orders of magnitude and is uncorrelated with faithfulness. We show that on the isolated gloss dataset ASL3DWord favorable $\tau3$ can be attained, hence isolating the size of the sentence-level paired-dataset as the bottleneck.

2606.01640 2026-06-02 cs.AI cs.CL

MobEvolve: An Agentic Self-Evolving Heuristic System for Interpretable Human Mobility Generation

MobEvolve:用于可解释人类移动性生成的智能体自进化启发式系统

Junlin He, Yihong Tang, Tong Nie, Ao Qu, Yuebing Liang, Hamzeh Alizadeh, Bang Liu, Wei Ma, Lijun Sun

发表机构 * The Hong Kong Polytechnic University(香港理工大学) McGill University(麦吉尔大学) MIT(麻省理工学院) Tsinghua University(清华大学) Autorité régionale de transport métropolitain(大都会交通地区管理局) Université de Montréal(蒙特利尔大学) Mila – Quebec AI Institute(魁北克人工智能研究所)

AI总结 提出MobEvolve,首个智能体自进化启发式框架,通过LLM代理迭代演化内部逻辑,在保持可解释性和推理效率的同时,在个体轨迹保真度、群体分布对齐和行为合理性上超越现有方法。

详情
AI中文摘要

人类移动性生成旨在根据个体特征为目标人群合成真实的出行链。现有范式,包括深度生成模型、基于LLM的方法和传统启发式方法,难以同时满足该任务的复杂需求,同时保持可解释性、行为合理性、群体级分布对齐和推理效率。为弥合这一差距,我们引入了MobEvolve,这是首个用于人类移动性生成的智能体自进化启发式框架。MobEvolve初始化一个行为启发的启发式系统,并利用LLM代理迭代演化其内部逻辑。通过在验证集上诊断经验性错位和失败案例,代理提出有针对性的更新并积累演化记忆以实现累积性自我改进。在新加坡和蒙特利尔基准上的广泛评估表明,MobEvolve在个体轨迹保真度、群体级分布对齐和行为合理性方面显著优于最先进的深度生成和基于LLM的方法,同时保持可解释性和高推理效率。

英文摘要

Human mobility generation aims to synthesize realistic trip chains for target populations based on individual features. Existing paradigms, including deep generative models, LLM-based methods, and traditional heuristics, struggle to satisfy the complex demands of this task while simultaneously maintaining interpretability, behavioral plausibility, population-level distributional alignment, and inference efficiency. To bridge this gap, we introduce MobEvolve, the first agentic self-evolving heuristic framework for human mobility generation. MobEvolve initializes a behavior-inspired heuristic system and employs an LLM agent to iteratively evolve its internal logic. By diagnosing empirical misalignments and failure cases on a validation set, the agent proposes targeted updates and accumulates evolution memory for cumulative self-improvement. Extensive evaluations on the Singapore and Montreal benchmarks demonstrate that MobEvolve significantly outperforms state-of-the-art deep generative and LLM-based methods in individual trajectory fidelity, population-level distribution alignment, and behavioral plausibility, while preserving interpretability and high inference efficiency.

2606.01638 2026-06-02 cs.CV

CanonCGT: Reference-Based Color Grading via Canonical Pivot Representation

CanonCGT:基于参考的颜色分级通过规范枢轴表示

Jinwon Ko, Keunsoo Ko, Chang-Su Kim

发表机构 * Korea University(韩国大学) The Catholic University of Korea(韩国天主教大学)

AI总结 提出一种基于规范枢轴的两阶段框架CanonCGT,通过去除内在色调偏差并匹配参考风格,实现稳定、真实的颜色分级。

Comments CVPR 2026 accepted

详情
AI中文摘要

基于参考的颜色分级旨在再现参考图像的色调和光照,同时保持色彩和谐与场景结构。现有的逼真和基于滤镜的方法通常产生不稳定的色调映射——过度偏移或不一致地保留颜色——导致不自然的结果。我们提出CanonCGT,一个基于规范枢轴的两阶段框架——一种风格中立的中间表示,用于稳定的颜色映射。第一阶段通过去除内在色调偏差来规范化输入,第二阶段对其进行颜色分级以匹配参考风格。一种双阶段训练方案DP-CGT结合了监督预设学习和非配对照片上的自监督细化。CanonCGT在多种数据集上产生逼真且色调一致的结果,在稳定性和视觉保真度上超越了最先进的方法。我们的代码可在\href{https://github.com/Jinwon-Ko/CanonCGT}{https://github.com/Jinwon-Ko/CanonCGT}获取。

英文摘要

Reference-based color grading aims to reproduce the tonal mood and lighting of a reference while preserving color harmony and scene structure. Existing photorealistic and filter-based methods often produce unstable tone mappings -- over-shifting or inconsistently retaining colors -- leading to unnatural results. We propose CanonCGT, a two-stage framework built on a canonical pivot -- a style-neutral intermediate representation for stable color mapping. The first stage canonicalizes the input by removing intrinsic tonal bias, and the second color-grades it to match the reference style. A dual-phase training scheme, DP-CGT, combines supervised preset learning with self-supervised refinement on unpaired photographs. CanonCGT delivers photorealistic and tonally consistent results across diverse datasets, surpassing state-of-the-art methods in stability and visual fidelity. Our codes are available at \href{https://github.com/Jinwon-Ko/CanonCGT}{https://github.com/Jinwon-Ko/CanonCGT}

2606.01636 2026-06-02 cs.CV

Pave-GRPO: Beyond Instantaneous Guidance through Principled Average Velocity Decomposition

Pave-GRPO:通过原则性平均速度分解超越瞬时引导

Pengyang Ling, Jiazi Bu, Yujie Zhou, Yibin Wang, Zhenyu Hu, Zihan Zhang, Yi Jin, Huaian Chen, Yuhang Zang

发表机构 * University of Science and Technology of China(中国科学技术大学) Shanghai Jiao Tong University(上海交通大学) Fudan University(复旦大学) Harbin Institute of Technology(哈尔滨工业大学) Beihang University(北京航空航天大学) Shanghai AI Laboratory(上海人工智能实验室)

AI总结 提出Pave-GRPO方法,通过原则性平均速度分解将粗粒度过渡分解为细粒度子轨迹,在不增加生成成本的情况下将奖励反馈传播到更多中间步骤,实现更全面的偏好对齐。

Comments 8 pages,5 figures

详情
AI中文摘要

通过群体相对策略优化(GRPO)的后训练已成为将基于流的生成模型与人类偏好对齐的强大范式。然而,流模型的迭代去噪性质在生成用于策略梯度更新的群体展开时会产生巨大成本,迫使现有方法使用极少的去噪步骤进行训练。这种时间稀疏性严重限制了偏好优化:奖励反馈只能到达每个轨迹的少数阶段,使得绝大多数中间去噪步骤缺乏直接监督,从而损害了对齐的粒度。为了解决这个问题,我们提出了Pave-GRPO,它通过原则性平均速度分解重新表述了GRPO目标。我们不生成昂贵的高步数展开,而是保持高效的少步数群体采样,但将每个粗粒度转换分解为跨越多个中间时间步的等效细粒度子轨迹集合。这将奖励反馈传播到更密集的时间阶段集,从而实现更全面的偏好对齐,而无需额外的生成成本。这种设计有两个好处:(i)零成本视野扩展:通过直接重用分段群体样本及其相关奖励,Pave-GRPO在固定采样预算下显著拓宽了有效优化范围;(ii)全面的时间监督:通过将瞬时速度目标等效分解为多时间步集合,它将奖励信号分布到去噪过程的更多中间阶段,从而实现更细粒度、更彻底的偏好优化。大量实验验证了Pave-GRPO在不同奖励设置下有效推进了偏好对齐,提供了全面的性能提升。

英文摘要

Post-training via Group Relative Policy Optimization (GRPO) has emerged as a powerful paradigm for aligning flow-based generative models with human preferences. However, the iterative denoising nature of flow models incurs substantial costs when generating group rollouts for policy-gradient updates, compelling existing methods to train with extremely few denoising steps. This temporal sparsity severely restricts preference optimization: reward feedback can only reach a handful of stages per trajectory, leaving the vast majority of intermediate denoising steps without direct supervision and thus compromising alignment granularity. To address this, we propose Pave-GRPO, which reformulates the GRPO objective through Principled average velocity decomposition. Rather than generating expensive high-step rollouts, we maintain efficient few-step group sampling but decompose each coarse transition into an equivalent ensemble of finer sub-trajectories spanning multiple intermediate timesteps. This propagates reward feedback to a denser set of temporal stages for more comprehensive preference alignment without additional generation cost. This design offers two benefits: (i) zero-cost horizon expansion: through the direct reuse of piece-wise group samples and their associated rewards, Pave-GRPO significantly broadens the effective optimization scope under fixed sampling budgets; and (ii) comprehensive temporal supervision: by equivalently decomposing an instantaneous velocity target into a multi-timestep ensemble, it distributes reward signals across more intermediate stages of the denoising process, enabling finer-grained and more thorough preference optimization. Extensive experiments validate that Pave-GRPO effectively advances preference alignment across different reward settings, offering comprehensive performance enhancement.

2606.01635 2026-06-02 cs.CL cs.AI

AlphaToken: Decoupling Adaptation and Stability for Path-Aware Response Token Valuation in LLM Post-Training

AlphaToken: 在LLM后训练中解耦适应性与稳定性的路径感知响应令牌估值

Liu Qing, Ou Wu, Yi Du

发表机构 * Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences(中国科学院大学杭州高等研究院)

AI总结 提出AlphaToken框架,通过解耦适应性(促进目标任务学习)和稳定性(保持预训练能力)并引入路径感知机制,利用Fisher漂移代理和Ghost点积扩展实现高效令牌估值,从而在微调和偏好优化中屏蔽低价值令牌,提升后训练性能并缓解灾难性遗忘。

详情
AI中文摘要

令牌选择对于有效的LLM后训练至关重要。然而,现有方法大多依赖局部启发式,很少将令牌选择形式化为对单个响应令牌的原则性估值。我们引入了$\textbf{AlphaToken}$,一个响应令牌估值框架,它将估值解耦为$\textbf{适应性}$(促进目标任务学习)和$\textbf{稳定性}$(保持预训练能力),并通过结合局部令牌梯度的直接路径信号与自回归生成中的下游因果路径信号,使每个目标具有$\textbf{路径感知}$性。由于保留数据通常不可用,AlphaToken通过锚定在预训练参考模型上的$\textbf{Fisher漂移代理}$来近似稳定性。为了高效计算,我们将Ghost点积扩展到令牌级估值。AlphaToken在微调和偏好优化过程中屏蔽低价值响应令牌,将训练信号集中在更有价值的位置。实验表明,AlphaToken提高了后训练性能并缓解了灾难性遗忘。

英文摘要

Token selection is pivotal for effective LLM post-training. However, existing methods mostly rely on local heuristics and rarely formulate token selection as a principled valuation of individual response tokens. We introduce $\textbf{AlphaToken}$, a response token valuation framework that decouples valuation into $\textbf{adaptation}$ (promoting target-task learning) and $\textbf{stability}$ (preserving pre-trained capabilities), and makes each objective $\textbf{path-aware}$ by combining the direct-path signal from local token gradients with the downstream causal-path signal in autoregressive generation. Since retention data are typically unavailable, AlphaToken approximates stability via a $\textbf{Fisher-drift proxy}$ anchored at the pre-trained reference model. For efficient computation, we extend Ghost Dot-Product to token-level valuation. AlphaToken masks low-value response tokens during fine-tuning and preference optimization, concentrating training signals on more valuable positions. Experiments show that AlphaToken improves post-training performance and mitigates catastrophic forgetting.

2606.01634 2026-06-02 cs.LG cs.AI

E4GEN: Event-level Explainable Extreme-Enhanced Time-series Generation

E4GEN:事件级可解释的极端增强时间序列生成

Lin Jiang, Dahai Yu, Ximiao Li, Guang Wang

发表机构 * Florida State University(佛罗里达州立大学)

AI总结 提出E4GEN可解释扩散框架,通过E-Activator、E-Predictor和E-Control三个组件实现事件级极端事件可控生成,在整体保真度、极端事件保真度和下游效用上优于现有方法。

Comments 48 pages,26 figures

详情
AI中文摘要

生成逼真的时间序列对于科学研究和实际应用至关重要。然而,现有方法通常强调整体分布保真度,而未能忠实捕捉极端事件。为了推进现有研究,我们提出了E4GEN,一个用于极端事件感知时间序列生成的可解释扩散框架。E4GEN通过三个关键组件提供了关于何时、什么以及如何控制极端事件生成的系统见解。首先,E-Activator在去噪过程中学习数据集自适应的极端控制信号激活步骤,而不干扰常规时间成分,包括趋势和季节性。其次,E-Predictor通过自驱动语义预测确定要强制执行的控制信号,其中每个样本通过推断生成过程中的潜在极端事件信息来导出其自身的控制信号。它还包括一种新颖的数据条件训练、噪声初始化采样机制,以解决训练标签不可用的问题。第三,E-Control通过可训练的极端控制网络指定如何控制极端事件生成,该网络将语义控制信号转换为逐层信号并将其注入去噪过程。我们在六个数据集上使用17个指标评估了E4GEN,大量实验表明,E4GEN在多个维度上优于最先进的模型,包括整体保真度、极端事件保真度和下游效用。

英文摘要

Generating realistic time series is essential for scientific research and real-world applications. However, existing methods often emphasize overall distributional fidelity while failing to faithfully capture extreme events. To advance existing research, we propose E4GEN, an explainable diffusion framework for extreme event-aware time-series generation. E4GEN provides systematic insights into when, what, and how to control extreme-event generation through three key components. First, E-Activator learns the dataset-adaptive extreme-control signal activation step during the denoising process without interfering with regular temporal components, including trend and seasonality. Second, E-Predictor determines what control signal to enforce through Self-Driven Semantic Prediction, where each sample derives its own control signal by inferring latent extreme-event information during generation. It also includes a novel Data-Conditioned Training, Noise-Initiated Sampling mechanism to address the issue of unavailable training labels. Third, E-Control specifies how to control extreme-event generation through a trainable Extreme Control Network, which transforms the semantic control signal into layer-wise signals and injects it into the denoising process. We evaluate E4GEN on six datasets with 17 metrics, and extensive experiments show that E4GEN outperforms state-of-the-art models across multiple dimensions, including overall fidelity, extreme-event fidelity, and downstream utility.

2606.01626 2026-06-02 cs.LG

IMWM: Intuition Models Complement World Models for Latent Planning

IMWM:直觉模型补充世界模型用于潜在规划

Baoqi Gao, Ruize Han, Miao Wang, Song Wang

发表机构 * Beihang University(北航) Shenzhen University of Advanced Technology(深圳先进技术大学)

AI总结 针对基于潜在世界模型的规划中搜索瓶颈问题,提出IMWM框架,通过直觉模型与三个轻量组件协作,在四个像素级任务上显著提升成功率。

详情
AI中文摘要

使用学习到的潜在世界模型进行规划是从原始像素控制的有前途的途径,但仅靠强大的世界模型是不够的。我们通过实验证明了这一点:即使使用完美的世界模型(通过将学习到的前向预测器替换为真实环境动态的理想化展开来实现),有限预算的基于样本的规划器仍然在某些任务上失败,这表明瓶颈可能在于搜索而非世界模型的准确性。受此差距的启发,我们提出了IMWM(直觉模型+世界模型),它将世界模型与从演示中训练出的直觉模型配对,以识别有希望的动作。这两个模型通过三个轻量组件协作:(i)检索初始化,从检索到的演示中初始化规划器的动作提议;(ii)混合成本,将直觉分数与世界模型展开成本相结合;(iii)可靠性门控,调整规划器在每个设置中信任直觉的程度。在四个基于像素的目标到达任务(Two-Room、Reacher、Push-T和OGBench-Cube)中,IMWM在所有四个任务上的平均成功率均高于仅使用世界模型的规划器,其中在Two-Room(99.2%,+11.5个百分点)和OGBench-Cube(94.7%,+28.5个百分点)上提升最大。

英文摘要

Planning with a learned latent world model is a promising route to control from raw pixels, but a strong world model alone is not enough. We show this experimentally: even with a perfect world model (operationalized by replacing the learned forward predictor with an idealized rollout of the true environment dynamics), a finite-budget sample-based planner still fails on some tasks, indicating that the bottleneck can lie in search rather than in world-model accuracy. Motivated by this gap, we propose IMWM (Intuition Model + World Model), which pairs the world model with an intuition model trained from demonstrations to recognize promising actions. The two models collaborate through three lightweight components: (i) Retrieval Initialization, which initializes the planner's action proposal from a retrieved demonstration; (ii) Hybrid Cost, which combines the intuition score with the world-model rollout cost; and (iii) a Reliability Gate, which adjusts how much the planner trusts intuition in each setting. Across four pixel-based goal-reaching tasks (Two-Room, Reacher, Push-T, and OGBench-Cube), IMWM has higher mean success than the world-model-only planner on all four, with the largest gains on Two-Room (99.2%, +11.5 percentage points) and OGBench-Cube (94.7%, +28.5 percentage points).