arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4106
2606.01666 2026-06-02 cs.LG cs.AI

DOT-MoE: Differentiable Optimal Transport for MoEfication

DOT-MoE:用于MoE化的可微最优传输

Udbhav Bamba, Arnav Chavan, Aryamaan Thakur, Steve Teig, Deepak Gupta

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出DOT-MoE框架,通过可微最优传输将密集层分解为专家,联合学习神经元分配和路由策略,在减少50%活跃参数的同时保留90%原始性能。

Comments Accepted at ICML 2026

详情
AI中文摘要

大型语言模型(LLMs)的扩展带来了显著的性能提升,但也造成了推理效率方面的重大挑战。虽然混合专家(MoEs)架构通过将模型大小与推理成本解耦来解决这一问题,但从头训练MoEs通常不稳定且计算密集。将预训练的密集模型转换为稀疏MoEs已成为一种替代方案;然而,现有方法通常依赖启发式神经元聚类或随机分割来将前馈网络(FFN)划分为专家。在这项工作中,我们提出了DOT-MoE,一种新颖的框架,将密集层的分解建模为可微最优传输(DOT)问题。与静态启发式方法不同,我们将神经元分配建模为平衡传输问题,利用可微的Sinkhorn-Knopp迭代来强制执行严格的专家容量约束。此外,我们利用直通估计器(STE)来联合学习离散的神经元到专家的分配和令牌到专家的路由策略。跨多个架构和基准的大量实验表明,DOT-MoE显著优于结构化剪枝、启发式聚类和随机分割基线,在减少50%活跃参数的同时保留了原始密集模型90%的性能。

英文摘要

The scaling of Large Language Models (LLMs) has driven significant performance gains but created substantial challenges in inference efficiency. While Mixture of Experts (MoEs) architectures address this by decoupling model size from inference cost, training MoEs from scratch is often unstable and compute intensive. Conversion of pre-trained dense models into sparse MoEs has emerged as an alternative solution; however, existing methods typically rely on heuristic neuron clustering or random splitting to partition the Feed-Forward Network (FFN) into experts. In this work, we propose DOT-MoE, a novel framework that formulates the decomposition of dense layers as a Differentiable Optimal Transport (DOT) problem. Instead of static heuristics, we model neuron assignment as a balanced transport problem, utilizing differentiable Sinkhorn-Knopp iterations to enforce strict expert capacity constraints. Furthermore, we utilize Straight-Through Estimators (STE) to jointly learn the discrete neuron-to-expert assignment and the token-to-expert routing policy end-to-end. Extensive experiments across multiple architectures and benchmarks demonstrate that DOT-MoE significantly outperforms structured pruning, heuristic clustering, and random-split baselines, retaining 90% of the original dense model's performance while reducing active parameters by 50%.

2606.01665 2026-06-02 cs.LG

Quantifying the Energy Floor: Direct Measurement and Replay Buffer Bias in SAC-Based HVAC Control on sbsim

量化能量下限:基于sbsim的SAC HVAC控制中的直接测量与回放缓冲区偏差

Bo Li, Chen Zhang

发表机构 * Shanghai Jiao Tong University College of Smart Energy(上海交通大学智能能源学院)

AI总结 通过最小动作实验直接测量SAC HVAC控制中的能量下限,发现回放缓冲区初始化是次优性的主要来源,消除后可将成本降至接近下限。

Comments 5 pages, 3 figures, 2 tables. Presented at AI-DEEDS 2026 Workshop, ACM Sustainability Week, Banff, Canada (non-archival)

详情
AI中文摘要

我们在sbsim校准建筑模拟器上量化了Soft Actor-Critic (SAC) HVAC控制的能量下限——在动作空间约束下的最小可实现成本。通过最小动作实验,我们直接测量到该下限为35.51美元/天,其中连续电力负载占主导(35.44美元,99.8%),燃气消耗可忽略。标准SAC基线使用调度策略回放缓冲区过渡初始化,收敛到37.18美元/天,高于下限4.7%。我们确定缓冲区初始化是此场景中次优性的主要来源:从空缓冲区训练可将成本降至35.57美元/天,消除了96%的差距。将供水温度范围扩大10 K仅带来可忽略的额外节省(0.03美元/天),进一步扩大则触发物理约束违反。我们还发现一个折扣因子耦合(gamma_eff = 0.891),将有效规划视野从8.3小时缩小至46分钟——这是一个需要审计的基准广泛问题。在规划视野、奖励权重和观测增强上的系统消融实验证实,所有预填充缓冲区配置的聚类范围在0.7%以内(37.18–37.42美元),表明设备最小功率(而非算法设计)构成了约束性限制。

英文摘要

We quantify the energy floor -- the minimum achievable cost given action space constraints -- for Soft Actor-Critic (SAC) HVAC control on the sbsim calibrated building simulator. Through minimum-action experiments, we directly measure this floor at USD 35.51/day, dominated by continuous electrical loads (USD 35.44, 99.8%) with negligible gas consumption. The standard SAC baseline, initialized with schedule-policy replay buffer transitions, converges to USD 37.18/day, 4.7% above the floor. We identify buffer initialization as the dominant source of sub-optimality in this scenario: training from an empty buffer reduces cost to USD 35.57/day, eliminating 96% of the gap. Expanding the supply water temperature range by 10 K yields negligible additional savings (USD 0.03/day), and further expansion triggers physical constraint violations. We additionally uncover a discount factor coupling (gamma_eff = 0.891) shrinking the effective planning horizon from 8.3 h to 46 min -- a benchmark-wide issue warranting audit. Systematic ablation across planning horizon, reward weights, and observation enrichment confirms all pre-filled-buffer configurations cluster within 0.7% (USD 37.18--USD 37.42), demonstrating that equipment minimum power -- not algorithmic design -- imposes the binding constraint.

2606.01660 2026-06-02 cs.LG

Gate the Filter, Not the Message: Node-Channel Mixtures for Pre-Propagation GNNs

门控滤波器而非消息:预传播图神经网络中的节点-通道混合

Zichao Yue, Zhiru Zhang

发表机构 * School of Electrical and Computer Engineering, Cornell University(康奈尔大学电气与计算机工程学院)

AI总结 针对预传播图神经网络中复杂跳聚合器性能不佳的问题,提出FilterMoE模型,通过3D门控张量联合路由节点和通道上的可学习切比雪夫滤波器专家,在11个同质和异质基准测试中平均提升1.53个测试分数。

详情
AI中文摘要

预传播图神经网络(PPGNNs)将所有图相关的计算推入预处理步骤,仅对生成的密集跳特征进行训练,这使得它们具有高度可扩展性。该领域的一个难题是,更复杂的跳聚合器并不总是可靠地优于简单的聚合器:在许多基准测试中,基于普通MLP的聚合器与跳注意力变体相当或更优。我们从图滤波器的角度重新审视这一行为。在预计算的扩散基上,现有的PPGNNs主要区别在于滤波器系数如何在节点和特征通道之间共享,而非仅仅在原始聚合器容量上。基于MLP的架构学习通道相关的滤波器,这些滤波器在节点之间大致共享,而基于跳注意力的架构学习节点相关的混合,这些混合在通道之间大致共享。这揭示了标准PPGNN设计中的一个缺失机制:在预传播计算约束下,联合节点和通道自适应滤波。我们提出FilterMoE,一种混合专家PPGNN,其中一小批可学习的切比雪夫滤波器专家通过3D门控张量在节点和通道上联合路由。在11个同质和异质基准测试中,FilterMoE在9个数据集上优于强PPGNN基线,并在所有三个大规模基准测试中排名第一,平均测试分数提高了1.53分。这些结果确立了联合节点-通道滤波器路由作为数据集特定跳聚合器选择的稳健替代方案。

英文摘要

Pre-propagation graph neural networks (PPGNNs) push all graph-dependent computation into a preprocessing step and train only on the resulting dense hop features, which makes them highly scalable. A puzzle in this regime is that more complex hop aggregators do not reliably outperform simpler ones: on many benchmarks, a plain MLP-based aggregator matches or beats hop-attention variants. We revisit this behavior from a graph-filter perspective. Over a precomputed diffusion basis, existing PPGNNs differ mainly in how filter coefficients are shared across nodes and feature channels, rather than simply in raw aggregator capacity. MLP-based architectures learn channel-dependent filters that are largely shared across nodes, while hop-attention-based architectures learn node-dependent mixtures that are largely shared across channels. This reveals a missing regime in standard PPGNN designs: joint node- and channel-adaptive filtering under the pre-propagation computational contract. We propose FilterMoE, a mixture-of-experts PPGNN in which a small bank of learnable Chebyshev filter experts is routed jointly over nodes and channels by a 3D gating tensor. Across eleven homophilic and heterophilic benchmarks, FilterMoE outperforms strong PPGNN baselines on nine datasets and ranks first on all three large-scale benchmarks, improving the average test score by 1.53 points. These results establish joint node-channel filter routing as a robust alternative to dataset-specific hop-aggregator selection.

2606.01651 2026-06-02 cs.CV

Restoring Initial Noise Sensitivity in Text-to-Image Distillation via Geometric Alignment

通过几何对齐恢复文本到图像蒸馏中的初始噪声敏感性

Huayang Huang, Ruoyu Wang, Jinhui Zhao, Wei Deng, Daiguo Zhou, Jian Luan, Yu Wu, Ye Zhu

发表机构 * Huazhong University of Science and Technology(华中科技大学)

AI总结 提出几何感知蒸馏(GAD)框架,通过匹配雅可比-向量积来对齐教师和学生模型的局部功能行为,从而恢复文本到图像蒸馏中丢失的初始噪声敏感性,提升下游噪声驱动控制任务的性能。

Comments ICML 2026

详情
AI中文摘要

生成式蒸馏通过将多步轨迹压缩为少步学生模型,在保持感知质量的同时显著加速文本到图像(T2I)生成。然而,现有方法主要优化效率和输出保真度,往往忽略了原始轨迹的关键属性。在这项工作中,我们识别出一个缺失的关键属性:对初始噪声的敏感性,其退化会损害依赖噪声优化和操作的下游控制方法。我们将此问题追溯到标准的蒸馏目标,这些目标强制逐点输出对齐,无意中压平了输入-输出景观并抑制了教师的局部几何结构。为了解决这个问题,我们提出了几何感知蒸馏(GAD),一种保持敏感性的框架,用于对齐教师和学生模型的局部功能行为。具体而言,GAD匹配关于输入噪声的雅可比-向量积,使学生能够再现教师对扰动的微分响应。在多个T2I范式和噪声驱动控制任务上的大量实验表明,GAD显著恢复了敏感性并提高了多样性,同时保持了高视觉保真度。代码可在 https://github.com/Hannah1102/GAD 获取。

英文摘要

Generative distillation significantly accelerates text-to-image (T2I) generation by compressing multi-step trajectories into few-step student models while preserving perceptual quality. However, existing methods primarily optimize efficiency and output fidelity, often neglecting critical properties of the original trajectory. In this work, we identify a key missing property: sensitivity to initial noise, whose degradation impairs downstream control methods relying on noise-based optimization and manipulation. We trace this issue to standard distillation objectives that enforce pointwise output alignment, inadvertently flattening the input-output landscape and suppressing the teacher's local geometric structure. To address this, we propose Geometry-Aware Distillation (GAD), a sensitivity-preserving framework that aligns the local functional behavior of teacher and student models. Specifically, GAD matches Jacobian-vector products with respect to input noise, enabling the student to reproduce the teacher's differential response to perturbations. Extensive experiments across multiple T2I paradigms and noise-driven control tasks demonstrate that GAD significantly restores sensitivity and improves diversity while maintaining high visual fidelity. Code is available at https://github.com/Hannah1102/GAD.

2606.01643 2026-06-02 cs.CV

Conditional Collapse in Sign Language Production: A Diagnostic and a Scaling Argument

手语生成中的条件坍塌:诊断与缩放论证

Rui Hong, Jana Košecká

发表机构 * George Mason University(乔治·马歇尔大学)

AI总结 本文通过提出三个独立评估层级(初始姿态条件、输出多样性、目标忠实度)并利用冻结运动自编码器的潜在表示计算成对距离比,诊断手语生成模型中的条件坍塌问题,并论证句子级配对数据集规模是瓶颈。

详情
AI中文摘要

手语生成(SLP)是从自然语言文本生成虚拟人物手语动作的任务。生成动作的质量通常通过运动空间弗雷歇距离(FID)和反向翻译(BT)BLEU分数在How2Sign等基准上进行评估。这两个指标可能大幅提升,而底层生成器未能忠实表示手语手势。在这项工作中,我们提出在三个独立层级上评估生成的动作:(τ1)初始姿态条件,(τ2)输出多样性,以及(τ3)目标忠实度。我们使用冻结运动自编码器(MoAE)的潜在表示计算这些成对距离比。我们在How2Sign数据集上评估了14个SLP模型检查点,包括重新实现的Neural Sign Actors(NSA),并表明τ3忠实度从未达到,而FID变化近两个数量级且与忠实度不相关。我们表明,在孤立词汇数据集ASL3DWord上可以达到有利的τ3,因此将句子级配对数据集的大小确定为瓶颈。

英文摘要

Sign Language Production (SLP) is the task of generating avatar sign language motion from natural language text. The quality of the generated motion is typically evaluated by a motion-space Fréchet distance (FID) and back-translation (BT) BLEU score on benchmarks such as How2Sign. Both metrics can improve substantially while the underlying generator fails to faithfully represent the sign language gestures. In this work we propose to evaluate the generated motion at three independent levels: ($\tau1$) initial-pose conditioning, ($\tau2$) output diversity, and ($\tau3$) target faithfulness. We compute these as pairwise-distance ratios using latent representations of a frozen motion autoencoder (MoAE). We evaluate 14 SLP model checkpoints on the How2Sign dataset, including a re-implemented Neural Sign Actors (NSA), and show that $\tau3$ faithfulness is never attained, while FID varies by nearly two orders of magnitude and is uncorrelated with faithfulness. We show that on the isolated gloss dataset ASL3DWord favorable $\tau3$ can be attained, hence isolating the size of the sentence-level paired-dataset as the bottleneck.

2606.01640 2026-06-02 cs.AI cs.CL

MobEvolve: An Agentic Self-Evolving Heuristic System for Interpretable Human Mobility Generation

MobEvolve:用于可解释人类移动性生成的智能体自进化启发式系统

Junlin He, Yihong Tang, Tong Nie, Ao Qu, Yuebing Liang, Hamzeh Alizadeh, Bang Liu, Wei Ma, Lijun Sun

发表机构 * The Hong Kong Polytechnic University(香港理工大学) McGill University(麦吉尔大学) MIT(麻省理工学院) Tsinghua University(清华大学) Autorité régionale de transport métropolitain(大都会交通地区管理局) Université de Montréal(蒙特利尔大学) Mila – Quebec AI Institute(魁北克人工智能研究所)

AI总结 提出MobEvolve,首个智能体自进化启发式框架,通过LLM代理迭代演化内部逻辑,在保持可解释性和推理效率的同时,在个体轨迹保真度、群体分布对齐和行为合理性上超越现有方法。

详情
AI中文摘要

人类移动性生成旨在根据个体特征为目标人群合成真实的出行链。现有范式,包括深度生成模型、基于LLM的方法和传统启发式方法,难以同时满足该任务的复杂需求,同时保持可解释性、行为合理性、群体级分布对齐和推理效率。为弥合这一差距,我们引入了MobEvolve,这是首个用于人类移动性生成的智能体自进化启发式框架。MobEvolve初始化一个行为启发的启发式系统,并利用LLM代理迭代演化其内部逻辑。通过在验证集上诊断经验性错位和失败案例,代理提出有针对性的更新并积累演化记忆以实现累积性自我改进。在新加坡和蒙特利尔基准上的广泛评估表明,MobEvolve在个体轨迹保真度、群体级分布对齐和行为合理性方面显著优于最先进的深度生成和基于LLM的方法,同时保持可解释性和高推理效率。

英文摘要

Human mobility generation aims to synthesize realistic trip chains for target populations based on individual features. Existing paradigms, including deep generative models, LLM-based methods, and traditional heuristics, struggle to satisfy the complex demands of this task while simultaneously maintaining interpretability, behavioral plausibility, population-level distributional alignment, and inference efficiency. To bridge this gap, we introduce MobEvolve, the first agentic self-evolving heuristic framework for human mobility generation. MobEvolve initializes a behavior-inspired heuristic system and employs an LLM agent to iteratively evolve its internal logic. By diagnosing empirical misalignments and failure cases on a validation set, the agent proposes targeted updates and accumulates evolution memory for cumulative self-improvement. Extensive evaluations on the Singapore and Montreal benchmarks demonstrate that MobEvolve significantly outperforms state-of-the-art deep generative and LLM-based methods in individual trajectory fidelity, population-level distribution alignment, and behavioral plausibility, while preserving interpretability and high inference efficiency.

2606.01638 2026-06-02 cs.CV

CanonCGT: Reference-Based Color Grading via Canonical Pivot Representation

CanonCGT:基于参考的颜色分级通过规范枢轴表示

Jinwon Ko, Keunsoo Ko, Chang-Su Kim

发表机构 * Korea University(韩国大学) The Catholic University of Korea(韩国天主教大学)

AI总结 提出一种基于规范枢轴的两阶段框架CanonCGT,通过去除内在色调偏差并匹配参考风格,实现稳定、真实的颜色分级。

Comments CVPR 2026 accepted

详情
AI中文摘要

基于参考的颜色分级旨在再现参考图像的色调和光照,同时保持色彩和谐与场景结构。现有的逼真和基于滤镜的方法通常产生不稳定的色调映射——过度偏移或不一致地保留颜色——导致不自然的结果。我们提出CanonCGT,一个基于规范枢轴的两阶段框架——一种风格中立的中间表示,用于稳定的颜色映射。第一阶段通过去除内在色调偏差来规范化输入,第二阶段对其进行颜色分级以匹配参考风格。一种双阶段训练方案DP-CGT结合了监督预设学习和非配对照片上的自监督细化。CanonCGT在多种数据集上产生逼真且色调一致的结果,在稳定性和视觉保真度上超越了最先进的方法。我们的代码可在\href{https://github.com/Jinwon-Ko/CanonCGT}{https://github.com/Jinwon-Ko/CanonCGT}获取。

英文摘要

Reference-based color grading aims to reproduce the tonal mood and lighting of a reference while preserving color harmony and scene structure. Existing photorealistic and filter-based methods often produce unstable tone mappings -- over-shifting or inconsistently retaining colors -- leading to unnatural results. We propose CanonCGT, a two-stage framework built on a canonical pivot -- a style-neutral intermediate representation for stable color mapping. The first stage canonicalizes the input by removing intrinsic tonal bias, and the second color-grades it to match the reference style. A dual-phase training scheme, DP-CGT, combines supervised preset learning with self-supervised refinement on unpaired photographs. CanonCGT delivers photorealistic and tonally consistent results across diverse datasets, surpassing state-of-the-art methods in stability and visual fidelity. Our codes are available at \href{https://github.com/Jinwon-Ko/CanonCGT}{https://github.com/Jinwon-Ko/CanonCGT}

2606.01636 2026-06-02 cs.CV

Pave-GRPO: Beyond Instantaneous Guidance through Principled Average Velocity Decomposition

Pave-GRPO:通过原则性平均速度分解超越瞬时引导

Pengyang Ling, Jiazi Bu, Yujie Zhou, Yibin Wang, Zhenyu Hu, Zihan Zhang, Yi Jin, Huaian Chen, Yuhang Zang

发表机构 * University of Science and Technology of China(中国科学技术大学) Shanghai Jiao Tong University(上海交通大学) Fudan University(复旦大学) Harbin Institute of Technology(哈尔滨工业大学) Beihang University(北京航空航天大学) Shanghai AI Laboratory(上海人工智能实验室)

AI总结 提出Pave-GRPO方法,通过原则性平均速度分解将粗粒度过渡分解为细粒度子轨迹,在不增加生成成本的情况下将奖励反馈传播到更多中间步骤,实现更全面的偏好对齐。

Comments 8 pages,5 figures

详情
AI中文摘要

通过群体相对策略优化(GRPO)的后训练已成为将基于流的生成模型与人类偏好对齐的强大范式。然而,流模型的迭代去噪性质在生成用于策略梯度更新的群体展开时会产生巨大成本,迫使现有方法使用极少的去噪步骤进行训练。这种时间稀疏性严重限制了偏好优化:奖励反馈只能到达每个轨迹的少数阶段,使得绝大多数中间去噪步骤缺乏直接监督,从而损害了对齐的粒度。为了解决这个问题,我们提出了Pave-GRPO,它通过原则性平均速度分解重新表述了GRPO目标。我们不生成昂贵的高步数展开,而是保持高效的少步数群体采样,但将每个粗粒度转换分解为跨越多个中间时间步的等效细粒度子轨迹集合。这将奖励反馈传播到更密集的时间阶段集,从而实现更全面的偏好对齐,而无需额外的生成成本。这种设计有两个好处:(i)零成本视野扩展:通过直接重用分段群体样本及其相关奖励,Pave-GRPO在固定采样预算下显著拓宽了有效优化范围;(ii)全面的时间监督:通过将瞬时速度目标等效分解为多时间步集合,它将奖励信号分布到去噪过程的更多中间阶段,从而实现更细粒度、更彻底的偏好优化。大量实验验证了Pave-GRPO在不同奖励设置下有效推进了偏好对齐,提供了全面的性能提升。

英文摘要

Post-training via Group Relative Policy Optimization (GRPO) has emerged as a powerful paradigm for aligning flow-based generative models with human preferences. However, the iterative denoising nature of flow models incurs substantial costs when generating group rollouts for policy-gradient updates, compelling existing methods to train with extremely few denoising steps. This temporal sparsity severely restricts preference optimization: reward feedback can only reach a handful of stages per trajectory, leaving the vast majority of intermediate denoising steps without direct supervision and thus compromising alignment granularity. To address this, we propose Pave-GRPO, which reformulates the GRPO objective through Principled average velocity decomposition. Rather than generating expensive high-step rollouts, we maintain efficient few-step group sampling but decompose each coarse transition into an equivalent ensemble of finer sub-trajectories spanning multiple intermediate timesteps. This propagates reward feedback to a denser set of temporal stages for more comprehensive preference alignment without additional generation cost. This design offers two benefits: (i) zero-cost horizon expansion: through the direct reuse of piece-wise group samples and their associated rewards, Pave-GRPO significantly broadens the effective optimization scope under fixed sampling budgets; and (ii) comprehensive temporal supervision: by equivalently decomposing an instantaneous velocity target into a multi-timestep ensemble, it distributes reward signals across more intermediate stages of the denoising process, enabling finer-grained and more thorough preference optimization. Extensive experiments validate that Pave-GRPO effectively advances preference alignment across different reward settings, offering comprehensive performance enhancement.

2606.01635 2026-06-02 cs.CL cs.AI

AlphaToken: Decoupling Adaptation and Stability for Path-Aware Response Token Valuation in LLM Post-Training

AlphaToken: 在LLM后训练中解耦适应性与稳定性的路径感知响应令牌估值

Liu Qing, Ou Wu, Yi Du

发表机构 * Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences(中国科学院大学杭州高等研究院)

AI总结 提出AlphaToken框架,通过解耦适应性(促进目标任务学习)和稳定性(保持预训练能力)并引入路径感知机制,利用Fisher漂移代理和Ghost点积扩展实现高效令牌估值,从而在微调和偏好优化中屏蔽低价值令牌,提升后训练性能并缓解灾难性遗忘。

详情
AI中文摘要

令牌选择对于有效的LLM后训练至关重要。然而,现有方法大多依赖局部启发式,很少将令牌选择形式化为对单个响应令牌的原则性估值。我们引入了$\textbf{AlphaToken}$,一个响应令牌估值框架,它将估值解耦为$\textbf{适应性}$(促进目标任务学习)和$\textbf{稳定性}$(保持预训练能力),并通过结合局部令牌梯度的直接路径信号与自回归生成中的下游因果路径信号,使每个目标具有$\textbf{路径感知}$性。由于保留数据通常不可用,AlphaToken通过锚定在预训练参考模型上的$\textbf{Fisher漂移代理}$来近似稳定性。为了高效计算,我们将Ghost点积扩展到令牌级估值。AlphaToken在微调和偏好优化过程中屏蔽低价值响应令牌,将训练信号集中在更有价值的位置。实验表明,AlphaToken提高了后训练性能并缓解了灾难性遗忘。

英文摘要

Token selection is pivotal for effective LLM post-training. However, existing methods mostly rely on local heuristics and rarely formulate token selection as a principled valuation of individual response tokens. We introduce $\textbf{AlphaToken}$, a response token valuation framework that decouples valuation into $\textbf{adaptation}$ (promoting target-task learning) and $\textbf{stability}$ (preserving pre-trained capabilities), and makes each objective $\textbf{path-aware}$ by combining the direct-path signal from local token gradients with the downstream causal-path signal in autoregressive generation. Since retention data are typically unavailable, AlphaToken approximates stability via a $\textbf{Fisher-drift proxy}$ anchored at the pre-trained reference model. For efficient computation, we extend Ghost Dot-Product to token-level valuation. AlphaToken masks low-value response tokens during fine-tuning and preference optimization, concentrating training signals on more valuable positions. Experiments show that AlphaToken improves post-training performance and mitigates catastrophic forgetting.

2606.01626 2026-06-02 cs.LG

IMWM: Intuition Models Complement World Models for Latent Planning

IMWM:直觉模型补充世界模型用于潜在规划

Baoqi Gao, Ruize Han, Miao Wang, Song Wang

发表机构 * Beihang University(北航) Shenzhen University of Advanced Technology(深圳先进技术大学)

AI总结 针对基于潜在世界模型的规划中搜索瓶颈问题,提出IMWM框架,通过直觉模型与三个轻量组件协作,在四个像素级任务上显著提升成功率。

详情
AI中文摘要

使用学习到的潜在世界模型进行规划是从原始像素控制的有前途的途径,但仅靠强大的世界模型是不够的。我们通过实验证明了这一点:即使使用完美的世界模型(通过将学习到的前向预测器替换为真实环境动态的理想化展开来实现),有限预算的基于样本的规划器仍然在某些任务上失败,这表明瓶颈可能在于搜索而非世界模型的准确性。受此差距的启发,我们提出了IMWM(直觉模型+世界模型),它将世界模型与从演示中训练出的直觉模型配对,以识别有希望的动作。这两个模型通过三个轻量组件协作:(i)检索初始化,从检索到的演示中初始化规划器的动作提议;(ii)混合成本,将直觉分数与世界模型展开成本相结合;(iii)可靠性门控,调整规划器在每个设置中信任直觉的程度。在四个基于像素的目标到达任务(Two-Room、Reacher、Push-T和OGBench-Cube)中,IMWM在所有四个任务上的平均成功率均高于仅使用世界模型的规划器,其中在Two-Room(99.2%,+11.5个百分点)和OGBench-Cube(94.7%,+28.5个百分点)上提升最大。

英文摘要

Planning with a learned latent world model is a promising route to control from raw pixels, but a strong world model alone is not enough. We show this experimentally: even with a perfect world model (operationalized by replacing the learned forward predictor with an idealized rollout of the true environment dynamics), a finite-budget sample-based planner still fails on some tasks, indicating that the bottleneck can lie in search rather than in world-model accuracy. Motivated by this gap, we propose IMWM (Intuition Model + World Model), which pairs the world model with an intuition model trained from demonstrations to recognize promising actions. The two models collaborate through three lightweight components: (i) Retrieval Initialization, which initializes the planner's action proposal from a retrieved demonstration; (ii) Hybrid Cost, which combines the intuition score with the world-model rollout cost; and (iii) a Reliability Gate, which adjusts how much the planner trusts intuition in each setting. Across four pixel-based goal-reaching tasks (Two-Room, Reacher, Push-T, and OGBench-Cube), IMWM has higher mean success than the world-model-only planner on all four, with the largest gains on Two-Room (99.2%, +11.5 percentage points) and OGBench-Cube (94.7%, +28.5 percentage points).

2606.01620 2026-06-02 cs.CV

Real-Time Generation of Streamable Talking Portrait Video with Reference-Guided Deep Compression VAEs

基于参考引导深度压缩VAE的流式说话人肖像视频实时生成

Sicheng Xu, Yu Deng, Shoukang Hu, Yichuan Wang, Yizhong Zhang, Zhan Chen, Jiaolong Yang, Baining Guo

发表机构 * Microsoft Research(微软研究院) Microsoft AI(微软人工智能)

AI总结 提出一种结合因果视频VAE和自回归潜在去噪模型的流式说话人肖像视频生成框架,通过参考图像引导实现实时高质量生成。

Comments CVPR 2026 (Highlight) Camera ready

详情
AI中文摘要

视频扩散模型显著推动了肖像视频生成的发展,但其高计算需求限制了在交互式应用中的使用。本文提出一个框架,用于生成以语音音频和参考图像为条件的可流式说话人肖像视频。该框架专为流式场景精心设计,包含一个用于深度潜在压缩的因果视频VAE和一个自回归潜在去噪模型。我们的因果VAE集成了可变数量的参考图像作为引导,使网络能够专注于动态信息而非静态外观,从而提升压缩效率和重建质量。此外,我们扩展了残差自编码范式,以改善VAE中的时空因果处理。生成器基于Rectified Flow Transformer架构,并以块状自回归方式生成视频潜在表示。我们的方法能够实时生成高质量的说话人肖像视频,速度显著快于基线模型。此外,综合实验表明,在逼真度、生动性和视频质量方面,该方法与这些大型模型相当甚至更优。

英文摘要

Video diffusion models have significantly advanced portrait video generation, yet their high computational demands limit their use in interactive applications. This work presents a framework for streamable talking portrait video generation conditioned on speech audio and reference images. Designed meticulously for streaming scenarios, it features a causal video VAE for deep latent compression and an autoregressive latent denoising model. Our causal VAE integrates a variable number of reference images as guidance, allowing the network to focus on dynamic information rather than static appearance, thereby enhancing compression efficacy and reconstruction quality. Additionally, we extend the residual auto-encoding paradigm to improve spatial-temporal causality handling in our VAE. The generator is based on a Rectified Flow Transformer architecture and produces video latents in a blockwise auto-regressive manner. Our method enables the real-time generation of high-quality talking portrait videos, achieving speeds significantly faster than baseline models. Furthermore, comprehensive experiments demonstrate that it is on par with or even outperforms these large models in realism, vividness, and video quality.

2606.01617 2026-06-02 cs.CL cs.AI

EvoPool: Evolutionary Programmatic Annotation for Label-Efficient Specialized Supervision

EvoPool: 面向标签高效专业监督的进化式程序化标注

Tianyi Xu, Yaolun Zhang, Xuan Ouyang, Huazheng Wang

发表机构 * Oregon State University(俄勒冈州立大学) University of Wisconsin–Madison(威斯康星大学麦迪逊分校)

AI总结 提出进化多智能体框架EvoPool,通过程序化标注器迭代进化与投票聚合,在低标注成本下显著提升专业领域监督性能。

Comments 39 pages, 7 figures. Code: https://github.com/tianyi0216/EvoPool

详情
AI中文摘要

大型语言模型在通用任务上表现出色,但在训练标签成本高昂的专业高风险领域,其性能不如较小的监督模型。我们针对这一场景提出了EvoPool,一个受达尔文进化启发的进化多智能体框架。三个专业智能体迭代地提出可执行的标注器代码,一个小型验证集提供适应度信号,一个确定性门控仅保留通过跨代可行性、多样性和边际贡献检查的标注器。通过EvoAgg(一种结合语义特征与标注器投票特征的文本感知聚合器)将池投票映射为软训练标签。所构建的池在每样本成本接近零的情况下运行,在10万样本上比LLM标注快4500至31000倍。在8个LLM弱专业和复杂任务中的7个(涵盖生物医学关系抽取、法律条款分类、复杂推理和密集多标签生物医学分类)上,EvoPool比最强的LLM标注基线平均高出+0.141 macro-F1,在ChemProt上最高达+0.301,在PubMed上达+0.265。代码见:https://github.com/tianyi0216/EvoPool

英文摘要

Large language models excel at general tasks but underperform smaller supervised models in specialized, high-stakes domains where training labels are costly. We address this regime with EvoPool, an evolutionary multi-agent framework inspired by Darwinian evolution. Three specialized agents iteratively propose executable annotator code, a small validation set provides a fitness signal, and a deterministic gate keeps only annotators that pass viability, diversity, and marginal-contribution checks across generations. Pool votes are mapped to soft training labels by EvoAgg, a text-aware aggregator combining semantic features with annotator-vote features. The authored pool runs at near-zero per-example cost and is 4500 to 31000x faster than LLM annotation on 100K examples. Across 7 of 8 LLM-weak specialized and complex tasks spanning biomedical relation extraction, legal-clause classification, complex reasoning, and dense multi-label biomedical classification, EvoPool beats the strongest LLM annotation baseline by an average +0.141 macro-F1, peaking at +0.301 on ChemProt and +0.265 on PubMed. Code is available at: https://github.com/tianyi0216/EvoPool

2606.01615 2026-06-02 cs.CV cs.MM

Turing Patterns for Multimedia: Reaction-Diffusion Multi-Modal Fusion for Language-Guided Video Moment Retrieval

图灵模式用于多媒体:反应-扩散多模态融合用于语言引导的视频时刻检索

Xiang Fang, Wanlong Fang, Wei Ji, Tat-Seng Chua

发表机构 * Nanyang Technological University(南洋理工大学) National University of Singapore(新加坡国立大学)

AI总结 提出基于反应-扩散过程的多模态融合框架RDMF,通过模拟生物模式形成机制实现视频与文本的动态对齐,用于视频时刻检索与高亮检测。

Comments Published in ACM MM 2025. Address some typos

详情
AI中文摘要

视频-语言模型对于时刻检索和高亮检测等任务至关重要,但它们通常难以捕捉时间视频序列与文本语义之间的动态、非线性交互。现有方法依赖静态交叉注意力或提示调优机制,无法自适应地建模模态间的演化关系,导致对齐次优和泛化受限。受系统生物学启发,我们提出 extbf{反应-扩散多模态融合(RDMF)},这是一个新颖的框架,将视频-语言对齐重新构想为反应-扩散(RD)过程,借鉴了Alan Turing引入的模式形成原理。在RDMF中,视频特征随时间扩散以捕捉时间上下文,而文本-视频交互被建模为非线性反应,放大相关特征并抑制噪声,形成类似于生物系统的涌现模式。利用Gray-Scott RD模型,我们设计了一个计算高效的融合模块,集成视频和文本表示,并通过图灵不稳定性准则对稳定性和收敛性进行严格的数学分析。我们的框架具有理论依据,采用先进的数学工具确保稳定的模式形成,并且实际可行,集成了标准组件如预训练编码器和DETR风格的头用于时刻检索和显著性预测。RDMF代表了一种开创性的跨学科方法,桥接了系统生物学和多媒体研究,以解决传统多模态融合的局限性。初步实验表明,它在识别显著视频时刻方面具有超越现有方法的潜力,为视频-语言任务提供了新的范式。

英文摘要

Video-language models are pivotal for tasks such as moment retrieval and highlight detection, yet they often struggle to capture the dynamic, non-linear interactions between temporal video sequences and textual semantics. Existing approaches, relying on static cross-attention or prompt-tuning mechanisms, fail to adaptively model the evolving relationships between modalities, leading to suboptimal alignment and limited generalization. Inspired by systems biology, we propose \textbf{Reaction-Diffusion Multimodal Fusion (RDMF)}, a novel framework that reimagines video-language alignment as a reaction-diffusion (RD) process, drawing on the principles of pattern formation introduced by Alan Turing. In RDMF, video features diffuse across time to capture temporal context, while text-video interactions are modeled as non-linear reactions that amplify relevant features and suppress noise, forming emergent patterns akin to biological systems. Leveraging the Gray-Scott RD model, we design a computationally efficient fusion module that integrates video and text representations, supported by rigorous mathematical analysis of stability and convergence using Turing instability criteria. Our framework is theoretically grounded, employing advanced mathematical tools to ensure stable pattern formation, and is practically viable, incorporating standard components like pretrained encoders and DETR-style heads for moment retrieval and saliency prediction. RDMF represents a pioneering interdisciplinary approach, bridging systems biology and multimedia research to address the limitations of conventional multimodal fusion. Preliminary experiments demonstrate its potential to outperform existing methods in identifying salient video moments, offering a new paradigm for video-language tasks.

2606.01612 2026-06-02 cs.CV cs.LG

Self-Improving Small Object Grounding in LVLMs

LVLMs中的自改进小目标定位

Tianze Yang, Yucheng Shi, Ruitong Sun, Ninghao Liu, Jin Sun

发表机构 * University of Georgia(佐治亚大学)

AI总结 利用LVLMs内部注意力模式,通过轻量级IoU回归器或无需训练的注意力熵选择器,从多个候选框中选出最佳框,实现小目标定位的自改进。

Comments 29 Pages, 15 Figures

详情
AI中文摘要

大型视觉语言模型(LVLMs)中的内部注意力模式能否在无需微调的情况下识别可靠的小目标框?在这项工作中,我们给出了肯定的答案。LVLMs中的注意力结构编码了定位质量——一个仅基于注意力图训练的轻量级IoU回归器实现了强IoU预测(Pearson r > 0.67)。该回归器驱动了我们基于注意力的候选选择(ACS)框架的回归器变体,称为ACS-Learned,它从多个采样候选中选择最佳框以改进目标定位。通过分析回归器学习的内容,我们揭示了哪些Transformer层和头最为关键,并推导出ACS-Free:一个无需训练的选择器,它根据这些判别性头上的注意力熵对候选进行排序,推理时无需任何学习组件。在COCO和Objects365上的实验表明,小目标定位的自改进高达19%,其中ACS-Free在所有无需训练的方法中排名最佳,表明有用的注意力结构提高了LVLMs中定位的可靠性和可解释性。

英文摘要

Can internal attention patterns in Large Vision Language Models (LVLMs) identify reliable small-object boxes without fine-tuning? In this work, we provide an affirmative answer. Attention structure in LVLMs encodes grounding quality-a lightweight IoU regressor trained solely on attention maps achieves strong IoU prediction (Pearson r > 0.67). This regressor powers the regressor-based variant of our Attention-based Candidate Selection (ACS) framework, called ACS-Learned, which selects the best box from multiple sampled candidates to improve object grounding. By analyzing what the regressor learns, we reveal which transformer layers and heads are most critical and derive ACS-Free: a training-free selector that ranks candidates by attention entropy on these discriminative heads, with no learned component at inference. Experiments on COCO and Objects365 demonstrate up to 19% self-improvement on small object localization, with ACS-Free ranking best among all training-free methods, demonstrating that useful attention structure improves both localization reliability and interpretability in LVLMs.

2606.01610 2026-06-02 cs.AI

Revisiting Ripple Effects in Knowledge Editing through Pressure-Aware Joint Neighborhood Optimization

重新审视知识编辑中的涟漪效应:通过压力感知联合邻域优化

Haoben Huang, Shuxin Liu, Ou Wu, Di Gao

发表机构 * Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences(杭州高等研究院,中国科学院大学)

AI总结 针对大语言模型单次编辑引发的涟漪效应,提出联合邻域优化框架,通过压力感知协调和语义预执行门控联合优化可编辑侧与保留侧的耦合压力,在RippleEdits上传播与保留指标提升至少7.0%。

详情
AI中文摘要

大语言模型中的单次编辑更新会在局部知识邻域中引发涟漪效应:理想情况下传播到相关事实,同时意外扰动应保留的事实。现有方法分别处理这两种效应,而未显式建模它们的耦合。我们通过分析典型基线中的涟漪响应挑战这种分离,识别出两种耦合的设计压力:可编辑侧协调和保留侧泄露。我们提出联合邻域优化(JNO),一种新的知识编辑框架,在目标规划阶段形式化并联合处理这两种压力。JNO通过压力感知协调(PAC)实例化这一原则,该协调在耦合约束下联合优化邻域目标表示,并设置语义预执行门控,在参数执行前拒绝高风险目标计划。在RippleEdits上的实验表明,JNO在保持跨骨干编辑稳定性的同时,传播和保留指标至少提升7.0%。

英文摘要

Single-edit updates in large language models can trigger ripple effects across local knowledge neighborhoods: desirable propagation to related facts and unintended perturbation of preserved ones. Existing methods address these two effects separately, without explicitly modeling their coupling. We challenge this separation through an analysis of ripple responses across typical baselines, identifying two coupled design pressures: editable-side coordination and preserved-side leakage. We propose Joint Neighborhood Optimization (JNO), a new knowledge-editing framework to formalize and jointly address both pressures at the target-planning stage. JNO instantiates this principle through Pressure-Aware Coordination (PAC), which jointly optimizes neighborhood target representations under coupled constraints, and a semantic pre-execution gate that rejects high-risk target plans before parameter execution. Experiments on RippleEdits show JNO improves propagation and preservation metrics by at least 7.0% while preserving cross-backbone editing stability.

2606.01608 2026-06-02 cs.CV

Exploiting Semantic and Pixel Representations for Ultra-Low Bitrate Image Compression

利用语义和像素表示进行超低比特率图像压缩

Hao Wei, Yanhui Zhou, Chenyang Ge, Saeed Anwar, Ajmal Mian

发表机构 * National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University(人机混合增强智能国家重点实验室,人工智能与机器人研究院,西安交通大学) School of Information and Telecommunication, Xi’an Jiaotong University(信息与电信学院,西安交通大学) Department of Computer Science and Software Engineering, The University of Western Australia(计算机科学与软件工程系,西澳大学)

AI总结 提出SPRDiff扩散压缩方法,通过三重编码器架构和失真感知重建模块,在超低比特率下同时保持语义一致性和像素级保真度,实现率-失真-感知权衡最优。

详情
AI中文摘要

大多数现有的极端压缩方法未能实现最优的率-失真-感知权衡,因为它们通常优先考虑感知保真度和视觉真实性而非像素级精度。因此,重建结果往往与原始图像有明显偏差。超低比特率图像压缩因此至关重要——不仅要产生极其紧凑的表示,还要确保重建图像在语义上与源图像保持一致,并在像素级忠实于源图像。为此,我们提出了SPRDiff,一种基于扩散的压缩方法,充分利用语义和像素表示,从而在超低比特率约束下增强重建保真度。具体来说,我们开发了一个三重编码器架构,利用预训练的面向失真和面向语义编码器的高保真特征来补偿冻结的VAE编码器提取的有限表示,从而改善潜在压缩和熵建模。为了进一步提高扩散模型的重建保真度,我们引入了一个具有双特征提取的失真感知重建模块。该模块不仅生成保留主要结构的粗略重建,还提供实用且准确的语义级和像素级条件信号来指导扩散模型。在基准数据集上的大量实验表明,我们的方法在极低比特率(低于0.03 bpp)下在率-失真-感知权衡方面优于最先进的方法,有效保持了重建图像中的感知质量和像素级保真度。我们将在https://github.com/cshw2021/SPRDiff发布源代码和训练模型。

英文摘要

Most existing extreme compression methods fail to achieve an optimal rate-distortion-perception trade-off, as they typically prioritize perceptual fidelity and visual realism over pixel-level accuracy. Consequently, the resulting reconstructions often deviate noticeably from the originals. Ultra-low bitrate image compression is therefore crucial-not only for producing extremely compact representations but also for ensuring that reconstructed images remain semantically coherent and faithful to the source at the pixel level. To this end, we propose SPRDiff, a diffusion-based compression method that fully leverages both semantic and pixel representations, thereby enhancing reconstruction fidelity under ultra-low bitrate constraints. Specifically, we develop a triple-encoder architecture that utilizes high-fidelity features from the pretrained distortion-oriented and semantic-oriented encoders to compensate for the limited representations extracted by the frozen VAE encoder, thereby improving latent compression and entropy modeling. To further enhance the reconstruction fidelity of diffusion models, we introduce a distortion-aware reconstruction module with dual feature extraction. This module not only generates a coarse reconstruction that preserves the main structures, but also provides practical and accurate semantic- and pixel-level conditional signals to guide the diffusion model. Extensive experiments on benchmark datasets demonstrate that our method outperforms state-of-the-art approaches in the rate-distortion-perception tradeoff at extremely low bitrates (below 0.03 bpp), effectively preserving both perceptual quality and pixel-wise fidelity in the reconstructed images. We will release the source code and trained models at https://github.com/cshw2021/SPRDiff.

2606.01604 2026-06-02 cs.CV

Paving the Way for Point Cloud Video Representation Learning Using A PDE Model

使用PDE模型为点云视频表示学习铺平道路

Zhuoxu Huang, Zhenkun Fan, Jungong Han, Josef Kittler

发表机构 * Department of Computer Science, Aberystwyth University(阿伯里斯يث大学计算机科学系) Department of Automation, Beijing National Research Center for Information Science and Technology, Tsinghua University(自动化系、北京信息科学与技术国家研究中心、清华大学) Department of Electrical Engineering, Surrey University(Surrey大学电子工程系)

AI总结 提出MotionPDE方法,通过将时空相关性学习建模为可解的偏微分方程(PDE),并利用对比学习结构优化,作为即插即用模块提升点云视频表示学习性能。

Comments Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI) in 2026

详情
AI中文摘要

研究时空相关性,特别是空间点随时间的变化,对于理解点云视频至关重要。传统方法,尤其是基于流的技术,由于顺序点云数据的无序空间排列,难以处理这些相关性。为了解决这一挑战,我们提出了一种新方法,通过将问题建模为可解的偏微分方程(PDE)来正则化时空相关性学习。虽然PDE在物理领域长期有效,但其在点云视频等新型序列数据上的应用仍未充分探索。受流体分析启发,我们构建了一个简化的PDE,并通过时间嵌入和空间嵌入之间的对比学习结构来指导和优化PDE的求解过程。借助这种额外的监督,我们的方法MotionPDE作为现有骨干模型的有效、即插即用的增强模块,仅增加极少的计算开销和参数。利用对比学习过程,我们进一步挖掘了MotionPDE的自监督能力,取得了有希望的结果,突显了其在点云视频数据解释中的实用性和适应性。带有训练检查点的代码仓库将在https://github.com/zhh6425/motionpde.git提供,以促进未来研究。

英文摘要

Investigating spatial-temporal correlations, specifically how spatial points vary over time, is crucial for understanding point cloud videos. Traditional methods, particularly flow-based techniques, struggle with these correlations due to the unordered spatial arrangement of sequential point cloud data. To address this challenge, we propose a novel approach that regularizes spatial-temporal correlation learning by formulating the problem as a solvable Partial Differential Equation (PDE). While PDEs have long been effective in the physical domain, their application to novel sequential data like point cloud video remains underexplored. Inspired by fluid analysis, we construct a simplified PDE, and the process of solving PDE is guided and refined by a contrastive learning structure between the temporal embeddings and the spatial embeddings. With this extra supervision, our method, named MotionPDE, serves as an effective, plug-and-play enhancement module for existing backbone models, adding minimal computational overhead and parameters. Capitalizing on the contrastive learning process, we delve deeper into the self-supervised capabilities of MotionPDE, yielding promising results that underscore its utility and adaptability in point cloud video data interpretation. The code repo with trained checkpoints will be available at https://github.com/zhh6425/motionpde.git for facilitating future research.

2606.01601 2026-06-02 cs.CV

EIVE: End-to-End Instance-Specific Visual Explanations for Detection Transformers

EIVE: 面向检测Transformer的端到端实例特定视觉解释

Jianlin Xiang, Yanshan Li, Linhui Dai

发表机构 * Institute of Intelligent Information Processing, Shenzhen University(智能信息处理研究院,深圳大学) Guangdong Provincial Key Laboratory of Intelligent Information Processing, Shenzhen University(广东省智能信息处理重点实验室,深圳大学) Shenzhen Key Laboratory of Modern Communications and Information Processing, Shenzhen University(深圳现代通信与信息处理重点实验室,深圳大学)

AI总结 提出EIVE框架,通过重新公式化解码器交叉注意力为实例级特征归因路径,直接生成实例级显著性图,无需梯度计算或输入扰动,高效解释DETR类检测器。

Comments 17 pages, 11 figures

详情
AI中文摘要

由于目标检测的多实例特性,其视觉可解释性仍然具有挑战性。现有方法主要采用事后范式(如基于梯度或扰动的解释方法)来解释预训练检测器。然而,这些方法需要额外的梯度计算或重复模型推理,导致效率有限。为解决此问题,我们提出了一种端到端实例特定视觉解释框架(EIVE),该框架在Detection Transformer(DETR)类模型的前向传播后直接生成实例级显著性图。具体而言,我们将解码器中的交叉注意力机制重新公式化为实例级特征归因路径,使得每个目标查询的交叉注意力对应于其预测实例的视觉归因。基于此公式,我们设计了一个跨层混合共识融合(CLHCF)模块,聚合解码器各层的交叉注意力信号,生成稳定且紧凑的解释。EIVE的解释过程既不需要梯度计算也不需要输入扰动,具有高计算效率,并适用于单尺度和多尺度的DETR类目标检测器。最后,我们提出了一种注意力感知联合训练策略(AAJTS)作为面向训练的应用,该策略对交叉注意力模式施加空间约束,以鼓励稳定且集中的归因表示,从而提高可解释性和检测性能。在MS COCO 2017、ExDark和Cityscapes上的实验表明,EIVE生成高质量的实例级显著性图,在标准指标上达到与最先进事后方法相当或更好的性能,同时显著提高了解释效率。代码可在https://github.com/xjlDestiny/EIVE.git获取。

英文摘要

Visual explainability for object detection remains challenging due to the multi-instance nature of detection. Existing approaches predominantly adopt post-hoc paradigms, such as gradient-based or perturbation-based explanation methods, to interpret pretrained detectors. However, these methods require additional gradient computation or repeated model inference, resulting in limited efficiency. To address this issue, we propose an End-to-end Instance-specific Visual Explanation framework (EIVE) that directly generates instance-level saliency maps following the forward pass of Detection Transformer (DETR)-like models. Specifically, we reformulate the cross-attention mechanism in the decoder as an instance-level feature attribution pathway, so that the cross-attention of each object query corresponds to the visual attribution of its predicted instance. Based on this formulation, we design a cross-layer hybrid consensus fusion (CLHCF) module to aggregate cross-attention signals across decoder layers, producing stable and compact explanations. The explanation process of EIVE requires neither gradient computation nor input perturbation, yielding high computational efficiency, and applies to single- and multi-scale DETR-like object detectors. Finally, we present an attention-aware joint training strategy (AAJTS) as a training-oriented application, which imposes spatial constraints on cross-attention patterns to encourage stable and concentrated attribution representations, thereby improving both interpretability and detection performance. Experiments on MS COCO 2017, ExDark, and Cityscapes demonstrate that EIVE produces high-quality instance-level saliency maps and achieves performance comparable to, or better than, state-of-the-art post-hoc methods across standard metrics, while substantially improving explanation efficiency. Code is available at https://github.com/xjlDestiny/EIVE.git.

2606.01600 2026-06-02 cs.CV cs.CL cs.RO

RoboTrustBench: Benchmarking the Trustworthiness of Video World Models for Robotic Manipulation

RoboTrustBench:机器人操作视频世界模型的可信度基准测试

Huiqiong Li, Jiayu Wang, Zhiting Mei, Anirudha Majumdar, Jingjing Chen, Bin Zhu

发表机构 * Singapore Management University(新加坡国立管理学院) Fudan University(复旦大学) Princeton University(普林斯顿大学)

AI总结 针对视频世界模型在机器人操作中的可信度问题,提出RoboTrustBench基准,包含正常、约束敏感、反事实和对抗四种场景,通过专家验证的指令-图像对和六维评估协议,发现当前模型在约束推理、反事实基础、物理交互和不安全指令抑制方面存在不足。

Comments Project: https://huiqiongli.github.io/RoboTrustBench/

详情
AI中文摘要

视频世界模型越来越多地用于机器人操作,然而现有基准大多在有效、可行和安全的指令下评估它们。我们引入了RoboTrustBench,一个用于评估视频世界模型在四种场景下可信度的基准:正常、约束敏感、反事实和对抗。基于真实世界的DROID片段构建,RoboTrustBench包含1,207个专家验证的指令-图像对和一个六维评估协议,包含13个细粒度标准。通过人类和MLLM评估七个代表性的视频世界模型,我们发现当前模型通常生成视觉上连贯的视频,但在约束推理、反事实基础、物理交互和不安全指令抑制方面存在困难。这些结果表明,视觉质量和表面级别的指令遵循不足以实现可信赖的机器人视频世界建模。

英文摘要

Video world models are increasingly used in robotic manipulation, yet existing benchmarks mostly evaluate them under valid, feasible, and safe instructions. We introduce RoboTrustBench, a benchmark for evaluating the trustworthiness of video world models under four scenarios: Normal, Constraint-Sensitive, Counterfactual, and Adversarial. Built from real-world DROID episodes, RoboTrustBench contains 1,207 expert-validated instruction-image pairs and a six-dimensional evaluation protocol with 13 fine-grained criteria. Evaluating seven representative video world models with human and MLLM assessment, we find that current models often generate visually coherent videos, but struggle with constraint reasoning, counterfactual grounding, physical interaction, and unsafe-instruction suppression. These results show that visual quality and surface-level instruction following are insufficient for trustworthy robotic video world modeling.

2606.01599 2026-06-02 cs.AI

TRON: Targeted Rule-Verifiable Online Environments for Visual Reasoning RL

TRON:面向视觉推理强化学习的目标化规则可验证在线环境

Tianze Yang, Yucheng Shi, Ruitong Sun, Jingyuan Huang, Ninghao Liu, Jin Sun

发表机构 * University of Georgia(佐治亚大学)

AI总结 提出TRON在线环境框架,通过可控生成-验证程序产生无限训练实例,支持视觉推理强化学习,在多个多模态基准上提升性能。

Comments 27 pages, 8 figures

详情
AI中文摘要

视觉推理的强化学习(RL)需要可扩展、可验证且可控的训练信号。现有的视觉RL后训练在静态策划数据集上进行,其图像-问题-答案样本受限于收集预算。本文引入TRON(目标化、规则可验证的在线环境),一种在线环境基底:训练rollout由可控的生成-验证程序按需生成,该程序采样新的潜在视觉状态,渲染图像,提出问题,并精确验证答案。因此,单次运行可以按当前课程所需的难度级别抽取无限的新实例流。当前TRON套件包含520个环境,组织成五个能力桶(空间、数学、图表、模式/逻辑和计数);同一基底支持在所有桶上训练的单个完整模型以及每个桶的能力专家模型,无需额外数据收集。我们还引入了基底分析,涵盖生成可靠性、实例和级别多样性、跨环境近似重复以及按难度级别的基础模型通过率。使用METHOD进行RL后训练在Qwen3-VL-4B、Qwen2.5-VL-7B和MiMo-VL-7B-SFT上的十个外部多模态推理基准上持续提升性能。

英文摘要

Reinforcement learning (RL) for visual reasoning needs scalable, verifiable, and controllable training signals. Existing visual RL post-training trains on static curated datasets, with fixed image-question-answer samples bounded by their collection budget. In this work, we introduce TRON (Targeted, Rule-verifiable Online eNvironments), an online environment substrate: a training rollout is generated on demand by a controllable generator-verifier program that samples a fresh latent visual state, renders an image, asks a question, and exactly verifies the answer. A single run can therefore draw an unbounded stream of fresh instances at the difficulty level required by the current curriculum. The current TRON suite contains 520 environments organized into five ability buckets (spatial, mathematical, diagram, pattern/logic, and counting); the same substrate supports both a single full model trained on all buckets and per-bucket ability-specialist models, with no additional data collection. We also introduce a substrate analysis covering generation reliability, instance and level diversity, cross-environment near-duplicates, and base-model pass rate by difficulty level. RL post-training with METHOD consistently improves performance on ten external multimodal reasoning benchmarks across Qwen3-VL-4B, Qwen2.5-VL-7B, and MiMo-VL-7B-SFT.

2606.01595 2026-06-02 cs.LG

Uncertainty-Calibrated Diffusion for Reliable 3D Molecular Graph Generation

不确定性校准的扩散用于可靠的3D分子图生成

Fang Wan, Jingxiang Qu, Yi Liu

发表机构 * State University of New York at Stony Brook(纽约州立大学石溪分校)

AI总结 针对扩散模型在3D分子图生成中因认知不确定性导致采样质量下降的问题,提出不确定性校准扩散方法(UCD),通过校准反向扩散过程来补偿认知不确定性,在多个基准上取得最优性能。

详情
AI中文摘要

贝叶斯推理通过将预测视为分布而非确定性值,为神经网络中的认知不确定性建模提供了原则性框架。同时,用于3D分子图生成的扩散模型在受严格化学约束的脆弱几何结构上运行,使得推理对不确定性误校准高度敏感。一个被广泛忽视的问题是,来自学习去噪器的认知不确定性会与反向扩散过程中有意注入的偶然不确定性相互作用,导致系统性的方差膨胀以及真实分布与模拟分布之间的不匹配。这种效应对于高精度分子生成尤其有害,因为即使微小偏差也可能违反化学有效性。在这项工作中,我们对认知不确定性如何通过扩散推理传播并降低采样质量进行了理论和实证分析。基于此研究,我们提出了UCD(不确定性校准扩散),一种简单而有效的方法,通过校准反向扩散过程来考虑认知不确定性。在标准3D分子基准上的大量实验表明,UCD在不同基线方法中一致地提高了采样质量,为3D分子扩散建立了新的最先进性能。代码可在 https://github.com/jiuguaiwf/UCD 获取。

英文摘要

Bayesian inference provides a principled framework for modeling epistemic uncertainty in neural networks by treating predictions as distributions rather than deterministic values. Meanwhile, diffusion-based models for 3D molecular graph generation operate on fragile geometric structures governed by strict chemical constraints, making inference highly sensitive to uncertainty miscalibration. A largely overlooked issue is that epistemic uncertainty arising from the learned denoiser interacts with the aleatoric uncertainty intentionally injected during reverse diffusion, leading to systematic variance inflation and a mismatch between the true distribution and the simulated distribution. This effect is particularly detrimental for high-precision molecular generation, where even small deviations can violate chemical validity. In this work, we provide a theoretical and empirical analysis of how epistemic uncertainty propagates through diffusion inference and degrades sampling quality. Building on this investigation, we propose UCD (Uncertainty-Calibrated Diffusion), a simple yet effective method that calibrates the reverse diffusion process to account for epistemic uncertainty. Extensive experiments on standard 3D molecular benchmarks demonstrate that UCD consistently improves sampling quality across diverse baseline methods, establishing new state-of-the-art performance for 3D molecular diffusion. The code is available at https://github.com/jiuguaiwf/UCD.

2606.01591 2026-06-02 cs.CV cs.LG

TLG: Temporal-Logic Grounding for Video Question Answering via Source-Annotation Reconstruction and Category-Targeted Reasoning

TLG: 通过源标注重建和类别目标推理实现视频问答的时间逻辑基础

Ali Alavi

发表机构 * The Ohio State University(俄亥俄州立大学)

AI总结 提出TLG三阶段系统,通过重建动作时间线、解析问题为时间逻辑程序并确定性执行,结合强视觉语言模型和前沿推理模型,将视频问答准确率从46.9%提升至71.37%。

详情
AI中文摘要

TimeLogic挑战评估对视频的形式时间逻辑推理——包括16个算子(之前、之后、直到、自从、总是、共现、排序等),采用布尔和四选一形式。端到端视频语言模型在此任务上接近随机水平,因为它们将视频视为帧的集合,无法定位动作发生的时间。我们提出TLG(时间逻辑基础),一个三阶段系统:(i)从生成基准测试的公共源数据集标注中重建每个视频的动作时间线,将每个问题解析为时间逻辑程序,并确定性执行;(ii)在没有标注的情况下回退到强大的开放视觉语言模型;(iii)仅将视觉语言模型经验上最弱的问题类别路由到前沿推理模型。TLG将测试准确率从46.9%的视觉语言模型基线提升到71.37%,绝对增益+24.5,达到排行榜前三名3分以内。我们报告了广泛的消融实验,包括三种基于模型的时间线重建变体,它们都低于整体视觉语言模型,将时间基础隔离为不可约的瓶颈,并表明真正的标注——而非更大的模型——驱动准确率。

英文摘要

The TimeLogic Challenge evaluates formal temporal-logic reasoning over video - 16 operators (before, after, until, since, always, co-occur, ordering, ...) in boolean and 4-way multiple-choice form. End-to-end video-language models (VLMs) hover near chance on this task because they treat video as a bag of frames and cannot localize when actions occur. We present TLG (Temporal-Logic Grounding), a three-tier system that (i) reconstructs each video's action timeline from the public source-dataset annotations the benchmark was generated from, parses every question into a temporal-logic program, and executes it deterministically; (ii) falls back to a strong open VLM where no annotation exists; and (iii) routes only the question categories where the VLM is empirically weakest to a frontier reasoning model. TLG raises test accuracy from a 46.9% VLM baseline to 71.37%, a +24.5 absolute gain, reaching within 3 points of the leaderboard top. We report extensive ablations, including three model-based timeline-reconstruction variants that all underperform a holistic VLM, isolating temporal grounding as the irreducible bottleneck and showing that real annotations - not larger models - drive accuracy.

2606.01590 2026-06-02 cs.CV cs.GR

Effective Multi-sensor Conditioning for Street-view Novel-view Synthesis

面向街景新视角合成的有效多传感器条件控制

Zhengfei Kuang, Adam Sun, Liyuan Zhu, Tong Wu, Shengqu Cai, Jonathan Tremblay, Iro Armeni, Ehsan Adeli, Lior Yariv, Gordon Wetzstein

发表机构 * Stanford Univerity(斯坦福大学) NVIDIA

AI总结 提出StreetNVS视频扩散框架,通过参考增强相机注意力模块和相对射线级位置编码联合利用LiDAR、环视图像和相机位姿,实现稀疏LiDAR条件下的高质量街景新视角合成。

详情
AI中文摘要

现代车辆平台配备了丰富的传感器套件,包括LiDAR、标定多相机系统和精确的自车运动,这原则上为从新视角重新渲染驾驶场景提供了强信号。最近一系列工作利用视频扩散模型完成此任务,通过其生成先验从稀疏车辆观测中合成合理的新视角。然而在实践中,现有方法仅利用了该信号的一部分,且其质量往往随着目标轨迹偏离记录驾驶路径而下降。我们认为这本质上是一个多传感器融合问题:稀疏LiDAR重投影提供准确但不完整的度量几何,环视参考图像提供密集外观但不提供度量深度,而相机位姿将两者跨视图连接起来。我们引入StreetNVS,一种视频扩散框架,通过基于相对射线级位置编码的参考增强相机注意力模块,联合对所有三种信号进行条件控制。我们开发了一种两阶段课程训练策略,逐步使模型适应越来越稀疏的LiDAR。在Waymo Open数据集上,StreetNVS在稀疏LiDAR条件下显著优于最先进的基线,与依赖密集10-100倍点云的方法性能相当。我们进一步展示了沿极端轨迹外路径(如高程、车道偏移、拉回和旋转)合成连贯视频的能力。我们的网站:https://streetnvs.github.io

英文摘要

Modern vehicle platforms are equipped with a rich sensor suite, including LiDAR, calibrated multi-camera rigs, and accurate ego-motion, that in principle offers strong signal for re-rendering a driving scene from novel viewpoints. A growing line of recent work leverages video diffusion models for this task, using their generative priors to synthesize plausible novel views from sparse vehicle observations. In practice, however, existing methods exploit only a fragment of this signal, and their quality tends to degrade as the target trajectory departs from the recorded driving path. We argue that this is fundamentally a multi-sensor fusion problem: sparse LiDAR reprojections supply accurate but incomplete metric geometry, surround-view reference imagery supplies dense appearance but no metric depth, and camera poses tie the two together across views. We introduce StreetNVS, a video diffusion framework that jointly conditions on all three signals through a Reference-Enhanced Camera Attention module based on a relative ray-level positional encoding. We develop a two-stage curriculum training strategy that gradually exposes the model to increasingly sparse LiDAR. On the Waymo Open Dataset, StreetNVS substantially outperforms state-of-the-art baselines under sparse LiDAR conditioning, matches methods that rely on 10-100 times denser point clouds. We further show capabilities of synthesizing coherent videos along extreme out-of-trajectory paths such as elevation, lane-shift, pullback, and rotation. Our website: https://streetnvs.github.io

2606.01584 2026-06-02 cs.CL cs.AI

Identifying High-Confidence Social Biases in LLMs for Trustworthy Conversational Tutoring Agents

识别LLM中高置信度的社会偏见以构建可信的对话辅导代理

Aitor Arronte Alvarez, Naiyi Xie Fincham

发表机构 * University of Hawaii at Manoa(夏威夷大学马诺亚分校)

AI总结 本研究通过生成对话数据集,评估大型语言模型在辅导场景中检测社会偏见的能力,发现模型在对话上下文中比基准测试更难检测偏见,且对错误判断过度自信,影响推理和反馈。

Comments Accepted for AIED 2026

详情
AI中文摘要

对话辅导代理已被证明能提高学习参与度和学生成绩,大型语言模型(LLM)越来越多地被用于这些系统以提供可扩展的个性化反馈。然而,LLM可能会延续或放大刻板的社会偏见,在教育环境中带来特殊风险。在本研究中,我们评估了LLM在对话辅导场景中的表现,以识别高置信度的社会偏见,即模型在无法识别辅导对话中的偏见判断时仍保持高度自信,可能影响其推理和向学习者提供的反馈。我们提出了一种新的数据集生成方法,通过重新生成学生-AI辅导教师互动并引入来自基准数据集的受控偏见轮次,实现在自然教学条件下的偏见评估。利用这些数据,我们评估了多个LLM检测刻板偏见的能力,并通过计算和人工评估分析了其响应背后的置信度和推理。我们发现,在对话辅导上下文中,偏见检测比基于基准的评估更具挑战性,且最先进的LLM对其刻板偏见陈述的错误评估过于自信。此外,模型置信度强烈影响推理和反馈,突显了基于LLM的辅导代理中过度自信和偏见行为的风险。最后,我们讨论了影响、缓解考虑和未来研究方向。

英文摘要

Conversational tutoring agents have been shown to improve learning engagement and student outcomes, and large language models (LLMs) are increasingly used in these systems to provide scalable, personalized feedback. However, LLMs may perpetuate or amplify stereotypical social biases, posing particular risks in educational settings. In this study, we evaluate LLMs in conversational tutoring scenarios to identify high-confidence social biases, instances where models are unable to identify biased judgments in tutoring conversations while maintaining strong confidence in their assessments, potentially affecting their reasoning and the feedback they provide to learners. We present a new dataset generation method that enables bias evaluation under naturalistic instructional conditions by regenerating student-AI tutor interactions and introducing turns with controlled bias derived from a benchmark dataset. Using this data, we assess multiple LLMs' ability to detect stereotypical biases and analyze the confidence and reasoning underlying their responses through computational and human evaluations. We find that bias detection is substantially more challenging in conversational tutoring contexts than in benchmark-based evaluations, and that state-of-the-art LLMs are overconfident in their incorrect assessments of stereotypical bias statements. Moreover, model confidence strongly influences reasoning and feedback, highlighting the risks of overconfident, biased behavior in LLM-based tutoring agents. We conclude by discussing implications, mitigation considerations, and directions for future research.

2606.01577 2026-06-02 cs.CV

FLAME: Physics-Guided Neural Operators for Onboard Satellite Methane Detection in Hyperspectral Imagery

FLAME:物理引导的神经算子用于高光谱图像中星载甲烷检测

Junhyuk Heo, Junhwan Park, Sancheol Sim, Beomkyu Choi, Woojin Cho

发表机构 * KAIST(韩国科学技术院)

AI总结 提出FLAME,一种将甲烷吸收物理直接嵌入架构的物理引导神经算子,在星载甲烷检测中实现最高精度,像素级假阳性率降低近3倍,参数最少且满足星载硬件延迟预算。

详情
AI中文摘要

甲烷是近期气候变化的主要驱动因素,快速识别其排放源是一项关键的气候干预措施。星载高光谱成像是完成此任务的主要工具,但每个传感器产生的数据量使得地面检测不切实际,因此需要星载检测。经典方法在星载硬件上产生过高的计算成本,而深度学习模型速度快但检测质量不足。我们提出FLAME,一种物理引导的神经算子,将甲烷吸收的物理直接构建到其架构中。在甲烷检测基准上,FLAME在所有评估方法中实现了最高的检测精度,将像素级假阳性率相比最强神经基线降低了近3倍,在学习基线中使用参数最少,并且在星载卫星硬件的延迟预算内运行。

英文摘要

Methane is a major driver of near-term climate change, and rapidly identifying its emission sources is a critical climate intervention. Spaceborne hyperspectral imagery is the primary tool for this task, but the volume of data produced by each sensor makes ground-based detection impractical and necessitates onboard detection. Classical methods incur prohibitive computational cost on onboard hardware, while deep learning models are fast but fall short on detection quality. We propose FLAME, a physics-guided neural operator that builds the physics of methane absorption directly into its architecture. On the methane detection benchmark, FLAME achieves the highest detection accuracy among all evaluated methods, reduces the pixel-level false positive rate by nearly $3\times$ over the strongest neural baseline, uses the fewest parameters among learned baselines, and runs within the latency budget of onboard satellite hardware.

2606.01576 2026-06-02 cs.CV

Deformable Wiener Filter for Future Video Coding

可变形维纳滤波器用于未来视频编码

Xuewei Meng, Chuanmin Jia, Xinfeng Zhang, Shanshe Wang, Siwei Ma

发表机构 * National Engineering Research Center of Visual Technology, School of Computer Science, Peking University(视觉技术国家工程研究中心,北京大学计算机科学学院) Core Media Technology, Disney Streaming(核心媒体技术,迪士尼流媒体) Wangxuan Institute of Computer Technology, Peking University(王萱计算机技术研究所,北京大学) Information Technology R&D Innovation Center of Peking University(北京大学信息技术研发创新中心) Peng Cheng Laboratory, Shenzhen(鹏城实验室,深圳)

AI总结 提出一种结合局部与非局部特征的可变形维纳滤波器(DWF),通过监督训练和自适应融合实现高效环路滤波,在VVC标准上平均节省1.16%~2.67%的码率。

Comments This paper has been published in IEEE Transactions on Image Processing

详情
Journal ref
IEEE Transactions on Image Processing, vol. 31, pp. 7222-7236, 2022
AI中文摘要

环路滤波器由于在混合视频编码框架中显著的降噪能力而受到越来越多的关注。然而,现有通用视频编码(VVC)中的环路滤波器主要利用图像局部相似性。尽管一些基于非局部的环路滤波器可以弥补这一不足,但非局部滤波器广泛使用的无监督参数估计方法限制了性能。鉴于此,我们提出了一种可变形维纳滤波器(DWF)。它结合了局部和非局部特性,并基于维纳滤波器理论监督地训练滤波器系数。在滤波过程中,首先为每个感兴趣样本导出局部相邻样本和非局部相似样本。然后,基于块级噪声和样本级特征将待滤波样本分类到特定组中。每组样本共享相同的滤波器系数。之后,根据分类结果自适应融合局部和非局部参考样本。最后,对每个待滤波样本进行带有异常值数据约束的滤波操作。此外,详细分析了所提出的DWF在不同参考样本导出方案下的性能。仿真结果表明,与VTM-11.0相比,所提方法在全内、随机访问和低延迟B配置下平均分别节省1.16%、1.92%和2.67%的码率。

英文摘要

In-loop filters have attracted increasing attention due to the remarkable noise-reduction capability in the hybrid video coding framework. However, the existing in-loop filters in Versatile Video Coding (VVC) mainly take advantage of the image local similarity. Although some non-local based in-loop filters can make up for this shortcoming, the widely-used unsupervised parameter estimation method by non-local filters limits the performance. In view of this, we propose a deformable Wiener Filter (DWF). It combines the local and non-local characteristics and supervisedly trains the filter coefficients based on the Wiener Filter theory. In the filtering process, local adjacent samples and non-local similar samples are first derived for each sample of interest. Then the to-be-filtered samples are classified into specific groups based on the patch level noise and sample-level characteristics. Samples in each group share the same filter coefficients. After that, the local and non-local reference samples are adaptively fused based on the classification results. Finally, the filtering operation with outlier data constraints is conducted for each to-be-filtered sample. Moreover, the performance of the proposed DWF is analyzed with different reference sample derivation schemes in detail. Simulation results show that the proposed approach achieves 1.16%, 1.92%, and 2.67% bit-rate savings on average compared to the VTM-11.0 for All Intra, Random Access, and Low-Delay B configurations, respectively.

2606.01566 2026-06-02 cs.LG

RobustModelMaker: Coupling Bootstrap Stability Selection with Leakage-Safe Nested Cross-Validation for Scientific Machine Learning

RobustModelMaker: 将Bootstrap稳定性选择与防泄漏嵌套交叉验证相结合的科学机器学习

Amanda S Barnard

发表机构 * School of Computing, Australian National University(计算学院,澳大利亚国立大学)

AI总结 针对小到中等规模科学数据集,提出RobustModelMaker框架,通过结合bootstrap稳定性选择与严格嵌套交叉验证,在防止数据泄漏的同时提供稳定性测试的特征子集和性能估计,在预测得分和选择稳定性上优于多种替代方法。

Comments 19 pages, 2 figure plates, 8 tables

详情
AI中文摘要

小到中等规模的科学数据集使机器学习流程面临两种叠加压力。单次特征选择产生的特征集在训练数据微小扰动下会发生显著变化,而任何使用相同数据进行选择、调参和评估的程序都会产生乐观偏差的性能估计。这两种失效模式通常被视为可分离的,但在科学数据所处的场景中,它们相互影响:不稳定的选择会放大本已乐观的得分的方差,而针对其中一种的标准补救措施很少能解决另一种。RobustModelMaker是一个Python框架,它将bootstrap稳定性选择与严格的嵌套交叉验证相结合,在每个折叠内执行所有预处理和选择,并生成一个经过稳定性测试的特征子集以及一个防泄漏的性能估计。该框架支持二分类、多分类和回归中的九种算法。行为通过确定性测试套件进行验证,该套件涵盖单元测试、性能测试和可重复性检查,在三个真实科学数据集上,与三种替代选择器(ANOVA F检验、带交叉验证的递归特征消除和Boruta)在预测得分和选择稳定性的Jaccard度量上进行比较。RobustModelMaker在每个数据集上的得分与最佳替代选择器相当,并且在所有三种任务类型中,在联合得分-稳定性前沿上占据了一个任何替代方法都无法匹敌的位置。两个示例应用——来自PLCO试验的卵巢癌生物标志物发现和UCI超导数据上的临界温度回归——说明了该框架在实际中的使用方式,以及当稳定性被视为首要交付成果而非涌现属性时,哪些权衡变得可见。

英文摘要

Small-to-medium scientific datasets place machine learning pipelines under two compounding pressures. Single-run feature selection produces feature sets that change substantially under small perturbations of the training data, and any procedure that uses the same data for selection, tuning, and evaluation produces optimistically biased performance estimates. The two failure modes are routinely treated as separable, but in the regimes where scientific data live, they interact: an unstable selection inflates the variance of an already-optimistic score, and standard remedies for one rarely address the other. RobustModelMaker is a Python framework that couples bootstrap stability selection with strict nested cross-validation, performs all preprocessing and selection inside each fold, and produces a stability-tested feature subset together with a leakage-safe performance estimate. The framework supports nine algorithms across binary classification, multiclass classification, and regression. Behaviour is verified by a deterministic test suite spanning unit, performance, and reproducibility checks on three real scientific datasets comparing to three alternative selectors (ANOVA F-test, recursive feature elimination with cross-validation, and Boruta) on both predictive score and a Jaccard measure of selection stability. RobustModelMaker is competitive in score with the best alternative selector on each dataset, and occupies a position on the joint score-stability frontier that none of the alternatives match across all three task types. Two example applications, ovarian cancer biomarker discovery from the PLCO Trial and critical-temperature regression on the UCI Superconductivity Data, illustrate how the framework is used in practice and what trade-offs become visible when stability is treated as a first-class deliverable rather than an emergent property.

2606.01565 2026-06-02 cs.RO cs.CV

Hierarchical Semantic-Augmented Navigation: Optimal Transport and Graph-Driven Reasoning for Vision-Language Navigation

层级语义增强导航:面向视觉语言导航的最优传输与图驱动推理

Xiang Fang, Wanlong Fang, Changshuo Wang

发表机构 * School of Software Engineering, Huazhong University of Science and Technology(华中科技大学软件学院) Interdisciplinary Graduate Programme, Nanyang Technological University, Singapore(新加坡南洋理工大学交叉学科研究生项目) University College London(伦敦大学学院)

AI总结 提出层级语义增强导航框架,通过动态层级语义场景图、基于最优传输的拓扑规划器与图感知强化学习策略,解决连续环境中的视觉语言导航难题,实现最优性能。

Comments Published in NeurIPS 2025, address some typos

详情
AI中文摘要

连续环境中的视觉语言导航(VLN-CE)对自主智能体构成严峻挑战,要求无缝整合自然语言指令与视觉观察以在复杂3D室内空间导航。现有方法在长程任务中常因场景理解有限、规划效率低下及缺乏稳健决策框架而表现不佳。我们引入层级语义增强导航(HSAN)框架,这是一种开创性方法,通过三项协同创新重新定义VLN-CE。首先,HSAN构建动态层级语义场景图,利用视觉语言模型捕捉从物体到区域到区域的多级环境表示,实现细粒度空间推理。其次,它采用基于最优传输的拓扑规划器,以Kantorovich对偶为基础,通过平衡语义相关性与空间可达性来选择长期目标,并具有理论最优性保证。第三,图感知强化学习策略确保精确的低层控制,在稳健避障的同时导航子目标。通过整合谱图理论、最优传输和先进的多模态学习,HSAN解决了先前工作中静态地图和启发式规划器的缺陷。在多个具有挑战性的VLN-CE数据集上的大量实验表明,HSAN实现了最先进的性能,在导航成功率和泛化到未见环境方面均有显著提升。

英文摘要

Vision-Language Navigation in Continuous Environments (VLN-CE) poses a formidable challenge for autonomous agents, requiring seamless integration of natural language instructions and visual observations to navigate complex 3D indoor spaces. Existing approaches often falter in long-horizon tasks due to limited scene understanding, inefficient planning, and lack of robust decision-making frameworks. We introduce the \textbf{Hierarchical Semantic-Augmented Navigation (HSAN)} framework, a groundbreaking approach that redefines VLN-CE through three synergistic innovations. First, HSAN constructs a dynamic hierarchical semantic scene graph, leveraging vision-language models to capture multi-level environmental representations, from objects to regions to zones, enabling nuanced spatial reasoning. Second, it employs an optimal transport-based topological planner, grounded in Kantorovich's duality, to select long-term goals by balancing semantic relevance and spatial accessibility with theoretical guarantees of optimality. Third, a graph-aware reinforcement learning policy ensures precise low-level control, navigating subgoals while robustly avoiding obstacles. By integrating spectral graph theory, optimal transport, and advanced multi-modal learning, HSAN addresses the shortcomings of static maps and heuristic planners prevalent in prior work. Extensive experiments on multiple challenging VLN-CE datasets demonstrate that HSAN achieves state-of-the-art performance, with significant improvements in navigation success and generalization to unseen environments.

2606.01563 2026-06-02 cs.LG

MomentKV: Closing the Directional Gap in KV Cache Eviction for Long-Context Inference

MomentKV:消除长上下文推理中KV缓存驱逐的方向差距

Yu Li, Binxu Li, Tian Lan

发表机构 * George Washington University(乔治·华盛顿大学) Princeton University(普林斯顿大学)

AI总结 针对长上下文推理中KV缓存驱逐导致输出退化的问题,提出MomentKV方法,通过维护驱逐令牌集的矩统计量(计数、键均值、值均值和值-键协方差)来识别与累积摘要对齐的令牌,并在推理时提供驱逐注意力输出的一阶近似,实现选择性驱逐与精确校正的相互增强。

详情
AI中文摘要

基于Transformer的语言模型中的自回归解码依赖于KV缓存,其内存占用随序列长度线性增长,成为长上下文推理的主要瓶颈。KV缓存驱逐通过保留固定大小的键值对子集并丢弃其余部分来解决这一问题。我们发现输出退化的一个主要来源并非驱逐令牌上的残余注意力质量(现有方法已最小化),而是保留令牌集与驱逐令牌集之间的方向不匹配。具体而言,实际中被驱逐的令牌通常与保留的令牌接近正交。因此,即使少量的驱逐质量也可能对最终的方向分布产生过大影响,并放大为显著的输出误差。这揭示了现有策略的根本局限性。为解决此问题,我们提出MomentKV,它在驱逐令牌集上维护紧凑的小规模矩统计量,包括计数、键均值、值均值和值-键协方差。在驱逐过程中,利用矩统计量识别已经与累积摘要良好对齐并被其捕获的令牌,保持驱逐集的几何规则性。在推理过程中,它们产生驱逐注意力输出的闭式一阶近似,在选择性驱逐与精确校正之间形成相互增强的循环。在LongBench和RULER上使用LLaMA-3.1-8B-Instruct和Qwen3-4B-Instruct进行的实验表明,MomentKV在每个缓存预算下均优于所有基线,在激进压缩下增益最大。

英文摘要

Autoregressive decoding in Transformer-based language models relies on the KV cache, whose memory footprint grows linearly with sequence length and becomes the primary bottleneck for long-context inference. KV cache eviction addresses this by retaining a fixed-size subset of key-value pairs and discarding the rest. We identify that a primary source of output degradation is not the residual attention mass on evicted tokens, which existing methods already minimize, but a directional mismatch between the retained and evicted token sets. Specifically, the evicted tokens in practice are often near-orthogonal to the retained ones. Thus, even a small evicted mass could have an oversized impact on the resulting direction distribution and amplify into substantial output error. This reveals a fundamental limit in existing strategies. To address this, we propose MomentKV, which maintains compact, small-size moment statistics over the evicted token set, including a count, key mean, value mean, and value-key covariance. During eviction, the moment statistics is leveraged to identify tokens already well aligned with and captured by the accumulated summary, keeping the evicted set geometrically regular. During inference, they yield a closed-form first-order approximation of the evicted attention output, forming a mutually reinforcing loop between selective eviction and accurate correction. On LongBench and RULER with LLaMA-3.1-8B-Instruct and Qwen3-4B-Instruct, MomentKV outperforms all baselines at every cache budget, with the largest gains under aggressive compression.

2606.01560 2026-06-02 cs.LG cs.AI

GJDNet: Robust Graph Neural Networks via Joint Disentangled Learning Against Adversarial Attacks

GJDNet: 通过联合解缠学习实现鲁棒图神经网络对抗攻击

Canyixing Cui, Tao Wu, Xingping Xian, Xiao-Ke Xu, Mao Wang, Weina Niu

发表机构 * School of Computer Science and Technology, Chongqing University of Posts and Telecommunications(重庆邮电大学计算机科学与技术学院) School of Cyber Security and Information Law, Chongqing University of Posts and Telecommunications(重庆邮电大学网络安全与信息法学院) Computational Communication Research Center, Beijing Normal University(北京师范大学计算通信研究中心) School of Journalism and Communication, Beijing Normal University(北京师范大学新闻传播学院) School of Computer Science and Engineering, University of Electronic Science and Technology of China(电子科技大学计算机科学与工程学院)

AI总结 提出GJDNet框架,通过联合解缠节点表示和决策空间,并采用球形决策边界,增强图神经网络在不同图同配性下的鲁棒性。

详情
AI中文摘要

图神经网络(GNN)易受对抗攻击,这类攻击通过在同配图中引入异配边、在异配图中引入同配边,从根本上反转连接模式。这种结构反转造成结构-特征不匹配,扰乱不同图类型上的邻域聚合。然而,我们发现现有防御措施存在局限性,它们要么在固定的同配性假设下将邻域视为整体,要么依赖无法应对扰动引起的表示偏移的标准softmax分类器。为进一步利用这一观察,我们采用鲁棒性视角,联合解缠节点表示和决策空间,在隔离扰动影响的同时强制实现分离良好的决策区域。基于此原则,我们提出图联合解缠网络(GJDNet),这是一个统一的框架,用于在不同图同配性机制下进行鲁棒节点分类。GJDNet在表示和决策两个层面增强鲁棒性:它采用特征驱动的软结构解缠,结合偏度感知的邻居过滤,抑制扰动引起的结构-特征不匹配;并引入球形决策边界(SDB),促进嵌入空间中的类内紧凑性和类间分离,从而在扰动下稳定决策边界。理论分析揭示了所提出的解缠表示和决策机制的有效性,而大量实验表明,GJDNet在不同连接模式的图上始终展现出强鲁棒性。

英文摘要

Graph Neural Networks (GNNs) are vulnerable to adversarial attacks, which inherently invert connectivity patterns by introducing disassortative edges in assortative graphs and assortative edges in disassortative graphs. This structural inversion creates structure-feature mismatches that disrupt neighborhood aggregation across different graph types. However, we find that existing defenses are limited, as they either treat neighborhoods as monolithic under fixed assortativity assumptions or rely on standard softmax classifiers that fail to account for perturbation-induced representation shifts. To further exploit this observation, we adopt a robustness perspective that jointly disentangles node representations and decision spaces, isolating perturbation effects while enforcing well-separated decision regions. Based on this principle, we propose Graph Joint Disentanglement Network (GJDNet), a unified framework for robust node classification across diverse graph assortativity regimes. GJDNet enhances robustness at both representation and decision levels: it employs feature-driven soft structural disentanglement with skewness-aware neighbor filtering to suppress perturbation-induced structure-feature mismatches, and introduces a Spherical Decision Boundary (SDB) to promote intra-class compactness and inter-class separation in the embedding space, thereby stabilizing decision boundaries under perturbations. Theoretical analysis provides insights into the effectiveness of the proposed disentangled representation and decision mechanisms, while extensive experiments demonstrate that GJDNet consistently achieves strong robustness across graphs with different connectivity regimes.