arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2237
2602.09639 2026-06-10 cs.LG stat.ML 版本更新

Blind denoising diffusion models and the blessings of dimensionality

盲去噪扩散模型与维度的祝福

Zahra Kadkhodaie, Aram-Alexandre Pooladian, Sinho Chewi, Eero Simoncelli

发表机构 * Flatiron Institute, Simons Foundation(Flatiron研究院,Simons基金会) Foundations of Data Science, Yale University(数据科学基础,耶鲁大学) Department of Statistics and Data Science, Yale University(统计与数据科学系,耶鲁大学) Ctr. for Neural Science & Courant Institute, New York University(神经科学中心及Courant学院,纽约大学)

AI总结 提出盲去噪扩散模型(BDDM),通过不向神经网络传递噪声幅度来简化设计,并在数据内在维度低于环境维度的假设下证明其正确性,实验显示自适应方案的优势。

Comments 39 pages, 13 figures; Accepted to ICML 2025 FoGen workshop

详情
AI中文摘要

去噪扩散模型(DDM)是跨多个领域从数据中学习密度的最先进方法,然而训练和采样流程的许多方面仍知之甚少。特别是,噪声调节要求从业者将人为设计的无原则噪声嵌入纳入神经网络架构,并使用临时噪声调度进行采样。为了解决这些缺点,我们提供了\emph{盲去噪扩散模型}(BDDM)的完整理论:这是DDM的一种变体,其中噪声幅度在训练或采样期间不传入神经网络,从而消除了上述设计选择的需要。我们在数据分布相对于环境维度具有低内在维度的假设下证明了BDDM作为采样算法的正确性。这一假设源于从单个噪声样本估计噪声水平的贝叶斯问题的引入,该问题可能具有独立的意义。我们通过实验将BDDM的性能与标准DDM进行比较,展示了我们分析严格证明的\emph{自适应}方案的优势。

英文摘要

Denoising diffusion models (DDMs) are state-of-the-art methods for learning densities from data across numerous domains, yet many aspects of the training and sampling pipeline remain poorly understood. In particular, noise conditioning requires practitioners to incorporate contrived unprincipled noise embeddings into neural network architectures and to use ad hoc noise schedules for sampling. To address these drawbacks, we provide a complete theory for \emph{blind denoising diffusion models} (BDDMs): a variant of DDMs where the noise amplitude is not passed into the neural network during training or sampling, obviating the need for the aforementioned design choices. We justify the correctness of BDDMs as a sampling algorithm under an assumption of low intrinsic dimensionality of the underlying data distribution relative to the ambient dimension. This assumption arises through the introduction of the Bayesian problem of estimating noise levels from a single noisy sample, which might be of independent interest. We empirically compare the performance of BDDMs to standard DDMs, showcasing the benefits of an \emph{adaptive} scheme which is rigorously justified by our analysis.

2602.06411 2026-06-10 cs.LG 版本更新

DAH-Net: A Dual-Attention Hybrid Network for Interpretable and Robust EEG-Based Emotion Recognition

DAH-Net: 用于可解释且鲁棒的基于脑电情绪识别的双注意力混合网络

S M Rakib UI Karim, Diponkor Bala, Wenyi Lu, Rownak Ara Rasul, Sean Goggins

发表机构 * Department of Electrical \& Computer Engineering University of Missouri Columbia, Missouri, USA Department of Computer Science Engineering City University Savar, Dhaka-1340, Bangladesh Department of Computer Science University of Missouri Columbia, Missouri, USA

AI总结 提出DAH-Net双注意力混合网络,集成1D-CNN、BiLSTM和双多头注意力,在2479样本988特征上达到99.19%测试准确率,优于传统模型,并通过特征分析揭示协方差特征贡献最大。

详情
AI中文摘要

基于脑电的情绪识别支持情感脑机接口和心理健康监测,但仍面临信号复杂性、受试者变异性和可解释性有限的挑战。我们提出DAH-Net,一种双注意力混合网络,集成1D-CNN、BiLSTM和双多头注意力(16+8头),用于三类脑电情绪分类。在2479个样本(988个脑电特征)上评估,DAH-Net达到99.19%的留出测试准确率,训练-测试差距为0.81%,优于RF(96.17%)、SVM(96.77%)、MLP(97.18%)和Transformer(98.19%)基线。Friedman检验(χ²=28.54,p<0.001)和事后Wilcoxon比较证实了统计显著性。使用随机森林重要性、SHAP归因和特征类别隔离进行的特征级分析表明,协方差特征达到接近基线的独立准确率(94.96%),而特征值特征独立表现有限(84.07%),但提供了紧凑的互补信息。紧凑架构(3.33M参数,使用32位权重约13.3MB)表明未来轻量级基于脑电的情感计算的潜力,有待受试者独立和外部验证。

英文摘要

EEG-based emotion recognition supports affective brain-computer interfaces and mental health monitoring yet remains challenged by signal complexity, subject variability, and limited interpretability. We propose DAH-Net, a dual-attention hybrid network integrating 1D-CNN, BiLSTM, and dual multi-head attention (16+8 heads) for three-class EEG emotion classification. Evaluated on 2,479 samples with 988 EEG features, DAH-Net achieves 99.19% held-out test accuracy with a 0.81% train-test gap, outperforming RF (96.17%), SVM (96.77%), MLP (97.18%), and Transformer (98.19%) baselines. Friedman testing (\c{hi}2 = 28.54, p < 0.001) and post-hoc Wilcoxon comparisons confirm statistical significance. Feature-level analysis using Random Forest importance, SHAP attribution, and feature category isolation shows that covariance features achieve near-baseline standalone accuracy (94.96%), while eigenvalue features show limited standalone performance (84.07%) but provide compact complementary information. The compact architecture (3.33M parameters, approximately 13.3MB using 32-bit weights) suggests potential for future lightweight EEG-based affective computing, pending subject-independent and external validation.

2601.21543 2026-06-10 cs.CL 版本更新

inversedMixup: Data Augmentation via Inverting Mixed Embeddings

inversedMixup:通过反转混合嵌入进行数据增强

Fanshuang Kong, Richong Zhang, Qiyu Sun, Zhijie Nie, Ting Deng, Chunming Hu

发表机构 * Beihang University(北京航空航天大学)

AI总结 提出 inversedMixup 框架,结合 Mixup 的可控性与 LLM 的可解释性,通过对齐嵌入空间将混合嵌入重构为可读句子,首次实证文本 Mixup 中的流形入侵现象,并扩展为三阶段数据增强方法,在少样本和全监督场景下有效。

详情
AI中文摘要

Mixup 通过以可控比率线性插值输入和标签来生成增强样本。然而,由于它在潜在嵌入层面操作,生成的样本不可解释。相比之下,基于 LLM 的增强方法通过提示在 token 级别生成句子,产生可读输出,但对生成过程的控制有限。受近期 LLM 反转(从嵌入重建自然语言,有助于弥合潜在嵌入空间与离散 token 空间之间的差距)进展的启发,我们提出了 inversedMixup,一个统一框架,结合了 Mixup 的可控性与基于 LLM 的生成的可解释性。具体来说,inversedMixup 将任务特定模型的输出嵌入空间与 LLM 的输入嵌入空间对齐,使得混合嵌入可以在可控混合比率下重建为人类可解释的句子。这种可解释性提供了文本 Mixup 中流形入侵现象的第一个实证证据。在此基础上,我们将 inversedMixup 扩展为三阶段数据增强方法,并引入一种简单而有效的策略来在增强过程中减轻流形入侵。大量实验证明了我们的方法在少样本和全监督场景下的有效性和泛化性。

英文摘要

Mixup generates augmented samples by linearly interpolating inputs and labels with a controllable ratio. However, since it operates at the latent embedding level, the resulting samples are not human-interpretable. In contrast, LLM-based augmentation methods produce sentences via prompts at the token level, yielding readable outputs but offering limited control over the generation process. Inspired by recent advances in LLM inversion, which reconstructs natural language from embeddings and helps bridge the gap between latent embedding space and discrete token space, we propose inversedMixup, a unified framework that combines the controllability of Mixup with the interpretability of LLM-based generation. Specifically, inversedMixup aligns the output embedding space of a task-specific model with the input embedding space of an LLM, so that mixed embeddings can be reconstructed, under a controllable mixing ratio, into human-interpretable sentences. This interpretability provides the first empirical evidence of the manifold intrusion phenomenon in text Mixup. Building on this, we extend inversedMixup into a three-stage data augmentation method, and introduce a simple yet effective strategy to mitigate manifold intrusion during augmentation. Extensive experiments demonstrate the effectiveness and generalizability of our approach in both few-shot and fully supervised scenarios.

2509.26000 2026-06-10 cs.LG stat.ML 版本更新

Informed Asymmetric Actor-Critic: Leveraging Privileged Signals Beyond Full-State Access

知情非对称Actor-Critic:利用超越全状态访问的特权信号

Daniel Ebi, Damien Ernst, Klemens Böhm, Gaspard Lambrechts

发表机构 * DeepMind(深度Mind) University of Cambridge(剑桥大学) ETH Zurich(苏黎世联邦理工学院) University of Geneva(日内瓦大学)

AI总结 提出知情非对称Actor-Critic框架,允许评论家基于任意状态相关特权信号进行条件化,并证明其产生无偏策略梯度估计;设计两种信息性准则选择最优信号,实验表明精选信号可匹配或超越全状态基线。

Comments Accepted at ICML 2026

详情
AI中文摘要

非对称强化学习利用训练时可用的特权信息来改善部分可观测条件下的学习。现有的非对称actor-critic方法通常假设在训练期间可以访问完整环境状态以条件化评论家,这在实践中往往不现实。我们引入了知情非对称actor-critic框架,允许评论家基于任意状态相关的特权信号进行条件化,并证明任何此类信号都会产生无偏的策略梯度估计。这大大扩展了可允许的特权信息集,并提出了选择最具信息性信号以促进学习的问题。为此,我们提出了两种新颖的信息性准则:一种基于依赖性的测试,可在训练前应用;另一种基于价值预测改进的测试,可事后应用。在部分可观测基准和合成环境上的实验表明,精心选择的特权信号可以在依赖更少状态信息的同时,匹配或超越全状态非对称基线。

英文摘要

Asymmetric reinforcement learning leverages privileged information available during training to improve learning under partial observability. Existing asymmetric actor-critic methods typically assume access to the full environment state to condition the critic during training, which is often unrealistic in practice. We introduce the informed asymmetric actor-critic framework that allows the critic to be conditioned on arbitrary state-dependent privileged signals, and show that any such signal yields unbiased policy gradient estimates. This substantially expands the set of admissible privileged information and raises the problem of selecting the most informative signals for learning. To this end, we propose two novel informativeness criteria: a dependence-based test that can be applied prior to training, and a test based on improvements in value prediction that can be applied post hoc. Experiments on partially observable benchmarks and synthetic environments demonstrate that carefully selected privileged signals can match or outperform full-state asymmetric baselines while relying on strictly less state information.

2508.21022 2026-06-10 cs.LG math.OC stat.ML 版本更新

A Sketch-and-Project Analysis of Subsampled Natural Gradient Algorithms

子采样自然梯度算法的草图-投影分析

Gil Goldshlager, Jiang Hu, Lin Lin

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 通过将子采样自然梯度下降(SNG)视为草图-投影方法,提出基于平方体积采样的新代理,证明单小批量下SNG方向期望等于预处理梯度下降步,给出全局收敛保证和显式收敛率,并解释SNG相对于SGD的优势在于更有效利用模型雅可比矩阵的谱衰减。

Comments 26 pages, 7 figures

详情
AI中文摘要

子采样自然梯度下降(SNG)已被用于实现高精度科学机器学习,但基于随机预条件的标准分析无法洞察实际小样本设置。我们通过将SNG分析为草图-投影方法克服了这一限制。受此视角启发,我们摒弃了使用两个独立小批量解耦梯度和预条件的常规理论代理,取而代之的是基于平方体积采样的新代理。在这个新代理下,我们证明即使存在耦合,SNG方向的期望也等于预处理梯度下降步,从而得到:(i) 使用任意大小的单个小批量时的全局收敛保证,以及(ii) 用与草图-投影结构相关的量显式表征收敛速率。这些发现进而为小样本设置提供了新见解,例如表明SNG相对于SGD的优势在于它能更有效地利用模型雅可比矩阵中的谱衰减。我们还扩展这些思想以解释SNG的一种流行结构化动量方案SPRING,通过证明它自然源于加速草图-投影方法。

英文摘要

Subsampled natural gradient descent (SNG) has been used to enable high-precision scientific machine learning, but standard analyses based on stochastic preconditioning fail to provide insight into realistic small-sample settings. We overcome this limitation by instead analyzing SNG as a sketch-and-project method. Motivated by this lens, we discard the usual theoretical proxy which decouples gradients and preconditioners using two independent mini-batches, and we replace it with a new proxy based on squared volume sampling. Under this new proxy we show that the expectation of the SNG direction becomes equal to a preconditioned gradient descent step even in the presence of coupling, leading to (i) global convergence guarantees when using a single mini-batch of any size, and (ii) an explicit characterization of the convergence rate in terms of quantities related to the sketch-and-project structure. These findings in turn yield new insights into small-sample settings, for example by suggesting that the advantage of SNG over SGD is that it can more effectively exploit spectral decay in the model Jacobian. We also extend these ideas to explain a popular structured momentum scheme for SNG, known as SPRING, by showing that it arises naturally from accelerated sketch-and-project methods.

2602.04119 2026-06-10 cs.LG q-bio.QM 版本更新

Synthesizable Molecular Generation via Soft-constrained GFlowNets with Rich Chemical Priors

通过具有丰富化学先验的软约束GFlowNets生成可合成分子

Hyeonah Kim, Minsu Kim, Celine Roget, Dionessa Biton, Louis Vaillancourt, Yves V. Brun, Yoshua Bengio, Alex Hernandez-Garcia

发表机构 * University of Toronto(多伦多大学) DeepMind(深度思维) University of Montreal(蒙特利尔大学)

AI总结 提出S3-GFN方法,通过软正则化序列GFlowNet,利用大规模SMILES语料库的化学先验,生成高奖励且可合成的分子,实验表明可合成率≥95%。

详情
AI中文摘要

生成模型在实验药物发现活动中的应用受到严重限制,因为从头设计实际可合成的分子非常困难。先前的工作利用生成流网络(GFlowNets)通过基于预定义反应模板和构建块的状态和动作空间设计来施加硬合成性约束。尽管这种方法前景广阔,但目前缺乏灵活性和可扩展性。作为替代方案,我们提出了S3-GFN,它通过对基于序列的GFlowNet进行简单的软正则化来生成可合成的SMILES分子。我们的方法利用从大规模SMILES语料库中学习到的丰富分子先验,将分子生成引导向高奖励、可合成的化学空间。该模型通过离策略重放训练和基于可合成与不可合成样本分离缓冲区的对比学习信号来施加约束。我们的实验表明,S3-GFN能够学习生成可合成分子(≥95%),并在多种任务中获得更高奖励。

英文摘要

The application of generative models for experimental drug discovery campaigns is severely limited by the difficulty of designing molecules de novo that can be synthesized in practice. Previous works have leveraged Generative Flow Networks (GFlowNets) to impose hard synthesizability constraints through the design of state and action spaces based on predefined reaction templates and building blocks. Despite the promising prospects of this approach, it currently lacks flexibility and scalability. As an alternative, we propose S3-GFN, which generates synthesizable SMILES molecules via simple soft regularization of a sequence-based GFlowNet. Our approach leverages rich molecular priors learned from large-scale SMILES corpora to steer molecular generation towards high-reward, synthesizable chemical spaces. The model induces constraints through off-policy replay training with a contrastive learning signal based on separate buffers of synthesizable and unsynthesizable samples. Our experiments show that S3-GFN learns to generate synthesizable molecules ($\geq 95\%$) with higher rewards in diverse tasks.

2503.13358 2026-06-10 cs.CV 版本更新

One-Step Residual Shifting Diffusion for Image Super-Resolution via Distillation

一步残差移位扩散用于图像超分辨率通过蒸馏

Daniil Selikhanovych, David Li, Aleksei Leonov, Nikita Gushchin, Sergei Kushneriuk, Alexander Filippov, Evgeny Burnaev, Iaroslav Koshelev, Alexander Korotin

发表机构 * Kandinsky Lab(坎迪斯基实验室) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学) Luzin Research Center(卢津研究所) Moscow Independent Research Institute of Artificial Intelligence(莫斯科独立人工智能研究 institute) Applied AI Institute(应用人工智能研究所)

AI总结 提出RSD蒸馏方法,通过训练学生网络使基于其生成图像的虚拟ResShift模型与教师一致,实现单步超分辨率,在感知指标上超越教师和SinSR,且参数和计算成本更低。

Comments ICML-2026

详情
AI中文摘要

用于超分辨率(SR)的扩散模型产生高质量的视觉结果,但需要昂贵的计算成本。尽管已经开发了几种加速基于扩散的SR模型的方法,但有些(例如SinSR)无法产生真实的感知细节,而其他(例如OSEDiff)可能会产生不存在的结构。为了克服这些问题,我们提出了RSD,一种新的ResShift蒸馏方法。我们的方法基于训练学生网络生成图像,使得基于这些图像训练的新假ResShift模型与教师模型一致。RSD实现单步恢复,并在各种感知指标(LPIPS、CLIPIQA、MUSIQ)上明显优于教师。我们表明,我们的蒸馏方法可以超越SinSR(另一种基于ResShift的蒸馏方法),使其在感知质量方面与最先进的扩散SR蒸馏方法相当,且计算成本有限。与基于预训练文本到图像模型的SR方法相比,RSD产生具有竞争力的感知质量,并需要更少的参数、GPU内存和训练成本。我们在各种真实世界和合成数据集上提供了实验结果,包括RealSR、RealSet65、DRealSR、ImageNet和DIV2K。我们在以下网址提供代码:此https URL。

英文摘要

Diffusion models for super-resolution (SR) produce high-quality visual results but require expensive computational costs. Despite the development of several methods to accelerate diffusion-based SR models, some (e.g., SinSR) fail to produce realistic perceptual details, while others (e.g., OSEDiff) may hallucinate non-existent structures. To overcome these issues, we present RSD, a new distillation method for ResShift. Our method is based on training the student network to produce images such that a new fake ResShift model trained on them will coincide with the teacher model. RSD achieves single-step restoration and outperforms the teacher by a noticeable margin in various perceptual metrics (LPIPS, CLIPIQA, MUSIQ). We show that our distillation method can surpass SinSR, the other distillation-based method for ResShift, making it on par with state-of-the-art diffusion SR distillation methods with limited computational costs in terms of perceptual quality. Compared to SR methods based on pre-trained text-to-image models, RSD produces competitive perceptual quality and requires fewer parameters, GPU memory, and training cost. We provide experimental results on various real-world and synthetic datasets, including RealSR, RealSet65, DRealSR, ImageNet, and DIV2K. We provide the code at https://github.com/Daniil-Selikhanovych/RSD.

2602.03164 2026-06-10 cs.LG cs.AI 版本更新

MemCast: Memory-Driven Time Series Forecasting with Experience-Conditioned Reasoning

MemCast:基于经验条件推理的记忆驱动时间序列预测

Xiaoyu Tao, Mingyue Cheng, Ze Guo, Shuo Yu, Yaguo Liu, Qi Liu, Shijin Wang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出MemCast框架,将时间序列预测转化为经验条件推理任务,通过层次化记忆学习历史模式、推理智慧和一般规律,并采用动态置信度适应策略实现持续进化,在多个数据集上优于现有方法。

详情
AI中文摘要

时间序列预测(TSF)在许多现实世界的决策中起着关键作用。最近,基于大型语言模型(LLM)的预测器取得了有希望的进展。尽管有效,现有方法通常缺乏显式的经验积累和持续进化。在这项工作中,我们提出了MemCast,一个学习到记忆的框架,将TSF重新表述为经验条件推理任务。具体来说,我们从训练集中学习经验并将其组织成层次化记忆。这是通过将预测结果总结为历史模式、将推理轨迹提炼为推理智慧、以及将提取的时间特征归纳为一般规律来实现的。此外,在推理过程中,我们利用历史模式指导推理过程,利用推理智慧选择更好的轨迹,而一般规律则作为反思迭代的标准。另外,为了实现持续进化,我们设计了一种动态置信度适应策略,该策略在不泄露测试集分布的情况下更新单个条目的置信度。在多个数据集上的大量实验表明,MemCast始终优于以前的方法,验证了我们方法的有效性。我们的代码可在以下网址获得:此 https URL。

英文摘要

Time series forecasting (TSF) plays a critical role in decision-making for many real-world applications. Recently, large language model (LLM)- based forecasters have made promising advancements. Despite their effectiveness, existing methods often lack explicit experience accumulation and continual evolution. In this work, we propose MemCast, a learning-to-memory framework that reformulates TSF as an experience-conditioned reasoning task. Specifically, we learn experience from the training set and organize it into a hierarchical memory. This is achieved by summarizing prediction results into historical patterns, distilling inference trajectories into reasoning wisdom, and inducing extracted temporal features into general laws. Furthermore, during inference, we leverage historical patterns to guide the reasoning process and utilize reasoning wisdom to select better trajectories, while general laws serve as criteria for reflective iteration. Additionally, to enable continual evolution, we design a dynamic confidence adaptation strategy that updates the confidence of individual entries without leaking the test set distribution. Extensive experiments on multiple datasets demonstrate that MemCast consistently outperforms previous methods, validating the effectiveness of our approach. Our code is available at https://github.com/Xiaoyu-Tao/MemCast-TS.

2602.02788 2026-06-10 cs.LG cs.AI physics.comp-ph 版本更新

Structure-Preserving Learning Improves Geometry Generalization in Neural PDEs

保结构学习提升神经PDE在几何泛化中的表现

Benjamin D. Shaffer, Shawn Koohy, Brooks Kinch, M. Ani Hsieh, Nathaniel Trask

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出General-Geometry Neural Whitney Forms (Geo-NeW)方法,通过联合学习微分算子和兼容的降阶有限元空间,结合有限元外微积分精确保持物理守恒律,显著提升对未见几何域的泛化能力。

详情
AI中文摘要

我们旨在为科学和工程开发物理基础模型,这些模型能为偏微分方程(PDE)提供实时解,并在适应未见几何时保持结构和精度。为此,我们引入了通用几何神经Whitney形式(Geo-NeW):一种数据驱动的有限元方法。我们联合学习一个微分算子和定义在底层几何上的兼容降阶有限元空间。求解所得模型以生成预测,同时通过有限元外微积分精确保持物理守恒律。几何通过基于Transformer的编码和作为学习到的有限元空间的基础,以离散化网格的形式进入模型。这明确地将底层几何和施加的边界条件与解联系起来,为学习神经PDE提供了强大的归纳偏置,我们证明这改善了对未见域的泛化。我们提供了一种本构模型的新参数化,确保解的存在性和唯一性。我们的方法在几个稳态PDE基准测试中展示了最先进的性能,并在分布外几何上比传统基线有显著改进。

英文摘要

We aim to develop physics foundation models for science and engineering that provide real-time solutions to Partial Differential Equations (PDEs) which preserve structure and accuracy under adaptation to unseen geometries. To this end, we introduce General-Geometry Neural Whitney Forms (Geo-NeW): a data-driven finite element method. We jointly learn a differential operator and compatible reduced finite element spaces defined on the underlying geometry. The resulting model is solved to generate predictions, while exactly preserving physical conservation laws through Finite Element Exterior Calculus. Geometry enters the model as a discretized mesh both through a transformer-based encoding and as the basis for the learned finite element spaces. This explicitly connects the underlying geometry and imposed boundary conditions to the solution, providing a powerful inductive bias for learning neural PDEs, which we demonstrate improves generalization to unseen domains. We provide a novel parameterization of the constitutive model ensuring the existence and uniqueness of the solution. Our approach demonstrates state-of-the-art performance on several steady-state PDE benchmarks, and provides a significant improvement over conventional baselines on out-of-distribution geometries.

2602.01951 2026-06-10 cs.CV 版本更新

Enabling Progressive Whole-slide Image Analysis with Multi-scale Pyramidal Network

利用多尺度金字塔网络实现渐进式全切片图像分析

Shuyang Wu, Yifu Qiu, Ines P Nearchou, Sandrine Prost, Jonathan A Fallowfield, Hakan Bilen, Timothy J Kendall

发表机构 * Institute for Regeneration and Repair, University of Edinburgh(再生与修复研究所,爱丁堡大学) School of Informatics, University of Edinburgh(信息学院,爱丁堡大学) Indica Labs, 8700 Education Pl NW, Bldg. B Albuquerque, US(Indica实验室,美国阿尔伯克基8700教育大道西北区B座) Medical School, University of St Andrews(医学学校,圣安德鲁大学)

AI总结 提出多尺度金字塔网络(MSPN),一种即插即用模块,仅使用单一高倍输入实现渐进式多尺度全切片图像分析,通过网格重映射和粗引导网络学习粗粒度上下文,在多个任务和框架上一致提升MIL性能。

详情
AI中文摘要

多实例学习(MIL)常用于计算病理学(CPath),其中多尺度特征对于捕捉精细细胞细节和广泛组织结构至关重要。然而,现有的多尺度MIL方法通常依赖于不灵活的多倍率输入或计算成本高昂的架构。随着预训练基础模型(FMs)成为特征提取的趋势并推动轻量级模型的发展,我们重新思考并探索更高效的多尺度MIL方法。在本文中,我们提出了多尺度金字塔网络(MSPN),一种用于基于注意力的MIL的即插即用模块。MSPN仅使用单一高倍输入实现渐进式多尺度全切片图像分析。它由(1)基于网格的重映射组成,该重映射聚合高倍特征以导出空间感知的粗粒度特征图,以及(2)粗引导网络(CGN),该网络学习粗粒度上下文。我们将MSPN作为附加模块在4个基于注意力的框架上,针对5个临床相关任务,使用2个基础模型和一个预训练的MIL框架进行基准测试。我们的结果表明,MSPN在比较的配置和任务上一致地提高了MIL性能,同时保持轻量且易于使用。

英文摘要

Multiple-instance Learning (MIL) is commonly used for computational pathology (CPath), where multi-scale features are essential for capturing both fine cellular details and broad tissue architecture. However, existing multi-scale MIL approaches typically rely on the inflexible multi-magnification inputs or the computationally expensive architectures. As pre-trained foundation models (FMs) become the trend for feature extraction and boost lightweight models, we rethink and explore a more efficient multi-scale MIL method. In this paper, we propose the Multi-scale Pyramidal Network (MSPN), a plug-and-play module for attention-based MIL. MSPN introduces progressive multi-scale whole-slide image analysis using only a single high-magnification input. It consists of (1) grid-based remapping that aggregates high-magnification features to derive spatially-aware coarse feature maps, and (2) the Coarse Guidance Network (CGN) that learns coarse contexts. We benchmark MSPN as an add-on module to 4 attention-based frameworks on 5 clinically relevant tasks with 2 foundation models, and a pre-trained MIL framework. Our results demonstrate that MSPN consistently improves MIL across the compared configurations and tasks, while being lightweight and easy-to-use.

2601.22763 2026-06-10 cs.CV 版本更新

Is Task-Specific Training Necessary for Anomaly Detection?

异常检测是否需要任务特定训练?

Xingwu Zhang, Guanxuan Li, Paul Henderson, Gerardo Aragon-Camarasa, Zijun Long

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出基于检索的异常检测框架RAD,无需任务特定训练,通过多级检索匹配记忆库中的无异常特征,在多个基准上达到最优性能,挑战了任务特定训练的必要性。

详情
AI中文摘要

当前最先进的多类无监督异常检测(MUAD)方法依赖于训练编码器-解码器模型来重建无异常特征。然而,我们认为这种任务特定训练在分布偏移下成本高昂,并且基于重建的残差评分进一步面临保真度-稳定性困境。现有的免训练替代方案在MUAD中仍然容易受到跨类别和跨区域不匹配的影响。受这些限制的启发,我们提出了基于检索的异常检测(RAD),一种无需任务特定训练的框架,它将无异常特征存储在记忆库中,并通过多级检索检测异常,将测试补丁与记忆库进行匹配。实验表明,RAD在四个既定基准(MVTec-AD、VisA、Real-IAD、3D-ADAM)的标准和少样本设置下均达到了最先进的性能。在MVTec-AD上,RAD仅使用单个无异常图像即可达到96.7%的像素AUROC,而RAD的全数据性能为98.5%。这些发现共同推翻了MUAD需要任务特定训练的假设,表明最先进的异常检测可以通过免训练的基于记忆的检索实现。我们的代码可在此https URL获取。

英文摘要

Current state-of-the-art multi-class unsupervised anomaly detection (MUAD) methods rely on training encoder--decoder models to reconstruct anomaly-free features. However, we argue that such task-specific training is costly under distribution shifts, and that reconstruction-based residual scoring further faces a fidelity--stability dilemma. Existing training-free alternatives, in turn, remain prone to cross-category and cross-region mismatches in MUAD. Motivated by these limitations, we propose Retrieval-based Anomaly Detection (RAD), a task-specific training-free framework that stores anomaly-free features in a memory and detects anomalies through multi-level retrieval, matching test patches against the memory. Experiments demonstrate that RAD achieves state-of-the-art performance across four established benchmarks (MVTec-AD, VisA, Real-IAD, 3D-ADAM) under both standard and few-shot settings. On MVTec-AD, RAD reaches 96.7% Pixel AUROC with just a single anomaly-free image compared to 98.5% of RAD's full-data performance. Collectively, these findings overturn the assumption that MUAD requires task-specific training, showing that state-of-the-art anomaly detection is feasible with training-free memory-based retrieval. Our code is available at https://github.com/longkukuhi/RAD.

2602.01179 2026-06-10 cs.LG 版本更新

Rethinking the Flow-Based Gradual Domain Adaptation: A Semi-Dual Optimal Transport Perspective

重新思考基于流的渐进域适应:半对偶最优传输视角

Zhichao Chen, Zhan Zhuang, Yunfei Teng, Hao Wang, Fangyikang Wang, Zhengnan Li, Tianqiao Liu, Haoxuan Li, Zhouchen Lin

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对渐进域适应中中间域构建问题,提出熵正则化半对偶非平衡最优传输框架,避免似然估计,实现稳定训练并提升性能。

Comments The paper has been accepted for presentation as a regular paper at the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

渐进域适应(GDA)旨在通过中间域逐步将模型从源域适应到目标域,以减轻域偏移。然而,真实的中间域通常不可用或无效,因此需要合成中间样本。基于流的模型最近被用于此目的,通过在源分布和目标分布之间插值。值得注意的是,它们的训练通常依赖于基于样本的对数似然估计,这可能会丢弃有用信息,从而降低GDA性能。解决这一限制的关键是通过样本直接构建中间域。为此,我们提出了一个熵正则化半对偶非平衡最优传输(E-SUOT)框架来构建中间域。具体来说,我们将基于流的GDA重新表述为拉格朗日对偶问题,并推导出一个等效的半对偶目标,从而避免了似然估计的需要。然而,对偶问题导致不稳定的最小-最大训练过程。为了缓解这个问题,我们进一步引入熵正则化,将其转换为更稳定的顺序优化过程。基于此,我们提出了一个新的GDA训练框架,并提供了稳定性和泛化方面的理论分析。最后,进行了大量实验来证明E-SUOT框架的有效性。

英文摘要

Gradual domain adaptation (GDA) aims to mitigate domain shift by progressively adapting models from the source domain to the target domain via intermediate domains. However, real intermediate domains are often unavailable or ineffective, necessitating the synthesis of intermediate samples. Flow-based models have recently been used for this purpose by interpolating between source and target distributions. Notably, their training typically relies on sample-based log-likelihood estimation, which can discard useful information and thus degrade GDA performance. The key to addressing this limitation is constructing the intermediate domains via samples directly. To this end, we propose an Entropy-regularized Semi-dual Unbalanced Optimal Transport (E-SUOT) framework to construct intermediate domains. Specifically, we reformulate flow-based GDA as a Lagrangian dual problem and derive an equivalent semi-dual objective that circumvents the need for likelihood estimation. However, the dual problem leads to an unstable min-max training procedure. To alleviate this issue, we further introduce the entropy regularization to convert it into a more stable sequential optimization procedure. Based on this, we propose a novel GDA training framework and provide theoretical analysis in terms of stability and generalization. Finally, extensive experiments are conducted to demonstrate the efficacy of the E-SUOT framework.

2512.06343 2026-06-10 cs.LG cs.AI cs.CL 版本更新

When Distance Distracts: Representation Distance Bias in BT-Loss for Reward Models

当距离干扰:BT损失中表示距离偏差对奖励模型的影响

Tong Xie, Andrew Bai, Yuanhao Ban, Yunqi Hong, Haoyu Li, Cho-Jui Hsieh

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 分析BT损失中表示距离导致的梯度偏差,提出NormBT自适应归一化方案,提升奖励模型在细粒度区分上的性能。

Comments ICML 2026

详情
AI中文摘要

奖励模型是RLHF框架中大型语言模型对齐的核心。奖励建模中使用的标准目标是Bradley-Terry(BT)损失,它从由选择和拒绝响应组成的成对数据中学习。在这项工作中,我们分析了BT损失的每个样本梯度,并展示了由于表示距离而产生的虚假学习信号。特别是,BT梯度范数由两个不同的组成部分缩放:(1)预测误差,反映选择和拒绝响应之间预测奖励的差异,以及关键地,(2)在最后一层输出空间中测量的对之间的表示距离。虽然第一项捕获了预期的训练信号,但第二项会显著影响更新幅度并导致学习错位。具体来说,表示距离小的对即使排名错误也经常收到微弱的更新,而距离大的对则收到不成比例的大更新。这导致来自大距离对的梯度掩盖了来自小距离对的梯度,而细粒度区分在小距离对中尤为重要。为了克服这一限制,我们提出了NormBT,一种自适应成对归一化方案,重新缩放更新以平衡表示驱动效应,并将学习信号聚焦于预测误差。NormBT是对BT损失的轻量级、即插即用修改,开销可忽略。在各种LLM骨干和数据集上,NormBT一致地提高了奖励模型性能,在RewardBench的推理类别上取得了超过5%的显著提升,该类别包含大量细粒度对。

英文摘要

Reward models are central to Large Language Model (LLM) alignment within the framework of RLHF. The standard objective used in reward modeling is the Bradley-Terry (BT) loss, which learns from pairwise data consisting of chosen and rejected responses. In this work, we analyze the per-sample gradient of BT-loss and show spurious learning signals due to representation distance. In particular, BT gradient norm scales with two distinct components: (1) prediction error, reflected by the difference in predicted rewards between chosen and rejected responses, and critically, (2) representation distance between the pair measured in the output space of the final layer. While the first term captures the intended training signal, the second term can significantly impact the update magnitude and misalign learning. Specifically, pairs with small representation distance often receive vanishingly weak updates, even when misranked, while pairs with large distance receive disproportionately strong updates. This leads to gradients from large-distance pairs to overshadow those from small-distance pairs, where fine-grained distinctions are especially important. To overcome this limitation, we propose NormBT, an adaptive pair-wise normalization scheme that rescales updates to balance representation-driven effects and focuses learning signals on prediction error. NormBT is a lightweight, drop-in modification to BT loss with negligible overhead. Across various LLM backbones and datasets, NormBT improves reward model performance consistently, with notable gains of over 5% on the Reasoning category of RewardBench, which contains numerous fine-grained pairs.

2512.08280 2026-06-10 cs.RO cs.AI cs.SY eess.SY 版本更新

Model-Based Diffusion Sampling for Predictive Control in Offline Decision Making

基于模型扩散采样的离线决策预测控制

Haldun Balim, Na Li, Yilun Du

发表机构 * GitHub

AI总结 提出MPDiffuser框架,通过组合扩散规划器与动力学扩散模型,在采样中交替更新以生成符合任务目标且动力学可行的轨迹,并利用轻量级排序模块选择最优轨迹,在D4RL和DSRL基准及四足机器人上验证了有效性。

详情
AI中文摘要

通过扩散模型进行离线决策通常会产生与系统动力学不对齐的轨迹,限制了其在控制中的可靠性。我们提出了模型预测扩散器(MPDiffuser),一种组合扩散框架,它将扩散规划器与动力学扩散模型相结合,以生成任务对齐且动力学可行的轨迹。MPDiffuser在采样过程中交替进行规划器和动力学更新,逐步修正可行性同时保留任务意图。然后,一个轻量级排序模块选择最能满足任务目标的轨迹。组合设计通过使动力学模型能够独立于规划器利用多样且未见过的数据,提高了样本效率和适应性。实验上,我们在无约束(D4RL)和约束(DSRL)基准上展示了相对于先前基于扩散的方法的一致改进,并通过在真实四足机器人上的部署验证了实用性。

英文摘要

Offline decision-making via diffusion models often produces trajectories that are misaligned with system dynamics, limiting their reliability for control. We propose Model Predictive Diffuser (MPDiffuser), a compositional diffusion framework that combines a diffusion planner with a dynamics diffusion model to generate task-aligned and dynamically plausible trajectories. MPDiffuser interleaves planner and dynamics updates during sampling, progressively correcting feasibility while preserving task intent. A lightweight ranking module then selects trajectories that best satisfy task objectives. The compositional design improves sample efficiency and adaptability by enabling the dynamics model to leverage diverse and previously unseen data independently of the planner. Empirically, we demonstrate consistent improvements over prior diffusion-based methods on unconstrained (D4RL) and constrained (DSRL) benchmarks, and validate practicality through deployment on a real quadrupedal robot.

2601.21218 2026-06-10 cs.CL 版本更新

Parametric Knowledge is Not All You Need: Toward Honest Large Language Models via Retrieval of Pretraining Data

参数化知识并非全部:通过检索预训练数据实现诚实的语言模型

Christopher Adrian Kusuma, Muhammad Reza Qorib, Hwee Tou Ng

发表机构 * Department of Computer Science, National University of Singapore(新加坡国立大学计算机科学系)

AI总结 针对大语言模型在知识不足时产生幻觉的问题,提出利用公开预训练数据构建更鲁棒的诚实性评估基准,并设计检索预训练数据的方法提升模型诚实性。

Comments Findings of ACL 2026

详情
AI中文摘要

大型语言模型(LLMs)在回答问题方面能力很强,但通常不了解自己的知识边界,即知道什么知道和不知道什么。因此,它们可能在自己知识不足的主题上生成事实上不正确的回答,即所谓的幻觉。与其产生幻觉,语言模型应该更加诚实,在缺乏相关知识时回答“我不知道”。许多方法已被提出以提高LLM的诚实性,但它们的评估缺乏鲁棒性,因为它们没有考虑LLM在预训练期间吸收的知识。在本文中,我们利用Pythia(一个具有公开预训练数据的真正开放LLM)提出了一个更鲁棒的LLM诚实性评估基准数据集。此外,我们还提出了一种利用预训练数据构建更诚实LLM的新方法。

英文摘要

Large language models (LLMs) are highly capable of answering questions, but they are often unaware of their own knowledge boundary, i.e., knowing what they know and what they don't know. As a result, they can generate factually incorrect responses on topics they do not have enough knowledge of, commonly known as hallucination. Rather than hallucinating, a language model should be more honest and respond with "I don't know" when it does not have enough knowledge about a topic. Many methods have been proposed to improve LLM honesty, but their evaluations lack robustness, as they do not take into account the knowledge that the LLM has ingested during its pretraining. In this paper, we propose a more robust evaluation benchmark dataset for LLM honesty by utilizing Pythia, a truly open LLM with publicly available pretraining data. In addition, we also propose a novel method for harnessing the pretraining data to build a more honest LLM.

2601.19210 2026-06-10 cs.CV 版本更新

Contrastive Spectral Rectification: Test-Time Defense towards Zero-shot Adversarial Robustness of CLIP

对比谱校正:面向CLIP零样本对抗鲁棒性的测试时防御

Sen Nie, Jie Zhang, Zhuo Wang, Shiguang Shan, Xilin Chen

发表机构 * University of Electronic Science and Technology of China(电子科技大学)

AI总结 提出对比谱校正(CSR)方法,利用对抗样本在频率衰减下的特征不一致性,通过谱引导对比目标优化校正扰动,在16个分类基准上平均提升18.1%的强攻击鲁棒性,且推理开销低。

Comments Accepted by ICML 2026

详情
AI中文摘要

视觉语言模型(如CLIP)展现出显著的零样本泛化能力,但仍极易受到对抗样本的攻击。尽管测试时防御方法前景广阔,现有方法无法对强攻击提供足够的鲁棒性,且常受限于高推理延迟和任务特定适用性。为解决这些限制,我们首先研究了对抗样本的内在特性,发现对抗样本在渐进频率衰减下表现出严重的特征不一致性。我们进一步将其归因于模型固有的谱偏差。利用这一洞察,我们提出了一种高效的测试时防御方法,名为对比谱校正(CSR)。CSR优化一个校正扰动,在谱引导的对比目标下将输入重新对齐到自然流形,并以输入自适应方式应用。在16个分类基准上的大量实验表明,CSR在强APGD攻击下平均优于现有技术18.1%,且推理开销适中。此外,CSR在多种视觉任务中展现出广泛的适用性。代码见https://this URL。

英文摘要

Vision-language models (VLMs) such as CLIP have demonstrated remarkable zero-shot generalization, yet remain highly vulnerable to adversarial examples (AEs). While test-time defenses are promising, existing methods fail to provide sufficient robustness against strong attacks and are often hampered by high inference latency and task-specific applicability. To address these limitations, we start by investigating the intrinsic properties of AEs, which reveals that AEs exhibit severe feature inconsistency under progressive frequency attenuation. We further attribute this to the model's inherent spectral bias. Leveraging this insight, we propose an efficient test-time defense named Contrastive Spectral Rectification (CSR). CSR optimizes a rectification perturbation to realign the input with the natural manifold under a spectral-guided contrastive objective, which is applied input-adaptively. Extensive experiments across 16 classification benchmarks demonstrate that CSR outperforms the SOTA by an average of 18.1% against strong APGD with modest inference overhead. Furthermore, CSR exhibits broad applicability across diverse visual tasks. Code is available at https://github.com/Summu77/CSR.

2601.18765 2026-06-10 cs.RO 版本更新

Goal-oriented Communication for Fast and Robust Robotic Fault Detection and Recovery

面向快速鲁棒机器人故障检测与恢复的目标导向通信

Shutong Chen, Adnan Aijaz, Yansha Deng

发表机构 * Department of Engineering, King’s College London(伦敦国王学院工程系) Bristol Research and Innovation Laboratory, Toshiba Europe Ltd.(托bsd欧洲有限公司布里斯托尔研究与创新实验室)

AI总结 提出目标导向通信框架,通过联合设计通信-计算-控制回路,利用3D场景图检测故障,并微调小语言模型结合知识蒸馏生成恢复动作,实现故障检测与恢复时间降低82.6%,任务成功率提升76%。

Comments Submit to IEEE for potential publication

详情
AI中文摘要

自主机器人系统广泛部署于智能工厂,并在动态、不确定及有人参与的环境中运行,需要低延迟且鲁棒的故障检测与恢复(FDR)。然而,现有FDR框架存在各种局限性,例如通信和计算的显著延迟,以及机器人运动/轨迹生成的不可靠性,这主要是因为通信-计算-控制(3C)回路的设计未考虑下游FDR目标。为了解决这个问题,我们提出了一种新颖的目标导向通信(GoC)框架,该框架联合设计3C回路,专门用于快速鲁棒的机器人FDR,目标是最小化FDR时间同时最大化机器人任务(例如工件分拣)成功率。对于故障检测,我们的GoC框架创新性地通过我们设计的表示提取器定义并提取3D场景图(3D-SG)作为语义表示,并通过监测3D-SG中的空间关系变化来检测故障。对于故障恢复,我们通过低秩适配(LoRA)微调一个小语言模型(SLM),并通过知识蒸馏增强其推理和泛化能力,以生成机器人的恢复动作。我们还设计了一个轻量级的目标导向数字孪生重建模块,在需要精细机器人控制时,仅使用任务相关的物体轮廓进行数字孪生重建,以优化SLM生成的恢复动作。大量仿真表明,与依赖视觉语言模型进行故障检测和大型语言模型进行故障恢复的最先进框架相比,我们的GoC框架将FDR时间降低了高达82.6%,并将任务成功率提高了高达76%。

英文摘要

Autonomous robotic systems are widely deployed in smart factories and operate in dynamic, uncertain, and human-involved environments that require low-latency and robust fault detection and recovery (FDR). However, existing FDR frameworks exhibit various limitations, such as significant delays in communication and computation, and unreliability in robot motion/trajectory generation, mainly because the communication-computation-control (3C) loop is designed without considering the downstream FDR goal. To address this, we propose a novel Goal-oriented Communication (GoC) framework that jointly designs the 3C loop tailored for fast and robust robotic FDR, with the goal of minimising the FDR time while maximising the robotic task (e.g., workpiece sorting) success rate. For fault detection, our GoC framework innovatively defines and extracts the 3D scene graph (3D-SG) as the semantic representation via our designed representation extractor, and detects faults by monitoring spatial relationship changes in the 3D-SG. For fault recovery, we fine-tune a small language model (SLM) via Low-Rank Adaptation (LoRA) and enhance its reasoning and generalization capabilities via knowledge distillation to generate recovery motions for robots. We also design a lightweight goal-oriented digital twin reconstruction module to refine the recovery motions generated by the SLM when fine-grained robotic control is required, using only task-relevant object contours for digital twin reconstruction. Extensive simulations demonstrate that our GoC framework reduces the FDR time by up to 82.6% and improves the task success rate by up to 76%, compared to the state-of-the-art frameworks that rely on vision language models for fault detection and large language models for fault recovery.

2601.18026 2026-06-10 cs.CL 版本更新

CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data

CommonLID:重新评估网络数据上最先进的语言识别性能

Pedro Ortiz Suarez, Laurie Burchell, Catherine Arnett, Rafael Mosquera-Gómez, Sara Hincapie-Monsalve, Thom Vaughan, Damian Stewart, Malte Ostendorff, Idris Abdulmumin, Vukosi Marivate, Shamsuddeen Hassan Muhammad, Atnafu Lambebo Tonja, Hend Al-Khalifa, Nadia Ghezaiel Hammouda, Verrah Otiende, Tack Hwa Wong, Jakhongir Saydaliev, Melika Nobakhtian, Muhammad Ravi Shulthan Habibi, Chalamalasetti Kranti, Carol Muchemi, Khang Nguyen, Faisal Muhammad Adam, Luis Frentzen Salim, Reem Alqifari, Cynthia Amol, Joseph Marvin Imperial, Ilker Kesen, Ahmad Mustafid, Pavel Stepachev, Leshem Choshen, David Anugraha, Hamada Nayel, Seid Muhie Yimam, Vallerie Alexandra Putra, My Chiffon Nguyen, Azmine Toushik Wasi, Gouthami Vadithya, Rob van der Goot, Lanwenn ar C'horr, Karan Dua, Andrew Yates, Mithil Bangera, Yeshil Bangera, Hitesh Laxmichand Patel, Shu Okabe, Fenal Ashokbhai Ilasariya, Dmitry Gaynullin, Genta Indra Winata, Yiyuan Li, Juan Pablo Martínez, Amit Agarwal, Ikhlasul Akmal Hanif, Raia Abu Ahmad, Esther Adenuga, Filbert Aurelian Tjiaranata, Weerayut Buaphet, Michael Anugraha, Sowmya Vajjala, Benjamin Rice, Azril Hafizi Amirudin, Jesujoba O. Alabi, Srikant Panda, Yassine Toughrai, Bruhan Kyomuhendo, Daniel Ruffinelli, Akshata A, Manuel Goulão, Ej Zhou, Ingrid Gabriela Franco Ramirez, Cristina Aggazzotti, Konstantin Dobler, Jun Kevin, Quentin Pagès, Nicholas Andrews, Nuhu Ibrahim, Mattes Ruckdeschel, Amr Keleg, Mike Zhang, Casper Muziri, Saron Samuel, Sotaro Takeshita, Kun Kerdthaisong, Luca Foppiano, Rasul Dent, Tommaso Green, Ahmad Mustapha Wali, Kamohelo Makaaka, Vicky Feliren, Inshirah Idris, Hande Celikkanat, Abdulhamid Abubakar, Jean Maillard, Benoît Sagot, Thibault Clérice, Kenton Murray, Sarah Luger

发表机构 * Common Crawl Foundation(Common Crawl基金会) EleutherAI Factored AI MLCommons

AI总结 提出CommonLID基准,覆盖109种语言,通过人工标注评估8种主流LID模型,揭示现有评估高估了网络领域多语言识别准确率。

Comments 18 pages, 8 tables, 5 figures

详情
AI中文摘要

语言识别(LID)是整理多语言语料库的基本步骤。然而,LID模型在许多语言上仍然表现不佳,尤其是在用于训练多语言语言模型的嘈杂且异构的网络数据上。在本文中,我们介绍了CommonLID,一个社区驱动、人工标注的网络领域LID基准,涵盖109种语言。其中许多语言此前未得到充分服务,使得CommonLID成为开发更具代表性的高质量文本语料库的关键资源。我们通过使用CommonLID以及其他五个常见的评估集来测试八种流行的LID模型,展示了其价值。我们分析结果以定位我们的贡献,并提供对当前技术水平的概述。我们特别强调,现有评估高估了网络领域许多语言的LID准确率。我们以开放、宽松的许可证提供CommonLID和用于创建它的代码。

英文摘要

Language identification (LID) is a fundamental step in curating multilingual corpora. However, LID models still perform poorly for many languages, especially on the noisy and heterogeneous web data often used to train multilingual language models. In this paper, we introduce CommonLID, a community-driven, human-annotated LID benchmark for the web domain, covering 109 languages. Many of the included languages have been previously under-served, making CommonLID a key resource for developing more representative high-quality text corpora. We show CommonLID's value by using it, alongside five other common evaluation sets, to test eight popular LID models. We analyse our results to situate our contribution and to provide an overview of the state of the art. In particular, we highlight that existing evaluations overestimate LID accuracy for many languages in the web domain. We make CommonLID and the code used to create it available under an open, permissive license.

2512.02201 2026-06-10 cs.CL 版本更新

Swivuriso: The South African Next Voices Multilingual Speech Dataset

Swivuriso:南非下一代语音多语言语音数据集

Vukosi Marivate, Kayode Olaleye, Sitwala Mundia, Andinda Bakainga, Unarine Netshifhefhe, Mahmooda Milanzie, Tsholofelo Hope Mogale, Thapelo Sindane, Zainab Abdulrasaq, Kesego Mokgosi, Chijioke Okorie, Nia Zion Van Wyk, Graham Morrissey, Dale Dunbar, Francois Smit, Tsosheletso Chidi, Rooweither Mabuya, Andiswa Bukula, Respect Mlambo, Tebogo Macucwa, Idris Abdulmumin, and Seani Rananga

发表机构 * University of Cape Town(开普敦大学) University of KwaZulu-Natal(夸祖鲁-纳塔尔大学)

AI总结 介绍Swivuriso,一个3000小时的多语言语音数据集,覆盖南非七种语言,用于自动语音识别技术的开发与基准测试,填补现有数据集空白。

Comments Work in Progress. Updated in June 2026

详情
AI中文摘要

本文介绍了Swivuriso,一个3000小时的多语言语音数据集,作为非洲下一代语音项目的一部分开发,旨在支持七种南非语言的自动语音识别(ASR)技术的开发和基准测试。涵盖农业、医疗保健和通用领域主题,Swivuriso填补了现有ASR数据集的重大空白。我们描述了指导数据集创建的设计原则、伦理考虑和数据收集程序。我们展示了使用这些数据训练/微调ASR模型的基线结果,并与相关语言的其他ASR数据集进行了比较。

英文摘要

This paper introduces Swivuriso, a 3000-hour multilingual speech dataset developed as part of the African Next Voices project, to support the development and benchmarking of automatic speech recognition (ASR) technologies in seven South African languages. Covering agriculture, healthcare, and general domain topics, Swivuriso addresses significant gaps in existing ASR datasets. We describe the design principles, ethical considerations, and data collection procedures that guided the dataset creation. We present baseline results of training/finetuning ASR models with this data and compare to other ASR datasets for the langauges concerned.

2402.00152 2026-06-10 cs.LG cs.NA math.NA stat.ML 版本更新

Deeper or Wider: A Perspective from Optimal Generalization Error with Sobolev Loss

更深还是更宽:从Sobolev损失下最优泛化误差的视角

Yahong Yang, Juncai He

发表机构 * GitHub

AI总结 本文通过分析Sobolev损失下的最优泛化误差,比较了深层网络与宽层网络,揭示了样本点数量、网络参数和损失函数正则性对架构选择的影响,并应用于深度Ritz和PINN方法。

Comments arXiv admin note: text overlap with arXiv:2310.10766, arXiv:2305.08466

详情
AI中文摘要

构建神经网络的架构是机器学习社区的一个具有挑战性的追求,而更深还是更宽的困境仍然是一个持续存在的问题。本文探讨了具有灵活层数的深层神经网络(DeNNs)与具有有限隐藏层的宽神经网络(WeNNs)之间的比较,重点关注它们在Sobolev损失下的最优泛化误差。分析研究表明,神经网络的架构可能受到多种因素的显著影响,包括样本点数量、神经网络内的参数以及损失函数的正则性。具体来说,更多的参数倾向于有利于WeNNs,而增加的样本点数量和损失函数的更大正则性则倾向于采用DeNNs。我们最终将该理论应用于使用深度Ritz和物理信息神经网络(PINN)方法求解偏微分方程,指导神经网络的设计。

英文摘要

Constructing the architecture of a neural network is a challenging pursuit for the machine learning community, and the dilemma of whether to go deeper or wider remains a persistent question. This paper explores a comparison between deeper neural networks (DeNNs) with a flexible number of layers and wider neural networks (WeNNs) with limited hidden layers, focusing on their optimal generalization error in Sobolev losses. Analytical investigations reveal that the architecture of a neural network can be significantly influenced by various factors, including the number of sample points, parameters within the neural networks, and the regularity of the loss function. Specifically, a higher number of parameters tends to favor WeNNs, while an increased number of sample points and greater regularity in the loss function lean towards the adoption of DeNNs. We ultimately apply this theory to address partial differential equations using deep Ritz and physics-informed neural network (PINN) methods, guiding the design of neural networks.

2601.08379 2026-06-10 cs.LG cs.AI cs.CV 版本更新

MMD Guidance: Training-Free Distribution Adaptation for Diffusion Models via Maximum Mean Discrepancy Guidance

MMD Guidance: 基于最大均值差异引导的无训练分布适应扩散模型

Matina Mahdizadeh Sani, Nima Jamali, Mohammad Jalali, Farzan Farnia

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出MMD Guidance,一种无训练方法,通过最大均值差异梯度引导扩散模型采样,实现与参考数据分布对齐,无需重新训练。

详情
AI中文摘要

预训练扩散模型已成为无条件及条件样本生成的有力先验,但其输出常偏离用户特定目标数据的特征。这种不匹配在领域适应任务中尤为突出,此时仅有少量参考样本可用且重新训练扩散模型不可行。现有推理时引导方法可调整采样轨迹,但通常优化替代目标(如分类器似然)而非直接对齐目标分布。我们提出MMD Guidance,一种无训练机制,通过生成样本与参考数据集之间的最大均值差异(MMD)梯度增强反向扩散过程。MMD能从有限数据中提供可靠分布估计,实践中方差低,且可高效微分,特别适合引导任务。我们的框架通过乘积核自然扩展到条件生成模型中的提示感知适应。此外,由于引导在潜在扩散模型(LDM)的潜在空间中进行,因此可高效计算。在合成及真实世界基准上的实验表明,MMD Guidance能在保持样本保真度的同时实现分布对齐。项目代码见该网址。

英文摘要

Pre-trained diffusion models have emerged as powerful generative priors for both unconditional and conditional sample generation, yet their outputs often deviate from the characteristics of user-specific target data. Such mismatches are especially problematic in domain adaptation tasks, where only a few reference examples are available and retraining the diffusion model is infeasible. Existing inference-time guidance methods can adjust sampling trajectories, but they typically optimize surrogate objectives such as classifier likelihoods rather than directly aligning with the target distribution. We propose \emph{MMD Guidance}, a training-free mechanism that augments the reverse diffusion process with gradients of the \textit{Maximum Mean Discrepancy (MMD)} between generated samples and a reference dataset. MMD provides reliable distributional estimates from limited data, exhibits low variance in practice, and is efficiently differentiable, which makes it particularly well-suited for the guidance task. Our framework naturally extends to prompt-aware adaptation in conditional generation models via product kernels. Also, it can be applied with computational efficiency in latent diffusion models (LDMs), since guidance is applied in the latent space of the LDM. Experiments on synthetic and real-world benchmarks demonstrate that MMD Guidance can achieve distributional alignment while preserving sample fidelity. The project code is available at github.com/matinamehdizadeh/MMD-Guidance.

2601.05232 2026-06-10 cs.CL cs.CY cs.LG 版本更新

AI Application Gives Users Real-Time Feedback on the Level of Peace in the Social Media Videos They Watch

AI应用为用户观看的社交媒体视频提供实时和平水平反馈

P. Gilda, P. Dungarwal, A. Thongkham, E. T. Ajayi, S. Choudhary, T. M. Terol, C. Lam, J. P. Araujo, M. McFadyen-Mungalln, L. S. Liebovitch, P. T. Coleman, H. West, K. Sieck, S. Carter

发表机构 * Data Science Institute, Columbia University(哥伦比亚大学数据科学研究所) Advanced Consortium on Cooperation, Conflict, and Complexity, Columbia University(哥伦比亚大学合作、冲突与复杂性高级联合体) Computer Science, Columbia University(哥伦比亚大学计算机科学) Data Science, St John’s University(圣约翰大学数据科学) Quantitative Methods in the Social Sciences, Columbia University(哥伦比亚大学社会科学定量方法) Barnard College, Columbia University(哥伦比亚大学巴纳德学院) Teachers College, Columbia University(哥伦比亚大学教师学院) Department of Industrial Engineering and Operations Research, Columbia University(哥伦比亚大学工业工程与运筹学系) Harmonious Communities, Toyota Research Institute(丰田研究院和谐社区)

AI总结 开发了一个实时分析YouTube视频中语言和平程度的AI应用,使用监督学习和大语言模型,大语言模型在测量和平相关社会维度上更接近人类编码者。

Comments 6 pages, 4 figures, corrected typos, minor edits; v3: 16 pages, improved title, abstract, introduction, discussion, conclusions, added more references

详情
AI中文摘要

现在大多数人通过社交媒体(如YouTube和Facebook)上的视频获取新闻,而不是通过精心策划的新闻业。“我们成为我们所注视的。”语言的内容和语调在开始或结束冲突中起着至关重要的作用。“仇恨言论”会加剧冲突,“和平言论”会促进和平。我们开发了一个应用程序,可以实时测量YouTube视频中这些方面的言论,从而为用户提供关于自身媒体消费的有用反馈。我们使用了两种方法:1)监督机器学习。在线新闻媒体文本中的语言通过衡量这些国家和平水平的调查进行标记。一个全连接前馈网络和两个卷积神经网络在该数据上训练,在测试集上预测和平水平的准确率约为97%,在另一个不同的新闻文本数据集中准确率约为70%,但未能泛化到YouTube视频,表明书面文本与转录的口语不同。2)社会科学维度。没有类似的外部数据来标记YouTube视频转录文本中的语言。因此,我们使用了2个词级情感分析(SA)和6个上下文级大语言模型(LLM)来测量59项社会科学研究确定的和平中的5个社会维度:同情-蔑视、新闻-观点、促进-预防、创造力-秩序、细微差别-简化。在52个视频上,LLM与3个人类编码者的值更接近(r^2~0.60),而SA的r^2~0.03。结果:与人类编码者相比,LLM成功测量了YouTube视频中与和平相关的重要社会维度。这些结果构成了一个分析引擎的基础,该引擎可以为用户和内容创作者提供关于自身媒体消费和创作的反馈。

英文摘要

Most people now get their news from videos on social media, such as YouTube and Facebook, rather than through curated journalism. "We become what we behold." The content and tone of language plays an essential role in starting or ending conflicts. "Hate Speech" can enhance conflict, "Peace Speech" can enhance peace. We developed an application that measures, in real time, these aspects of speech from YouTube videos, which can give users helpful feedback on their own media diet. We used two approaches: 1) supervised machine learning. Language in the text of online news media text was tagged by surveys that measure the level of peace in those countries. One fully connected feedforward and 2 convolutional neural networks trained on that data were $\sim 97\%$ accurate in predicting levels of peace in the test set and $\sim 70\%$ accurate in another distinct news text data set, but did not generalize to YouTube videos, suggesting that written text is different than transcribed spoken language. 2) social science dimensions. There is no similar external data to tag the text in the YouTube video transcripts. We therefore used 2 word-level sentiment analysis (SA) and 6 context-level large language models (LLMs) to measure 5 social dimensions in peace identified by 59 social science studies: compassion-contempt, news-opinion, promotion-prevention, creativity-order, nuance-simplification. LLMs more closely matched the values by 3 human coders on 52 videos, $r^2\sim0.60$ than SA, at $r^2\sim0.03$. Results: LLMs successfully measured social dimensions important in peace in YouTube videos, compared to human coders. These results serve as the basis of an analysis engine that can give users and content creators feedback on their own media diet and creations.

2601.06997 2026-06-10 cs.RO cs.CV 版本更新

ObjSplat: Geometry-Aware Gaussian Surfels for Active Object Reconstruction

ObjSplat: 几何感知的高斯面元用于主动物体重建

Yuetao Li, Zhizhou Jia, Yu Zhang, Qun Hao, Shaohui Zhang

发表机构 * School of Optics and Photonics, Beijing Institute of Technology(光学与光子学学院,北京理工大学) School of Optoelectronic Engineering, Changchun University of Science and Technology(光电工程学院,长春理工大学)

AI总结 提出ObjSplat框架,利用高斯面元统一表示,通过几何感知视点评估和下一最佳路径规划器,实现高效高保真的主动物体重建。

Comments Accepted to IEEE T-ASE. Code: https://github.com/Li-Yuetao/ObjSplat , Project Page: https://li-yuetao.github.io/ObjSplat-page/

详情
AI中文摘要

自主高保真物体重建是创建数字资产和弥合机器人模拟与现实差距的基础。我们提出ObjSplat,一个主动重建框架,利用高斯面元作为统一表示,逐步重建未知物体,同时具有逼真的外观和准确的几何。针对传统基于不透明度或深度线索的局限性,我们引入了几何感知视点评估管线,明确建模背面可见性和遮挡感知的多视图共视性,即使在几何复杂的物体上也能可靠地识别未重建区域。此外,为了克服贪婪规划策略的局限性,ObjSplat采用下一最佳路径(NBP)规划器,在动态构建的空间图上执行多步前瞻。通过联合优化信息增益和移动成本,该规划器生成全局高效的轨迹。在仿真和真实世界文化遗物上的大量实验表明,ObjSplat在几分钟内生成物理一致的模型,与最先进方法相比,实现了卓越的重建保真度和表面完整性,同时显著减少了扫描时间和路径长度。项目页面:此https URL。

英文摘要

Autonomous high-fidelity object reconstruction is fundamental for creating digital assets and bridging the simulation-to-reality gap in robotics. We present ObjSplat, an active reconstruction framework that leverages Gaussian surfels as a unified representation to progressively reconstruct unknown objects with both photorealistic appearance and accurate geometry. Addressing the limitations of conventional opacity or depth-based cues, we introduce a geometry-aware viewpoint evaluation pipeline that explicitly models back-face visibility and occlusion-aware multi-view covisibility, reliably identifying under-reconstructed regions even on geometrically complex objects. Furthermore, to overcome the limitations of greedy planning strategies, ObjSplat employs a next-best-path (NBP) planner that performs multi-step lookahead on a dynamically constructed spatial graph. By jointly optimizing information gain and movement cost, this planner generates globally efficient trajectories. Extensive experiments in simulation and on real-world cultural artifacts demonstrate that ObjSplat produces physically consistent models within minutes, achieving superior reconstruction fidelity and surface completeness while significantly reducing scan time and path length compared to state-of-the-art approaches. Project page: https://li-yuetao.github.io/ObjSplat-page/ .

2512.17629 2026-06-10 cs.LG cs.AI 版本更新

SCOPE: Sequential Causal Optimization of Process Interventions

SCOPE: 过程干预的顺序因果优化

Jakob De Moor, Hans Weytjens, Johannes De Smedt, Jochen De Weerdt

发表机构 * Research Centre for Information Systems Engineering (LIRIS), KU Leuven, Leuven, Belgium(信息系统工程研究中心(LIRIS),鲁汶大学,比利时列文) School of Computation, Information and Technology, Technical University of Munich (TUM), Munich, Germany(计算、信息与技术学院,慕尼黑技术大学(TUM),德国慕尼黑)

AI总结 提出SCOPE方法,通过反向归纳和因果学习直接利用观测数据,优化业务流程中顺序干预的KPI,优于现有方法。

详情
AI中文摘要

规范性过程监控(PresPM)在运行业务流程期间推荐干预措施以优化关键绩效指标(KPI)。在现实环境中,干预很少是孤立的:组织需要对齐干预序列以共同引导案例的结果。现有的PresPM方法仅部分解决了这一挑战。许多方法专注于单个干预决策,而其他方法将多个干预视为独立,忽略了它们随时间如何相互作用。确实处理这些依赖关系的方法依赖于模拟或数据增强来近似过程以训练强化学习(RL)代理,这可能会造成现实差距并引入偏差。我们提出了SCOPE(过程干预的顺序因果优化),一种学习对齐的顺序干预推荐的PresPM方法。SCOPE采用反向归纳来估计每个候选干预动作的效果,将其影响从最终决策点传播回第一个决策点。通过利用因果学习器,我们的方法可以直接使用观测数据,不同于需要构建过程近似用于RL的方法。在现有合成数据集和新的半合成数据集上的实验表明,SCOPE在优化KPI方面始终优于最先进的PresPM技术。基于真实事件日志的新型半合成设置作为可重复使用的基准,用于未来关于顺序PresPM的工作。

英文摘要

Prescriptive Process Monitoring (PresPM) recommends interventions during running business processes to optimize key performance indicators (KPIs). In realistic settings, interventions are rarely isolated: organizations need to align sequences of interventions to jointly steer the outcome of a case. Existing PresPM approaches only partially address this challenge. Many focus on a single intervention decision, while others treat multiple interventions independently, ignoring how they interact over time. Methods that do address these dependencies depend either on simulation or data augmentation to approximate the process to train a Reinforcement Learning (RL) agent, which may create a reality gap and introduce bias. We introduce SCOPE (Sequential Causal Optimization of Process Interventions), a PresPM approach that learns aligned sequential intervention recommendations. SCOPE employs backward induction to estimate the effect of each candidate intervention action, propagating its impact from the final decision point back to the first. By leveraging causal learners, our method can utilize observational data directly, unlike methods that require constructing process approximations for RL. Experiments on both an existing synthetic dataset and a new semi-synthetic dataset show that SCOPE consistently outperforms state-of-the-art PresPM techniques in optimizing the KPI. The novel semi-synthetic setup, based on a real-life event log, is provided as a reusable benchmark for future work on sequential PresPM.

2601.04776 2026-06-10 cs.CV 版本更新

Segmentation-Driven Monocular Shape from Polarization based on Physical Model

基于物理模型的分割驱动单目光学偏振形状恢复

Jinyu Zhang, Xu Ma, Weili Chen

发表机构 * Key Laboratory of Photoelectronic Imaging Technology and System of Ministry of Education of China, School of Optics and Photonics, Beijing Institute of Technology(中国教育部光电成像技术与系统重点实验室,光学与 photonics 学院,北京理工大学) National Key Laboratory of Scattering and Radiation, Beijing Institute of Environmental Features(散射与辐射国家重点实验室,北京环境特征研究院)

AI总结 提出分割驱动单目光学偏振形状恢复框架,通过偏振辅助自适应区域生长分割凸子区域并引入多尺度融合凸性先验约束,有效解决方位角歧义,提升重建精度与几何保真度。

Comments 23 pages, 10 figures, submittd to Elsevier Pattern Recognition

详情
AI中文摘要

单目光学偏振形状恢复(SfP)利用光偏振特性与表面几何之间的内在关系,从单视角偏振图像中恢复表面法线,为三维(3D)重建提供了一种紧凑且稳健的方法。尽管具有潜力,现有的单目SfP方法受到方位角歧义(偏振分析的固有限制)的影响,严重损害了重建的准确性和稳定性。本文提出了一种新颖的分割驱动单目SfP(SMSfP)框架,将全局形状恢复重新表述为在自适应分割的凸子区域上的一组局部重建。具体而言,提出了一种偏振辅助自适应区域生长(PARG)分割策略,将全局凸性假设分解为局部凸区域,有效抑制方位角歧义并保持表面连续性。此外,开发了一种多尺度融合凸性先验(MFCP)约束,以确保局部表面一致性并增强精细纹理和结构细节的恢复。在合成和真实世界数据集上的大量实验验证了所提出的方法,与现有的基于物理的单目SfP技术相比,在消歧准确性和几何保真度方面显示出显著改进。

英文摘要

Monocular shape-from-polarization (SfP) leverages the intrinsic relationship between light polarization properties and surface geometry to recover surface normals from single-view polarized images, providing a compact and robust approach for three-dimensional (3D) reconstruction. Despite its potential, existing monocular SfP methods suffer from azimuth angle ambiguity, an inherent limitation of polarization analysis, that severely compromises reconstruction accuracy and stability. This paper introduces a novel segmentation-driven monocular SfP (SMSfP) framework that reformulates global shape recovery into a set of local reconstructions over adaptively segmented convex sub-regions. Specifically, a polarization-aided adaptive region growing (PARG) segmentation strategy is proposed to decompose the global convexity assumption into locally convex regions, effectively suppressing azimuth ambiguities and preserving surface continuity. Furthermore, a multi-scale fusion convexity prior (MFCP) constraint is developed to ensure local surface consistency and enhance the recovery of fine textural and structural details. Extensive experiments on both synthetic and real-world datasets validate the proposed approach, showing significant improvements in disambiguation accuracy and geometric fidelity compared with existing physics-based monocular SfP techniques.

2510.04514 2026-06-10 cs.AI cs.CE cs.CL cs.CV stat.ME 版本更新

ChartAgent: A Multimodal Agent for Visually Grounded Reasoning in Complex Chart Question Answering

ChartAgent: 一种用于复杂图表问答中视觉基础推理的多模态智能体

Rachneet Kaur, Nishan Srishankar, Zhen Zeng, Sumitra Ganesh, Manuela Veloso

发表机构 * J.P. Morgan AI Research(摩根大通人工智能研究)

AI总结 提出ChartAgent框架,通过迭代分解查询为视觉子任务并利用图表专用视觉工具(如绘制注释、裁剪区域)进行空间域推理,在ChartBench和ChartX上取得最先进性能,尤其对无标注图表提升显著。

Comments Accepted at ACL 2026 (Main Conference). Also presented as an oral paper at the NeurIPS 2025 Multimodal Algorithmic Reasoning Workshop (https://marworkshop.github.io/neurips25/)

详情
AI中文摘要

最近的多模态大语言模型在基于图表的视觉问答中显示出潜力,但在无标注图表上——即那些需要精确视觉解释而非依赖文本捷径的图表——其性能急剧下降。为了解决这个问题,我们引入了ChartAgent,一种新颖的智能体框架,它直接在图表的空间域内显式执行视觉推理。与文本思维链推理不同,ChartAgent通过专门的行动(如绘制注释、裁剪区域(例如分割饼图切片、隔离条形图)和定位坐标轴)迭代地将查询分解为视觉子任务,并主动操作和交互图表图像,使用图表专用视觉工具库来完成每个子任务。这种迭代推理过程密切模仿了人类理解图表的认知策略。ChartAgent在ChartBench和ChartX基准测试上达到了最先进的准确率,整体上比先前方法绝对提升高达16.07%,在无标注、数值密集的查询上提升17.31%。此外,我们的分析表明,ChartAgent (a) 在多种图表类型上有效,(b) 在不同视觉和推理复杂度水平上均取得最高分数,(c) 作为一个即插即用的框架,提升了多种基础LLM的性能。我们的工作是首批使用工具增强的多模态智能体展示图表理解中视觉基础推理的工作之一。

英文摘要

Recent multimodal LLMs have shown promise in chart-based visual question answering, but their performance declines sharply on unannotated charts-those requiring precise visual interpretation rather than relying on textual shortcuts. To address this, we introduce ChartAgent, a novel agentic framework that explicitly performs visual reasoning directly within the chart's spatial domain. Unlike textual chain-of-thought reasoning, ChartAgent iteratively decomposes queries into visual subtasks and actively manipulates and interacts with chart images through specialized actions such as drawing annotations, cropping regions (e.g., segmenting pie slices, isolating bars), and localizing axes, using a library of chart-specific vision tools to fulfill each subtask. This iterative reasoning process closely mirrors human cognitive strategies for chart comprehension. ChartAgent achieves state-of-the-art accuracy on the ChartBench and ChartX benchmarks, surpassing prior methods by up to 16.07% absolute gain overall and 17.31% on unannotated, numerically intensive queries. Furthermore, our analyses show that ChartAgent is (a) effective across diverse chart types, (b) achieves the highest scores across varying visual and reasoning complexity levels, and (c) serves as a plug-and-play framework that boosts performance across diverse underlying LLMs. Our work is among the first to demonstrate visually grounded reasoning for chart understanding using tool-augmented multimodal agents.

2601.03093 2026-06-10 cs.LG cs.CL 版本更新

ATLAS: Verifier-Guided Adaptive Latent Activation Steering for Efficient LLM Reasoning

ATLAS:验证器引导的自适应潜在激活引导用于高效LLM推理

Tuc Nguyen, Thai Le

发表机构 * Indiana University Bloomington(印第安纳大学布卢明顿分校)

AI总结 提出ATLAS框架,通过轻量级验证器动态调整推理时潜在状态引导策略,实现每步自适应控制,在数学和编码推理任务上提升准确率并减少测试时token使用。

Comments 21 pages, 6 figures

详情
AI中文摘要

最近关于激活和潜在引导的研究表明,修改内部表示可以有效引导大型语言模型(LLMs)在不更新模型参数的情况下提高推理和效率。然而,大多数现有方法依赖固定引导策略和静态干预强度,这限制了它们在问题实例上的鲁棒性,并常常导致过度或不足引导。我们提出自适应测试时潜在引导(ATLAS),这是一个轻量级框架,通过训练好的、轻量级验证器在推理时动态控制引导决策。给定中间隐藏状态,验证器预测当前推理的质量,并自适应选择要应用的引导动作,实现每个示例和每个步骤的调整,且开销最小。ATLAS提供了一个统一框架,将学习到的潜在验证与测试时激活引导相结合,无需额外的LLM解码或推理时过程奖励模型调用即可实现自适应推理控制。在多个数学和编码推理基准上的实验表明,ATLAS始终优于普通解码和固定引导基线,在实现更高准确率的同时大幅减少测试时token使用。这些结果表明,验证器引导的潜在适应提供了一种有效且可扩展的机制,可以在不牺牲解决方案质量的情况下控制推理效率。所有源代码将公开提供。

英文摘要

Recent work on activation and latent steering has demonstrated that modifying internal representations can effectively guide large language models (LLMs) toward improved reasoning and efficiency without updating model parameters. However, most existing approaches rely on fixed steering policies and static intervention strengths, which limit their robustness across problem instances and often result in over- or under-steering. We propose Adaptive Test-time Latent Steering (ATLAS), a lightweight framework that dynamically controls steering decisions at inference time using a trained, lightweight verifier over the latent states. Given intermediate hidden states, the verifier predicts the quality of ongoing reasoning and adaptively selects which steering action to apply, enabling per-example and per-step adjustment with minimal overhead. ATLAS provides a unified framework for combining learned latent verification with test-time activation steering, enabling adaptive reasoning control without additional LLM decoding or inference-time process reward model calls. Experiments on multiple mathematical and coding reasoning benchmarks show that ATLAS consistently outperforms both vanilla decoding and fixed steering baselines, achieving higher accuracy while substantially reducing test-time token usage. These results demonstrate that verifier-guided latent adaptation provides an effective and scalable mechanism for controlling reasoning efficiency without sacrificing solution quality. All source code will be publicly available.

2510.14836 2026-06-10 cs.CV cs.RO 版本更新

QDepth-VLA: Quantized Depth Prediction as Auxiliary Supervision for Vision-Language-Action Models

QDepth-VLA:量化深度预测作为视觉-语言-动作模型的辅助监督

Yixuan Li, Yuhui Chen, Mingcai Zhou, Haoran Li, Zhengtao Zhang, Dongbin Zhao

发表机构 * School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Beijing Zhongke Huiling Robot Technology Co.(北京中科创联机器人科技有限公司)

AI总结 提出QDepth-VLA框架,通过辅助深度预测任务增强VLA模型的空间感知与推理能力,在仿真和真实任务中提升操作性能。

详情
AI中文摘要

空间感知和推理对于视觉-语言-动作(VLA)模型完成精细操作任务至关重要。然而,现有方法往往缺乏理解和推理精确控制所需的基本3D结构的能力。为解决这一局限,我们提出QDepth-VLA,一种通过辅助深度预测任务增强VLA模型的通用框架。设计了一个专门的深度专家,用于预测从VQ-VAE编码器获得的深度图的量化潜在令牌,使模型能够学习捕捉关键几何线索的深度感知表示。在仿真基准和真实世界任务上的实验结果表明,QDepth-VLA在操作任务上展现出强大的空间推理能力和竞争性能。

英文摘要

Spatial perception and reasoning are crucial for Vision-Language-Action (VLA) models to accomplish fine-grained manipulation tasks. However, existing approaches often lack the ability to understand and reason over the essential 3D structures necessary for precise control. To address this limitation, we propose QDepth-VLA, a general framework that augments VLA models with an auxiliary depth prediction task. A dedicated depth expert is designed to predict quantized latent tokens of depth maps obtained from a VQ-VAE encoder, enabling the model to learn depth-aware representations that capture critical geometric cues. Experimental results on the simulation benchmarks and real-world tasks demonstrate that QDepth-VLA yields strong spatial reasoning and competitive performance on manipulation tasks.

2512.14617 2026-06-10 cs.LG cs.AI 版本更新

Model-Based Reinforcement Learning in Discrete-Action Non-Markovian Reward Decision Processes

离散动作非马尔可夫奖励决策过程中基于模型的强化学习

Alessandro Trapasso, Luca Iocchi, Fabio Patrizi

发表机构 * Fondazione Bruno Kessler(布雷诺·科塞拉基金会) Sapienza University of Rome(罗马萨皮恩扎大学)

AI总结 提出QR-MAX算法,通过奖励机分解马尔可夫转移学习与非马尔可夫奖励处理,首次在离散NMRDP中获得PAC收敛到ε-最优策略的多项式样本复杂度,并扩展至连续状态空间。

Comments Accepted at IJCAI-ECAI 2026. 19 pages, 32 figures, includes appendix

详情
AI中文摘要

许多实际决策问题涉及的任务成功取决于整个系统历史,而非达到具有期望属性的状态。马尔可夫强化学习(RL)方法不适用于此类任务,而基于非马尔可夫奖励决策过程(NMRDP)的RL使智能体能够处理时间依赖任务。长期以来,这种方法缺乏关于(近)最优性和样本效率的形式保证。我们通过QR-MAX解决了这两个问题,这是一种新颖的基于模型的算法,用于离散NMRDP,通过奖励机将马尔可夫转移学习与非马尔可夫奖励处理分解。据我们所知,这是第一个利用这种分解获得PAC收敛到ε-最优策略且具有多项式样本复杂度的离散动作NMRDP的基于模型的RL算法。然后,我们将QR-MAX扩展到连续状态空间,提出Bucket-QR-MAX,一种基于SimHash的离散化器,它保留了相同的分解结构,无需手动网格划分或函数逼近即可实现快速稳定的学习。我们在复杂度递增的环境中将我们的方法与现代最先进的基于模型的RL方法进行了实验比较,显示出样本效率的显著提高和寻找最优策略的鲁棒性增强。

英文摘要

Many practical decision-making problems involve tasks whose success depends on the entire system history, rather than on achieving a state with desired properties. Markovian Reinforcement Learning (RL) approaches are not suitable for such tasks, while RL with non-Markovian reward decision processes (NMRDPs) enables agents to tackle temporal-dependency tasks. This approach has long been known to lack formal guarantees on both (near-)optimality and sample efficiency. We contribute to solving both issues with QR-MAX, a novel model-based algorithm for discrete NMRDPs that factorizes Markovian transition learning from non-Markovian reward handling via reward machines. To the best of our knowledge, this is the first model-based RL algorithm for discrete-action NMRDPs that exploits this factorization to obtain PAC convergence to $\varepsilon$-optimal policies with polynomial sample complexity. We then extend QR-MAX to continuous state spaces with Bucket-QR-MAX, a SimHash-based discretiser that preserves the same factorized structure and achieves fast and stable learning without manual gridding or function approximation. We experimentally compare our method with modern state-of-the-art model-based RL approaches on environments of increasing complexity, showing a significant improvement in sample efficiency and increased robustness in finding optimal policies.

2512.14614 2026-06-10 cs.CV cs.GR 版本更新

WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

WorldPlay:面向实时交互式世界建模的长期几何一致性

Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, Chunchao Guo

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出WorldPlay流式视频扩散模型,通过双重动作表示、重构上下文记忆和上下文强制蒸馏方法,实现实时交互式世界建模并保持长期几何一致性,生成24 FPS的720p长视频。

Comments project page: https://3d-models.hunyuan.tencent.com/world/, demo: https://3d.hunyuan.tencent.com/sceneTo3D, code: https://github.com/Tencent-Hunyuan/HY-WorldPlay

详情
AI中文摘要

本文提出WorldPlay,一种流式视频扩散模型,能够实现实时、交互式的世界建模,并保持长期几何一致性,解决了当前方法在速度与内存之间的权衡。WorldPlay的威力来自三个关键要素。1)我们使用双重动作表示(Dual Action Representation),以响应用户的键盘和鼠标输入实现鲁棒的动作控制。2)为了强制长期一致性,我们的重构上下文记忆(Reconstituted Context Memory)从过去帧动态重建上下文,并使用时间重构使几何上重要但久远的帧保持可访问,有效缓解记忆衰减。3)我们还提出上下文强制(Context Forcing),一种针对记忆感知模型的新型蒸馏方法。对齐教师和学生之间的记忆上下文,保留了学生使用长程信息的能力,在实现实时速度的同时防止误差漂移。综合来看,WorldPlay以24 FPS生成具有优越一致性的长时域流式720p视频,与现有技术相比表现更优,并在多种场景中展现出强大的泛化能力。项目页面和在线演示可访问:this https URL 和 this https URL。

英文摘要

This paper presents WorldPlay, a streaming video diffusion model that enables real-time, interactive world modeling with long-term geometric consistency, resolving the trade-off between speed and memory that limits current methods. WorldPlay draws power from three key ingredients. 1) We use a Dual Action Representation to enable robust action control in response to the user's keyboard and mouse inputs. 2) To enforce long-term consistency, our Reconstituted Context Memory dynamically rebuilds context from past frames and uses temporal reframing to keep geometrically important but long-past frames accessible, effectively alleviating memory attenuation. 3) We also propose Context Forcing, a novel distillation method designed for memory-aware model. Aligning memory context between the teacher and student preserves the student's capacity to use long-range information, enabling real-time speeds while preventing error drift. Taken together, WorldPlay generates long-horizon streaming 720p video at 24 FPS with superior consistency, comparing favorably with existing techniques and showing strong generalization across diverse scenes. Project page and online demo can be found: https://3d-models.hunyuan.tencent.com/world/ and https://3d.hunyuan.tencent.com/sceneTo3D.