arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3813
专题追踪
2508.06974 2026-05-19 cs.CL

Rethinking 1-bit Optimization Leveraging Pre-trained Large Language Models

重新思考利用预训练大语言模型的1位优化

Zhijun Tu, Jian Li, Yuanyuan Xi, Siqi Liu, Chuanjian Liu, Hanting Chen, Jie Hu, Yunhe Wang

发表机构 * Huawei Technologies Co., Ltd(华为技术有限公司)

AI总结 本文提出了一种一致的渐进式训练方法,通过将全精度权重逐步转化为二值化权重,以提高1位大语言模型的性能,并通过二进制感知初始化和双缩放补偿减少训练难度。

Comments 15 pages, 7 figures

详情
AI中文摘要

1位LLM量化在减少存储和计算成本方面具有显著优势。然而,现有方法通常从头开始训练1位LLM,未能充分利用预训练模型,导致训练成本高且准确性下降。本文发现全精度与1位表示之间的较大差距使直接适应困难。在本文中,我们引入了一种对前向和后向都一致的渐进式训练方法,平滑地将全精度权重转换为二值化权重。此外,我们还结合了二进制感知初始化和双缩放补偿,以减少渐进式训练的难度并提高性能。在各种大小的LLM上的实验结果表明,我们的方法优于现有方法。我们的结果表明,可以使用预训练模型实现高性能的1位LLM,从而消除了从头开始昂贵训练的需要。

英文摘要

1-bit LLM quantization offers significant advantages in reducing storage and computational costs. However, existing methods typically train 1-bit LLMs from scratch, failing to fully leverage pre-trained models. This results in high training costs and notable accuracy degradation. We identify that the large gap between full precision and 1-bit representations makes naive adaptation difficult. In this paper, we introduce a consistent progressive training for both forward and backward, smoothly converting the full-precision weights into the binarized ones. Additionally, we incorporate binary-aware initialization and dual-scaling compensation to reduce the difficulty of progressive training and improve the performance. Experimental results on LLMs of various sizes demonstrate that our method outperforms existing approaches. Our results show that high-performance 1-bit LLMs can be achieved using pre-trained models, eliminating the need for expensive training from scratch.

2508.06038 2026-05-19 cs.CV cs.AI

Fourier Compressor: Frequency-Domain Visual Token Compression for Vision-Language Models

Fourier Compressor: 频域视觉令牌压缩用于视觉-语言模型

Huanyu Wang, Jushi Kai, Haoli Bai, Lu Hou, Bo Jiang, Ziwei He, Zhouhan Lin

发表机构 * LUMIA Lab(LUMIA实验室) School of Artificial Intelligence(人工智能学院) Shanghai Jiao Tong University(上海交通大学) Shanghai Innovation Institute(上海创新研究院) Noah’s Ark Lab(诺亚实验室) Huawei Technologies Ltd.(华为技术有限公司) School of Computer Science(计算机科学学院)

AI总结 本文提出了一种基于频域的视觉令牌压缩策略,通过傅里叶变换减少计算开销并提升效率,同时保持语义准确性,实验表明其在图像和视频任务中均表现出色。

详情
AI中文摘要

视觉-语言模型(VLMs)由于高分辨率图像和视频输入引入的大量视觉令牌,导致计算开销和推理延迟显著增加。现有的无参数令牌压缩方法通常依赖于令牌选择或合并,但可能丢弃大量视觉信息或扭曲原始表示分布,导致在高压缩比下性能下降。为此,我们探索了一种更有效且高效的视觉令牌压缩策略,重点在频域方向。受图像压缩中频域变换(如JPEG)的成功启发,我们系统分析了视觉表示中的频域冗余,并揭示了不同频带中语义信息的非均匀分布。基于此,我们引入了傅里叶压缩器,一种有效、无参数且高度通用的模块,通过FFT(复杂度为O(n² log n))在频域内去除视觉表示的冗余。实现过程中无额外参数,计算开销极小且保持语义保真度。在图像基准测试中,我们的方法在保留超过96%原始准确率的同时,将推理FLOPs减少高达83.8%,生成速度提升31.2%。它在图像和视频理解任务中均表现出色,且在LLaVA和Qwen-VL架构中均能稳定泛化,证明其在高效VLMs中的实用价值。

英文摘要

Vision-Language Models (VLMs) incur substantial computational overhead and inference latency due to the large number of vision tokens introduced by high-resolution image and video inputs. Existing parameter-free token compression methods typically rely on token selection or merging, yet they risk discarding substantial visual information or distorting the original representation distribution, resulting in pronounced performance degradation at high compression ratios. In response, we aim to explore a more effective and efficient visual token compression strategy, with a promising direction in the frequency domain. Motivated by the success of frequency-domain transforms in image compression (e.g., JPEG), we systematically analyze the frequency redundancy in visual representations and uncover a non-uniform distribution of semantic information across frequency bands. Building upon this, we introduce Fourier Compressor, an effective, parameter-free, and highly generalizable module that removes redundancy from visual representations within the frequency domain. Implemented via FFT with $\mathcal{O}(n^2 \log n)$ complexity and no additional parameters, Fourier Compressor introduces negligible computational overhead while preserving semantic fidelity. Extensive experiments on image-based benchmarks demonstrate that our method achieves a favorable performance-efficiency trade-off, retaining over 96% of the original accuracy while reducing inference FLOPs by up to 83.8% and boosting generation speed by 31.2%. It consistently outperforms existing parameter-free methods and even surpasses some parameterized approaches. Importantly, Fourier Compressor generalizes consistently across both LLaVA and Qwen-VL architectures, and further extends to video understanding tasks, highlighting its practical applicability for efficient VLMs.

2507.18406 2026-05-19 cs.CL cs.DB cs.DL cs.IR

Factual Inconsistencies in Multilingual Wikipedia Tables

多语言维基百科表格中的事实不一致

Silvia Cappa, Lingxiao Kong, Pille-Riin Peet, Fanfu Wei, Yuchen Zhou, Jan-Christoph Kalo

发表机构 * CNR ISTC(意大利国家研究委员会信息与通信技术研究所) Fraunhofer Institute for Applied Information Technology FIT(弗劳恩霍夫应用信息技术研究所) Tallinn University of Technology(塔林理工大学) EURECOM(欧瑞康) Technical University of Munich(慕尼黑技术大学) University of Amsterdam(阿姆斯特丹大学)

AI总结 本研究探讨了多语言维基百科结构化内容中的跨语言不一致问题,特别是表格数据,通过开发方法收集、对齐和分析多语言维基百科文章中的表格,定义不一致的类别,并应用定量和定性指标评估多语言对齐,为事实验证、多语言知识交互和可靠AI系统设计提供启示。

Comments 11 pages, 7 figures, White Paper for RTF Work at ISWS Summer School 2025

详情
AI中文摘要

维基百科作为全球可访问的知识源,包含超过300种语言的内容。尽管覆盖相同主题,不同版本的维基百科是独立编写和更新的。这导致了事实不一致,可能影响百科全书和依赖维基百科作为主要训练数据的AI系统中立性和可靠性。本研究调查了维基百科结构化内容中的跨语言不一致,重点是表格数据。我们开发了一种方法来收集、对齐和分析维基百科多语言文章中的表格,定义不一致的类别。我们应用各种定量和定性指标来评估多语言对齐,使用样本数据集。这些见解对事实验证、多语言知识交互和设计利用维基百科内容的可靠AI系统具有影响。

英文摘要

Wikipedia serves as a globally accessible knowledge source with content in over 300 languages. Despite covering the same topics, the different versions of Wikipedia are written and updated independently. This leads to factual inconsistencies that can impact the neutrality and reliability of the encyclopedia and AI systems, which often rely on Wikipedia as a main training source. This study investigates cross-lingual inconsistencies in Wikipedia's structured content, with a focus on tabular data. We developed a methodology to collect, align, and analyze tables from Wikipedia multilingual articles, defining categories of inconsistency. We apply various quantitative and qualitative metrics to assess multilingual alignment using a sample dataset. These insights have implications for factual verification, multilingual knowledge interaction, and design for reliable AI systems leveraging Wikipedia content.

2507.17798 2026-05-19 cs.LG

Wasserstein GAN-Based Precipitation Downscaling with Optimal Transport for Enhancing Perceptual Realism

基于Wasserstein GAN与最优传输的降水下scaling以增强感知现实性

Kenta Shiraishi, Yuka Muto, Atsushi Okazaki, Shunji Kotsuki

发表机构 * Graduate School of Science and Engineering, Chiba University(千叶大学科学技术研究生院) Center for Environmental Remote Sensing, Chiba University(千叶大学环境遥感中心) Institute for Advanced Academic Research, Chiba University(千叶大学高级学术研究所) Research Institute of Disaster Medicine, Chiba University(千叶大学灾害医学研究所)

AI总结 本文提出利用Wasserstein GAN与最优传输成本进行降水下scaling,以提高降水预测的感知现实性,尽管WGAN在传统评估指标上略逊,但其生成的降水场在视觉上更真实,且能有效识别不真实输出和参考数据中的潜在伪影。

Journal ref Progress in Earth and Planetary Science, 13, 29, 2026

详情
AI中文摘要

高分辨率(HR)降水预测对于减少静止和局部强降雨造成的损害至关重要;然而,使用过程驱动的数值天气预测模型进行HR降水预测仍然具有挑战性。本研究提出利用Wasserstein生成对抗网络(WGAN)结合最优传输成本进行降水下scaling。与传统神经网络使用均方误差训练不同,WGAN能够生成具有精细结构的视觉上逼真的降水场,尽管WGAN在传统评估指标上略逊。WGAN学习的批评者与人类感知现实性密切相关。基于案例的分析表明,批评者分数的显著差异有助于识别不真实的WGAN输出和参考数据中的潜在伪影。这些发现表明,WGAN框架不仅提高了降水下scaling的感知现实性,还为评估和质量控制降水数据集提供了新的视角。

英文摘要

High-resolution (HR) precipitation prediction is essential for reducing damage from stationary and localized heavy rainfall; however, HR precipitation forecasts using process-driven numerical weather prediction models remains challenging. This study proposes using Wasserstein Generative Adversarial Network (WGAN) to perform precipitation downscaling with an optimal transport cost. In contrast to a conventional neural network trained with mean squared error, the WGAN generated visually realistic precipitation fields with fine-scale structures even though the WGAN exhibited slightly lower performance on conventional evaluation metrics. The learned critic of WGAN correlated well with human perceptual realism. Case-based analysis revealed that large discrepancies in critic scores can help identify both unrealistic WGAN outputs and potential artifacts in the reference data. These findings suggest that the WGAN framework not only improves perceptual realism in precipitation downscaling but also offers a new perspective for evaluating and quality-controlling precipitation datasets.

2507.05482 2026-05-19 cs.LG stat.ML

Stein Diffusion Guidance: Training-Free Posterior Correction for Sampling Beyond High-Density Regions

Stein Diffusion Guidance: Training-Free Posterior Correction for Sampling Beyond High-Density Regions

Van Khoa Nguyen, Lionel Blondé, Alexandros Kalousis

发表机构 * Department of Computer Science, University of Geneva(日内瓦大学计算机科学系)

AI总结 本文提出了一种基于Stein扩散引导的训练自由后验校正方法,用于在高密度区域之外进行采样。该方法结合了随机最优控制和Stein变分推断,通过引入新的理论界和运行成本函数,实现了在低密度区域的有效引导。

Comments Revised version accepted to the ICML 2026 main track; prior version accepted to two ICLR 2026 workshops: ReALM-GEN and DeLTa

详情
AI中文摘要

Training-free diffusion guidance offers a flexible framework for leveraging off-the-shelf classifiers without additional training. Yet, current approaches hinge on posterior approximations via Tweedie's formula, which often yield unreliable guidance, particularly in low-density regions. Stochastic optimal control (SOC), in contrast, enables principled posterior sampling but remains computationally prohibitive for efficient inference. In this work, we reconcile the strengths of these paradigms by introducing Stein Diffusion Guidance (SDG), a novel 免训练 framework grounded in a surrogate SOC objective. We establish a new theoretical bound on the SOC value function, revealing the necessity of correcting approximate posteriors to reflect true diffusion dynamics. Building on Stein variational inference, SDG computes the steepest descent direction that minimizes the Kullback-Leibler divergence between approximate and true posteriors. By integrating a principled Stein correction mechanism along with a novel running cost functional, SDG enables effective guidance in low-density regions. Our experiments on diverse image-guidance tasks and on challenging small-ligand sampling for protein docking suggest that SDG consistently outperforms standard 免训练 guidance methods and highlights its potential for broader posterior sampling problems beyond high-density regimes.

英文摘要

Training-free diffusion guidance offers a flexible framework for leveraging off-the-shelf classifiers without additional training. Yet, current approaches hinge on posterior approximations via Tweedie's formula, which often yield unreliable guidance, particularly in low-density regions. Stochastic optimal control (SOC), in contrast, enables principled posterior sampling but remains computationally prohibitive for efficient inference. In this work, we reconcile the strengths of these paradigms by introducing Stein Diffusion Guidance (SDG), a novel training-free framework grounded in a surrogate SOC objective. We establish a new theoretical bound on the SOC value function, revealing the necessity of correcting approximate posteriors to reflect true diffusion dynamics. Building on Stein variational inference, SDG computes the steepest descent direction that minimizes the Kullback-Leibler divergence between approximate and true posteriors. By integrating a principled Stein correction mechanism along with a novel running cost functional, SDG enables effective guidance in low-density regions. Our experiments on diverse image-guidance tasks and on challenging small-ligand sampling for protein docking suggest that SDG consistently outperforms standard training-free guidance methods and highlights its potential for broader posterior sampling problems beyond high-density regimes.

2506.16042 2026-05-19 cs.AI cs.LG cs.OS

OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents

OSWorld-Human: 评估计算机使用代理的效率基准

Reyna Abhyankar, Qi Qi, Yiying Zhang

发表机构 * OpenAI Anthropic Google DeepMind ByteDance(字节跳动) Agent S2 GTA1 Lei Jedi

AI总结 本文研究了计算机使用代理在OSWorld基准上的时间性能,发现大模型调用导致高延迟,并构建了包含人类轨迹的OSWorld Human数据集,评估发现最佳代理仍需更多步骤。

详情
AI中文摘要

生成式AI正被用于解决涉及桌面应用的多种计算机使用任务。最先进的系统仅专注于提高领先基准的准确性。然而,这些系统由于端到端延迟极高(例如,数十分钟)而实际上不可用,因为通常只需人类几分钟即可完成的任务。为了理解这一现象并指导未来计算机代理的发展,我们首次研究了计算机使用代理在OSWorld基准上的时间性能。我们发现,规划、反思和判断的大模型调用占总延迟的主要部分,并且随着代理使用更多步骤完成任务,每一步骤的时间会比任务开始时的步骤长3倍。我们随后构建了OSWorld Human,即原始OSWorld数据集的手动标注版本,其中包含每个任务的人类确定轨迹。我们使用OSWorld Human评估了16个代理的效率,并发现即使最佳代理也比必要多出2.7-4.3倍的步骤。

英文摘要

Generative AI is being leveraged to solve a variety of computer-use tasks involving desktop applications. State-of-the-art systems have focused solely on improving accuracy on leading benchmarks. However, these systems are practically unusable due to extremely high end-to-end latency (e.g., tens of minutes) for tasks that typically take humans just a few minutes to complete. To understand the cause behind this and to guide future developments of computer agents, we conduct the first study on the temporal performance of computer-use agents on OSWorld, the flagship benchmark in computer-use AI. We find that large model calls for planning, reflection, and judging account for most of the overall latency, and as an agent uses more steps to complete a task, each successive step can take 3x longer than steps at the beginning of a task. We then construct OSWorld Human, a manually annotated version of the original OSWorld dataset that contains a human-determined trajectory for each task. We evaluate 16 agents on their efficiency using OSWorld Human and found that even the best agents take 2.7-4.3x more steps than necessary.

2506.15588 2026-05-19 cs.LG

Memory-Efficient Differentially Private Training with Gradient Random Projection

内存高效的差分隐私训练与梯度随机投影

Alex Mulrooney, Devansh Gupta, James Flemings, Huanyu Zhang, Murali Annavaram, Meisam Razaviyayn, Xinwei Zhang

发表机构 * University of Delaware(德克萨斯大学) University of Southern California(南加州大学) Meta(Meta公司) Amazon(亚马逊)

AI总结 本文提出DP-GRAPE方法,通过随机高斯矩阵替代SVD子空间,减少内存使用并保持与一阶DP方法相当的效用,同时消除了昂贵的SVD计算需求,显著提升内存效率和模型性能。

详情
AI中文摘要

差分隐私(DP)在神经网络训练中保护敏感数据,但标准方法如DP-Adam由于每个样本梯度裁剪导致高内存开销,限制了可扩展性。我们引入DP-GRAPE(梯度随机投影),一种差分隐私训练方法,显著减少内存使用,同时保持与一阶DP方法相当的效用。DP-GRAPE的灵感来自我们发现隐私化使梯度奇异值谱变平,使基于SVD的投影(如GaLore(Zhao等人,2024))变得不必要的。因此,DP-GRAPE采用三个关键组件:(1)随机高斯矩阵替代基于SVD的子空间;(2)在投影后对梯度进行隐私化;(3)在反向传播期间应用投影。这些贡献消除了昂贵的SVD计算需求,实现了显著的内存节省,并提高了效用。尽管在较低维子空间中运行,我们的理论分析显示,DP-GRAPE在隐私-效用权衡上与DP-SGD相当。我们的广泛实验证明,DP-GRAPE可以显著减少DP训练的内存足迹,而不牺牲准确性和训练时间。特别是,DP-GRAPE在预训练视觉Transformer时将内存使用减少超过63%,在微调RoBERTa-Large时减少超过70%,同时实现相似性能。我们进一步证明,DP-GRAPE能够扩展到微调大型模型,如具有67亿参数的OPT,这是DP-Adam因内存限制而无法处理的规模。我们的代码可在https://github.com/alexmul1114/DP_GRAPE获得。

英文摘要

Differential privacy (DP) protects sensitive data during neural network training, but standard methods like DP-Adam suffer from high memory overhead due to per-sample gradient clipping, limiting scalability. We introduce DP-GRAPE (Gradient RAndom ProjEction), a DP training method that significantly reduces memory usage while maintaining utility on par with first-order DP approaches. DP-GRAPE is motivated by our finding that privatization flattens the gradient singular value spectrum, making SVD-based projections (as in GaLore (Zhao et al., 2024)) unnecessary. Consequently, DP-GRAPE employs three key components: (1) random Gaussian matrices replace SVD-based subspaces, (2) gradients are privatized after projection, and (3) projection is applied during backpropagation. These contributions eliminate the need for costly SVD computations, enable substantial memory savings, and lead to improved utility. Despite operating in lower-dimensional subspaces, our theoretical analysis shows that DP-GRAPE achieves a privacy-utility tradeoff comparable to DP-SGD. Our extensive empirical experiments show that DP-GRAPE can significantly reduce the memory footprint of DP training without sacrificing accuracy or training time. In particular, DP-GRAPE reduces memory usage by over 63% when pre-training Vision Transformers and over 70% when fine-tuning RoBERTa-Large as compared to DP-Adam, while achieving similar performance. We further demonstrate that DP-GRAPE scales to fine-tuning large models such as OPT with up to 6.7 billion parameters, a scale at which DP-Adam fails due to memory constraints. Our code is available at https://github.com/alexmul1114/DP_GRAPE.

2506.08244 2026-05-19 cs.LG cs.AI stat.ML

Algebraic Priors for Approximately Equivariant Networks

代数先验用于近似等变网络

Riccardo Ali, Pietro Liò, Jamie Vicary

发表机构 * University of Cambridge(剑桥大学)

AI总结 本文提出了一种无需参数的代数方法,利用群表示理论来构建等变网络的先验,通过实验验证该方法在多个任务中表现优异,甚至在无限群情况下也优于专门设计的模型。

详情
AI中文摘要

等变神经网络通过群作用来整合对称性,将其作为归纳偏差以提高性能。现有方法在潜在空间中学习等变作用,或设计具有等变结构的架构。这些方法通常能获得良好的经验结果,但可能涉及架构特定的约束、大量参数和高计算成本。我们挑战复杂等变架构范式,提出一种无参数的方法,基于群表示理论。我们证明,对于有限群上的等变编码器,潜在空间几乎必然包含每个线性无关数据轨道的一个副本,我们通过多个实验证明这一点。利用这一基础的代数洞察,我们通过辅助损失将群的正则表示作为归纳偏差,不增加可学习参数。我们的广泛评估显示,该方法在多个任务中表现优异,甚至在无限群情况下也优于专门设计的模型。我们进一步通过消融研究验证了正则表示的选择,显示其在所有情况下均优于定义和平凡群表示的基线模型。

英文摘要

Equivariant neural networks incorporate symmetries through group actions, embedding them as an inductive bias to improve performance. Existing methods learn an equivariant action on the latent space, or design architectures that are equivariant by construction. These approaches often deliver strong empirical results but can involve architecture-specific constraints, large parameter counts, and high computational cost. We challenge the paradigm of complex equivariant architectures with a parameter-free approach grounded in group representation theory. We prove that for an equivariant encoder over a finite group, the latent space must almost surely contain one copy of its regular representation for each linearly independent data orbit, which we explore with a number of empirical studies. Leveraging this foundational algebraic insight, we impose the group's regular representation as an inductive bias via an auxiliary loss, adding no learnable parameters. Our extensive evaluation shows that this method matches or outperforms specialized models in several cases, even those for infinite groups. We further validate our choice of the regular representation through an ablation study, showing it consistently outperforms defining and trivial group representation baselines.

2505.24438 2026-05-19 cs.LG

Weisfeiler and Leman Follow the Arrow of Time: Expressive Power of Message Passing in Temporal Event Graphs

Weisfeiler和Leman跟随时间之箭:时间事件图中消息传递的表达能力

Franziska Heeg, Jonas Sauer, Petra Mutzel, Ingo Scholtes

发表机构 * Chair of Machine Learning for Complex Networks(复杂网络机器学习教授席) Center for AI and Data Science (CAIDAS)(人工智能与数据科学中心(CAIDAS)) University of Würzburg(乌尔姆大学) Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院) University of Karlsruhe(卡尔斯鲁厄大学) Institute for Computer Science 1(计算机科学研究所1) University of Bonn(波恩大学)

AI总结 研究探讨了时间事件图中消息传递方法的表达能力,提出了一种基于一致事件图同构的扩展Weisfeiler-Leman算法,以区分非同构的时间图。

详情
AI中文摘要

时间图的一个重要特征是时间箭头如何影响其因果拓扑,即哪些节点可能通过时间尊重的路径因果地相互影响。由此产生的模式常被时间图神经网络(TGNNs)忽视。为了正式分析TGNNs的表达能力,我们缺乏一个将图同构扩展到时间图的一般化方法,以完全捕捉其因果拓扑。针对这一缺口,我们引入了一致事件图同构的概念,该概念利用了时间图中时间尊重路径的时间展开表示。我们比较了这一定义与现有时间图同构的概念。我们展示了并突出了我们方法的优势,并开发了一个时间图的Weisfeiler-Leman算法的扩展,以启发式地区分非同构的时间图。基于这一理论基础,我们推导出一种新的消息传递方案,用于时间图神经网络,该方案在时间图的事件图表示上运行。实验评估显示,我们的方法在时间图分类实验中表现良好。

英文摘要

An important characteristic of temporal graphs is how the directed arrow of time influences their causal topology, i.e., which nodes can possibly influence each other causally via time-respecting paths. The resulting patterns are often neglected by temporal graph neural networks (TGNNs). To formally analyze the expressive power of TGNNs, we lack a generalization of graph isomorphism to temporal graphs that fully captures their causal topology. Addressing this gap, we introduce the notion of consistent event graph isomorphism, which utilizes a time-unfolded representation of time-respecting paths in temporal graphs. We compare this definition with existing notions of temporal graph isomorphisms. We illustrate and highlight the advantages of our approach and develop a temporal generalization of the Weisfeiler-Leman algorithm to heuristically distinguish non-isomorphic temporal graphs. Building on this theoretical foundation, we derive a novel message passing scheme for temporal graph neural networks that operates on the event graph representation of temporal graphs. An experimental evaluation shows that our approach performs well in a temporal graph classification experiment.

2505.21893 2026-05-19 cs.LG cs.AI

SIPO: Stabilized and Improved Preference Optimization for Aligning Diffusion Models

SIPO: 用于对齐扩散模型的人类偏好优化的稳定与改进方法

Xiaomeng Yang, Mengping Yang, Junyan Wang, Zhijian Zhou, Zhiyu Tan, Hao Li

发表机构 * Shanghai Science and Intelligence Institute, Shanghai, China(上海科学与智能研究所) Fudan University, Shanghai, China(复旦大学) Australian Institute for Machine Learning, The University of Adelaide(澳大利亚机器学习研究所,阿德莱德大学)

AI总结 本研究提出SIPO框架,通过时间步感知的重要性重新加权和梯度稳定技术,解决扩散模型对齐中训练不稳定和策略偏差问题,提升了对齐效果和稳定性。

Comments This version supplements with more detailed content on reasoning and proof, additional experimental results, and ablation studies

详情
AI中文摘要

偏好学习作为一种有效技术,已被广泛用于将扩散模型与人类偏好对齐在视觉生成中。然而,现有对齐方法如Diffusion-DPO面临两个根本性挑战:由于各个时间步的高梯度方差导致的训练不稳定以及由于优化数据与策略模型分布之间的差异引起的策略偏差。我们的第一项贡献是对不同时间步的扩散轨迹进行系统分析,发现不稳定性主要源于早期时间步的低重要性权重。为了解决这些问题,我们提出了SIPO,即一种用于将扩散模型与人类偏好对齐的稳定和改进的偏好优化框架。具体而言,引入了一个关键梯度,即DPO-C&M,通过裁剪和屏蔽无信息的时间步来稳定训练。随后,采用时间步感知的重要性重新加权范式以缓解策略偏差并在对齐过程中强调信息更新。在各种基线模型上进行的广泛实验,包括图像生成模型SD1.5、SDXL和视频生成模型CogVideoX-2B/5B、Wan2.1-1.3B,表明我们的SIPO在稳定训练和性能方面均优于现有对齐方法。总体而言,这些结果表明了时间步感知对齐的重要性,并为改进扩散模型的偏好优化提供了有价值的指导。

英文摘要

Preference learning has garnered extensive attention as an effective technique for aligning diffusion models with human preferences in visual generation. However, existing alignment approaches such as Diffusion-DPO suffer from two fundamental challenges: training instability caused by high gradient variances at various timesteps and high parameter sensitivities, and off-policy bias arising from the discrepancy between the optimization data and the policy models' distribution. Our first contribution is a systematic analysis of diffusion trajectories across different timesteps, identifying that the instability primarily originates from early timesteps with low importance weights. To address these issues, we propose \textbf{SIPO}, a \textbf{S}tabilized and \textbf{I}mproved \textbf{P}reference \textbf{O}ptimization framework for aligning diffusion models with human preferences. Concretely, a key gradient, \emph{i.e.,} DPO-C\&M is introduced to stabilize training by clipping and masking uninformative timesteps. This is followed by a timestep-aware importance-reweighting paradigm to mitigate off-policy bias and emphasize informative updates throughout the alignment process. Extensive experiments on various baseline models including image generation models on SD1.5, SDXL, and video generation models CogVideoX-2B/5B, Wan2.1-1.3B, demonstrate that our SIPO consistently promotes stabilized training and outperforms existing alignment methods that with meticulous adjustments on parameters.Overall, these results suggest the importance of timestep-aware alignment and provide valuable guidelines for improved preference optimization in aligning diffusion models.

2505.20218 2026-05-19 cs.LG

Fine-grained List-wise Alignment for Generative Medication Recommendation

细粒度列表级对齐用于生成性药物推荐

Chenxiao Fan, Chongming Gao, Wentao Shi, Yaxin Gong, Zihao Zhao, Fuli Feng

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出FLAME框架,通过细粒度列表级对齐方法,利用大语言模型生成药物列表,以提高药物推荐的准确性和安全性,同时考虑药物间的相互作用和潜在不良反应。

Comments NeurIPS 2025 Spotlight

详情
AI中文摘要

准确且安全的药物推荐对于有效的临床决策至关重要,尤其是在多病共存的情况下。然而,现有系统依赖于点预测范式,忽略了药物间的协同效应和潜在的不良药物-药物相互作用(DDIs)。我们提出FLAME,一种针对大语言模型(LLMs)的细粒度列表级对齐框架,能够生成药物-药物的药物列表。FLAME将推荐视为一个顺序决策过程,每一步添加或移除一种药物。为了提供细粒度的学习信号,我们设计了基于潜在函数的奖励塑造的步骤式组相对策略优化(GRPO),明确建模DDIs并优化每种药物对整体处方的贡献。此外,FLAME通过整合结构化临床知识和协作信息,增强了患者建模。在基准数据集上的实验表明,FLAME实现了最先进的性能,提供了更高的准确性和可控的安全性-准确性权衡,以及在多样化的临床场景中的强大泛化能力。我们的代码可在https://github.com/cxfann/Flame获取。

英文摘要

Accurate and safe medication recommendations are critical for effective clinical decision-making, especially in multimorbidity cases. However, existing systems rely on point-wise prediction paradigms that overlook synergistic drug effects and potential adverse drug-drug interactions (DDIs). We propose FLAME, a fine-grained list-wise alignment framework for large language models (LLMs), enabling drug-by-drug generation of drug lists. FLAME formulates recommendation as a sequential decision process, where each step adds or removes a single drug. To provide fine-grained learning signals, we devise step-wise Group Relative Policy Optimization (GRPO) with potential-based reward shaping, which explicitly models DDIs and optimizes the contribution of each drug to the overall prescription. Furthermore, FLAME enhances patient modeling by integrating structured clinical knowledge and collaborative information into the representation space of LLMs. Experiments on benchmark datasets demonstrate that FLAME achieves state-of-the-art performance, delivering superior accuracy, controllable safety-accuracy trade-offs, and strong generalization across diverse clinical scenarios. Our code is available at https://github.com/cxfann/Flame.

2505.18991 2026-05-19 cs.CV

Fast Kernel-Space Diffusion for Remote Sensing Pansharpening

快速核空间扩散用于遥感全色锐化

Hancong Jin, Zihan Cao, Liang-jian Deng, Jingjing Li

发表机构 * University of Electronic Science and Technology of China(电子科技大学)

AI总结 本文提出KSDiff框架,通过整合低秩核心张量生成器和统一因子生成器,利用结构感知的多头注意力机制生成增强全局上下文的卷积核,以提升全色锐化质量并加速推理,实验表明其在性能和效率上均优于现有方法。

Comments CVPR 2026 Findings

详情
AI中文摘要

全色锐化旨在将高分辨率全色(PAN)图像和低分辨率多光谱(LRMS)图像融合为一幅具有精细空间细节和丰富光谱信息的单一图像。尽管深度学习方法取得了进展,但现有方法往往无法捕捉遥感数据分布中固有的全局先验。基于扩散模型的方法因强大的分布映射能力而成为有前途的解决方案,但它们存在推理延迟大的问题。我们引入KSDiff,一种快速核空间扩散框架,通过整合低秩核心张量生成器和统一因子生成器,利用结构感知的多头注意力机制生成增强全局上下文的卷积核,以提升全色锐化质量并加速推理。我们进一步提出一种针对全色锐化的两阶段训练策略,便于集成到现有全色锐化架构中。实验表明,KSDiff在性能上优于最近的有前途的方法,并且在扩散基线全色锐化方法上实现了超过500倍的推理速度提升。消融研究、可视化和进一步评估证实了我们方法的有效性。代码将在可能接受时发布。

英文摘要

Pansharpening seeks to fuse high-resolution panchromatic (PAN) and low-resolution multispectral (LRMS) images into a single image with both fine spatial and rich spectral detail. Despite progress in deep learning-based approaches, existing methods often fail to capture global priors inherent in remote sensing data distributions. Diffusion-based models have recently emerged as promising solutions due to their powerful distribution mapping capabilities, however, they suffer from heavy inference latency. We introduce KSDiff, a fast kernel-space diffusion framework that generates convolutional kernels enriched with global context to enhance pansharpening quality and accelerate inference. Specifically, KSDiff constructs these kernels through the integration of a low-rank core tensor generator and a unified factor generator, orchestrated by a structure-aware multi-head attention mechanism. We further introduce a two-stage training strategy tailored for pansharpening, facilitating integration into existing pansharpening architectures. Experiments show that KSDiff achieves superior performance compared to recent promising methods, and with over $500 \times$ faster inference than diffusion-based pansharpening baselines. Ablation studies, visualizations and further evaluations substantiate the effectiveness of our approach. Code will be released upon possible acceptance.

2505.17138 2026-05-19 cs.LG cs.AI

RAP: Runtime Adaptive Pruning for LLM Inference

RAP: 用于大语言模型推理的运行时自适应剪枝

Huanrong Liu, Chunlin Tian, Xuyang Wei, Qingbiao Li, Li Li

发表机构 * Faculty of Science and Technology, University of Macau, Macau, China(澳门大学科学与技术学院) School of Information and Software Engineering, University of Electronic Science and Technology of China, Chengdu, China(电子科技大学信息与软件工程学院)

AI总结 本文提出RAP,一种基于强化学习的弹性剪枝框架,通过动态调整压缩策略来适应运行时内存变化和异构KV缓存需求,首次在推理过程中同时考虑模型权重和KV缓存。

详情
AI中文摘要

大语言模型(LLMs)在语言理解和生成方面表现出色,但其巨大的计算和内存需求限制了部署。压缩提供了一种潜在的解决方案来缓解这些约束。然而,大多数现有方法依赖于固定的启发式方法,因此无法适应运行时内存变化或来自多样化用户请求的异构KV缓存需求。为了解决这些限制,我们提出了RAP,一种由强化学习(RL)驱动的弹性剪枝框架,能够以运行时感知的方式动态调整压缩策略。具体而言,RAP动态跟踪实际执行过程中模型参数与KV缓存之间的演变比例。认识到前馈网络(FFNs)包含大部分参数,而参数轻量的注意力层主导KV缓存的形成,RL代理只保留那些在当前内存预算内最大化效用的组件,基于即时的工作负载和设备状态。广泛的实验结果表明,RAP优于最先进的基线方法,标志着首次在推理过程中同时考虑模型权重和KV缓存。

英文摘要

Large language models (LLMs) excel at language understanding and generation, but their enormous computational and memory requirements hinder deployment. Compression offers a potential solution to mitigate these constraints. However, most existing methods rely on fixed heuristics and thus fail to adapt to runtime memory variations or heterogeneous KV-cache demands arising from diverse user requests. To address these limitations, we propose RAP, an elastic pruning framework driven by reinforcement learning (RL) that dynamically adjusts compression strategies in a runtime-aware manner. Specifically, RAP dynamically tracks the evolving ratio between model parameters and KV-cache across practical execution. Recognizing that FFNs house most parameters, whereas parameter -light attention layers dominate KV-cache formation, the RL agent retains only those components that maximize utility within the current memory budget, conditioned on instantaneous workload and device state. Extensive experiments results demonstrate that RAP outperforms state-of-the-art baselines, marking the first time to jointly consider model weights and KV-cache on the fly.

2505.16278 2026-05-19 cs.CV cs.AI cs.RO

DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving

DriveMoE:面向端到端自动驾驶的视觉-语言-动作混合专家模型

Zhenjie Yang, Yilin Chai, Xiaosong Jia, Qifeng Li, Yuqian Shao, Xuekai Zhu, Haisheng Su, Junchi Yan

发表机构 * Sch. of Computer Science & Sch. of Artificial Intelligence, Shanghai Jiao Tong University(上海交通大学计算机科学学院与人工智能学院) Institute of Trustworthy Embodied AI, Fudan University(复旦大学可信具身人工智能研究院) Shanghai Key Laboratory of Multimodal Embodied AI(上海多模态具身人工智能重点实验室) AnyScale AI Project(AnyScale AI项目)

AI总结 本文提出DriveMoE,一种基于混合专家架构的端到端自动驾驶框架,通过场景专用的视觉混合专家和技能专用的动作混合专家,实现了对复杂驾驶场景的有效处理,展示了在自动驾驶任务中结合视觉和动作混合专家的有效性。

Comments Accepted by CVPR 2026, Project Page: https://thinklab-sjtu.github.io/DriveMoE/

详情
AI中文摘要

端到端自动驾驶(E2E-AD)需要有效处理多视角传感器数据和稳健处理多样且复杂的驾驶场景,特别是罕见的激进转弯等场景。最近混合专家(MoE)架构在大语言模型(LLMs)中的成功表明,参数的专业化能够实现强大的可扩展性。在本工作中,我们提出了DriveMoE,一种新的基于MoE的E2E-AD框架,包含场景专用的视觉MoE和技能专用的动作MoE。DriveMoE基于我们$π_0$视觉-语言-动作(VLA)基线(最初来自具身AI领域),称为Drive-$π_0$。具体而言,我们通过训练一个路由器,根据驾驶上下文动态选择相关摄像头,将视觉MoE添加到Drive-$π_0$中。这种设计模仿了人类驾驶认知,即司机选择性地关注关键视觉线索,而不是穷尽处理所有视觉信息。此外,我们通过训练另一个路由器来激活针对不同驾驶行为的专用专家模块,通过显式的行为专业化,DriveMoE能够处理多样化的场景而不受现有模型中模式平均的困扰。在Bench2Drive闭环评估实验中,DriveMoE实现了最先进的性能,证明了在自动驾驶任务中结合视觉和动作MoE的有效性。我们将发布DriveMoE和Drive-$π_0$的代码和模型。

英文摘要

End-to-end autonomous driving (E2E-AD) demands effective processing of multi-view sensory data and robust handling of diverse and complex driving scenarios, particularly rare maneuvers such as aggressive turns. Recent success of Mixture-of-Experts (MoE) architecture in Large Language Models (LLMs) demonstrates that specialization of parameters enables strong scalability. In this work, we propose DriveMoE, a novel MoE-based E2E-AD framework, with a Scene-Specialized Vision MoE and a Skill-Specialized Action MoE. DriveMoE is built upon our $π_0$ Vision-Language-Action (VLA) baseline (originally from the embodied AI field), called Drive-$π_0$. Specifically, we add Vision MoE to Drive-$π_0$ by training a router to select relevant cameras according to the driving context dynamically. This design mirrors human driving cognition, where drivers selectively attend to crucial visual cues rather than exhaustively processing all visual information. In addition, we add Action MoE by training another router to activate specialized expert modules for different driving behaviors. Through explicit behavioral specialization, DriveMoE is able to handle diverse scenarios without suffering from modes averaging like existing models. In Bench2Drive closed-loop evaluation experiments, DriveMoE achieves state-of-the-art (SOTA) performance, demonstrating the effectiveness of combining vision and action MoE in autonomous driving tasks. We will release our code and models of DriveMoE and Drive-$π_0$.

2503.14346 2026-05-19 cs.CV

3D Densification for Multi-Map Monocular VSLAM in Endoscopy

3D致密化用于内窥镜多地图单目视觉SLAM

X. Anadón, Javier Rodríguez-Puigvert, J. M. M. Montiel

发表机构 * Universidad de Zaragoza(萨拉戈萨大学)

AI总结 本文提出了一种方法,通过去除异常值和增强地图密度,改进了内窥镜多地图单目视觉SLAM中的3D环境表示,实现了在临床应用中更精确的3D地图重建。

详情
AI中文摘要

多地图稀疏单目视觉同时定位与建图应用于单目内窥镜序列已被证明在内窥镜中频繁的损失(如运动模糊、时间遮挡、工具交互或水喷射)后能够稳健地恢复跟踪。稀疏多地图对于稳健的相机定位是足够的,但它们在环境表示方面非常差,它们是嘈杂的,有高比例的不准确重建的3D点,包括显著的异常值,更重要的是在临床应用中具有不可接受的低密度。我们提出了一种方法来去除异常值并增强状态-of-the-art稀疏内窥镜多地图CudaSIFT-SLAM的地图。通过使用鲁棒的LMedS将NN LightDepth用于到尺度的深度密集预测对齐稀疏CudaSIFT子地图。我们的系统缓解了单目深度估计中的固有尺度模糊问题,同时过滤异常值,导致可靠的致密3D地图。我们在C3VD幻影结肠数据集中提供了准确致密地图的实验证据,4.15毫米RMS精度在可接受的计算时间内。我们还报告了在Endomapper数据集上的真实结肠镜的定性结果。

英文摘要

Multi-map Sparse Monocular visual Simultaneous Localization and Mapping applied to monocular endoscopic sequences has proven efficient to robustly recover tracking after the frequent losses in endoscopy due to motion blur, temporal occlusion, tools interaction or water jets. The sparse multi-maps are adequate for robust camera localization, however they are very poor for environment representation, they are noisy, with a high percentage of inaccurately reconstructed 3D points, including significant outliers, and more importantly with an unacceptable low density for clinical applications. We propose a method to remove outliers and densify the maps of the state of the art for sparse endoscopy multi-map CudaSIFT-SLAM. The NN LightDepth for up-to-scale depth dense predictions are aligned with the sparse CudaSIFT submaps by means of the robust to spurious LMedS. Our system mitigates the inherent scale ambiguity in monocular depth estimation while filtering outliers, leading to reliable densified 3D maps. We provide experimental evidence of accurate densified maps 4.15 mm RMS accuracy at affordable computing time in the C3VD phantom colon dataset. We report qualitative results on the real colonoscopy from the Endomapper dataset.

2501.01046 2026-05-19 cs.CL

SEDD: Scalable and Efficient Dataset Deduplication with GPUs

SEDD: 一种基于GPU的可扩展且高效的去重数据集处理方法

Youngjun Son, Chaewon Kim, Jaejin Lee

发表机构 * Graduate School of Data Science(数据科学研究生院) Department of Computer Science(计算机科学系) Seoul National University(首尔国立大学)

AI总结 本文提出SEDD,一种基于GPU的高效去重框架,通过引入计算高效且部分可重用的哈希函数、高度优化的GPU内核和硬件感知的自动参数选择机制,显著减少了通信瓶颈,提升了去重效率,同时保持了高去重精度。

Comments 13 pages, 7 figures

详情
AI中文摘要

数据集去重被广泛认可为一个关键的预处理步骤,能够提高数据质量和大型语言模型的性能。常用的去重方法是MinHash局部敏感哈希(LSH)算法。最近,NVIDIA NeMo Curator等GPU加速框架被引入以处理大规模语料库;然而,由于物理数据洗牌带来的高通信开销和GPU资源利用率低,这些框架仍然不够高效。在本文中,我们提出了SEDD,一种高性能的GPU加速去重框架,优化于分布式集群环境。SEDD引入了计算高效且部分可重用的哈希函数,以及高度优化的GPU内核和硬件感知的自动参数选择机制。通过将传统数据洗牌替换为流式处理方法,SEDD显著减轻了通信瓶颈。在处理3000万文档的节点上,我们的框架在CPU基础工具SlimPajama上性能提升高达158倍,在NVIDIA NeMo Curator的GPU基础工具上性能提升高达7.8倍。值得注意的是,SEDD大幅加速了之前耗时的MinHash签名生成阶段,相对于CPU基准,速度提升高达375倍。尽管在效率上有这些提升,SEDD仍保持了高去重精度,重复文档集的Jaccard相似度超过0.95,与标准MinHash算法识别的相似度相比。在大规模实验中,1.2万亿个标记的去重在8节点32 GPU V100集群上仅用3小时完成。相关代码已公开在GitHub(https://github.com/mcrl/SEDD)。

英文摘要

Dataset deduplication is widely recognized as a crucial preprocessing step that enhances data quality and improves the performance of large language models. A commonly used method for this process is the MinHash Locality-Sensitive Hashing (LSH) algorithm. Recently, GPU-accelerated frameworks such as NVIDIA NeMo Curator have been introduced to handle large-scale corpora; however, they remain suboptimal due to high communication overhead from physical data shuffling and underutilization of GPU resources. In this paper, we propose SEDD, a high-performance GPU-accelerated deduplication framework optimized for distributed cluster environments. SEDD introduces a computationally efficient, partially reusable hash function, alongside highly optimized GPU kernels and a hardware-aware automatic parameter selection mechanism. By replacing traditional data shuffling with a streaming-based approach, SEDD significantly mitigates communication bottlenecks. Our framework outperforms the CPU-based deduplication tool in SlimPajama by up to 158$\times$ and the GPU-based tool in NVIDIA NeMo Curator by up to 7.8$\times$ when processing 30 million documents on a node with four GPUs. Notably, SEDD dramatically accelerates the previously time-consuming MinHash signature generation phase, achieving speedups of up to 375$\times$ over the CPU baseline. Despite these gains in efficiency, SEDD maintains high deduplication fidelity, with duplicate document sets achieving Jaccard similarities of over 0.95 compared to those identified by the standard MinHash algorithm. In large-scale experiments, the deduplication of 1.2 trillion tokens is completed in just 3 hours on an 8-node 32-GPU V100 cluster. The related code is publicly available on GitHub (https://github.com/mcrl/SEDD).

2410.13181 2026-05-19 cs.CL

AdaSwitch: Adaptive Switching between Small and Large Agents for Effective Cloud-Local Collaborative Learning

AdaSwitch: 一种在小型和大型代理之间自适应切换以实现有效的云-本地协作学习方法

Hao Sun, Jiayi Wu, Hengyi Cai, Xiaochi Wei, Yue Feng, Bo Wang, Shuaiqiang Wang, Yan Zhang, Dawei Yin

发表机构 * State Key Laboratory of General Artificial Intelligence, Peking University, Beijing, China(一般人工智能国家重点实验室,北京大学,北京,中国) School of Intelligence Science and Technology, Peking University(智能科学与技术学院,北京大学) East China Normal University(东华大学) Chinese Academy of Sciences(中国科学院) Baidu Inc(百度公司) Beijing Institute of Technology(北京理工大学) University of Birmingham(伯明翰大学)

AI总结 本文提出了一种新的LLM使用范式,通过自适应机制在云端和本地部署的LLM之间切换,以提高任务完成性能和效率,通过本地代理处理简单推理步骤,云代理处理复杂推理步骤,实验表明该方法在多个基准测试中有效提升了本地代理的性能。

Comments EMNLP 2024 Main Conference

详情
AI中文摘要

近年来,大型语言模型(LLMs)的发展取得了显著进展。用户面临一个选择:使用基于云的LLMs来获得生成质量,或部署本地LLMs以降低计算成本。前者通常成本高且效率低,而后者通常无法满足需要深入思考过程的推理步骤的性能要求。在本工作中,我们提出了一种新的LLM使用范式,以促进大型云端LLMs和较小本地部署LLMs的协作操作。我们的框架包含两个主要模块:本地代理由相对较小的LLM实例化,处理较简单的推理步骤;云代理配备较大的LLM,处理更复杂的推理步骤。这种协作处理通过自适应机制实现,其中本地代理会内省并主动向云代理寻求帮助,从而有效整合本地部署和云端LLMs的优势,显著提升任务完成性能和效率。我们评估了AdaSwitch在7个基准测试上的表现,涵盖数学推理和复杂问答,使用不同类型的LLMs实例化本地和云代理。实验结果表明,AdaSwitch有效提升了本地代理的性能,有时在计算开销远低于云代理的情况下,也能取得具有竞争力的结果。

英文摘要

Recent advancements in large language models (LLMs) have been remarkable. Users face a choice between using cloud-based LLMs for generation quality and deploying local-based LLMs for lower computational cost. The former option is typically costly and inefficient, while the latter usually fails to deliver satisfactory performance for reasoning steps requiring deliberate thought processes. In this work, we propose a novel LLM utilization paradigm that facilitates the collaborative operation of large cloud-based LLMs and smaller local-deployed LLMs. Our framework comprises two primary modules: the local agent instantiated with a relatively smaller LLM, handling less complex reasoning steps, and the cloud agent equipped with a larger LLM, managing more intricate reasoning steps. This collaborative processing is enabled through an adaptive mechanism where the local agent introspectively identifies errors and proactively seeks assistance from the cloud agent, thereby effectively integrating the strengths of both locally-deployed and cloud-based LLMs, resulting in significant enhancements in task completion performance and efficiency. We evaluate AdaSwitch across 7 benchmarks, ranging from mathematical reasoning and complex question answering, using various types of LLMs to instantiate the local and cloud agents. The empirical results show that AdaSwitch effectively improves the performance of the local agent, and sometimes achieves competitive results compared to the cloud agent while utilizing much less computational overhead.

2401.09512 2026-05-19 cs.SD eess.AS

MLAAD: The Multi-Language Audio Anti-Spoofing Dataset

MLAAD:多语言音频防伪数据集

Nicolas M. Müller, Piotr Kawa, Wei Herng Choong, Edresson Casanova, Eren Gölge, Thorsten Müller, Piotr Syga, Philip Sperl, Konstantin Böttinger

发表机构 * Fraunhofer AISEC(弗劳恩霍夫人工智能安全研究所) Resemble AI Wrocław University of Science and Technology(沃拉夫大学) Thorsten-Voice

AI总结 本文提出多语言音频防伪数据集(MLAAD)版本10,包含175个文本到语音(TTS)模型,总计1002.9小时的合成语音,涵盖54种语言,用于训练和评估音频深度伪造检测模型,并展示了其在多个数据集上的优越性能。

Comments IJCNN 2024

详情
AI中文摘要

本文提出了多语言音频防伪数据集(MLAAD)版本10,这是一个用于训练和评估音频深度伪造检测模型的合成音频数据集。该数据集包含175个文本到语音(TTS)模型,总计1002.9小时的合成语音,涵盖54种不同的语言。为了评估该数据集,我们使用MLAAD训练了三种最先进的深度伪造检测模型,并观察到其在作为训练资源时,比InTheWild和FakeOrReal等类似数据集表现更优。此外,与著名的ASVspoof 2019数据集相比,MLAAD证明是一种互补的资源。在八个数据集上的测试中,MLAAD和ASVspoof 2019相互超越,各自在四个数据集上表现突出。通过发布该数据集并提供经过训练的模型通过交互式网络服务器访问,我们旨在普及反伪造技术,使其不仅限于专家领域,并为全球对抗音频伪造和深度伪造做出贡献。

英文摘要

This paper presents the Multi-Language Audio Anti-Spoofing Dataset (MLAAD), version 10: a dataset of synthetic audio to train and evaluate audio deepfake detection models. It features 175 Text-to-Speech (TTS) models, comprising a total of 1002.9 hours of synthetic voice in 54 different languages. To evaluate this dataset, we train three state-of-the-art deepfake detection models with MLAAD and observe that it demonstrates superior performance to comparable datasets like InTheWild and FakeOrReal when used as a training resource. Moreover, compared to the renowned ASVspoof 2019 dataset, MLAAD proves to be a complementary resource. In tests across eight datasets, MLAAD and ASVspoof 2019 alternately outperformed each other, each excelling on four datasets. By publishing the dataset and making a trained model accessible via an interactive webserver, we aim to democratize anti-spoofing technology, making it accessible beyond the realm of specialists, and contributing to global efforts against audio spoofing and deepfakes.

2307.08643 2026-05-19 cs.LG stat.ML

Corruptions of Supervised Learning Problems: Typology and Mitigations

监督学习问题的腐败:类型与缓解方法

Laura Iacovissi, Nan Lu, Robert C. Williamson

发表机构 * Tübingen AI Center, University of Tübingen(图宾根人工智能中心,图宾根大学)

AI总结 本文提出了一种通用的腐败理论,通过马尔可夫核分析底层概率分布的变化,统一了不同类型的腐败模型,并探讨了针对各种腐败类型的缓解方法。

Comments 73 pages. To be published in Journal of Machine Learning Research 27 (2026) 1-73

详情
AI中文摘要

腐败在数据收集中普遍存在。尽管已有大量研究,现有文献主要集中在特定设置和学习场景,缺乏对腐败建模和缓解的统一视角。本文开发了一种通用的腐败理论,涵盖监督学习问题的所有修改,包括模型类和损失的变化。通过分析底层概率分布的变化,我们的方法带来了三个新机会:首先,构建了一个新型且可证明的腐败框架,区分不同类型的腐败;其次,通过比较清洁和受污染场景下的贝叶斯风险,系统分析了腐败对学习任务的影响;第三,基于这些结果,我们研究了各种腐败类型的缓解方法。我们扩展了现有的标签腐败损失修正方法以处理依赖性腐败类型。我们的发现强调了将经典腐败修正学习框架推广到更宽松的范式以涵盖更多腐败类型的必要性。我们提供了这种范式以及属性和联合腐败情况下的损失修正公式。

英文摘要

Corruption is notoriously widespread in data collection. Despite extensive research, the existing literature predominantly focuses on specific settings and learning scenarios, lacking a unified view of corruption modelization and mitigation. In this work, we develop a general theory of corruption, which incorporates all modifications to a supervised learning problem, including changes in model class and loss. Focusing on changes to the underlying probability distributions via Markov kernels, our approach leads to three novel opportunities. First, it enables the construction of a novel, provably exhaustive corruption framework, distinguishing among different corruption types. This serves to unify existing models and establish a consistent nomenclature. Second, it facilitates a systematic analysis of corruption's consequences on learning tasks, by comparing Bayes risks in the clean and corrupted scenarios. Notably, while label corruptions affect only the loss function, attribute corruptions additionally influence the hypothesis class. Third, building upon these results, we investigate mitigations for various corruption types. We expand existing loss-correction methods for label corruption to handle dependent corruption types. Our findings highlight the necessity to generalize this classical corruption-corrected learning framework to a new paradigm with weaker requirements to encompass more corruption types. We provide such a paradigm as well as loss correction formulas in the attribute and joint corruption cases.

2605.18150 2026-05-19 cs.AI

Whispers in the Noise: Surrogate-Guided Concept Awakening via a Multi-Agent Framework

噪声中的低语:通过多智能体框架引导的代理觉醒

Mengyu Sun, Ziyuan Yang, Zunlong Zhou, Junxu Liu, Haibo Hu, Yi Zhang

发表机构 * Department of Electrical and Electronic Engineering, The Hong Kong Polytechnic University(香港理工大学电子与电气工程系) School of Cyber Science and Engineering, Sichuan University(四川大学网络空间安全学院) Lee Kong Chian School of Medicine, Nanyang Technological University(南洋理工大学李科钦医学院)

AI总结 本文研究了在黑盒约束下如何通过多智能体框架从预训练模型中恢复被擦除的概念,提出了一种无需训练的代理方法,通过引导噪声状态来实现可控的觉醒,展示了当前概念擦除方法的局限性。

详情
AI中文摘要

扩散模型(DMs)被广泛用于文本到图像生成,但其强大的生成能力也引发了对不安全或不期望内容的担忧。概念擦除旨在通过从预训练模型中移除特定概念来缓解这些风险。然而,最近的研究表明,此类方法往往抑制而非完全消除目标概念,使模型易受觉醒攻击。现有方法主要依赖于通过优化或反向操作进行白盒访问,而概念觉醒在黑盒约束下仍显不足。在本文中,我们重新审视去噪过程并从轨迹角度出发,表明概念擦除主要破坏早期阶段的文本-语义对齐,但并未完全阻止语义信息沿去噪动态传播。随着生成过程的进行,模型越来越依赖于演化的噪声状态而非文本条件,这为绕过擦除映射提供了机会。受此观察启发,我们提出了ConceptAgent,一种无需训练、黑盒、多智能体框架,通过引导噪声状态初始化去噪轨迹来唤醒擦除的概念。大量实验表明,ConceptAgent能够在无模型参数、梯度或内部表示访问的情况下,实现准确且可控的擦除概念觉醒。这些结果突显了当前概念擦除方法的根本限制,并提供了关于DMs中语义控制动态性质的新见解。

英文摘要

Diffusion models (DMs) are widely used for text-to-image generation, but their strong generative capabilities also raise concerns about unsafe or undesirable content. Concept erasure aims to mitigate these risks by removing specific concepts from pretrained models. However, recent studies show that such methods often suppress rather than fully eliminate target concepts, leaving models vulnerable to awakening attacks. Existing approaches primarily rely on white-box access through optimization or inversion, while concept awakening under black-box constraints remains underexplored. In this work, we revisit the denoising process from a trajectory perspective and show that concept erasure mainly disrupts early-stage text-semantic alignment but does not fully prevent semantic information from propagating along the denoising dynamics. As generation proceeds, the model increasingly depends on the evolving noisy state rather than textual conditions, which creates an opportunity to bypass erased mappings. Motivated by this observation, we propose ConceptAgent, a training-free, black-box, multi-agent framework that awakens erased concepts by initializing the denoising trajectory from surrogate-guided noisy states. Extensive experiments demonstrate that ConceptAgent enables accurate and controllable awakening of erased concepts under black-box settings without access to model parameters, gradients, or internal representations. These results highlight fundamental limitations of current concept erasure methods and provide new insights into the dynamic nature of semantic control in DMs.

2605.18147 2026-05-19 cs.LG

Foundation Models for Credit Risk Prediction: A Game Changer?

信贷风险预测的基础模型:变革性突破?

Bart Baesens, Andreas Goethals, Stefan Lessmann, Simon De Vos, Cristián Bravo, David Martens, Victor Medina-Olivares, Christophe Mues, Maria Oskarsdóttir, Seppe vanden Broucke, Tim Verdonck, Wouter Verbeke

发表机构 * Faculty of Economics and Business, KU Leuven, Belgium(比利时库勒万大学经济与商业学院) School of Business and Economics, Humboldt University of Berlin, Germany(德国洪堡大学商学院) Department of Statistical and Actuarial Sciences, Western University, Canada(加拿大西部大学统计与精算科学系) Department of Engineering Management, University of Antwerp, Belgium(比利时安特卫普大学工程管理系) Business School, University of Edinburgh, United Kingdom(英国爱丁堡大学商学院) Business School, University of Southampton, United Kingdom(英国南安普顿大学商学院) School of Mathematical Sciences, University of Southampton, United Kingdom(英国南安普顿大学数学科学学院) Department of Business Informatics and Operations Management, Ghent University, Belgium(比利时根特大学商业信息与运营管理系) Department of Mathematics, University of Antwerp, Belgium(比利时安特卫普大学数学系) Department of Mathematics, KU Leuven, Belgium(比利时库勒万大学数学系)

AI总结 本文研究了信贷风险预测中基础模型的应用,探讨了其在小数据环境下提升预测性能的能力,并通过对比多种方法验证了基础模型在PD和LGD建模任务中的优越性。

详情
AI中文摘要

预测模型在信贷风险管理中发挥着关键作用,通过准确估计违约概率和损失来指导关键决策。大量研究引入了新的建模技术,并通过大规模基准研究巩固了最先进的方法。如今,梯度提升模型配以SHAP解释器已成为准标准,但风险模型的持续改进仍是首要任务。同时,人工智能的快速进展,尤其是大型语言模型,已颠覆了预测建模范式。基础模型通过在广泛领域数据集上预训练,利用先验知识表现出色。尽管在自然语言处理和计算机视觉中广泛应用,但针对表格数据的基础模型才刚刚出现。我们推测,在小数据设置中,如中小企业贷款或专门化的公司投资组合中,使用非领域数据进行预训练可能特别有益,并可能帮助解决长期存在的挑战,包括低违约率投资组合和类别不平衡问题。本文将最近提出的方法与广泛竞争对手进行基准测试,包括已建立和先进的机器学习技术,在PD和LGD建模两个核心任务中进行评估。我们的评估涵盖了各种数据集、性能指标和实验条件。我们发现,表格基础模型在各种数据集和任务中表现最佳。此外,当数据集规模减小时,它们在预测性能上提供了显著改进。这些结果令人印象深刻,因为模型在即开即用的情况下进行测试,无需超参数调优,确保了易用性和降低了计算成本。

英文摘要

Predictive models play a pivotal role in credit risk management, guiding critical decisions through accurate estimation of default probabilities and losses. Extensive research has introduced new modeling techniques, complemented by large-scale benchmarking studies consolidating the state-of-the-art. Today, quasi-standards such as gradient-boosting models paired with SHAP explainers have emerged, yet continuous improvement of risk models remains a top priority. Concurrently, rapid advancements in AI, most notably large language models, have disrupted predictive modeling paradigms. Foundation models, pretrained on extensive datasets from diverse domains, have demonstrated remarkable performance by leveraging prior knowledge. While prevalent in natural language processing and computer vision, foundation models for tabular data have only recently emerged. We conjecture that pretraining on out-of-domain data is particularly beneficial in small-data settings, such as SME lending or specialized corporate portfolios, and may help address longstanding challenges including low default portfolios and class imbalance. This paper benchmarks recently proposed tabular foundation models against a broad set of competitors, including established and advanced machine learning techniques, across two core tasks: PD and LGD modeling. Our evaluation encompasses various datasets, performance indicators, and experimental conditions. We find that tabular foundation models generally perform best across datasets and tasks. Moreover, they offer significant improvement in predictive performance as dataset size shrinks. These results are remarkable given that the models are tested out-of-the-box, without hyperparameter tuning, ensuring ease of use and mitigating computational costs.

2605.18144 2026-05-19 cs.AI

Evidence-Grounded Frontier Mapping and Agentic Hypothesis Generation in Nanomedicine

基于证据的前沿映射与代理假设生成在纳米医学中

Christiaan G. A. Viviers, Koen de Bruin, Mirre M. Trines, Ayla M. Hokke, Roy van der Meel, Avi Schroeder, Twan Lammers, Willem J. M. Mulder, Fons van der Sommen

发表机构 * ARIA Lab, Signal Processing Systems, Department of Electrical Engineering, Eindhoven University of Technology(ARIA实验室,信号处理系统,电气工程系,埃因霍温理工大学) Laboratory of Chemical Biology, Department of Biomedical Engineering, Eindhoven University of Technology(化学生物学实验室,生物医学工程系,埃因霍温理工大学) The Louis Family Laboratory for Targeted Drug Delivery and Personalized Medicine Technologies, Department of Chemical Engineering, Technion - Israel Institute of Technology(定向药物输送与个性化医学技术实验室,化学工程系,技术离子-以色列理工学院) Department of Nanomedicine and Theranostics, Institute for Experimental Molecular Imaging (ExMI), RWTH Aachen University Hospital(纳米医学与诊疗学系,实验分子成像研究所(ExMI),亚琛工业大学医院) Department of Internal Medicine and Radboud Center for Infectious Diseases (RCI), Radboud University Medical Center(内科学系和Radboud感染疾病中心(RCI),Radboud大学医学中心)

AI总结 该研究提出了一种结合文章嵌入、相似性图分析、稀疏前沿提取、结构化证据包检索和审计过的大型语言模型(LLM)工作流的系统pArticleMap,用于支持纳米医学研究方向的选择和假设生成,通过生成和评分基于引用的假设,实现了证据导向的研究辅助。

详情
AI中文摘要

纳米医学研究涵盖了递送化学、免疫学、成像、生物材料和疾病特定的转化科学,但其概念设计空间仍然在大量异质文献中碎片化。截至目前,人工智能在纳米医学中的应用主要集中在性质预测和配方优化,对研究方向选择层面的证据导向发现支持关注较少。我们引入了pArticleMap,一个结合文章嵌入、相似性图分析、稀疏前沿提取、结构化证据包检索和审计过的大型语言模型(LLM)工作流的文献映射和研究假设生成系统。该系统不同于预测未来概念共现,而是针对低密度文章级桥接区域和聚类界面,然后在代理设置中利用大型语言模型生成和评分基于引用的假设。我们通过回顾性实现基准(在历史截止点下生成后续文献)和盲人类读者评估层,在提示条件下的纳米医学任务中评估该系统。在4个选定的回顾性包中,pArticleMap在基准协议下生成了想法并选择了任务保留的假设(获胜想法)。对于任务级保留的假设,获得了一个汇总的黄金回收率10.8%,召回@10为15.9%,未来邻域率61.0%,表明该系统经常能够达到正确的前瞻性邻域(论文想法),即使没有精确的论文级回收。人类-代理协议总体上是中等的,表明内部评分是有用的支持信号,但不能替代专家判断。这些结果将pArticleMap定位为一种保守的、基于证据的研究助手,用于纳米医学。

英文摘要

Nanomedicine research spans delivery chemistry, immunology, imaging, biomaterials, and disease-specific translational science, yet its conceptual design space remains fragmented across a large and heterogeneous literature. To date, artificial intelligence in nanomedicine has focused primarily on property prediction and formulation optimization, with much less attention to evidence-grounded discovery support at the level of research direction selection. We introduce pArticleMap, a literature-mapping and research-hypothesis-generation system that combines article embeddings, similarity-graph analysis, sparse frontier extraction, structured evidence-pack retrieval, and an audited large-language-model (LLM) workflow for grounded ideation. Rather than forecasting future concept co-occurrence, pArticleMap targets low-density article-level bridge regions and cluster interfaces, then generates and scores citation-grounded hypotheses with large language models in an agentic setup. We evaluate the system with a retrospective realization benchmark (generate later literature under a historical cutoff) and a blinded human reader assessment layer across cue-conditioned nanomedicine tasks. Across 4 selected retrospective bundles, pArticleMap generated ideas and selected task-retained hypotheses (winner ideas) under the benchmark protocol. For task-level retained hypotheses, a pooled gold recovery rate of 10.8% was obtained, with a recall@10 of 15.9% and a future-neighborhood rate of 61.0%, indicating that the system often reached the correct forward-looking neighborhood (paper ideas) even without exact paper-level recovery. Human-agent agreement is modest overall, indicating that internal scoring is useful as a support signal but does not replace expert judgment. These results position pArticleMap as a conservative, evidence-grounded research assistant for nanomedicine.

2605.18143 2026-05-19 cs.AI

Generative AI and the Productivity Divide: Human-AI Complementarities in Education

生成式AI与生产力差距:教育中的人类-人工智能互补性

Lihi Idan, Bharat Anand

发表机构 * Leonard N. Stern School of Business, New York University(纽约大学 Leonard N. Stern 商学院) Industrial and Systems Engineering Department, Texas A&M University(德克萨斯大学阿姆斯特朗工程学院)

AI总结 本研究探讨了生成式AI对不同用户生产力影响的异质性,发现AI交互能力(AIC)是决定AI使用效果的关键因素,通过概念图干预可减少不平等,强调需结合AIC微培训和标准流程以实现持续价值捕获。

详情
AI中文摘要

生成式人工智能(GenAI)正在改变企业创造、处理和应用知识的方式,但对其生产力影响的异质性知之甚少。我们报告了一项随机对照试验的结果,参与者(早期知识工作者的类比)被分配在传统资源或大语言模型(LLM)辅助下自学技术领域。平均而言,GenAI访问显著提高了任务表现,但收益分布极不均衡。改进未由GPA或先前知识预测,而是由AI交互能力(AIC)——即获取、过滤和验证模型输出的能力——预测。高AIC参与者实现了显著收益;低AIC参与者则获得有限甚至负的边际回报。概念图干预( scaffolding)减少了结果变异,表明标准化流程可减轻AI中介表现中的不平等。我们通过人类-人工智能互补性视角解读这些发现:GenAI提高平均生产力,但引入了新的能力不平等轴。管理上,企业应将GenAI访问与短期AIC微培训和简单标准操作程序相结合,以一致捕获价值并避免不均的采用结果。

英文摘要

Generative Artificial Intelligence (GenAI) is transforming how firms create, process, and apply knowledge, yet little is known about the heterogeneity of its productivity effects across users. We report results from a randomized controlled experiment in which participants-analogs of early-career knowledge workers-were assigned to self-study a technical domain using either traditional resources or large-language-model (LLM) assistance. On average, GenAI access significantly increased task performance, but the distribution of gains was highly uneven. Improvements were not predicted by GPA or prior knowledge, but by \textit{AI Interaction Competence (AIC)} -- the ability to elicit, filter, and verify model outputs. High-AIC participants realized outsized gains; low-AIC participants saw limited or even negative marginal returns. A scaffolding intervention (conceptual maps) reduced outcome variance, indicating that standardized workflows can mitigate inequality in AI-mediated performance. We interpret these findings through the lens of human-AI complementarities: GenAI raises mean productivity while introducing a new axis of capability inequality. Managerially, firms should pair GenAI access with short AIC micro-training and simple standard operating procedures to capture value consistently and avoid uneven adoption outcomes.

2605.18132 2026-05-19 cs.CV cs.AI

Who Generated This 3D Asset? Learning Source Attribution for Generative 3D Models

谁生成了这个3D资产?学习生成3D模型的来源归属

Sihan Ma, Siyuan Liang, Dacheng Tao

发表机构 * College of Computing & Data Science, Nanyang Technological University, Singapore(南洋理工大学计算机与数据科学学院)

AI总结 该研究提出了一种方法,用于确定给定3D资产是由哪种生成模型创建的,通过构建首个被动来源归属基准,发现生成3D模型留下稳定的指纹特征,从而建立了可信的3D内容来源的新标准。

详情
AI中文摘要

生成3D模型被应用于游戏、机器人和沉浸式创作,因此来源归属至关重要:给定一个3D资产,我们能否确定并识别出是哪种生成模型创建的?该问题面临两个核心挑战:分散的归属信号,其中3D指纹分布在多视角、几何和频率域提示中;以及现实部署约束,其中稀少的标签、退化的提示和混合真实/合成资产会破坏归属的可靠性。为了系统研究该问题,我们构建了迄今为止首个被动来源归属基准,涵盖22种代表性的3D生成器,在标准、少样本和现实部署协议下。基于此基准,我们发现生成3D模型留下两种稳定的指纹:跨视角不一致性和体现在几何统计和频率域提示中的结构伪影。为了捕捉这些分散的信号,我们提出了一种层次多视角多模态Transformer,融合每个视角的外观、几何和频率域特征,并在跨视角建模全局关系。大量实验表明性能优异,在全监督下达到97.22%的准确率,在仅有1%训练数据时达到77.17%的准确率,对应每个生成器少于五个样本。这些结果表明现代3D生成器留下稳定且可归属的指纹,建立了可信3D内容来源的新基准和方法论基础。

英文摘要

Generative 3D models are deployed in gaming, robotics, and immersive creation, making source attribution critical: given a 3D asset, can we identify whether and which generative model created it? This problem faces two core challenges: dispersed attribution signals, where 3D fingerprints are distributed across multi-view, geometric, and frequency-domain cues; and realistic deployment constraints, where scarce labels, degraded prompts, and mixed real/synthetic assets undermine attribution reliability. To systematically study this problem, we construct, to the best of our knowledge, the first passive source attribution benchmark for modern generated assets, covering 22 representative 3D generators under standard, few-shot, and realistic deployment protocols. Based on this benchmark, we find that generative 3D models leave two types of stable fingerprints: cross-view inconsistency and structural artifacts reflected in geometric statistics and frequency-domain cues. To capture these dispersed signals, we propose a hierarchical multi-view multi-modal Transformer that fuses appearance, geometric, and frequency-domain features within each view and models global relationships across views. Extensive experiments demonstrate strong performance, achieving 97.22% accuracy under full supervision and 77.17% accuracy with only 1% training data, corresponding to fewer than five samples per generator. These results show that modern 3D generators leave stable and attributable fingerprints, establishing a new benchmark and methodological foundation for trustworthy 3D content provenance.

2605.18130 2026-05-19 cs.CV

Rad-VLSM: A Cross-Modal Framework with Semantics-Assisted Prompting for Medical Segmentation and Diagnosis

Rad-VLSM:一种结合语义辅助提示的跨模态框架用于医学分割与诊断

Fengyi Zhang, Xujie Zeng, Mohan Liu, Zengyi Wang, Yalong Jiang

发表机构 * Student Member, IEEE(IEEE学生会员) Member, IEEE(IEEE会员)

AI总结 本文提出Rad-VLSM框架,通过语义引导的提示机制,提升医学图像分割与诊断的准确性,解决现有模型易受背景组织和无关视觉相关性干扰的问题。

详情
AI中文摘要

医学图像分割在支持诊断而非仅仅生成病变掩码时更具临床价值。然而,诊断相关的病变线索往往微妙且局部化,而现有模型可能受背景组织、声学伪影和无关视觉相关性干扰。为了解决这个问题,我们提出了Rad-VLSM,一种两阶段跨模态框架,用于语义辅助的病变聚焦、鲁棒分割和视觉基础诊断。第一阶段中,基于BLIP-2的视觉-语言对齐模块在语义引导下识别病变相关候选区域,并将其转换为框提示。第二阶段中,这些提示被输入基于SAM的多任务网络,其中多候选区域聚合策略提高提示稳定性并引导病变分割。预测的掩码随后用作诊断的空间先验,视觉-放射组学融合头将病变感知的视觉特征与选定的放射组学描述符整合。通过使用语义信息进行定位而非直接预测,Rad-VLSM减少了文本到诊断的依赖,并将诊断基于病变层面的证据。在私有临床乳腺超声数据集和公共基准测试中,Rad-VLSM在分割和诊断性能方面表现强劲,具有良好的泛化能力。

英文摘要

Medical image segmentation is more clinically valuable when it supports diagnosis rather than merely producing lesion masks. However, diagnostically relevant lesion cues are often subtle and localized, while existing models may be distracted by background tissues, acoustic artifacts, and irrelevant visual correlations. To address this problem, we propose Rad-VLSM, a two-stage cross-modal framework for semantics-assisted lesion focusing, robust segmentation, and visually grounded diagnosis. In the first stage, a BLIP-2-based vision-language alignment module identifies lesion-related candidate regions under semantic guidance and converts them into box prompts. In the second stage, these prompts are fed into a SAM-based multitask network, where a multi-candidate region aggregation strategy improves prompt stability and guides lesion segmentation. The predicted masks are then used as spatial priors for diagnosis, and a visual-radiomics fusion head integrates lesion-aware visual features with selected radiomics descriptors. By using semantic information for localization rather than direct prediction, Rad-VLSM reduces text-to-diagnosis dependence and grounds diagnosis in lesion-level evidence. Experiments on a private clinical breast ultrasound dataset and public benchmarks show that Rad-VLSM achieves strong segmentation and diagnostic performance with favorable generalization.

2605.18128 2026-05-19 cs.AI

POST: Prior-Observation Adversarial Learning of Spatio-Temporal Associations for Multivariate Time Series Anomaly Detection

POST: 基于先验观察的时空关联对抗学习用于多变量时间序列异常检测

Suofei Zhang, Yaxuan Zheng, Haifeng Hu

发表机构 * School of Internet of Things(物联网学院) National Engineering Research Center of Communications and Networking(通信与网络国家工程研究中心)

AI总结 本文提出了一种新的框架,通过联合先验观察对抗学习方法统一时空建模,以解决多变量时间序列异常检测中的空间过泛化问题,并在公开数据集和自建基准上展示了在时间检测和空间定位任务上的新状态。

详情
AI中文摘要

现有的多变量时间序列异常检测(MTSAD)框架越来越多地依赖于将图神经网络(GNNs)与序列模型相结合,以捕捉复杂的时空依赖关系。然而,较少关注空间过泛化问题,即不受约束的结构建模会 indiscriminately 重建异常,不可避免地降低检测召回率。为了解决这个问题,我们提出了一种新的框架,通过联合先验观察对抗学习方法统一时空建模。在空间维度上,模型交替学习邻接矩阵作为结构先验,并在训练过程中通过最小化方式建模先验与数据驱动观察之间的关联差异。这种对抗优化不仅提高了模型对时间检测的敏感性,还使模型能够定位到特定通道的异常。为了系统评估这种异常定位能力,我们进一步构建了一个带有精确通道注释的合成基准。在公开数据集和我们专门的基准上进行的广泛实验表明,所提出的框架在时间和空间定位任务上都建立了新的状态。我们的代码、预训练模型和基准已公开在 https://github.com/anocodetest1/POST。

英文摘要

Existing Multivariate Time Series Anomaly Detection (MTSAD) frameworks increasingly rely on integrating Graph Neural Networks (GNNs) with sequence models to capture complex spatio-temporal dependencies. However, less attention is paid to the spatial over-generalization problem, where unconstrained structural modeling indiscriminately reconstructs anomalies, inevitably degrading detection recall. To tackle this problem, we propose a novel framework that unifies spatio-temporal modeling through a joint prior-observation adversarial learning paradigm. In the spatial dimension, the model alternately learns adjacency matrices as structural prior and models the association discrepancy between prior and data-driven observation in a minimax manner during training. Such adversarial optimization not only improves the model sensitivity for time-wise detection, but also enables the model to localize anomalies to specific channels. To systematically evaluate this anomaly localization capability, we further construct a synthetic benchmark equipped with precise channel-wise annotations. Extensive experiments across public datasets and our dedicated benchmark demonstrate that the proposed framework establishes a new state-of-the-art in both time-wise detection and spatial localization tasks. Our code, pre-trained models, and benchmark are publicly available at https://github.com/anocodetest1/POST.

2605.18115 2026-05-19 cs.CV

WinTok: A Win-Win Hybrid Tokenizer via Decomposing Visual Understanding and Generation with Transferable Tokens

WinTok: 一种通过分解视觉理解和生成来实现双赢的混合分词器

Yiwei Guo, Shaobin Zhuang, Zhipeng Huang, Canmiao Fu, Chen Li, Jing Lyu, Yali Wang

发表机构 * Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(中国科学院深圳先进技术研究院) WeChat Vision, Tencent Inc.(腾讯微信视觉部) Shanghai Jiao Tong University(上海交通大学)

AI总结 本文提出WinTok,一种通过分解视觉理解和生成任务来实现双赢的混合分词器,通过引入可迁移的语义分词来减少跨任务干扰,从而在多个基准测试中提升了重建、理解和生成性能。

详情
AI中文摘要

构建统一的视觉分词器对于弥合视觉理解和生成之间的差距至关重要。然而,现有方法在处理这两个任务之间的固有冲突时存在困难,因为单一的分词空间被迫同时支持高层语义抽象和低层像素重建。我们提出了WinTok,一种简洁的混合分词器,通过显式解耦这两个目标实现了双赢性能。WinTok通过添加一组可学习的语义分词来补充像素分词,有效地减轻了跨任务干扰,而无需付出双分词器的计算开销。为进一步增强理解能力,我们引入了不对称的分词蒸馏机制:语义分词通过任何视觉基础模型预训练的语义嵌入进行引导,使它们能够继承强大的辨别能力,同时保持灵活性。在10个具有挑战性的基准测试中,WinTok在重建、理解和生成方面都实现了持续的改进。仅在5000万开源数据上训练,WinTok在分类准确率上超越了强大的基线UniTok 11.2%,尽管其使用的训练数据显著少于其他方法。代码已发布在https://github.com/markywg/WinTok。

英文摘要

Building a unified visual tokenizer is essential for bridging the gap between visual understanding and generation. Yet existing approaches struggle with the inherent conflict between these tasks, as a single token space is forced to support both high-level semantic abstraction and low-level pixel reconstruction. We propose WinTok, a concise hybrid tokenizer that achieves a win-win performance by explicitly decoupling the two objectives. WinTok supplements pixel tokens with a set of learnable semantic tokens, effectively mitigating cross-task interference without incurring the computational overhead of dual tokenizers. To further enhance understanding capability, we introduce an asymmetric token distillation mechanism: the semantic tokens are guided by pretrained semantic embeddings from any visual foundation model, enabling them to inherit strong discriminative power while maintaining flexibility. Across 10 challenging benchmarks, WinTok delivers consistent improvements in reconstruction, understanding, and generation. Trained on only 50M open-source data, WinTok surpasses the strong baseline UniTok by 11.2% in classification accuracy and achieves a competitive reconstruction rFID of 0.41, despite using substantially less training data. Code is released at https://github.com/markywg/WinTok.

2605.18111 2026-05-19 cs.CL cs.CV

How Good LLMs Are at Answering Bangla Medical Visual Questions? Dataset and Benchmarking

LLMs在回答孟加拉语医学视觉问题方面的表现如何?数据集与基准测试

Rafid Ahmed, Intesar Tahmid, Mir Sazzat Hossain, Tasnimul Hossain Tomal, Md Fahim, Md Farhad Alam Bhuiyan

发表机构 * Penta Global Limited Center for Computational & Data Sciences, Independent University(独立大学计算与数据科学中心)

AI总结 本文提出BanglaMedVQA数据集,用于评估当前基础模型在孟加拉语医学视觉问答任务中的表现,发现其性能显著低于英语基准,揭示了低资源语言在医学推理中的挑战。

Comments 14 pages, 7 figures, 5 tables, Proceedings of The Second AAAI Bridge Program on AI for Medicine and Healthcare, PMLR 317:1-14, 2026

详情
AI中文摘要

近年来,大型语言模型(LLMs)和大型视觉语言模型(LVLMs)的进步使通用系统在复杂推理任务中展现出有希望的能力,包括医学领域。医学视觉问答(MedVQA)尤其受益于这些发展。然而,尽管孟加拉语是全球最广泛使用的语言之一,但尚不存在针对它的MedVQA基准。为解决这一缺口,我们引入了BanglaMedVQA数据集,包含经过临床验证的图像-问题-答案三元组,并对当前基础模型在该资源上的全面评估。与先前发现的当前模型在英语MedVQA基准上表现不佳一致,我们的分析显示孟加拉语性能显著更低,反映了低资源语言固有的挑战。即使表现最佳的模型如Gemini和GPT-4.1 mini也未能准确回答专门的诊断问题,表明在细粒度医学推理方面存在严重限制。虽然某些开源模型如Gemma-3偶尔在一般类别中优于这些模型,但它们在临床复杂问题上也表现不佳,凸显了对顶级评估方法的迫切需求。

英文摘要

Recent advancements in Large Language Models (LLMs) and Large Vision Language Models (LVLMs) have enabled general-purpose systems to demonstrate promising capabilities in complex reasoning tasks, including those in the medical domain. Medical Visual Question Answering (MedVQA) has particularly benefited from these developments. However, despite Bangla being one of the most widely spoken languages globally, there exists no established MedVQA benchmark for it. To address this gap, we introduce BanglaMedVQA, a dataset comprising clinically validated image-question-answer pairs, along with a comprehensive evaluation of current foundation models on this resource. Consistent with prior findings that report low performance of current models on English MedVQA benchmarks, our analysis reveals that Bangla performance is substantially lower, reflecting the challenges inherent to low-resource languages. Even top-performing models such as Gemini and GPT-4.1 mini fail to accurately answer specialized diagnostic questions, indicating severe limitations in fine-grained medical reasoning. Although certain open-source models, such as Gemma-3, occasionally outperform these models in general categories, they too struggle with clinically complex questions, underscoring the urgent need for top-notch evaluation method.

2605.18109 2026-05-19 cs.AI cs.CV cs.RO

TaskGround: Structured Executable Task Inference for Full-Scene Household Reasoning

TaskGround:全场景家庭推理的结构化可执行任务推断

ZhiYuan Feng, Yu Deng, Ruichuan An, Zhenhua Liu, Qixiu Li, Keming Wu, Zhiying Du, Weijie Wang, Haoxiao Wang, Shuang Chen, Sicheng Xu, Yaobo Liang, Jiaolong Yang, Baining Guo

发表机构 * Tsinghua University(清华大学) Microsoft Research Asia(微软亚洲研究院) Peking University(北京大学) Fudan University(复旦大学) Zhejiang University(浙江大学)

AI总结 本文提出TaskGround框架,通过结构化任务推断提升全场景家庭推理能力,其核心贡献是引入FullHome评估套件,验证了在家庭场景中执行任务结构推断的重要性,并展示了紧凑本地模型在实际家庭部署中的有效性。

Comments Project page: https://aaronfengzy.github.io/TaskGround/

详情
AI中文摘要

在真实家庭部署中,家庭代理通常必须从完整的家庭场景和处于特定情境的家庭请求出发,而不是从干净的任务规范出发。此类请求要求代理识别与任务相关的实体,恢复意图的任务条件,并从周围场景上下文中解决顺序约束。我们正式将这种能力定义为全场景家庭推理:给定一个完整的家庭场景和一个处于特定情境的家庭请求,代理必须在生成接地技能级动作序列之前推断出可执行的任务结构。这种设置具有挑战性,因为完整的家庭场景包含大量与任务无关的信息,使直接完整场景提示效率低下且容易出错。在实际部署中,这一挑战进一步被隐私和本地计算限制放大,这些限制更倾向于紧凑的开放权重模型,其具有有限的长上下文推理能力。我们提出TaskGround,一种无需训练且模型无关的Ground-Infer-Execute框架,该框架将完整的场景接地为紧凑的任务相关场景切片,推断出可执行的任务结构,并将其编译为接地的技能级动作序列。为了评估这一设置,我们引入了FullHome,一个经过人类验证的400个家庭任务评估套件,涵盖多样化的家庭规模环境以及目标导向和过程约束要求。在FullHome上,TaskGround在专有和开放权重模型上均大幅提升了任务成功率。值得注意的是,它使Qwen3.5-9B在直接完整场景提示下与GPT-5竞争,同时将总输入token成本减少了多达18倍。我们的结果识别了执行任务结构推断为全场景家庭推理中的关键瓶颈,并表明结构化接地可以显著提高紧凑本地模型在实际家庭部署中的有效性。

英文摘要

In real home deployments, household agents must often operate from a complete household scene and a situated household request, rather than from a clean task specification. Such requests require agents to identify task-relevant entities, recover intended task conditions, and resolve ordering constraints from the surrounding scene context. We formalize this capability as full-scene household reasoning: given a complete household scene and a situated household request, an agent must infer executable task structure before producing a grounded skill-level action sequence. This setting is challenging because complete household scenes contain substantial task-irrelevant information, making direct complete-scene prompting inefficient and error-prone. In practical deployment, this challenge is further amplified by privacy and local compute constraints, which favor compact open-weight models with limited long-context reasoning ability. We propose TaskGround, a training-free and model-agnostic Ground-Infer-Execute framework that grounds complete scenes into compact task-relevant scene slices, infers executable task structure, and compiles it into grounded skill-level action sequences. To evaluate this setting, we introduce FullHome, a human-validated evaluation suite of 400 household tasks spanning diverse home-scale environments and both goal-oriented and process-constrained requirements. On FullHome, TaskGround improves task success rates by large margins across both proprietary and open-weight models. Notably, it makes Qwen3.5-9B competitive with GPT-5 under direct complete-scene prompting while reducing total input-token cost by up to 18x. Our results identify executable task-structure inference as a central bottleneck in full-scene household reasoning and show that structured grounding can make compact local models substantially more effective for practical household deployment.

2605.18105 2026-05-19 cs.CL

How Loud Rumbles Hit Newsstands: A Data Analysis of Coverage and Spatial Bias in German News about Landslides Around the World

轰鸣声击中新闻摊位:对全球山体滑坡相关新闻报道和空间偏见的数据分析

Brielen Madureira, Andreas Niekler, Marc Keuschnigg, Mariana Madruga de Brito

发表机构 * LeipzigLab – Climate Discourse(莱比锡实验室——气候话语) Leipzig University(莱比锡大学) Helmholtz Centre for Environmental Research(亥姆霍兹环境研究中心) Linköping University(林雪平大学)

AI总结 本文通过分析25年间近6万篇关于5500起山体滑坡事件的新闻文章,探讨德国报纸对全球山体滑坡的报道方式,揭示南欧和西欧地区报道过度的现象,为研究媒体对国际灾害关注的不平等提供参考。

Comments Work in progress

详情
AI中文摘要

山体滑坡常因破坏性和潜在致命性而击中新闻摊位。新闻是创建或丰富灾害数据库以及加快基于媒体的注意力动态研究的重要信息来源。为此,新闻数据集必须被过滤、定位和验证。本文聚焦于全球山体滑坡在德国报纸中的报道方式。我们分析了25年间近6万篇关于5500起新闻事件的新闻文章,将其与外部国家滑坡易发性指标进行比较,并提供见解,例如南欧和西欧地区报道过度,以促进对媒体对国际灾害关注不平等的研究。

英文摘要

Landslides often hit newsstands due to their destructive and potentially fatal effects. News are a valuable source of information for creating or enriching disaster databases and for expediting media-based studies of the dynamics of media attention. To accomplish that, news datasets must be filtered, geolocated and validated. This paper focuses on how landslides around the world are reported in German newspapers. We analyse almost 60k news articles about 5.5k news events in a 25-year period, compare it with external measures of countries' susceptibility to landslides and provide insights, e.g.~the overreporting of Southern and Western Europe, to foment further studies on inequalities in media attention to international disasters.