arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2154
专题追踪
2605.19378 2026-05-20 cs.CV

Sparse Mixture-of-Experts Routing in Visual Diffusion Transformers:Diagnosis, Boundary Calibration and Evolutionary Roadmap from Routing Collapse to Selective Deadlock

视觉扩散变换器中稀疏专家混合路由的稀疏性:从路由崩溃到选择性死锁的诊断、边界校准和进化路线图

Haiying Sha

发表机构 * Haiying Sha(海ying Sha)

AI总结 本文系统诊断了Token-Choice稀疏混合专家(MoE)在视频扩散变换器中的训练失败模式,通过分析超过6500万个标记的路由决策时间序列,提出了功能冗余假说,并总结了从视觉统一到世界模型的三步进化路线图。

详情
AI中文摘要

本文系统诊断了Token-Choice稀疏混合专家(MoE)在视频扩散变换器中的训练失败模式。从约50亿参数的预训练密集模型开始,我们遵循三条定律将其转换为MoE架构:路由专家精确克隆原始FFN权重,共享专家初始化为零以验证,然后初始化为极小的非零噪声以实际训练,而只有门控网络从随机初始化开始。实验揭示了五层失败模式的层次结构:(1)线性路由器经历全局软饱和,导致所有专家同质化;(2)MLP路由器引入选择性死锁,其中大约三分之一的层退化为单专家模式,无法通过增加辅助损失防止;(3)交叉注意力路由器表现出初步的自我恢复,但约九层仍顽固死锁;(4)死锁层显示U型分布,集中在浅层视觉处理层和深层语义整合层;(5)bfloat16混合精度导致微小权重更新被硬件截断为零。基于超过6500万个标记的路由决策时间序列,我们提出了功能冗余假说:死锁是共享专家在门控-共享专家-路由专家三元系统中成熟之前的理性等待策略。该假说由系统生物学中的功能冗余理论支持。在工程方面,我们总结了密集到MoE转换的三条定律,并提供了完整的bfloat16精度陷阱解决方案。我们校准了Token-Choice范式的当前能力边界,并概述了从视觉统一到世界模型的三步进化路线图。

英文摘要

This paper systematically diagnoses the training failure modes of Token-Choice sparse Mixture-of-Experts (MoE) on video Diffusion Transformers. Starting from a pretrained dense model of about 5 billion parameters, we convert it into an MoE architecture following three laws: routed experts exactly clone the original FFN weights, shared experts are initialized to zero for verification and then to extremely small non-zero noise for actual training, while only the gating networks start from random initialization. Experiments reveal a hierarchy of five failure modes: (1) linear routers suffer global soft saturation with complete expert homogenization; (2) MLP routers introduce selective deadlock, where roughly one-third of layers degenerate into a single-expert mode that cannot be prevented by increasing the auxiliary loss; (3) cross-attention routers exhibit preliminary self-recovery, yet about nine layers remain stubbornly deadlocked; (4) deadlocked layers display a U-shaped distribution, concentrated in shallow visual processing layers and deep semantic integration layers; (5) bfloat16 mixed precision causes tiny weight updates to be truncated to zero by hardware. Based on routing decision time series over 65 million tokens across 5,000 training steps, we propose the Functional Redundancy Hypothesis: deadlock is a rational waiting strategy before the shared expert matures within the gate-shared expert-routed expert triadic system. This hypothesis is supported by the theory of functional redundancy in systems biology. On the engineering side, we summarize the Three Laws of dense-to-MoE conversion and provide a complete solution for the bfloat16 precision trap. We calibrate the current capability boundary of the Token-Choice paradigm and outline a three-step evolutionary roadmap from visual unification to a world model.

2605.19377 2026-05-20 cs.LG cs.AI

The Evaluation Game: Beyond Static LLM Benchmarking

评估游戏:超越静态LLM基准测试

Paul Wang, Jade Garcia-Bourrée, Anne-Marie Kermarrec, Vincent Corruble

发表机构 * Sorbonne Université, CNRS, LIP6(索邦大学,国家科学研究中心,LIP6实验室) École Polytechnique Fédérale de Lausanne(洛桑联邦理工学院)

AI总结 本文提出了一种基于博弈论的框架,用于评估大型语言模型的安全性,通过数据增强的群作用结构分析评估者与训练者之间的互动,揭示了对抗性测试中局部泛化和记忆补丁的区别。

Comments 36 pages

详情
AI中文摘要

随着劫持攻击,即能够绕过安全限制的对抗性输入,持续在大型语言模型中被发现,实践者越来越依赖微调作为防御策略。然而,这种鲁棒性微调的理论基础仍不明确。我们引入了一个博弈论框架,将评估者(检查模型中的劫持攻击)与训练者之间的互动形式化为一个双人博弈。我们方法的关键特征是使用群作用,一种数学结构,用于正式表示数据增强。最简单的非平凡实例是圆周上的循环平移群,在此情况下,我们展示了根据训练者的泛化范围的不同而出现的各种情形。在临界阈值以下,评估者在线性多轮次中保持恒定的误判率,而在其他情况下则表现出非常不同的行为。我们进一步提供了实证证据支持模型的局部依赖性:对于我们测试的三个模型家族(Llama、Qwen和Mistral),我们有显著证据表明,在对抗性提示上微调只会导致局部泛化,测试示例上的拒绝率与到微调提示的距离高度相关。我们的框架重新定义了对抗性评估的核心对象:基准不是静态的提示集,而是在评估者群作用下的轨道,而忽略训练者适应的审计协议无法区分真正的修复和记忆补丁。

英文摘要

As jailbreaks, adversarially crafted inputs that bypass safety constraints, continue to be discovered in Large Language Models, practitioners increasingly rely on fine-tuning as a defensive strategy. Yet the theoretical foundations underlying this robustness fine-tuning remain underexplored. We introduce a game-theoretic framework in which the interaction between an evaluator (auditing the model for jailbreaks) and a trainer is formalized as a two-player game. A key feature of our approach is the use of group actions, a mathematical structure that captures symmetries and transformations, to formally represent data augmentation. The simplest non-trivial instance is the circle with cyclic translation groups, where we exhibit various regimes depending on the trainer's generalization range. Below a critical threshold, the evaluator maintains a constant miss ratio for linearly many rounds, whereas other settings can yield very different behaviors. We further provide empirical evidence supporting locality-dependence of the model: for the three model families we tested (Llama, Qwen and Mistral), we have significant evidence that fine-tuning on adversarial prompts induces only local generalization, with refusal rates on test examples highly correlated with the distance to the fine-tuning prompts. Our framework recasts the central object of adversarial evaluation: a benchmark is not a static set of prompts but an orbit under the evaluator's group action, and audit protocols that ignore trainer-side adaptation cannot distinguish a genuine fix from a memorized patch.

2605.19374 2026-05-20 cs.CV cs.AI cs.LG

Concept-Guided Noisy Negative Suppression for Zero-Shot Classification and Grounding of Chest X-Ray Findings

基于概念的噪声负样本抑制用于零样本分类和胸片发现的 grounding

Chenyu Lian, Hong-Yu Zhou, Chun-Ka Wong, Jing Qin

发表机构 * The Center for Smart Health, School of Nursing, the Hong Kong Polytechnic University, Hong Kong, China(香港理工大学智能健康中心,护理学院,中国香港) Research Institute for Smart Ageing, the Hong Kong Polytechnic University, Hong Kong, China(香港理工大学智能老龄化研究 institute,中国香港) School of Biomedical Engineering, Tsinghua Medicine, Tsinghua University, Beijing, China(清华大学生物医学工程学院,清华大学,北京,中国) Queen Mary Hospital, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong, China(香港大学李嘉诚医学院Queen Mary医院,中国香港)

AI总结 本文提出了一种基于概念的噪声负样本抑制框架CoNNS,通过构建层次化概念本体,解决不同患者间相似发现导致的噪声负样本问题,提升零样本理解任务的性能。

Comments Early accepted by MICCAI 2026

详情
AI中文摘要

利用胸片和放射学报告进行视觉-语言对齐已成为零样本分类和胸片发现 grounding 的先进范式。然而,标准对比学习通常将不同患者的影像和报告简单视为负样本对。这种假设引入了噪声负样本,因为不同患者经常表现出相似的发现。此类噪声负样本导致语义模糊并降低零样本理解任务的性能。为了解决这一挑战,我们提出CoNNS,一种基于概念的噪声负样本抑制框架。为了支持负样本抑制机制,不同于先前方法使用原始报告或模板化文本,我们利用大型语言模型构建层次化概念本体。本体通过显式建模存在性、属性(位置和特征)和文本(证据片段和存在陈述)来结构化41个关键临床概念。利用该本体,我们实现了包含三个步骤的跨患者对再标记策略:(1)细粒度分解,根据发现存在性对配对进行分类;(2)噪声负样本过滤,通过移除假负样本解决语义冲突;(3)困难负样本挖掘,利用轻量级语言模型识别细微属性差异。最后,我们提出了一种概念感知的NCE损失,以对齐视觉特征与文本并抑制识别出的噪声负样本。在多粒度零样本grounding任务和五个零样本分类数据集上的广泛实验验证了CoNNS优于现有最先进模型。代码可在https://github.com/DopamineLcy/conns获取。

英文摘要

Vision-language alignment using chest X-rays and radiology reports has emerged as an advanced paradigm for zero-shot classification and grounding of chest X-ray findings. However, standard contrastive learning typically treats radiographs and reports from different patients simply as negative pairs. This assumption introduces noisy negatives, as different patients frequently exhibit similar findings. Such noisy negatives cause semantic ambiguity and degrade performance in zero-shot understanding tasks. To address this challenge, we propose CoNNS, a concept-guided noisy-negative suppression framework. To support the negative suppression mechanism, unlike previous methods that use raw reports or templatized texts, we construct a hierarchical concept ontology using large language models. The ontology structures 41 key clinical concepts by explicitly modeling presence, attributes (location and characteristics), and texts (evidential segment and presence statement). Leveraging this ontology, we implement a cross-patient pair relabeling strategy comprising three steps: (1) Fine-Grained Breakdown to categorize pairs based on finding presence; (2) Noisy Negative Filtering to resolve semantic conflicts by removing false negatives; and (3) Hard Negative Mining to identify subtle attribute discrepancies using a lightweight language model. Finally, we propose a Concept-Aware NCE loss to align visual features with text while suppressing the identified noisy negatives. Extensive experiments across multi-granularity zero-shot grounding tasks and five zero-shot classification datasets validate that CoNNS outperforms existing state-of-the-art models. The code is available at https://github.com/DopamineLcy/conns.

2605.19371 2026-05-20 cs.CV cs.AI

Multi-Scale Generative Modeling with Heat Dissipation Flow Matching

多尺度生成建模与热耗散流匹配

Jun Ma, Hanquan Zhang, Yanjun Qin, Haoyuan Guan, Ke Zhang

发表机构 * Department of Systems Science, Faculty of Arts and Sciences, Beijing Normal University(北京师范大学系统科学系,文理学院) School of Computer Science and Technology, Xinjiang University(新疆大学计算机科学与技术学院) International Academic Center of Complex Systems, Beijing Normal University(北京师范大学复杂系统学术中心) School of Systems Science, Beijing Normal University(北京师范大学系统科学学院)

AI总结 本文提出Heat Dissipation Flow Matching (HDFM)方法,通过引入连续模糊(热耗散)过程来注入多尺度先验,解决模糊基模型在SDE框架中的局限性,并在ODE框架如Flow Matching中实现更有效的多尺度细节保留和颜色预算保持。

详情
AI中文摘要

扩散模型在图像生成中被广泛应用,大多数模型依赖于噪声为基础的破坏和去噪。一个不同的分支使用模糊作为主要破坏,通过提供多尺度先验来更好地保持颜色预算和多尺度细节。然而,基于模糊的模型仍局限于SDE框架,并未整合到ODE框架中,如Flow Matching (FM)。同时,在模糊基公式中,经典的逆热耗散(IHD)过程面临病态挑战。此外,在数据流形假设下,从高维噪声(或速度)空间回归模糊图像也具有困难。我们提出Heat Dissipation Flow Matching (HDFM),其引入连续模糊(热耗散)过程到FM中以注入多尺度先验。HDFM将插值热耗散路径对齐以解决病态问题,并采用x预测来缓解高维回归困难。玩具实验和消融研究显示,HDFM在模糊和x预测方面均受益。HDFM在所有数据集上均优于大多数基线方法。

英文摘要

Diffusion models are widely used in image generation, with most relying on noise-based corruption and denoising. A distinct branch instead uses blur as the main corruption, preserving better color budgets and multi-scale detail by providing multi-scale priors. However, blur-based models remain in SDE-based frameworks and are not integrated into ODE-based frameworks, such as Flow Matching (FM). Meanwhile, in the blur-based formulation, the classical inverse heat-dissipation (IHD) process faces an ill-posed challenge. Moreover, under the data-manifold assumption, regressing blurred images from high-dimensional noise (or velocity) space is also difficult. We propose Heat Dissipation Flow Matching (HDFM), which introduces a continuous blurred (heat-dissipation) process into FM to inject multi-scale priors. HDFM aligns an interpolated heat-dissipation path to address ill-posedness and adopts $x$-prediction to mitigate high-dimensional regression difficulty. Toy experiments and ablation studies show that HDFM consistently benefits from both blur and $x$-prediction. The performance of HDFM outperforms most baseline methods on all datasets.

2605.19366 2026-05-20 cs.LG

Accurate, Efficient, and Explainable Deep Learning Approaches for Environmental Science Problems

准确、高效且可解释的深度学习方法用于环境科学问题

Jimeng Shi

发表机构 * College of Engineering and Computing(工程与计算学院)

AI总结 本文提出三种针对复杂环境科学问题的深度学习方法:用于海岸河流洪水预测的WaLeF模型、用于全球天气预测的CoDiCast模型以及用于环境科学科学问答的Hypercube-RAG方法,旨在提高环境智能的准确性、效率和可解释性。

Comments 161 pages

详情
AI中文摘要

环境科学在保护生态系统中起着关键作用,这一领域由大规模、异构数据驱动。在大数据时代,人工智能(AI)已成为一种变革性工具,用于学习模式并支持决策。本论文开发了针对复杂环境科学问题的AI方法,以实现环境智能,研究了三个具体挑战。首先,我们专注于海岸河流系统的洪水预测和管理。传统物理模型计算成本高,限制了实时应用。为此,我们提出了一种基于深度学习(DL)的水位预测模型WaLeF,以及一种基于预测的深度学习模型FIDLAr用于水位管理。在佛罗里达南部易发洪水的海岸系统中评估,该系统以极端降雨和海平面上下波动为特点,FIDLAr在准确性和效率上优于基线模型,同时提供可解释的输出。其次,我们针对全球天气预测,这受到大规模数据规模的挑战。传统物理方法是确定性的且计算密集型。我们提出CoDiCast,一种条件扩散模型,专门用于概率天气预测。从生成AI用于预测任务中衍生而来,实验表明CoDiCast实现了准确且高效的预测,具有明确的不确定性量化。最后,我们解决环境科学中的科学问答问题。在回答领域内问题时,大型语言模型(LLMs)常常由于知识过时或有限而产生幻觉。虽然检索增强生成(RAG)检索了领域特定的知识,但现有方法在准确度、效率或可解释性之间进行权衡。我们提出Hypercube-RAG,基于结构化的文本立方体框架,成功同时表现出这三种属性。

英文摘要

Environmental science plays a pivotal role in safeguarding ecosystems, a domain driven by large-scale, heterogeneous data. In the big data era, artificial intelligence (AI) has emerged as a transformative tool for learning patterns and supporting decision-making. This dissertation develops AI-based approaches tailored to complex environmental science problems to achieve Environmental Intelligence, studying three specific challenges. First, we focus on flood prediction and management in coastal river systems. Conventional physics-based models are computationally intensive, limiting real-time application. To overcome this, we propose a deep learning (DL)-based model, WaLeF, for water level forecasting, and a forecast-informed DL model, FIDLAr, to manage water levels. Evaluated in a flood-prone coastal system in South Florida characterized by extreme rainfall and sea level fluctuations, FIDLAr outperforms baselines in accuracy and efficiency while providing interpretable outputs. Second, we target global weather prediction, which is challenged by massive data scale. Traditional physics methods are deterministic and computationally heavy. We propose CoDiCast, a conditional diffusion model tailored for probabilistic weather forecasting. Adapted from generative AI for predictive tasks, experiments show CoDiCast achieves accurate, efficient forecasts with explicit uncertainty quantification. Lastly, we address scientific question-answering in environmental science. When answering in-domain questions, large language models (LLMs) often suffer from hallucinations due to out-of-date or limited knowledge. While retrieval-augmented generation (RAG) retrieves domain-specific knowledge, existing methods trade off accuracy, efficiency, or explainability. We propose Hypercube-RAG, built on a structured text cube framework, which successfully exhibits all three properties simultaneously.

2605.19360 2026-05-20 cs.CV cs.LG cs.NE physics.app-ph physics.optics

Scalable, Energy-Efficient Optical-Neural Architecture for Multiplexed Deepfake Video Detection

可扩展的、节能的光学-神经架构用于多路复用的深度伪造视频检测

Parnian Ghapandar Kashani, Shiqi Chen, Aydogan Ozcan

发表机构 * Electrical and Computer Engineering Department, University of California, Los Angeles, CA, 90095, USA(加州大学洛杉矶分校电气与计算机工程系) Bioengineering Department, University of California, Los Angeles, CA, 90095, USA(加州大学洛杉矶分校生物工程系) California NanoSystems Institute (CNSI), University of California, Los Angeles, CA, 90095, USA(加州大学洛杉矶分校加州纳米系统研究所)

AI总结 本文提出了一种结合轻量级数字前端和空间复用光学解码后端的混合深度伪造视频检测框架,通过可编程空间光调制器实现大规模并行模拟推理,从而在降低计算成本的同时提高视频真实性预测的吞吐量和准确性。

Comments 30 Pages, 8 Figures

详情
AI中文摘要

AI生成视觉媒体的快速普及催生了对高效、可信的深度伪造检测系统的需求。然而,现有基于深度学习的检测方法依赖于计算密集且能耗高的推理算法,限制了其可扩展性。本文提出了一种混合的数字-模拟深度伪造视频检测框架,结合轻量级数字前端和空间复用光学解码后端,通过可编程空间光调制器实现大规模并行模拟推理。通过在单次光学传播过程中同时处理15个或更多的视频流,该系统在降低计算成本的同时实现了高吞吐量和准确的视频级真实性预测。我们使用不同数据集验证了该混合深度伪造视频处理器,包括经典面部交换、现实世界深度伪造记录和完全AI生成的视频。使用在可见光谱范围内操作的空间复用实验装置,我们在Celeb-DF视频数据集上实现了97.79%的深度伪造检测准确率、99.86%的灵敏度和95.72%的特异性,分别在15个视频并行处理的单次光学传播中测试。多路复用的光学解码器还展示了对各种视频退化、噪声、压缩、实验偏移和黑盒对抗攻击的鲁棒性。我们的结果表明,将光学计算整合到AI推理中可以同时提高吞吐量、能效和对抗鲁棒性——这三个属性在纯数字系统中难以同时实现。

英文摘要

The rapid proliferation of AI-generated visual media has created an urgent need for efficient, trustworthy deepfake detection systems. However, existing deep learning-based detection methods rely on computationally intensive and energy-demanding inference algorithms, limiting their scalability. Here, we present a hybrid digital-analog deepfake video detection framework that combines a lightweight digital front-end with a spatially multiplexed optical decoding back-end for massively parallel analog inference through a programmable spatial light modulator. By simultaneously processing 15 or more video streams within a single optical propagation pass, the system enables high-throughput and accurate video-level authenticity prediction at reduced computational cost compared with purely digital methods. We validated this hybrid deepfake video processor using different datasets spanning classical face-swapping, real-world deepfake recordings, and fully AI-generated videos. Using a spatially multiplexed experimental set-up operating in the visible spectrum, we achieved average deepfake detection accuracy, sensitivity and specificity of 97.79%, 99.86% and 95.72%, respectively, on the Celeb-DF video dataset with 15 videos tested in parallel in a single optical pass per inference. The multiplexed optical decoder also demonstrates resilience against various types of video degradation, noise, compression, experimental misalignments and black-box adversarial attacks. Our results show that integrating optical computation into AI inference enables simultaneous gains in throughput, energy efficiency, and adversarial robustness - three properties that are difficult to achieve together in purely digital systems.

2605.19359 2026-05-20 cs.CV cs.LG

MAM-CLIP: Vision-Language Pretraining on Mammography Atlases for BI-RADS Classification

MAM-CLIP:基于乳腺X线图集的视觉-语言预训练用于BI-RADS分类

Halil Ibrahim Gulluk, Olivier Gevaert

发表机构 * Department of Electrical Engineering(电气工程系) Biomedical Informatics Research (BMIR)(生物医学信息学研究(BMIR)) Stanford University(斯坦福大学)

AI总结 本文提出MAM-CLIP模型,通过预训练PubMedBERT和对比学习来提升乳腺X线图像的BI-RADS分类性能,实验表明在标注样本稀缺时,该方法能显著提高F1分数。

详情
AI中文摘要

深度学习方法在预测乳腺X线图像的BI-RADS评分方面已显示出有前景的结果。然而,这些图像的解释可能因人而异,即使在放射科医生之间也可能存在差异。鉴于乳腺X线的固有复杂性,仅依靠图像标签训练分类模型通常效果有限。为了解决这一挑战,我们收集了来自两个乳腺图集的2313张乳腺X线图像及其对应的描述。我们提出的方法采用了一个多模态模型,使用预训练的PubMedBERT作为语言组件。通过在图像-文本对上进行对比学习训练,使视觉编码器能够吸收描述中丰富的信息,从而提高其对乳腺X线发现的理解。然后,我们对两个数据集进行微调以进行BI-RADS预测,其性能优于没有此预训练的模型,尤其是在标注样本稀缺时。在3类平均F1分数上,改进范围从+1%到+14%:在40K训练样本时增加+1%,在1K样本时增加+14%。此外,我们的实验表明,来自乳腺图集的2K图像-文本对比2K标注样本更具信息量,当训练样本超过10K时,平均提升幅度为+1.1%。总体而言,我们的工作提供了一个用于乳腺X线的视觉-语言模型,并突显了乳腺图集文本信息的价值。此外,我们公开发布了TEKNOFEST数据集的预处理乳腺X线图像。训练代码、预训练模型权重、数据提取脚本和发布的数据集均可在:https://github.com/igulluk/MAM-CLIP上公开获取。

英文摘要

Deep learning methods have demonstrated promising results in predicting BI-RADS scores from mammography images. However, the interpretation of these images can vary, leading to discrepancies even among radiologists. Given the inherent complexity of mammograms, training classification models solely on image labels often yields limited performance. To address this challenge, we curated 2313 mammogram images and their corresponding captions from two mammography atlases. Our proposed approach employs a multi-modal model that uses a pretrained PubMedBERT as the language component. By training this model on image-text pairs with contrastive learning, we enable the vision encoder to absorb the rich information contained in the captions, thereby improving its understanding of mammography findings. We then fine-tune the vision encoder on two datasets for BI-RADS prediction, achieving superior performance compared with models trained without this pretraining, particularly when labeled samples are scarce. The improvement in the 3-class average F1 score ranges from +1% to +14%: a +1% increase with 40K training samples, and a +14% increase with 1K samples. Furthermore, our experiments reveal that 2K image-text pairs from mammography atlases can be more informative than 2K labeled samples for label prediction, with an average margin of +1.1% when more than 10K training samples are available. Overall, our work provides a vision-language model for mammography and highlights the value of textual information from mammography atlases. In addition, we publicly release preprocessed mammography images of the TEKNOFEST dataset. The training code, pre-trained model weights, data extraction scripts, and the released dataset are publicly available at: https://github.com/igulluk/MAM-CLIP

2605.19358 2026-05-20 cs.CL

Taming the Thinker: Conditional Entropy Shaping for Adaptive LLM Reasoning

驯服思考者:面向自适应大语言模型推理的条件熵塑造

Shuyu Wei, Jian Sun, Delai Qiu, Yining Wang, Shengping Liu, Jiaen Liang, Ying Fu, Wei Huang, Jitao Sang

发表机构 * Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence(北京交通大数据挖掘与具身智能重点实验室) Beijing Jiaotong University(北京交通大学) Unisound AI Technology Co., Ltd.(Unisound AI科技有限公司)

AI总结 本文提出条件熵塑造(CES)框架,通过动态控制token级响应熵,使大语言模型在简单问题上产生简洁解,在复杂问题上促进深入探索,从而平衡响应长度与准确性。

详情
AI中文摘要

基于熵的深度推理已成为提升大语言模型(LLMs)推理能力的有前景方向,但现有方法往往要么无差别地增加响应长度,要么以牺牲准确性为代价缩短响应。为更好地平衡这一权衡,我们引入了条件熵塑造(CES),一种框架,能够动态控制token级响应熵,使LLMs在简单问题上产生简洁的解决方案,同时在困难问题上鼓励更深入的探索。基于DAPO,CES使用token级熵作为不确定性信号,并应用一个条件双向策略:它在正确的推理路径上惩罚高熵的'分叉点'token以提高简洁性,并在错误路径上奖励它们以促进探索和错误纠正。我们将在DeepSeek-R1-Distill-7B上实现CES,并在12个数学基准上进行评估。CES在平均准确性上优于DAPO,同时减少响应长度,补充实验显示在较小的1.5B模型和领域外基准上也呈现出相似的趋势。

英文摘要

Entropy-based deep reasoning has emerged as a promising direction for improving the reasoning capabilities of Large Language Models (LLMs), but existing methods often either increase response length indiscriminately or shorten responses at the cost of accuracy. To better balance this trade-off, we introduce Conditional Entropy Shaping (CES), a framework that dynamically controls token-level response entropy, enabling LLMs to produce concise solutions on simple problems while encouraging deeper exploration on hard ones. Built on DAPO, CES uses token-level entropy as an uncertainty signal and applies a conditional bidirectional policy: it penalizes high-entropy "forking point" tokens on correct reasoning paths to improve conciseness, and rewards them on incorrect paths to encourage exploration and error correction. We implement CES on DeepSeek-R1-Distill-7B and evaluate it on 12 mathematical benchmarks. CES consistently improves average accuracy while reducing response length relative to DAPO, and supplementary experiments show similar trends on a smaller 1.5B backbone and on out-of-domain benchmarks.

2605.19357 2026-05-20 cs.CL

SciCustom: A Framework for Custom Evaluation of Scientific Capabilities in Large Language Models

SciCustom: 一个用于大型语言模型科学能力定制评估的框架

Yiyang Gu, Junwei Yang, Junyu Luo, Ye Yuan, Bin Feng, Yingce Xia, Shufang Xie, Kaili Liu, Bohan Wu, Qi Shi, Haoran Li, Beier Xiao, Zhiping Xiao, Xiao Luo, Weizhi Zhang, Philip S. Yu, Zequn Liu, Ming Zhang

发表机构 * State Key Laboratory for Multimedia Information Processing, School of Computer Science, PKU-Anker LLM Lab, Peking University(多媒体信息处理国家重点实验室,计算机学院,PKU-Anker LLM实验室,北京大学) Zhongguancun Academy(中关村学院) IDEA Xidian University(西安电子科技大学) Peking University(北京大学) University of Washington(华盛顿大学) University of Wisconsin–Madison(威斯康星大学麦迪逊分校) University of Illinois Chicago(伊利诺伊大学芝加哥分校)

AI总结 本文提出SciCustom框架,通过从大规模科学数据中自定义构建基准,评估LLM在特定科学任务中的能力,无需专家标注或合成问题生成,展示了细粒度科学能力差异。

Comments Accepted to ACL 2026 Main Conference

详情
AI中文摘要

Large language models (LLMs) are increasingly applied to scientific research, yet existing evaluations often fail to reflect the fine-grained capabilities required in practice. Most benchmarks are manually curated or domain-generic, limiting scalability and alignment with real scientific use cases. In this paper, we propose a new framework named SciCustom to address the problem. It enables the custom construction of benchmarks from large-scale scientific data to evaluate application-specific scientific capabilities in LLMs. SciCustom first organizes scientific knowledge into ontology-grounded knowledge units with controlled granularity and trains a tagger to map large-scale data instances into this knowledge space. Given a custom requirement, relevant knowledge units are identified via voting-based multi-model consensus. These units enable relevance-aware benchmark retrieval via binary search, followed by proxy subset selection and data-grounded benchmark generation for efficient evaluation. Experiments in chemistry and healthcare demonstrate that SciCustom reveals fine-grained differences in LLM scientific capabilities that standard benchmarks overlook, while requiring neither expert annotation nor synthetic question generation. This work provides a scalable and application-aware foundation for benchmarking scientific capabilities in LLMs. The source code is available at https://github.com/yjwtheonly/SciCustom.

英文摘要

Large language models (LLMs) are increasingly applied to scientific research, yet existing evaluations often fail to reflect the fine-grained capabilities required in practice. Most benchmarks are manually curated or domain-generic, limiting scalability and alignment with real scientific use cases. In this paper, we propose a new framework named SciCustom to address the problem. It enables the custom construction of benchmarks from large-scale scientific data to evaluate application-specific scientific capabilities in LLMs. SciCustom first organizes scientific knowledge into ontology-grounded knowledge units with controlled granularity and trains a tagger to map large-scale data instances into this knowledge space. Given a custom requirement, relevant knowledge units are identified via voting-based multi-model consensus. These units enable relevance-aware benchmark retrieval via binary search, followed by proxy subset selection and data-grounded benchmark generation for efficient evaluation. Experiments in chemistry and healthcare demonstrate that SciCustom reveals fine-grained differences in LLM scientific capabilities that standard benchmarks overlook, while requiring neither expert annotation nor synthetic question generation. This work provides a scalable and application-aware foundation for benchmarking scientific capabilities in LLMs. The source code is available at https://github.com/yjwtheonly/SciCustom.

2605.19346 2026-05-20 cs.CL cs.AI cs.LG

IMLJD: A Computational Dataset for Indian Matrimonial Litigation Analysis

IMLJD:印度婚姻诉讼分析计算数据集

Joy Bose

发表机构 * Independent Researcher(独立研究员)

AI总结 本文提出IMLJD数据集,用于分析印度婚姻纠纷案件,包含3613份法院判决,涵盖IPC第498A条、《家庭暴力保护法》和CrPC第482条案件,通过结构化标签、元数据指标和知识图谱揭示最高法院与卡纳塔克高等法院中撤销请求的成功率差异。

Comments 8 pages, 2 figures, 5 tables. Dataset available at huggingface.co/datasets/joyboseroy/imljd and Code at github.com/joyboseroy/imljd

详情
AI中文摘要

我们介绍了IMLJD,一个包含3,613份印度法院判决的开放数据集,涵盖受IPC第498A条、《家庭暴力保护法》和CrPC第482条规制的婚姻纠纷案件。该数据集涵盖最高法院(2000-2024年,1,474份案件)和卡纳塔克高等法院(2018-2024年,2,139份案件),包含结构化结果标签、元数据衍生指标和知识图谱。我们发现,最高法院级别的撤销请求成功率为57.6%,而卡纳塔克高等法院为39.7%。在匹配的2018至2024年期间,最高法院的撤销率是59.3%,扩大了差距至19.6个百分点,证实该发现对时间调整具有鲁棒性。该数据集、代码和知识图谱已公开发布在https://github.com/joyboseroy/imljd和https://huggingface.co/datasets/joyboseroy/imljd。

英文摘要

We present IMLJD, an open dataset of 3,613 Indian court judgments covering matrimonial disputes under IPC Section 498A, the Protection of Women from Domestic Violence Act, and CrPC Section 482. The dataset covers the Supreme Court of India from 2000 to 2024 (1,474 cases) and the Karnataka High Court from 2018 to 2024 (2,139 cases), with structured outcome labels, metadata-derived indicators, and a knowledge graph. We find that 57.6% of quashing petitions succeed at the Supreme Court level compared to 39.7% at the Karnataka High Court level. On a matched 2018 to 2024 period, the SC quash rate is 59.3%, widening the differential to 19.6 percentage points and confirming the finding is robust to temporal adjustment. The dataset, code, and knowledge graph are released openly at https://github.com/joyboseroy/imljd and https://huggingface.co/datasets/joyboseroy/imljd.

2605.19343 2026-05-20 cs.LG

What Makes a Representation Good for Single-Cell Perturbation Prediction?

什么使一个表示对单细胞扰动预测有效?

Wenkang Jiang, Yuhang Liu, Yichao Cai, Erdun Gao, Jiayi Dong, Ehsan Abbasnejad, Lina Yao, Javen Qinfeng Shi

发表机构 * Australian Institute for Machine Learning(澳大利亚机器学习研究院) Responsible AI Research Centre(负责任人工智能研究中心) College of Computer Science and Artificial Intelligence(计算机科学与人工智能学院) Department of Data Science and AI(数据科学与人工智能部门) School of Computer Science and Engineering(计算机科学与工程学院)

AI总结 本文提出PerturbedVAE框架,通过分离扰动特定信息和主导不变结构,恢复因果表示以有效利用此类信息进行预测,并通过可识别性分析明确在特定条件下如何具体指定框架。

Comments Accepted to ICML 2026

详情
AI中文摘要

单细胞扰动建模对于理解和预测细胞对遗传扰动的反应至关重要。然而,现有方法,从因果表示学习到基础模型,往往面临一个被忽视的挑战:基因表达主要由扰动不变信息主导,而扰动特定信号本质上是稀疏的。因此,学习的表示要么将不变和扰动特定信息混合,导致虚假且不可推广的预测器,要么完全抑制扰动特定信号,使它们对预测无效。为了解决这一问题,我们提出了PerturbedVAE,一个通用框架,旨在解决这种信号不平衡。该框架明确将扰动特定信息与主导不变结构分开,并恢复因果表示,以有效利用此类信息进行预测。我们进一步提供了可识别性分析,该分析刻画了稀疏扰动效应可以可靠恢复的条件,从而明确在这些条件下如何具体指定框架。实证上,PerturbedVAE在广泛使用的基准上实现了最先进的性能,在多个评估设置中取得显著进展,在离分布组合预测中获得显著提升,并揭示了可解释的扰动响应程序。

英文摘要

Single-cell perturbation modeling is fundamental for understanding and predicting cellular responses to genetic perturbations. However, existing approaches, from causal representation learning to foundation models, often struggle with an overlooked challenge: gene expression is dominated by perturbation-invariant information, while perturbation-specific signals are intrinsically sparse. As a result, learned representations either entangle invariant and perturbation-specific information, leading to spurious and non-generalizable predictors, or suppress perturbation-specific signals altogether, rendering them ineffective for prediction. To address this, we propose PerturbedVAE, a general framework designed to resolve this signal imbalance. The framework explicitly separates perturbation-specific information from dominant invariant structure and recovers causal representations to effectively utilize such information for prediction. We further provide an identifiability analysis that characterizes the conditions under which sparse perturbation effects can be reliably recovered, thereby clarifying how the framework can be concretely specified under such conditions. Empirically, PerturbedVAE achieves state-of-the-art performance on a widely used benchmark across multiple evaluation settings, yielding significant gains on out-of-distribution combinatorial predictions and uncovering interpretable perturbation-response programs.

2605.19341 2026-05-20 cs.CL cs.AI cs.LG stat.ML

HalluWorld: A Controlled Benchmark for Hallucination via Reference World Models

HalluWorld: 一个用于通过参考世界模型控制幻觉的基准

Emmy Liu, Varun Gangal, Michael Yu, Zhuofu Tao, Karan Singh, Sachin Kumar, Steven Y. Feng

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Patronus AI Independent Researcher(独立研究者) Stanford University(斯坦福大学) The Ohio State University(俄亥俄州立大学) DegenAI Labs(DegenAI实验室)

AI总结 本文提出HalluWorld基准,通过显式参考世界模型研究语言模型的幻觉问题,发现不同任务中幻觉表现不一致,表明幻觉源于多种失败模式而非单一能力。

Comments HalluWorld benchmark (code and data) at github.com/DegenAI-Labs/HalluWorld

详情
AI中文摘要

幻觉仍然是大语言模型的核心失败模式,但现有基准在摘要、问答、检索增强生成和代理交互中操作不一致。这种碎片化使得不清楚一种缓解措施在不同情境中是否有效。当前基准要么需要人工标注和固定参考,要么依赖难以复现的观察。为研究根本原因,我们引入HalluWorld,一个基于显式参考世界模型的可扩展基准:当模型生成一个与该世界不一致的可观察声明时,即产生幻觉。基于这一观点,我们构建了合成和半合成环境,在其中参考世界完全指定,模型观点受控,幻觉标签自动产生。HalluWorld涵盖网格世界、国际象棋和现实终端任务,使世界复杂性、可观察性、时间变化和源冲突政策可控,并将幻觉细分为细粒度错误类别。我们评估了前沿和开放权重语言模型在这些设置中的表现,发现一致模式:前沿模型在直接观察信息上的感知幻觉接近解决,而多步状态跟踪和因果正向模拟仍然困难且未被扩展思考普遍解决。在终端设置中,模型在何时应放弃时也遇到困难。不同探测类型和领域中的失败分布不均,表明幻觉源于不同的失败模式而非单一能力。我们的结果表明,受控参考世界为测量和减少现代语言模型中的幻觉提供了可扩展且可重复的路径。

英文摘要

Hallucination remains a central failure mode of large language models, but existing benchmarks operationalize it inconsistently across summarization, question answering, retrieval-augmented generation, and agentic interaction. This fragmentation makes it unclear whether a mitigation that works in one setting reduces hallucinations across contexts. Current benchmarks either require human annotation and fixed references that may be memorized, or rely on observations in settings that are difficult to reproduce. To study root causes, we introduce HalluWorld, an extensible benchmark grounded in an explicit reference-world formulation: a model hallucinates when it produces an observable claim that is false with respect to this world. Building on this view, we construct synthetic and semi-synthetic environments in which the reference world is fully specified, the model's view is controlled, and hallucination labels are generated automatically. HalluWorld spans gridworlds, chess, and realistic terminal tasks, enabling controlled variation of world complexity, observability, temporal change, and source-conflict policy, and disentangling hallucinations into fine-grained error categories. We evaluate frontier and open-weight language models across these settings and find consistent patterns: perceptual hallucination on directly observed information is near-solved for frontier models, while multi-step state tracking and causal forward simulation remain difficult and are not generally solved by extended thinking. In the terminal setting, models also struggle with when to abstain. The uneven profile of failures across probe types and domains suggests that hallucinations arise from distinct failure modes rather than a single capability. Our results suggest that controlled reference worlds offer a scalable and reproducible path toward measuring and reducing hallucinations in modern language models.

2605.19340 2026-05-20 cs.CV

Selective, Regularized, and Calibrated: Harnessing Vision Foundation Models for Cross-Domain Few-Shot Semantic Segmentation

选择性、正则化和校准:利用视觉基础模型进行跨域少样本语义分割

Junyuan Ma, Xunzhi Xiang, Wenbin Li, Qi Fan, Yang Gao

发表机构 * Nanjing University(南京大学) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 本文提出HERA框架,通过选择性、正则化和校准的方法,有效利用视觉基础模型进行跨域少样本语义分割,提升了模型在新领域中的适应能力,并在多个基准上取得了更高的mIoU成绩。

Comments 20 pages, 11 figures, 13 tables. Accepted to CVPR 2026

详情
AI中文摘要

视觉基础模型(VFMs)在各种视觉任务中已取得优异性能。然而,将VFMs应用于跨域少样本分割(CD-FSS)仍然具有挑战性,因为CD-FSS需要在仅少量标记示例的情况下对新类别的对象进行分割,并且在域转移下进行。挑战主要由两个因素驱动:(1)每个新类别的标记示例有限,相对于VFM预训练的规模,这使模型在重新训练时容易过拟合;(2)目标域在预训练期间未被充分代表,导致跨域不一致性和层间敏感性。为了解决这些问题,我们提出了层次示例表示适应(HERA),一种基于VFMs的三阶段选择-正则化-校准分割框架,能够有效利用有限的标签并在不重新训练源数据的情况下适应新领域。我们首先设计了层次层选择(HLS)以自适应地识别最信息丰富的VFM层,使用数据依赖的示例转移风险(ETR)计算每个候选层。然后,先验引导正则化(PGR)对选定的表示进行正则化,产生后续阶段的结构化局部信号。此外,像素级自适应校准(PAC)将选定的表示与细化的交互图结合,校准像素级预测,产生一致的掩码。这些阶段共同形成一个层次选择-正则化-校准的管道,指导冻结的VFM特征在新领域中工作,同时在测试时仅微调不到2.7%的参数。广泛的实验表明,HERA在多个CD-FSS基准上超越了现有最佳方法,mIoU提高了超过4.1个百分点。

英文摘要

Vision foundation models (VFMs) have achieved strong performance across various vision tasks. However, it still remains challenging to apply VFMs for cross-domain few-shot segmentation (CD-FSS), which segments objects of novel classes under domain shifts using only a few labeled exemplars. The challenge is mainly driven by two factors: (1) limited labeled exemplars per novel class relative to the scale of VFM pre-training, making the model prone to overfitting during retraining, and (2) target-domain shifts underrepresented during pre-training, inducing cross-domain inconsistency and layer-wise sensitivity. To address these issues, we propose Hierarchical Exemplar Representation Adaptation (HERA), a three-stage select-regularize-calibrate VFM-based segmentation framework that learns effectively from limited labels and adapts to novel domains without source-data retraining. We first design Hierarchical Layer Selection (HLS) to adaptively identify the most informative VFM layer using a data-dependent Exemplar Transfer Risk (ETR) computed for each candidate layer. Then, Prior-Guided Regularization (PGR) regularizes interactions on the selected representation, yielding well-structured local signals for the subsequent stage. Furthermore, Pixelwise Adaptive Calibration (PAC) combines the selected representation with the refined interaction maps to calibrate pixel-wise predictions, producing consistent masks. Together, these stages form a hierarchical select-regularize-calibrate pipeline that guides frozen VFM features in new domains while fine-tuning less than 2.7% of parameters at test time. Extensive experiments show that HERA surpasses the state of the art by more than 4.1 mIoU across multiple CD-FSS benchmarks.

2605.19337 2026-05-20 cs.AI

Agentic Trading: When LLM Agents Meet Financial Markets

代理交易:当大语言模型代理与金融市场相遇

Yihan Xia, Panpan You, Taotao Wang, Fang Liu, Han Qi, Xiaoxiao Wu, Shengli Zhang

发表机构 * College of Electronic and Information Engineering, Shenzhen University, Shenzhen, China(深圳大学电子与信息工程学院) Shenzhen Audencia Financial Technology Institute, Shenzhen University, Guangdong, People's Republic of China(深圳大学审计金融科技研究所)

AI总结 本文探讨了如何将大语言模型(LLM)作为交易系统中的代理,感知市场信息、检索上下文、进行决策推理、发出可交易动作并适应市场反馈。研究通过分析77项研究,发现协议不可比性是主要问题,提出证据日志、可重复性审计和报告检查表作为主要贡献。

Comments 59 pages, 15 figures, 27 tables

详情
AI中文摘要

越来越多的研究探讨如何将大语言模型(LLMs)嵌入到交易系统中作为代理,这些代理能够感知市场信息、检索上下文、对决策进行推理、发出可交易动作并在市场反馈下进行适应。本文将基于LLM的交易代理重新界定为专家系统决策流程,并呈现了一个包含77项研究的证据图谱,这些研究是在2026年3月9日通过协议编码快照筛选得出的。主要经验子集(n=19)满足最低边界条件,即动作输出加闭环评估;其余58项研究作为背景和设计语境保留。核心经验发现是协议不可比性:在主要子集中,只有2/19项研究报告可提取的时间一致拆分协议,1/19项报告明确的交易成本模型,1/19项记录了宇宙或幸存者处理,11/19项报告了执行时间和语义,15/19项被编码为R0,没有任何研究达到R3的可重复性。因此,我们使用架构-能力-适应作为分析透镜,而不是验证过的分类学,我们突出证据日志、可重复性审计和报告检查表作为主要贡献。最终的调查表明,架构实验正在迅速扩展,而可比评价协议、执行语义和可重复的成果仍然是该领域即时的瓶颈。

英文摘要

A growing body of work explores how Large Language Models (LLMs) can be embedded in trading systems as agents that perceive market information, retrieve context, reason about decisions, emit tradable actions, and adapt under market feedback. This paper reframes LLM-based trading agents as expert-system decision pipelines and presents an audit-oriented evidence map of 77 included studies in a protocol-coded snapshot screened through 2026-03-09. A primary empirical subset (n=19) satisfies the minimum boundary of Action Output plus Closed-Loop Evaluation; the remaining 58 included studies are retained as background and design context. The central empirical finding is protocol incomparability: within the primary subset, only 2/19 studies report extractable time-consistent split protocols, 1/19 reports an explicit transaction-cost model, 1/19 documents universe or survivorship handling, 11/19 report execution timing or semantics, 15/19 are coded as R0, and no study reaches R3 reproducibility. We therefore use Architecture-Capability-Adaptation as a working analytical lens rather than a validated taxonomy, and we foreground the evidence ledger, reproducibility audit, and reporting checklist as the main contributions. The resulting survey shows that architectural experimentation is expanding rapidly, while comparable evaluation protocols, execution semantics, and reproducible artifacts remain the field's immediate bottlenecks.

2605.19330 2026-05-20 cs.AI cs.LG cs.SE

MOCHA: Multi-Objective Chebyshev Annealing for Agent Skill Optimization

MOCHA:多目标切比雪夫退火用于智能体技能优化

Md Mehrab Tanjim, Jayakumar Subramanian, Xiang Chen, Branislav Kveton, Subhojyoti Mukherjee, Anlan Zhang, Sungchul Kim, Somdeb Sarkhel, Sunav Choudhury

发表机构 * Adobe Research(Adobe研究院)

AI总结 该研究提出MOCHA方法,通过切比雪夫标量化和指数退火解决智能体技能优化中的多目标问题,实现更优的帕累托前沿发现和性能提升。

Comments Preprint. 25 pages, 14 figures, 5 tables

详情
AI中文摘要

LLM智能体通过技能组织行为——这些技能是结构化的自然语言规范,指导智能体推理、检索和响应。与单体提示不同,技能是多字段的artifact,受严格平台限制:描述字段因路由被截断,指令正文通过渐进披露压缩,且共存技能竞争有限的上下文窗口。这些限制使技能优化本质上是多目标的:一个技能必须同时最大化任务性能并满足平台限制。然而,现有提示优化器要么忽略这些权衡,要么将其折叠成加权和,忽略了非凸目标区域中的帕累托最优变体。我们引入MOCHA(多目标切比雪夫退火),用切比雪夫标量化替代单目标选择——覆盖完整的帕累托前沿,包括非凸区域——结合指数退火,从探索转向利用。在六个多样化的智能体技能实验中,所有方法共享相同的多目标变异操作符,基线接收相同的单目标文本反馈。现有优化器在六个任务中的四个任务上无法改进种子技能:1000次运行无进展。MOCHA在所有任务中突破,平均正确率比最强基线提高7.5%(在FEVER上达14.9%,在TheoremQA上达10.4%),同时发现两倍多的帕累托最优技能变体。

英文摘要

LLM agents organize behavior through skills - structured natural-language specifications governing how an agent reasons, retrieves, and responds. Unlike monolithic prompts, skills are multi-field artifacts subject to hard platform constraints: description fields are truncated for routing, instruction bodies are compacted via progressive disclosure, and co-resident skills compete for limited context windows. These constraints make skill optimization inherently multi-objective: a skill must simultaneously maximize task performance and satisfy platform limits. Yet existing prompt optimizers either ignore these trade-offs or collapse them into a weighted sum, missing Pareto-optimal variants in non-convex objective regions. We introduce MOCHA (Multi-Objective Chebyshev Annealing), which replaces single-objective selection with Chebyshev scalarization - covering the full Pareto front, including non-convex regions - combined with exponential annealing that transitions from exploration to exploitation. In our experiments across six diverse agent skills - where all methods share the same multi-objective mutation operator and baselines receive identical per-objective textual feedback - existing optimizers fail to improve the seed skill on 4 of 6 tasks: 1000 rollouts yield zero progress. MOCHA breaks through on every task, achieving 7.5% relative improvement in mean correctness over the strongest baseline (up to 14.9% on FEVER and 10.4% on TheoremQA) while discovering twice as many more Pareto-optimal skill variants.

2605.19325 2026-05-20 cs.LG

An Exterior Method for Nonnegative Matrix Factorization

非负矩阵分解的外方法

Qiujing Lu, Tonmoy Monsoor, Ehsan Ebrahimzadeh, Kartik Sharma, Vwani Roychowdhury

发表机构 * ECE, UCLA(加州大学洛杉矶分校电子与计算机工程系) eBay Search Science Team(eBay搜索科学团队)

AI总结 本文提出了一种非负矩阵分解的外方法(eNMF),通过分离低秩近似和非负性约束,解决了传统内部方法在非凸优化中收敛慢或陷入次优解的问题,并在多个数据集上验证了其优越的性能。

Comments Accepted to ICML 2026

详情
AI中文摘要

非负矩阵分解(NMF)旨在寻找低秩近似$X \approx UV^T$,其中因素非负,并通常使用内部方法在整个优化过程中强制可行性。我们证明,这种约束驱动的方法可能会在非凸景观中阻碍进展,导致收敛缓慢或收敛到次优的 stationary 点。我们提出了一种非负矩阵分解的外框架(eNMF),将低秩近似与非负性约束分开。我们的方法从最优无约束因子分解初始化,并引入一种旋转过程,将无约束因子映射到非负正交体最近的外部点。这种视角产生了一种算法框架,其中简单的迭代更新收敛到满足KKT条件的边界点。外形式还使NMF解具有几何解释,澄清了在排列和正交变换下因子分解的等价类。一项引人注目的数值结果,涉及400个NMF实验,涵盖真实和合成数据集,显示在99%的情况下,不同算法倾向于收敛到等价的因子矩阵。我们对9种最先进的NMF算法进行基准测试,涵盖9种初始化方案,跨3个真实世界和2个合成数据集。eNMF在所有81个竞争对手中表现一致,达到相等时间设置下30%的重建误差降低,以及相等误差设置下的150%加速。下游实验进一步证明了在音频处理和推荐任务中的显著性能提升,证实了所提出外优化框架的实用价值。代码可在https://github.com/roychowdhuryresearch/eNMF获取。

英文摘要

Nonnegative matrix factorization (NMF) seeks a low-rank approximation $X \approx UV^T$ with nonnegative factors and is commonly solved using interior methods that enforce feasibility throughout optimization. We show that such constraint-driven approaches can impede progress in the nonconvex landscape, leading to slow convergence or convergence to suboptimal stationary points. We propose an exterior framework for NMF (eNMF) that separates low-rank approximation from nonnegativity enforcement. Our method initializes from the optimal unconstrained factorization and introduces a rotation procedure that maps unconstrained factors to an exterior point closest to the nonnegative orthant. This viewpoint yields an algorithmic framework in which simple iterative updates converge to KKT-satisfying stationary points on the boundary of the positive orthant. The exterior formulation also enables a geometric interpretation of NMF solutions, clarifying equivalence classes of factorizations under permutation and orthogonal transformations. An intriguing numerical result, involving 400 NMF experiments across both real and synthetic datasets, show that in 99% of the cases, different algorithms tend to converge towards equivalent factor matrices. We benchmark eNMF against 9 state-of-the-art NMF algorithms with 9 initialization schemes across 3 real-world and 2 synthetic datasets. eNMF consistently outperforms all 81 competitors, achieving up to 30% lower reconstruction error under equal-time settings and up to 150% speedup under equal-error settings. The downstream experiments further demonstrate substantial performance gains in audio processing and recommendation tasks, corroborating the practical benefits of the proposed exterior optimization framework. Code is available at https://github.com/roychowdhuryresearch/eNMF

2605.19324 2026-05-20 cs.LG

BrainDyn: A Sheaf Neural ODE for Generative Brain Dynamics

BrainDyn: 一种用于生成脑动态的sheaf神经ODE

Siddharth Viswanath, Panayiotis Ketonis, Chen Liu, Michael Perlmutter, Dhananjay Bhaskar, Smita Krishnaswamy

发表机构 * Yale University(耶鲁大学) Boise State University(博伊西州立大学) University of Wisconsin–Madison(威斯康星大学麦迪逊分校)

AI总结 本文提出BrainDyn,一种基于sheaf神经ODE的模型,用于生成脑动态,通过LSTM编码脑区活动历史,利用sheaf拉普拉斯算子促进信息传递,实现跨模态的强预测能力。

详情
AI中文摘要

高效的神经网络模型能够生成类似大脑动态的活动,可以用于生成合成数据、分析在测试扰动活动等条件下大脑瞬态的差异以及推断底层生成动态。然而,大型语言模型(LLMs)或标准循环神经网络(RNNs)忽略了解剖组织,因此不产生与脑区对齐的组件。另一方面,基于图的网络通常有非常简单的消息传递规则,这些规则不足以表达类似大脑的动态。为此,我们引入了BrainDyn,一种用于在结构化脑图上连续时间动态的sheaf神经ODE模型。BrainDyn使用长短期记忆(LSTM)模型在滑动时间窗口上编码每个脑区的最近活动历史,以生成隐藏状态或茎,这些状态通过可学习的限制映射投影到边特定的共享空间中。这些共享空间中相邻节点之间的差异由sheaf拉普拉斯算子表征,可以促进神经元单元之间的信息传递。这些信息的输出然后被馈送到神经ODE中,该神经ODE控制神经元活动的连续时间演变。我们对静息态fMRI(PNC数据集)、头皮EEG与局灶性癫痫(TUSZ数据集)以及由NEST尖峰网络模拟器模拟的活动进行了评估。BrainDyn在跨模态中实现了强大的预测能力,所得到的表示支持下游任务,包括在硅中扰动预测。

英文摘要

Efficient neural network models that generate brain-like dynamic activity can be a valuable resource for generating synthetic data, analyzing differences in brain transients under conditions such as testing perturbation activity or inferring the underlying generative dynamics. However, large language models (LLMs) or standard recurrent neural networks (RNNs) ignore the anatomical organization and therefore do not produce components that align with brain regions. On the other hand, graph-based networks often have very simple message passing rules that are not sufficiently expressive for brain-like dynamics. To address this, we introduce BrainDyn, a sheaf neural ordinary differential equation (neural ODE) model for continuous-time dynamics on structured brain graphs. BrainDyn encodes the recent activity history of each brain region using a long short-term memory (LSTM) model over a sliding temporal window to produce hidden states, or stalks, that are projected through learnable restriction maps into edge-specific shared spaces. Discrepancies between neighboring nodes in these shared spaces are characterized by a sheaf Laplacian that can facilitate message passing between neuronal units. The output of these messages is then fed to a neural ODE that governs the continuous-time evolution of neuronal activity. We evaluated BrainDyn on resting-state fMRI (PNC dataset), scalp EEG with focal epilepsy (TUSZ dataset), and simulated activity from the NEST spiking network simulator. BrainDyn achieves strong forecasting ability across modalities, and the resulting representations support downstream tasks including in silico perturbation prediction.

2605.19322 2026-05-20 cs.CV

DynaTok: Temporally Adaptive and Positional Bias-Aware Token Compression for Video-LLMs

DynaTok: 时序自适应和位置偏见感知的视频大语言模型token压缩

Minyoung Park, Taehun Kong, Sangjun Ahn

发表机构 * LG Electronics, Seoul, South Korea(LG电子,首尔,韩国)

AI总结 本文提出DynaTok,一种无需训练的时序自适应和位置偏见感知的token压缩框架,通过在时序和空间维度上分配token预算,有效减少冗余的时空覆盖,提升视频大语言模型的效率和鲁棒性。

详情
AI中文摘要

近年来,视频大语言模型(Video-LLMs)的进步显著扩展了多模态推理能力。然而,从长视频序列中提取的大量视觉token带来了高昂的计算成本,限制了其在现实场景中的应用。现有的无训练token压缩方法基于注意力大小作为语义重要性的代理进行token选择,但往往忽视位置偏见并仅依赖短期时间局部性,导致冗余的时空覆盖和低效的token使用。我们提出了DynaTok,一种无需训练、时序自适应且偏见感知的token压缩框架,能够在时序和空间维度上分配token预算。通过轻量级的指数移动平均(EMA)内存,时序预算分配(TBA)模块动态地将较少的token分配给冗余帧,将更多的token分配给新颖的帧,捕捉长期时间变化。空间预算分配(SBA)模块通过基于激活的注意力图选择空间多样性和语义重要的特征,同时利用空间内存减少已选区域的冗余并缓解位置偏见。DynaTok无缝集成到现有的Video-LLMs中,如LLaVA-OneVision和LLaVA-Video,无需重新训练,并在高强度压缩下有效保留语义覆盖。在四个代表性VideoQA基准测试-MVBench、LongVideoBench、MLVU和VideoMME上的实验表明,即使在90%的token减少下,DynaTok仍能保留超过95%的基线准确性,优于最近的无训练方法。这些结果表明,DynaTok为高效和稳健的视频推理提供了系统的基础,为未来Video-LLMs实现实时流媒体视频理解铺平了道路。

英文摘要

Recent advances in Video Large Language Models (Video-LLMs) have greatly expanded multimodal reasoning capabilities. However, the massive number of visual tokens extracted from long video sequences incurs prohibitive computational costs, limiting their deployment in real-world scenarios. Existing training-free token compression methods select tokens based on attention magnitude as a proxy for semantic importance, but often overlook positional bias and rely only on short-term temporal locality, leading to redundant spatio-temporal coverage and inefficient token usage. We present DynaTok, a training-free, temporally adaptive and bias-aware token compression framework that allocates token budgets across both temporal and spatial dimensions. Through a lightweight exponential moving average (EMA) memory, the Temporal Budget Allocation (TBA) module dynamically assigns fewer tokens to redundant frames and more to novel frames, capturing long-term temporal variation. The Spatial Budget Allocation (SBA) module complements this by selecting spatially diverse and semantically important features using activation-based attention maps, while leveraging a spatial memory to reduce redundancy from previously selected regions and mitigate positional bias. DynaTok integrates seamlessly with existing Video-LLMs such as LLaVA-OneVision and LLaVA-Video without retraining, and effectively preserves semantic coverage under aggressive compression. Experiments on four representative VideoQA benchmarks-MVBench, LongVideoBench, MLVU, and VideoMME-show that DynaTok retains over 95% of baseline accuracy even with a 90% token reduction, surpassing recent training-free approaches. These results demonstrate that DynaTok provides a principled foundation for efficient and robust video reasoning, paving the way toward real-time streaming video understanding with future Video-LLMs.

2605.19319 2026-05-20 cs.CV

SWEET: Sparse World Modeling with Image Editing for Embodied Task Execution

SWEET:基于图像编辑的稀疏世界建模用于具身任务执行

Yiren Song, Yihan Wang, Xiyao Deng, Zhuoran Yan, Mike Zheng Shou

发表机构 * Show Lab, National University of Singapore(新加坡国立大学Show实验室) Central South University(中南大学)

AI总结 本文研究图像编辑模型能否作为稀疏视觉世界模型用于机器人操作,通过预测任务级未来状态而非密集视频生成,提出SWEET框架实现稀疏视觉规划,结合语言指令和空间引导生成关键帧,并通过扩散动作预测器生成可执行动作,实验表明其在不同场景中提升关键帧预测能力。

详情
AI中文摘要

视觉预测已成为具身控制的有前景范式,其中未来观察被生成并转化为动作。然而,密集视频生成计算成本高且对许多操作任务而言往往不必要,其进展可以总结为少量任务相关视觉状态。本文研究图像编辑模型能否作为稀疏视觉世界模型用于机器人操作,通过预测任务级未来状态而非密集视频生成。我们首先在相同的机器人数据设置下比较视频生成模型Wan2.2和图像编辑模型FLUX-Kontext,发现图像编辑能生成更可靠的任务级关键帧,具有更好的视觉保真度和显著更低的推理成本。受此启发,我们提出SWEET,一种单次稀疏视觉规划框架,通过连续图像编辑生成一系列任务相关操作关键帧,基于语言指令和可选箭头式空间引导。一个目标条件化的扩散动作预测器将相邻想象的关键帧转换为可执行的动作块。为了减少真实与编辑视觉子目标之间的不匹配,我们进一步引入混合训练策略,使用过滤后的编辑目标。在DROID和RoboMimic上的实验表明,SWEET在已见和未见场景中均提升了关键帧预测能力,并实现了从序列关键帧规划到可执行机器人动作的完整流程,表明图像编辑是具身视觉预测中一个有前景但尚未被广泛探索的方向。

英文摘要

Visual prediction has emerged as a promising paradigm for embodied control, where future observations are generated and then translated into actions. However, dense video generation is computationally expensive and often unnecessary for many manipulation tasks, whose progress can be summarized by a small number of task-relevant visual states. In this work, we study whether image editing models can serve as sparse visual world models for robot manipulation by predicting task-level future states without dense video rollout. We first conduct a controlled comparison between the video generation model Wan2.2 and the image editing model FLUX-Kontext under the same robotic data setting, and find that image editing produces more reliable task-level keyframes with better visual fidelity and substantially lower inference cost. Motivated by this observation, we propose SWEET, a one-shot sparse visual planning framework that progressively generates a sequence of task-relevant manipulation keyframes through successive image editing, conditioned on language instructions and optional arrow-based spatial guidance. A goal-conditioned diffusion action predictor then converts adjacent imagined keyframes into executable action chunks. To reduce the mismatch between real and edited visual subgoals, we further introduce a mixed-training strategy with filtered edited targets. Experiments on DROID and RoboMimic show that SWEET improves keyframe prediction across seen and unseen scenes and enables a full pipeline from sequential keyframe planning to executable robot actions, suggesting that image editing is a promising and underexplored direction for embodied visual prediction.

2605.19317 2026-05-20 cs.LG cs.AI

Inference-Time Scaling in Diffusion Models through Iterative Partial Refinement

通过迭代部分细化在扩散模型中实现推理时间扩展

Taegu Kang, Jaesik Yoon, Sungjin Ahn

发表机构 * KAIST(韩国科学技术院)

AI总结 本文提出了一种无需外部验证器的扩散模型推理时间扩展方法Iterative Partial Refinement,通过在混合噪声条件下迭代部分细化生成更一致的样本,在MNIST Sudoku任务中提升了有效解率。

Comments Accepted at the ICLR 2026 Workshop on AI with Recursive Self-Improvement

详情
AI中文摘要

推理时间扩展已成为提升推理能力的主要方法,并越来越多地应用于扩散模型。然而,现有的扩散模型推理时间扩展方法通常依赖外部验证器或奖励模型来排名和选择样本,限制了其在这些评估器可用且可靠的情况下可扩展性。此外,尽管最近的扩散模型进行区域-wise、混合噪声推理,但针对此设置的推理时间扩展仍相对未被探索。我们提出Iterative Partial Refinement (IPR),一种针对顺序扩散模型的推理时间扩展方法,无需外部验证器。从已生成的样本开始,IPR重新噪声一部分区域并根据剩余区域重新生成它们,使模型能够在比初始生成时更丰富的上下文中修订早期决策。这种迭代部分细化生成更一致的样本而无需外部验证。在需要全局约束满足的推理任务中,IPR一致地提升了性能:在MNIST Sudoku任务中,有效解率从55.8%提高到75.0%。这些结果表明,仅迭代部分细化即可作为扩散模型在顺序、混合噪声设置中的有效推理时间扩展策略。代码可在:https://github.com/ahn-ml/IPR获取。

英文摘要

Inference-time scaling has emerged as a major approach for improving reasoning capabilities, and has been increasingly applied to diffusion models. However, existing inference-time scaling methods for diffusion models typically rely on external verifiers or reward models to rank and select samples, limiting their scalability to settings where such evaluators are available and reliable. Moreover, while recent diffusion models perform sequential inference with region-wise, mixed-noise conditioning, inference-time scaling tailored to this setting remains relatively underexplored. We propose Iterative Partial Refinement (IPR), an inference-time scaling method for sequential diffusion that requires no external verifier. Starting from an already-generated sample, IPR re-noises a subset of regions and regenerates them conditioned on the remaining regions, enabling the model to revise earlier decisions under a richer context than was available during the initial generation. This iterative partial refinement produces more globally consistent samples without external verification. On reasoning tasks requiring global constraint satisfaction, IPR consistently improves performance: on MNIST Sudoku, the valid solution rate increases from 55.8% to 75.0%. These results show that iterative partial refinement alone can serve as an effective inference-time scaling strategy for diffusion models in sequential, mixed-noise settings. Code is available at: https://github.com/ahn-ml/IPR

2605.19316 2026-05-20 cs.CL

A Multi-Agent Framework for Feature-Constrained Difficulty Control in Reading Comprehension Item Generation

一种用于特征约束难度控制的多智能体框架(在阅读理解项目生成中)

Seonjeong Hwang, Jun Seo, Hyounghun Kim, Gary Geunbae Lee

发表机构 * Graduate School of Artificial Intelligence, POSTECH, Republic of Korea(韩国POSTECH人工智能研究生院) Department of Computer Science and Engineering, POSTECH, Republic of Korea(韩国POSTECH计算机科学与工程系)

AI总结 本文提出MAFIG多智能体框架,通过多个LLM代理和特征特定评估器协作生成并迭代修订项目,以满足特征约束,从而实现更稳定的难度控制。

Comments ACL 2026 Main Conference

详情
AI中文摘要

最近的研究在难度控制的阅读理解项目生成中利用大型语言模型(LLMs)通过调整与难度相关的特征来生成项目。然而,现有方法通常依赖于单代理提示方法,这往往无法一致地满足指定的特征约束,导致生成的项目偏离目标难度水平。为了解决这一限制,我们引入了MAFIG,一种用于特征约束项目生成的多代理框架,其中多个LLM代理和特征特定评估器协作生成并根据预期约束迭代修订项目。此外,为了验证MAFIG在难度控制中的有效性,我们提出了一种构造特征约束集序列的方法,该序列产生难度单调递增的项目。实验结果表明,MAFIG生成符合目标约束的项目率显著高于基线方法,通过难度校准的约束序列实现了稳健的难度控制。

英文摘要

Recent studies in difficulty-controlled reading comprehension item generation have leveraged large language models (LLMs) to produce items by adjusting difficulty-related features. However, existing methods typically rely on a single-agent prompting approach, which often fails to consistently satisfy specified feature constraints, resulting in items that deviate from the target difficulty level. To address this limitation, we introduce MAFIG, a Multi-agent Framework for Feature-constrained Item Generation, where multiple LLM agents and feature-specific evaluators collaborate to generate and iteratively revise items based on intended constraints. Furthermore, to verify the efficacy of MAFIG in difficulty control, we propose a method for constructing a sequence of feature constraint sets that yield items with monotonically increasing difficulty. Experimental results demonstrate that MAFIG generates items that adhere to target constraints at a significantly higher rate than baselines, achieving robust difficulty control through the difficulty-calibrated constraint sequence.

2605.19314 2026-05-20 cs.RO cs.AI

ContextFlow: Hierarchical Task-State Alignment for Long-Horizon Embodied Agents

ContextFlow:长周期具身智能体的分层任务-状态对齐

Shuhan Guo, Kun Zhang, Haifei Liu, Xingyu Gao, Yongqi Zhang, Yaqing Wang, Quanming Yao

发表机构 * Department of Electronic Engineering, Tsinghua University(清华大学电子工程系) Qiuzhen College, Tsinghua University(清华大学启祯学院) Beijing Institute of Mathematical Sciences and Applications(北京数学科学研究院) Department of Data Science and Analytics Thrust, The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)数据科学与分析部门) Institute of Microelectronics, Chinese Academy of Sciences(中国科学院微电子研究所) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 本文研究了长周期具身智能体中任务-状态不一致问题,提出ContextFlow框架通过显式合同表示阶段、运行时观测转为证据包以及应用作用域更新来实现任务前沿对齐,提高任务执行的连贯性和可审计性。

详情
AI中文摘要

长周期具身智能体越来越多地将导航、搜索、接近和操作任务委托给专门执行器。随着这些执行器变得更强,瓶颈从局部技能执行转移到在规划、监控、记忆和执行之间保持一致的任务前沿。我们研究了任务-状态不一致,即在任务层面一致性失败,其中规划器的活跃阶段、运行时证据、记忆上下文和委托执行器不再支持相同的下一步决策。这种失败可能导致不支持的手动交接、阶段锁定、执行器-上下文不匹配和不必要的重新规划。我们提出ContextFlow,一个可检查的对齐框架,将阶段表示为显式合同,将运行时观测转换为证据包,并应用包括继续、细化、转移、提升和修复在内的作用域更新。ContextFlow使专门执行器负责局部闭环控制,同时使任务前沿对齐显式且可审计。在长周期具身任务上的实验和演示轨迹展示了证据基础的作用域更新如何诊断和缓解反复出现的任务-状态失败。

英文摘要

Long-horizon embodied agents increasingly delegate navigation, search, approach, and manipulation to specialist executors. As these executors become stronger, the main bottleneck shifts from local skill execution to maintaining a coherent task frontier across planning, monitoring, memory, and execution. We study task-state misalignment, a task-level consistency failure in which the planner's active stage, runtime evidence, remembered context, and delegated executor no longer justify the same next-step decision. This failure can lead to unsupported handoffs, stage lock, executor-context mismatch, and unnecessary replanning. We propose ContextFlow, an inspectable alignment framework that represents stages as explicit contracts, converts runtime observations into evidence packets, and applies scoped updates including continue, refine, transfer, promote, and repair. ContextFlow keeps specialist executors responsible for local closed-loop control while making task-frontier alignment explicit and auditable. Experiments and demonstration traces on long-horizon embodied tasks illustrate how evidence-grounded scoped updates diagnose and mitigate recurring task-state failures.

2605.19311 2026-05-20 cs.LG eess.SP

An Objective Performance Evaluation of the LSTM Networks in Time Series Classification

LSTM网络在时间序列分类中的客观性能评估

Sooraj Sunil, Balakumar Balasingam

发表机构 * Electrical and Computer Engineering(电气与计算机工程系) University of Windsor(温莎大学)

AI总结 本文提出了一种评估框架,比较了LSTM分类器与基于模型的期望最大化(EM)分类器在二元时间序列分类中的性能,发现当数据符合假设模型类时,EM分类器表现优异,而LSTM分类器需要更大的噪声统计分离度才能实现可靠的分类,且在模型仅在测量噪声上不同的情况下,其性能低于参考分类器。

Comments Accepted in 2026 29th International Conference on Information Fusion

详情
AI中文摘要

深度学习的快速采用已导致数据驱动模型取代经典基于模型的算法,即使在由良好理解的物理定律支配的领域也是如此。尽管数据驱动模型,如长短期记忆(LSTM)网络,已成为时间序列分析的流行选择,但其在结构化环境中的性能相对于基于模型的方法很少被客观评估。本文提出了一种性能评估框架,比较了LSTM分类器与基于模型的期望最大化(EM)分类器在二元时间序列分类中的性能。评估是在两个仅在噪声统计上不同的标量线性高斯状态空间模型上进行的,其中卡尔曼滤波似然比率检验使用真实参数作为最佳可实现分类性能的参考。通过蒙特卡洛模拟,分类器在三个轴上进行评估:任务难度,由过程或测量噪声之间分离度控制;序列长度;以及训练数据集大小。结果表明,当数据符合假设模型类时,利用已知模型结构的EM分类器表现良好。LSTM分类器需要更大的噪声统计分离度才能实现可靠的分类,并且在模型仅在测量噪声上不同的情况下,其性能低于参考分类器,无论序列长度或训练数据集大小如何。

英文摘要

The rapid adoption of deep learning has increasingly led to data-driven models replacing classical model-based algorithms, even in domains governed by well-understood physical laws. While data-driven models, such as long short-term memory (LSTM) networks, have become a popular choice for time-series analysis, their performance relative to model-based approaches in structured environments is rarely evaluated objectively. This paper presents a performance evaluation framework comparing an LSTM classifier against a model-based expectation maximization (EM) classifier for binary time-series classification. The evaluation is conducted on two scalar linear Gaussian state space models differing only in their noise statistics, where the Kalman filter likelihood ratio test with true parameters serves as a reference for the best achievable classification performance.Through Monte Carlo simulations, the classifiers are evaluated across three axes: task difficulty, controlled by the separation in process or measurement noise between the two models; sequence length; and training dataset size. The results show that the EM classifier, which exploits the known model structure, performs strongly when the data conform to the assumed model class. The LSTM classifier requires a larger separation in noise statistics to achieve reliable classification, and its performance saturates below the reference classifier when the models differ only in measurement noise, regardless of sequence length or training dataset size.

2605.19307 2026-05-20 cs.CV

MetaRA: Metamorphic Robustness Assessment for Multimodal Large Language Model-based Visual Question Answering Systems

MetaRA: 多模态大语言模型基于视觉问答系统的元形态鲁棒性评估

Quanxing Xu, Yuhao Tian, Ling Zhou, Xian Zhong, Xiaohua Huang, Rubing Huang, Chia-Wen Lin

发表机构 * School of Computer Science and Engineering, Macau University of Science and Technology, Macao SAR(澳门科学技术大学计算机科学与工程学院) Hubei Key Laboratory of Transportation Internet of Things, School of Computer Science and Artificial Intelligence, Wuhan University of Technology(湖北省交通物联网重点实验室,武汉理工大学)

AI总结 本文提出MetaRA,一种基于元形态测试的框架,用于评估多模态大语言模型基于视觉问答系统的鲁棒性,通过生成受控的图像-问题输入变体,揭示模型在语言扰动、视觉线索依赖和多模态推理中的弱点。

详情
AI中文摘要

视觉问答(VQA)作为代表性多模态任务,是评估多模态大语言模型(MLLMs)推理能力的关键基准。然而,现有评估主要依赖静态数据集和基于准确性的指标,无法捕捉鲁棒性、一致性和泛化能力。受元形态测试(MT)启发,我们提出元形态鲁棒性评估(MetaRA),一种测试框架,利用元形态关系(MRs)系统性地探测MLLM基于VQA系统的漏洞。MetaRA根据特定MRs生成受控的图像-问题输入变体,并在多样化的条件下评估模型。将MetaRA应用于多个基于MLLM的VQA模型,揭示了细微的失败模式,包括对语言扰动的敏感性、对表面视觉线索的过度依赖以及更深层次的多模态推理弱点。实验结果表明,MetaRA提供的诊断见解比传统准确性指标更丰富,暴露了在标准基准下仍隐藏的失败模式。总体而言,本文强调了在VQA中系统性鲁棒性评估的必要性,并将元形态评估定位为一种可扩展、模型无关的方法,用于可信的多模态AI。

英文摘要

Visual Question Answering (VQA), as the representative multimodal task, serves as a key benchmark for evaluating the reasoning capabilities of Multimodal Large Language Models (MLLMs). However, existing evaluations largely rely on static datasets and accuracy-based metrics, which fail to capture robustness, consistency, and generalization. Inspired by Metamorphic Testing (MT), we propose Metamorphic Robustness Assessment (MetaRA), a testing framework that employs Metamorphic Relations (MRs) to systematically probe vulnerabilities in MLLM-based VQA systems. MetaRA generates controlled variations of image-question inputs based on specific MRs and evaluates models across diverse conditions. Applying MetaRA to multiple MLLM-based VQA models across different tasks reveals nuanced failure patterns, including sensitivity to linguistic perturbations, over-reliance on superficial visual cues, and deeper weaknesses in multimodal reasoning. Experimental results demonstrate that MetaRA provides richer diagnostic insights than conventional accuracy metrics, exposing failure modes that remain hidden under standard benchmarks. Overall, this work highlights the need for systematic robustness evaluation in VQA and positions metamorphic assessment as a scalable, model-agnostic approach toward trustworthy multimodal AI.

2605.19306 2026-05-20 cs.LG math.OC

A Two-Phase Adaptive Balanced Penalty Method for Controllable Pareto Front Learning under Split Feasibility Conditions

一种用于在分割可行性条件下可控帕累托前沿学习的两阶段自适应平衡惩罚方法

Nguyen Viet Hoang, Dung D. Le, Tran Ngoc Thang

发表机构 * Faculty of Applied Mathematics and Informatics, Hanoi University of Science and Technology(应用数学与信息技术学院,河内科学技术大学) College of Engineering and Computer Science, VinUniversity(工程与计算机科学学院,Vin大学)

AI总结 本文提出了一种自适应平衡惩罚算法,用于在分割可行性条件下训练可控帕累托前沿学习的超网络,通过自适应指标驱动的可计算下界,将约束帕累托问题转化为双层标量分割问题,并证明了在标准凸性假设下的完全序列收敛性。

Comments 36 pages, 18 figures, 12 tables. Submitted to Neural Networks (Elsevier)

详情
AI中文摘要

我们解决在分割可行性条件下训练超网络用于可控帕累托前沿学习(CPFL)的开放问题,具有严格的理论保证。我们将约束帕累托问题重新表述为双层标量分割问题(BSSP),并提出自适应平衡惩罚(ABP)算法,其三个梯度组件——最优性、集可行性以及图像可行性——通过由可计算下界驱动的自适应指标进行混合。利用一种新的凸替代技术,我们证明在标准凸性和Robbins-Monro步长假设下实现了完全序列收敛性。然后将ABP惩罚结构转换为一种两阶段、以可行性优先的训练策略,用于超MLP和超Trans架构(ABP-HyperNet)。为了评估受约束的CPFL,我们引入了预期可行超体积(EFHV),该指标联合捕捉了解的质量和约束满足。在五个多目标基准上的实验验证了ABP求解器相对于真实值的性能,同时三个多任务学习数据集展示了ABP-HyperNet在提高可行性从36-49%到87-100%的情况下,相比无约束基线达到了2.3倍更高的EFHV。

英文摘要

We address the open problem of training hypernetworks for Controllable Pareto Front Learning (CPFL) under split feasibility conditions with rigorous theoretical guarantees. We reformulate the constrained Pareto problem as a Bi-Level Scalarized Split Problem (BSSP) and propose the Adaptive Balanced Penalty (ABP) algorithm, whose three gradient components -- optimality, set feasibility, and image feasibility -- are blended through an adaptive indicator driven by a computable lower bound. Using a novel convex surrogate technique, we prove full-sequence convergence under standard convexity and Robbins-Monro step-size assumptions. The ABP penalty structure is then translated into a two-phase, feasibility-first training strategy for Hyper-MLP and HyperTrans architectures (ABP-HyperNet). To evaluate constrained CPFL, we introduce the Expected Feasible Hypervolume (EFHV), which jointly captures solution quality and constraint satisfaction. Experiments on five multi-objective benchmarks validate the ABP solver against ground truth, while three multi-task learning datasets demonstrate that ABP-HyperNet achieves up to 2.3x higher EFHV than unconstrained baselines by raising feasibility from 36-49% to 87-100%.

2605.19304 2026-05-20 cs.CV cs.GR

MMGS: 10$\times$ Compressed 3DGS through Optimal Transport Aggregation based on Multi-view Ranking

MMGS: 通过多视图排序基于最优传输的10倍压缩3DGS

Beizhen Zhao, Sicheng Yu, Ziran Yin, Dongxu Shen, Hao Wang

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州))

AI总结 本文提出了一种基于最优传输聚合的多视图排序方法,通过全局几何分布匹配问题优化高斯参数,实现3DGS的10倍压缩和10倍加速训练速度,同时保持高质量渲染效果。

Comments 19 pages

详情
AI中文摘要

尽管3D高斯散射(3DGS)已革新了3D重建,但其因大量冗余原始体而存在显著开销。现有压缩方法通常依赖局部采样或固定修剪阈值,难以在减少冗余与高保真渲染之间取得平衡。为此,我们提出了一种新的框架,将高斯优化建模为全局几何分布匹配问题。具体而言,我们的方法集成了三个组成部分:(1)我们引入了多视图3D高斯贡献排序机制,通过几何一致性过滤原始体,而不是使用局部启发式方法;(2)我们提出了基于全局最优传输(OT)的聚合算法,合并冗余原始体的同时保持底层几何;(3)我们设计了基于OT的致密化操作符,保持高斯的分布属性以实现稳定的优化。我们的方法仅使用10%的原始体和10倍于vanilla 3DGS的加速训练速度,实现了最先进的渲染质量。

英文摘要

While 3D Gaussian Splatting (3DGS) has revolutionized 3D reconstruction, it suffers from significant overhead due to massive redundant primitives. Existing compression methods typically rely on local sampling or fixed pruning thresholds, which often struggle to balance redundancy reduction with high-fidelity rendering. To address this, we propose a novel framework that formulates Gaussian optimization as a global geometric distribution matching problem. Specifically, our approach integrates three components: (1) we introduce a multi-view 3D Gaussian contribution ranking mechanism that filters primitives using geometric consistency instead of local heuristics; (2) we propose a global Optimal Transport (OT)-based aggregation algorithm that merges redundant primitives while preserving the underlying geometry; and (3) we design an OT-based densification operator that maintains the Gaussian's distributional properties for stable optimization. Our approach achieves state-of-the-art rendering quality with only \textbf{10$\%$} primitives and \textbf{10$\times$} accelerated training speeds compared to vanilla 3DGS.

2605.19301 2026-05-20 cs.CV

iGSP:Implicit Gradient Subspace Projection for Efficient Continual Learning of Vision-Language Models

iGSP:隐式梯度子空间投影用于高效视觉-语言模型的持续学习

Xuezhi Cui, Dongbo Zhou, Wang Guo, Zeyuan Wang, Ziyu Li, Gaozhi Zhou, Xian Li, Ling Zhao, Wentao Yang, Chao Tao, Haifeng Li

发表机构 * School of Geosciences and Info-Physics, Central South University(地质科学与信息物理学院,中南大学) School of Earth Sciences and Spatial Information Engineering, Hunan University of Science and Technology(地球科学与空间信息工程学院,湖南科技大学)

AI总结 本文提出iGSP框架,通过隐式梯度子空间投影实现视觉-语言模型的高效持续学习,解决了传统方法在参数效率和任务间对齐一致性上的不足,显著提升了训练效率和知识重用率。

详情
AI中文摘要

视觉-语言模型需要高效适应不断出现的下游任务。尽管参数高效微调可以缓解灾难性遗忘,但为每个任务分配孤立模块会导致参数爆炸。相反,最近的相似性驱动共享机制错误地将表面视觉相似性等同于底层对齐一致性。这种根本性不匹配导致在视觉相似但逻辑不同的任务之间产生严重的负迁移,并未能利用在视觉上多样的任务之间的对齐重用。我们提出,对齐共享本质上是共享低秩子空间内重叠优化轨迹的几何问题。基于这一见解,我们提出iGSP,一种通过隐式梯度子空间投影实现高效适应的新框架。利用MoE路由器的早期收敛性来建立子空间基底,iGSP将适应过程分为两个阶段。首先,子空间识别阶段通过基底预扩展引入候选专家,应用一种新的子空间约束正则化来隐式地将新任务梯度投影到历史子空间,并通过将路由概率视为梯度流指示器来精确修剪冗余维度,最终最大化知识重用。其次,正交子空间微调阶段固定这一结构基底并去除正则化,快速拟合任务特定的残差损失。在MTIL基准测试中,iGSP在准确率上达到最先进的水平,同时显著提高了训练效率,与当前最先进的方法相比,平均可训练参数减少了42.7%,相对于其他方法最终总参数减少了86.9%。源代码可在https://github.com/GeoX-Lab/iGSP上获得。

英文摘要

Vision-Language Models require efficient adaptation to continually emerging downstream tasks. While Parameter-Efficient Fine-Tuning mitigates catastrophic forgetting, assigning isolated modules per task leads to parameter explosion. Conversely, recent similarity-driven sharing mechanisms falsely equate superficial visual similarity with underlying alignment consistency. This fundamental mismatch triggers severe negative transfer between visually similar but logically distinct tasks and fails to exploit alignment reuse across visually diverse ones. We argue thatalignment sharing is fundamentally a geometric problem of overlapping optimization trajectories within shared low-rank subspaces. Grounded in this insight, we propose iGSP, a novel framework that achieves efficient adaptation via implicit gradient subspace projection. Leveraging the early convergence of MoE routers to establish the subspace basis, iGSP bifurcates the adaptation process into two phases. First, the Subspace Identification phase introduces candidate experts via basis pre-expansion, applies a novel subspace-constrained regularization to implicitly project new task gradients onto the historical subspace, and precisely prunes redundant dimensions by treating routing probabilities as gradient flow indicators, ultimately to maximize knowledge reuse. Second, the Orthogonal Subspace Fine-Tuning phase fixes this structural basis and removes the regularization to rapidly fit the task-specific residual loss. Extensive experiments on the MTIL benchmark demonstrate that iGSP achieves state-of-the-art accuracy while significantly improving training efficiency, reducing the average trainable parameters by 42.7\% compared to current SOTA methods, and decreasing the final total parameters by 86.9\% relative to counterparts. The source code is available at https://github.com/GeoX-Lab/iGSP.

2605.19299 2026-05-20 cs.LG

Cross-Paradigm Knowledge Distillation: A Comprehensive Study of Bidirectional Transfer Between Random Forests and Deep Neural Networks for Big Data Applications

跨范式知识蒸馏:随机森林与深度神经网络之间双向知识转移的综合性研究用于大数据应用

Mahdi Naser Moghadasi

发表机构 * BrightMind AI Research(BrightMind AI研究院)

AI总结 本文研究了随机森林与深度神经网络之间双向知识蒸馏,提出了新的方法,通过144次实验展示了双向RF-DL蒸馏在分类和回归任务中的竞争力,同时提供了可解释性和表达性的互补优势。

详情
AI中文摘要

大数据的指数增长加剧了对能够处理多样化数据特征并保持计算效率的高效且可解释的机器学习模型的需求。知识蒸馏主要集中在神经网络到神经网络的转移,跨范式知识转移则鲜有探索。本文首次系统研究了随机森林(RF)与深度神经网络(DNN)之间的双向知识蒸馏,填补了集成学习和大数据应用中的模型压缩关键空白。我们提出了一种新的方法,包括渐进多阶段蒸馏、来自多样化树模型的多教师集成蒸馏以及不确定性感知的跨范式转移机制。通过在6个多样化的数据集上进行144次全面实验,涵盖了分类和回归任务,我们证明双向RF-DL蒸馏在保持可解释性的同时,提供了神经网络的表达能力。我们的结果表明,多教师集成蒸馏在传统方法上始终表现更优,其中NN-COMPACT在分类任务中达到98.13%的分类准确率,NN-WIDE在回归任务中达到92.6%的R²分数。所提出的框架使大数据环境中的部署更加灵活,可以根据计算约束和可解释性需求进行最优模型选择。这项工作在跨范式知识转移领域建立了新的研究方向,对可解释AI和资源受限大数据系统中的可扩展模型部署具有重要影响。

英文摘要

The exponential growth of big data has intensified the need for efficient and interpretable machine learning models that can handle diverse data characteristics while maintaining computational efficiency. Knowledge distillation has primarily focused on neural network-to-neural network transfer, leaving cross-paradigm knowledge transfer largely unexplored. This paper presents the first comprehensive study of bidirectional knowledge distillation between Random Forests (RF) and Deep Neural Networks (DNN), addressing critical gaps in ensemble learning and model compression for big data applications. We propose novel methodologies including progressive multi-stage distillation, multi-teacher ensemble distillation from diverse tree models, and uncertainty-aware cross-paradigm transfer mechanisms. Through 144 comprehensive experiments across 6 diverse datasets encompassing classification and regression tasks, we demonstrate that bidirectional RF-DL distillation achieves competitive performance while providing complementary benefits: interpretability from tree models and expressiveness from neural networks. Our results show that multi-teacher ensemble distillation consistently outperforms traditional approaches, with NN-COMPACT achieving 98.13% classification accuracy and NN-WIDE reaching 92.6% R^2 score in regression tasks. The proposed framework enables deployment flexibility in big data environments, allowing optimal model selection based on computational constraints and interpretability requirements. This work establishes a new research direction in cross-paradigm knowledge transfer with significant implications for interpretable AI and scalable model deployment in resource-constrained big data systems.

2605.19289 2026-05-20 cs.CV

What Makes Synthetic Data Effective in Image Segmentation

是什么使合成数据在图像分割中有效

Jinjin Zhang, Xiefan Guo, Yizhou Jin, Nan Zhou, Di Huang

发表机构 * State Key Laboratory of Complex and Critical Software Environment(复杂与关键软件环境国家重点实验室) Beihang University(北京航空航天大学) School of Computer Science and Engineering(计算机科学与工程学院)

AI总结 本文研究了合成数据在图像分割中的有效性,通过分析最先进的扩散模型生成的合成图像,发现密集场景构成和精细实例保真度是关键因素,并提出了一种统一框架SENSE,以提升分割性能。

Comments Accepted to ICML 2026

详情
AI中文摘要

受大规模生成模型快速发展的推动,合成数据已成为视觉理解的有前途的解决方案。尽管现代扩散模型在生成逼真图像方面表现出色,但其在复杂视觉分割任务中的潜力仍待探索。在本工作中,我们系统分析了最先进的扩散模型生成的合成图像,以揭示其有效性的决定因素。特别是,具有密集场景构成和精细实例保真度的合成图像表现出显著优势,能够产生更具判别性的空间表示。基于这些见解,我们提出了SENSE,一种利用灵活且可扩展的合成数据显著提升分割性能的统一框架。值得注意的是,SENSE是模型无关的,可与多种架构(如DPT和Mask2Former)兼容,并能有效扩展到参数容量不同的模型。在Cityscapes、COCO和ADE20K上的广泛实验验证了我们方法的有效性和泛化能力。代码可在https://github.com/zhang0jhon/SENSE获取。

英文摘要

Driven by rapid advances in large-scale generative models, synthetic data has emerged as a promising solution for visual understanding. While modern diffusion models achieve remarkable photorealistic image synthesis, their potential in complex visual segmentation tasks remains underexplored. In this work, we conduct a systematic analysis of synthetic images from state-of-the-art diffusion models to uncover the factors governing their utility. In particular, synthetic images characterized by dense scene composition and fine instance fidelity demonstrate distinctive benefits, yielding significantly more discriminative spatial representations. Building on these insights, we propose SENSE, a unified framework that leverages flexible and scalable synthetic data to substantially enhance segmentation performance. Notably, SENSE is model-agnostic, compatible with diverse architectures (e.g., DPT and Mask2Former), and scales effectively across models with varying parameter capacities. Extensive experiments on Cityscapes, COCO, and ADE20K validate the effectiveness and generalization capability of our approach. Code is available at https://github.com/zhang0jhon/SENSE.

2605.19285 2026-05-20 cs.CL cs.AI cs.CY

Are Rationales Necessary and Sufficient? Tuning LLMs for Explainable Misinformation Detection

理性是否必要且充分?为可解释的虚假信息检测调优大语言模型

Bing Wang, Rui Miao, Ximing Li, Chen Shen, Shaotian Yan, Changchun Li, Kaiyuan Liu, Xiaosong Yuan, Jieping Ye

发表机构 * College of Computer Science and Technology, Jilin University(吉林大学计算机科学与技术学院) Tongyi Lab, Alibaba Group(阿里云实验室) School of Artificial Intelligence, Jilin University(吉林大学人工智能学院) College of Computer Science and Technology, Zhejiang University(浙江大学计算机科学与技术学院)

AI总结 本文研究了如何通过调优大语言模型(LLM)来提升可解释性虚假信息检测的性能,提出了一种新的数据合成管道LONSREX,用于定位必要且充分的理性,以解决现有方法中因粗粒度标签和过度验证行为导致的理性不足和冗余问题。

Comments Accepted by KDD 2026. 12 pages, 8 figures. Code: https://github.com/wangbing1416/LONSREX

详情
AI中文摘要

社交媒体上虚假信息的快速传播已成为一个严峻挑战。为缓解其扩散,虚假信息检测(MD)已成为关键研究领域。传统基于小模型的MD方法通常通过黑盒过程进行二元分类。近年来,大型语言模型(LLMs)的兴起使可解释性MD成为可能,其中模型生成理性以解释其决策,从而提高透明度。现有可解释性MD方法主要集中在构建复杂的提示以从现成的LLMs中提取理性。在本文中,我们提出了一种管道来调优专门用于可解释性MD的LLM。我们的管道首先收集大规模经过事实核查的文章,然后使用多个强大的LLMs生成真实性预测和理性。为了确保高质量的训练数据,我们利用一种过滤策略,仅选择正确的实例进行微调。虽然该管道直观且普遍,但我们的实验表明,仅基于标签正确性的简单过滤在实践中是不够的,并存在两个关键限制:(1)粗粒度标签导致理性不足:仅基于二元标签过滤的理性不足以充分支持其决策;(2)过度验证行为导致不必要的理性:更强的LLMs倾向于表现出过度验证行为,生成过度冗长和不必要的理性。为了解决这些问题,我们引入了LONSREX,一种新的数据合成管道,用于定位可解释性MD中必要且充分的理性。具体来说,我们提出了一种度量标准,量化每个验证步骤对最终预测的贡献,从而评估其必要性和充分性。实验结果展示了LONSREX的有效性。

英文摘要

The rapid spread of misinformation on social media platforms has become a formidable challenge. To mitigate its proliferation, Misinformation Detection (MD) has emerged as a critical research topic. Traditional MD approaches based on small models typically perform binary classification through a black-box process. Recently, the rise of Large Language Models (LLMs) has enabled explainable MD, where models generate rationales that explain their decisions, thereby enhancing transparency. Existing explainable MD methods primarily focus on crafting sophisticated prompts to elicit rationales from off-the-shelf LLMs. In this work, we propose a pipeline to fine-tune a dedicated LLM specifically for explainable MD. Our pipeline begins by collecting large-scale fact-checked articles, and then uses multiple strong LLMs to produce veracity predictions and rationales. To ensure high-quality training data, we leverage a filtering strategy that selects only the correct instances for fine-tuning. While this pipeline is intuitive and prevalent, our experiments reveal that naive filtering based solely on label correctness is insufficient in practice and suffers from two critical limitations: (1) Coarse-grained labels cause insufficient rationales: Rationales filtered solely based on binary labels are insufficient to adequately support their decisions; (2) Over-verification behavior causes unnecessary rationales: Stronger LLMs tend to exhibit over-verification behavior, producing excessively verbose and unnecessary rationales. To address these issues, we introduce LONSREX, a novel data synthesis pipeline to Locate Necessary and Sufficient Rationales for Explainable MD. Specifically, we propose a metric that quantifies the contribution of each verification step to the final prediction, thereby evaluating its necessity and sufficiency. Experimental results demonstrate the effectiveness of LONSREX.