arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.19378 2026-05-20 cs.CV

多尺度生成建模与热耗散流匹配

Jun Ma, Hanquan Zhang, Yanjun Qin, Haoyuan Guan, Ke Zhang

发表机构 * Department of Systems Science, Faculty of Arts and Sciences, Beijing Normal University（北京师范大学系统科学系，文理学院）； School of Computer Science and Technology, Xinjiang University（新疆大学计算机科学与技术学院）； International Academic Center of Complex Systems, Beijing Normal University（北京师范大学复杂系统学术中心）； School of Systems Science, Beijing Normal University（北京师范大学系统科学学院）

AI总结本文提出Heat Dissipation Flow Matching (HDFM)方法，通过引入连续模糊（热耗散）过程来注入多尺度先验，解决模糊基模型在SDE框架中的局限性，并在ODE框架如Flow Matching中实现更有效的多尺度细节保留和颜色预算保持。

详情

AI中文摘要

扩散模型在图像生成中被广泛应用，大多数模型依赖于噪声为基础的破坏和去噪。一个不同的分支使用模糊作为主要破坏，通过提供多尺度先验来更好地保持颜色预算和多尺度细节。然而，基于模糊的模型仍局限于SDE框架，并未整合到ODE框架中，如Flow Matching (FM)。同时，在模糊基公式中，经典的逆热耗散（IHD）过程面临病态挑战。此外，在数据流形假设下，从高维噪声（或速度）空间回归模糊图像也具有困难。我们提出Heat Dissipation Flow Matching (HDFM)，其引入连续模糊（热耗散）过程到FM中以注入多尺度先验。HDFM将插值热耗散路径对齐以解决病态问题，并采用x预测来缓解高维回归困难。玩具实验和消融研究显示，HDFM在模糊和x预测方面均受益。HDFM在所有数据集上均优于大多数基线方法。

英文摘要

Diffusion models are widely used in image generation, with most relying on noise-based corruption and denoising. A distinct branch instead uses blur as the main corruption, preserving better color budgets and multi-scale detail by providing multi-scale priors. However, blur-based models remain in SDE-based frameworks and are not integrated into ODE-based frameworks, such as Flow Matching (FM). Meanwhile, in the blur-based formulation, the classical inverse heat-dissipation (IHD) process faces an ill-posed challenge. Moreover, under the data-manifold assumption, regressing blurred images from high-dimensional noise (or velocity) space is also difficult. We propose Heat Dissipation Flow Matching (HDFM), which introduces a continuous blurred (heat-dissipation) process into FM to inject multi-scale priors. HDFM aligns an interpolated heat-dissipation path to address ill-posedness and adopts $x$-prediction to mitigate high-dimensional regression difficulty. Toy experiments and ablation studies show that HDFM consistently benefits from both blur and $x$-prediction. The performance of HDFM outperforms most baseline methods on all datasets.

URL PDF HTML ☆

赞 0 踩 0

2605.19366 2026-05-20 cs.LG

Accurate, Efficient, and Explainable Deep Learning Approaches for Environmental Science Problems

准确、高效且可解释的深度学习方法用于环境科学问题

Jimeng Shi

发表机构 * College of Engineering and Computing（工程与计算学院）

AI总结本文提出三种针对复杂环境科学问题的深度学习方法：用于海岸河流洪水预测的WaLeF模型、用于全球天气预测的CoDiCast模型以及用于环境科学科学问答的Hypercube-RAG方法，旨在提高环境智能的准确性、效率和可解释性。

Comments 161 pages

详情

AI中文摘要

环境科学在保护生态系统中起着关键作用，这一领域由大规模、异构数据驱动。在大数据时代，人工智能（AI）已成为一种变革性工具，用于学习模式并支持决策。本论文开发了针对复杂环境科学问题的AI方法，以实现环境智能，研究了三个具体挑战。首先，我们专注于海岸河流系统的洪水预测和管理。传统物理模型计算成本高，限制了实时应用。为此，我们提出了一种基于深度学习（DL）的水位预测模型WaLeF，以及一种基于预测的深度学习模型FIDLAr用于水位管理。在佛罗里达南部易发洪水的海岸系统中评估，该系统以极端降雨和海平面上下波动为特点，FIDLAr在准确性和效率上优于基线模型，同时提供可解释的输出。其次，我们针对全球天气预测，这受到大规模数据规模的挑战。传统物理方法是确定性的且计算密集型。我们提出CoDiCast，一种条件扩散模型，专门用于概率天气预测。从生成AI用于预测任务中衍生而来，实验表明CoDiCast实现了准确且高效的预测，具有明确的不确定性量化。最后，我们解决环境科学中的科学问答问题。在回答领域内问题时，大型语言模型（LLMs）常常由于知识过时或有限而产生幻觉。虽然检索增强生成（RAG）检索了领域特定的知识，但现有方法在准确度、效率或可解释性之间进行权衡。我们提出Hypercube-RAG，基于结构化的文本立方体框架，成功同时表现出这三种属性。

英文摘要

Environmental science plays a pivotal role in safeguarding ecosystems, a domain driven by large-scale, heterogeneous data. In the big data era, artificial intelligence (AI) has emerged as a transformative tool for learning patterns and supporting decision-making. This dissertation develops AI-based approaches tailored to complex environmental science problems to achieve Environmental Intelligence, studying three specific challenges. First, we focus on flood prediction and management in coastal river systems. Conventional physics-based models are computationally intensive, limiting real-time application. To overcome this, we propose a deep learning (DL)-based model, WaLeF, for water level forecasting, and a forecast-informed DL model, FIDLAr, to manage water levels. Evaluated in a flood-prone coastal system in South Florida characterized by extreme rainfall and sea level fluctuations, FIDLAr outperforms baselines in accuracy and efficiency while providing interpretable outputs. Second, we target global weather prediction, which is challenged by massive data scale. Traditional physics methods are deterministic and computationally heavy. We propose CoDiCast, a conditional diffusion model tailored for probabilistic weather forecasting. Adapted from generative AI for predictive tasks, experiments show CoDiCast achieves accurate, efficient forecasts with explicit uncertainty quantification. Lastly, we address scientific question-answering in environmental science. When answering in-domain questions, large language models (LLMs) often suffer from hallucinations due to out-of-date or limited knowledge. While retrieval-augmented generation (RAG) retrieves domain-specific knowledge, existing methods trade off accuracy, efficiency, or explainability. We propose Hypercube-RAG, built on a structured text cube framework, which successfully exhibits all three properties simultaneously.

URL PDF HTML ☆

赞 0 踩 0

2605.19360 2026-05-20 cs.CV cs.LG cs.NE physics.app-ph physics.optics

Scalable, Energy-Efficient Optical-Neural Architecture for Multiplexed Deepfake Video Detection

可扩展的、节能的光学-神经架构用于多路复用的深度伪造视频检测

Parnian Ghapandar Kashani, Shiqi Chen, Aydogan Ozcan

发表机构 * Electrical and Computer Engineering Department, University of California, Los Angeles, CA, 90095, USA（加州大学洛杉矶分校电气与计算机工程系）； Bioengineering Department, University of California, Los Angeles, CA, 90095, USA（加州大学洛杉矶分校生物工程系）； California NanoSystems Institute (CNSI), University of California, Los Angeles, CA, 90095, USA（加州大学洛杉矶分校加州纳米系统研究所）

AI总结本文提出了一种结合轻量级数字前端和空间复用光学解码后端的混合深度伪造视频检测框架，通过可编程空间光调制器实现大规模并行模拟推理，从而在降低计算成本的同时提高视频真实性预测的吞吐量和准确性。

Comments 30 Pages, 8 Figures

详情

AI中文摘要

AI生成视觉媒体的快速普及催生了对高效、可信的深度伪造检测系统的需求。然而，现有基于深度学习的检测方法依赖于计算密集且能耗高的推理算法，限制了其可扩展性。本文提出了一种混合的数字-模拟深度伪造视频检测框架，结合轻量级数字前端和空间复用光学解码后端，通过可编程空间光调制器实现大规模并行模拟推理。通过在单次光学传播过程中同时处理15个或更多的视频流，该系统在降低计算成本的同时实现了高吞吐量和准确的视频级真实性预测。我们使用不同数据集验证了该混合深度伪造视频处理器，包括经典面部交换、现实世界深度伪造记录和完全AI生成的视频。使用在可见光谱范围内操作的空间复用实验装置，我们在Celeb-DF视频数据集上实现了97.79%的深度伪造检测准确率、99.86%的灵敏度和95.72%的特异性，分别在15个视频并行处理的单次光学传播中测试。多路复用的光学解码器还展示了对各种视频退化、噪声、压缩、实验偏移和黑盒对抗攻击的鲁棒性。我们的结果表明，将光学计算整合到AI推理中可以同时提高吞吐量、能效和对抗鲁棒性——这三个属性在纯数字系统中难以同时实现。

英文摘要

The rapid proliferation of AI-generated visual media has created an urgent need for efficient, trustworthy deepfake detection systems. However, existing deep learning-based detection methods rely on computationally intensive and energy-demanding inference algorithms, limiting their scalability. Here, we present a hybrid digital-analog deepfake video detection framework that combines a lightweight digital front-end with a spatially multiplexed optical decoding back-end for massively parallel analog inference through a programmable spatial light modulator. By simultaneously processing 15 or more video streams within a single optical propagation pass, the system enables high-throughput and accurate video-level authenticity prediction at reduced computational cost compared with purely digital methods. We validated this hybrid deepfake video processor using different datasets spanning classical face-swapping, real-world deepfake recordings, and fully AI-generated videos. Using a spatially multiplexed experimental set-up operating in the visible spectrum, we achieved average deepfake detection accuracy, sensitivity and specificity of 97.79%, 99.86% and 95.72%, respectively, on the Celeb-DF video dataset with 15 videos tested in parallel in a single optical pass per inference. The multiplexed optical decoder also demonstrates resilience against various types of video degradation, noise, compression, experimental misalignments and black-box adversarial attacks. Our results show that integrating optical computation into AI inference enables simultaneous gains in throughput, energy efficiency, and adversarial robustness - three properties that are difficult to achieve together in purely digital systems.

URL PDF HTML ☆

赞 0 踩 0

2605.19359 2026-05-20 cs.CV cs.LG

MAM-CLIP: Vision-Language Pretraining on Mammography Atlases for BI-RADS Classification

MAM-CLIP：基于乳腺X线图集的视觉-语言预训练用于BI-RADS分类

Halil Ibrahim Gulluk, Olivier Gevaert

发表机构 * Department of Electrical Engineering（电气工程系）； Biomedical Informatics Research (BMIR)（生物医学信息学研究（BMIR））； Stanford University（斯坦福大学）

AI总结本文提出MAM-CLIP模型，通过预训练PubMedBERT和对比学习来提升乳腺X线图像的BI-RADS分类性能，实验表明在标注样本稀缺时，该方法能显著提高F1分数。

详情

AI中文摘要

深度学习方法在预测乳腺X线图像的BI-RADS评分方面已显示出有前景的结果。然而，这些图像的解释可能因人而异，即使在放射科医生之间也可能存在差异。鉴于乳腺X线的固有复杂性，仅依靠图像标签训练分类模型通常效果有限。为了解决这一挑战，我们收集了来自两个乳腺图集的2313张乳腺X线图像及其对应的描述。我们提出的方法采用了一个多模态模型，使用预训练的PubMedBERT作为语言组件。通过在图像-文本对上进行对比学习训练，使视觉编码器能够吸收描述中丰富的信息，从而提高其对乳腺X线发现的理解。然后，我们对两个数据集进行微调以进行BI-RADS预测，其性能优于没有此预训练的模型，尤其是在标注样本稀缺时。在3类平均F1分数上，改进范围从+1%到+14%：在40K训练样本时增加+1%，在1K样本时增加+14%。此外，我们的实验表明，来自乳腺图集的2K图像-文本对比2K标注样本更具信息量，当训练样本超过10K时，平均提升幅度为+1.1%。总体而言，我们的工作提供了一个用于乳腺X线的视觉-语言模型，并突显了乳腺图集文本信息的价值。此外，我们公开发布了TEKNOFEST数据集的预处理乳腺X线图像。训练代码、预训练模型权重、数据提取脚本和发布的数据集均可在：https://github.com/igulluk/MAM-CLIP上公开获取。

英文摘要

Deep learning methods have demonstrated promising results in predicting BI-RADS scores from mammography images. However, the interpretation of these images can vary, leading to discrepancies even among radiologists. Given the inherent complexity of mammograms, training classification models solely on image labels often yields limited performance. To address this challenge, we curated 2313 mammogram images and their corresponding captions from two mammography atlases. Our proposed approach employs a multi-modal model that uses a pretrained PubMedBERT as the language component. By training this model on image-text pairs with contrastive learning, we enable the vision encoder to absorb the rich information contained in the captions, thereby improving its understanding of mammography findings. We then fine-tune the vision encoder on two datasets for BI-RADS prediction, achieving superior performance compared with models trained without this pretraining, particularly when labeled samples are scarce. The improvement in the 3-class average F1 score ranges from +1% to +14%: a +1% increase with 40K training samples, and a +14% increase with 1K samples. Furthermore, our experiments reveal that 2K image-text pairs from mammography atlases can be more informative than 2K labeled samples for label prediction, with an average margin of +1.1% when more than 10K training samples are available. Overall, our work provides a vision-language model for mammography and highlights the value of textual information from mammography atlases. In addition, we publicly release preprocessed mammography images of the TEKNOFEST dataset. The training code, pre-trained model weights, data extraction scripts, and the released dataset are publicly available at: https://github.com/igulluk/MAM-CLIP

URL PDF HTML ☆

赞 0 踩 0

2605.19358 2026-05-20 cs.CL

Taming the Thinker: Conditional Entropy Shaping for Adaptive LLM Reasoning

驯服思考者：面向自适应大语言模型推理的条件熵塑造

Shuyu Wei, Jian Sun, Delai Qiu, Yining Wang, Shengping Liu, Jiaen Liang, Ying Fu, Wei Huang, Jitao Sang

发表机构 * Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence（北京交通大数据挖掘与具身智能重点实验室）； Beijing Jiaotong University（北京交通大学）； Unisound AI Technology Co., Ltd.（Unisound AI科技有限公司）

AI总结本文提出条件熵塑造（CES）框架，通过动态控制token级响应熵，使大语言模型在简单问题上产生简洁解，在复杂问题上促进深入探索，从而平衡响应长度与准确性。

详情

AI中文摘要

基于熵的深度推理已成为提升大语言模型（LLMs）推理能力的有前景方向，但现有方法往往要么无差别地增加响应长度，要么以牺牲准确性为代价缩短响应。为更好地平衡这一权衡，我们引入了条件熵塑造（CES），一种框架，能够动态控制token级响应熵，使LLMs在简单问题上产生简洁的解决方案，同时在困难问题上鼓励更深入的探索。基于DAPO，CES使用token级熵作为不确定性信号，并应用一个条件双向策略：它在正确的推理路径上惩罚高熵的'分叉点'token以提高简洁性，并在错误路径上奖励它们以促进探索和错误纠正。我们将在DeepSeek-R1-Distill-7B上实现CES，并在12个数学基准上进行评估。CES在平均准确性上优于DAPO，同时减少响应长度，补充实验显示在较小的1.5B模型和领域外基准上也呈现出相似的趋势。

英文摘要

Entropy-based deep reasoning has emerged as a promising direction for improving the reasoning capabilities of Large Language Models (LLMs), but existing methods often either increase response length indiscriminately or shorten responses at the cost of accuracy. To better balance this trade-off, we introduce Conditional Entropy Shaping (CES), a framework that dynamically controls token-level response entropy, enabling LLMs to produce concise solutions on simple problems while encouraging deeper exploration on hard ones. Built on DAPO, CES uses token-level entropy as an uncertainty signal and applies a conditional bidirectional policy: it penalizes high-entropy "forking point" tokens on correct reasoning paths to improve conciseness, and rewards them on incorrect paths to encourage exploration and error correction. We implement CES on DeepSeek-R1-Distill-7B and evaluate it on 12 mathematical benchmarks. CES consistently improves average accuracy while reducing response length relative to DAPO, and supplementary experiments show similar trends on a smaller 1.5B backbone and on out-of-domain benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2605.19357 2026-05-20 cs.CL

SciCustom: A Framework for Custom Evaluation of Scientific Capabilities in Large Language Models

SciCustom: 一个用于大型语言模型科学能力定制评估的框架

Yiyang Gu, Junwei Yang, Junyu Luo, Ye Yuan, Bin Feng, Yingce Xia, Shufang Xie, Kaili Liu, Bohan Wu, Qi Shi, Haoran Li, Beier Xiao, Zhiping Xiao, Xiao Luo, Weizhi Zhang, Philip S. Yu, Zequn Liu, Ming Zhang

发表机构 * State Key Laboratory for Multimedia Information Processing, School of Computer Science, PKU-Anker LLM Lab, Peking University（多媒体信息处理国家重点实验室，计算机学院，PKU-Anker LLM实验室，北京大学）； Zhongguancun Academy（中关村学院）； IDEA ； Xidian University（西安电子科技大学）； Peking University（北京大学）； University of Washington（华盛顿大学）； University of Wisconsin–Madison（威斯康星大学麦迪逊分校）； University of Illinois Chicago（伊利诺伊大学芝加哥分校）

AI总结本文提出SciCustom框架，通过从大规模科学数据中自定义构建基准，评估LLM在特定科学任务中的能力，无需专家标注或合成问题生成，展示了细粒度科学能力差异。

Comments Accepted to ACL 2026 Main Conference

详情

AI中文摘要

BrainDyn: 一种用于生成脑动态的sheaf神经ODE

Siddharth Viswanath, Panayiotis Ketonis, Chen Liu, Michael Perlmutter, Dhananjay Bhaskar, Smita Krishnaswamy

发表机构 * Yale University（耶鲁大学）； Boise State University（博伊西州立大学）； University of Wisconsin–Madison（威斯康星大学麦迪逊分校）

AI总结本文提出BrainDyn，一种基于sheaf神经ODE的模型，用于生成脑动态，通过LSTM编码脑区活动历史，利用sheaf拉普拉斯算子促进信息传递，实现跨模态的强预测能力。

详情

AI中文摘要

高效的神经网络模型能够生成类似大脑动态的活动，可以用于生成合成数据、分析在测试扰动活动等条件下大脑瞬态的差异以及推断底层生成动态。然而，大型语言模型（LLMs）或标准循环神经网络（RNNs）忽略了解剖组织，因此不产生与脑区对齐的组件。另一方面，基于图的网络通常有非常简单的消息传递规则，这些规则不足以表达类似大脑的动态。为此，我们引入了BrainDyn，一种用于在结构化脑图上连续时间动态的sheaf神经ODE模型。BrainDyn使用长短期记忆（LSTM）模型在滑动时间窗口上编码每个脑区的最近活动历史，以生成隐藏状态或茎，这些状态通过可学习的限制映射投影到边特定的共享空间中。这些共享空间中相邻节点之间的差异由sheaf拉普拉斯算子表征，可以促进神经元单元之间的信息传递。这些信息的输出然后被馈送到神经ODE中，该神经ODE控制神经元活动的连续时间演变。我们对静息态fMRI（PNC数据集）、头皮EEG与局灶性癫痫（TUSZ数据集）以及由NEST尖峰网络模拟器模拟的活动进行了评估。BrainDyn在跨模态中实现了强大的预测能力，所得到的表示支持下游任务，包括在硅中扰动预测。

英文摘要

Efficient neural network models that generate brain-like dynamic activity can be a valuable resource for generating synthetic data, analyzing differences in brain transients under conditions such as testing perturbation activity or inferring the underlying generative dynamics. However, large language models (LLMs) or standard recurrent neural networks (RNNs) ignore the anatomical organization and therefore do not produce components that align with brain regions. On the other hand, graph-based networks often have very simple message passing rules that are not sufficiently expressive for brain-like dynamics. To address this, we introduce BrainDyn, a sheaf neural ordinary differential equation (neural ODE) model for continuous-time dynamics on structured brain graphs. BrainDyn encodes the recent activity history of each brain region using a long short-term memory (LSTM) model over a sliding temporal window to produce hidden states, or stalks, that are projected through learnable restriction maps into edge-specific shared spaces. Discrepancies between neighboring nodes in these shared spaces are characterized by a sheaf Laplacian that can facilitate message passing between neuronal units. The output of these messages is then fed to a neural ODE that governs the continuous-time evolution of neuronal activity. We evaluated BrainDyn on resting-state fMRI (PNC dataset), scalp EEG with focal epilepsy (TUSZ dataset), and simulated activity from the NEST spiking network simulator. BrainDyn achieves strong forecasting ability across modalities, and the resulting representations support downstream tasks including in silico perturbation prediction.

URL PDF HTML ☆

赞 0 踩 0

2605.19322 2026-05-20 cs.CV

DynaTok: Temporally Adaptive and Positional Bias-Aware Token Compression for Video-LLMs

DynaTok: 时序自适应和位置偏见感知的视频大语言模型token压缩

Minyoung Park, Taehun Kong, Sangjun Ahn

发表机构 * LG Electronics, Seoul, South Korea（LG电子，首尔，韩国）

AI总结本文提出DynaTok，一种无需训练的时序自适应和位置偏见感知的token压缩框架，通过在时序和空间维度上分配token预算，有效减少冗余的时空覆盖，提升视频大语言模型的效率和鲁棒性。

详情

AI中文摘要

近年来，视频大语言模型（Video-LLMs）的进步显著扩展了多模态推理能力。然而，从长视频序列中提取的大量视觉token带来了高昂的计算成本，限制了其在现实场景中的应用。现有的无训练token压缩方法基于注意力大小作为语义重要性的代理进行token选择，但往往忽视位置偏见并仅依赖短期时间局部性，导致冗余的时空覆盖和低效的token使用。我们提出了DynaTok，一种无需训练、时序自适应且偏见感知的token压缩框架，能够在时序和空间维度上分配token预算。通过轻量级的指数移动平均（EMA）内存，时序预算分配（TBA）模块动态地将较少的token分配给冗余帧，将更多的token分配给新颖的帧，捕捉长期时间变化。空间预算分配（SBA）模块通过基于激活的注意力图选择空间多样性和语义重要的特征，同时利用空间内存减少已选区域的冗余并缓解位置偏见。DynaTok无缝集成到现有的Video-LLMs中，如LLaVA-OneVision和LLaVA-Video，无需重新训练，并在高强度压缩下有效保留语义覆盖。在四个代表性VideoQA基准测试-MVBench、LongVideoBench、MLVU和VideoMME上的实验表明，即使在90%的token减少下，DynaTok仍能保留超过95%的基线准确性，优于最近的无训练方法。这些结果表明，DynaTok为高效和稳健的视频推理提供了系统的基础，为未来Video-LLMs实现实时流媒体视频理解铺平了道路。

英文摘要

Recent advances in Video Large Language Models (Video-LLMs) have greatly expanded multimodal reasoning capabilities. However, the massive number of visual tokens extracted from long video sequences incurs prohibitive computational costs, limiting their deployment in real-world scenarios. Existing training-free token compression methods select tokens based on attention magnitude as a proxy for semantic importance, but often overlook positional bias and rely only on short-term temporal locality, leading to redundant spatio-temporal coverage and inefficient token usage. We present DynaTok, a training-free, temporally adaptive and bias-aware token compression framework that allocates token budgets across both temporal and spatial dimensions. Through a lightweight exponential moving average (EMA) memory, the Temporal Budget Allocation (TBA) module dynamically assigns fewer tokens to redundant frames and more to novel frames, capturing long-term temporal variation. The Spatial Budget Allocation (SBA) module complements this by selecting spatially diverse and semantically important features using activation-based attention maps, while leveraging a spatial memory to reduce redundancy from previously selected regions and mitigate positional bias. DynaTok integrates seamlessly with existing Video-LLMs such as LLaVA-OneVision and LLaVA-Video without retraining, and effectively preserves semantic coverage under aggressive compression. Experiments on four representative VideoQA benchmarks-MVBench, LongVideoBench, MLVU, and VideoMME-show that DynaTok retains over 95% of baseline accuracy even with a 90% token reduction, surpassing recent training-free approaches. These results demonstrate that DynaTok provides a principled foundation for efficient and robust video reasoning, paving the way toward real-time streaming video understanding with future Video-LLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.19319 2026-05-20 cs.CV

ContextFlow：长周期具身智能体的分层任务-状态对齐

Shuhan Guo, Kun Zhang, Haifei Liu, Xingyu Gao, Yongqi Zhang, Yaqing Wang, Quanming Yao

发表机构 * Department of Electronic Engineering, Tsinghua University（清华大学电子工程系）； Qiuzhen College, Tsinghua University（清华大学启祯学院）； Beijing Institute of Mathematical Sciences and Applications（北京数学科学研究院）； Department of Data Science and Analytics Thrust, The Hong Kong University of Science and Technology (Guangzhou)（香港科学与技术大学（广州）数据科学与分析部门）； Institute of Microelectronics, Chinese Academy of Sciences（中国科学院微电子研究所）； University of Chinese Academy of Sciences（中国科学院大学）

AI总结本文研究了长周期具身智能体中任务-状态不一致问题，提出ContextFlow框架通过显式合同表示阶段、运行时观测转为证据包以及应用作用域更新来实现任务前沿对齐，提高任务执行的连贯性和可审计性。

详情

AI中文摘要

长周期具身智能体越来越多地将导航、搜索、接近和操作任务委托给专门执行器。随着这些执行器变得更强，瓶颈从局部技能执行转移到在规划、监控、记忆和执行之间保持一致的任务前沿。我们研究了任务-状态不一致，即在任务层面一致性失败，其中规划器的活跃阶段、运行时证据、记忆上下文和委托执行器不再支持相同的下一步决策。这种失败可能导致不支持的手动交接、阶段锁定、执行器-上下文不匹配和不必要的重新规划。我们提出ContextFlow，一个可检查的对齐框架，将阶段表示为显式合同，将运行时观测转换为证据包，并应用包括继续、细化、转移、提升和修复在内的作用域更新。ContextFlow使专门执行器负责局部闭环控制，同时使任务前沿对齐显式且可审计。在长周期具身任务上的实验和演示轨迹展示了证据基础的作用域更新如何诊断和缓解反复出现的任务-状态失败。

英文摘要

Long-horizon embodied agents increasingly delegate navigation, search, approach, and manipulation to specialist executors. As these executors become stronger, the main bottleneck shifts from local skill execution to maintaining a coherent task frontier across planning, monitoring, memory, and execution. We study task-state misalignment, a task-level consistency failure in which the planner's active stage, runtime evidence, remembered context, and delegated executor no longer justify the same next-step decision. This failure can lead to unsupported handoffs, stage lock, executor-context mismatch, and unnecessary replanning. We propose ContextFlow, an inspectable alignment framework that represents stages as explicit contracts, converts runtime observations into evidence packets, and applies scoped updates including continue, refine, transfer, promote, and repair. ContextFlow keeps specialist executors responsible for local closed-loop control while making task-frontier alignment explicit and auditable. Experiments and demonstration traces on long-horizon embodied tasks illustrate how evidence-grounded scoped updates diagnose and mitigate recurring task-state failures.

URL PDF HTML ☆

赞 0 踩 0

2605.19311 2026-05-20 cs.LG eess.SP

An Objective Performance Evaluation of the LSTM Networks in Time Series Classification

LSTM网络在时间序列分类中的客观性能评估

Sooraj Sunil, Balakumar Balasingam

发表机构 * Electrical and Computer Engineering（电气与计算机工程系）； University of Windsor（温莎大学）

AI总结本文提出了一种评估框架，比较了LSTM分类器与基于模型的期望最大化（EM）分类器在二元时间序列分类中的性能，发现当数据符合假设模型类时，EM分类器表现优异，而LSTM分类器需要更大的噪声统计分离度才能实现可靠的分类，且在模型仅在测量噪声上不同的情况下，其性能低于参考分类器。

Comments Accepted in 2026 29th International Conference on Information Fusion

详情

AI中文摘要

深度学习的快速采用已导致数据驱动模型取代经典基于模型的算法，即使在由良好理解的物理定律支配的领域也是如此。尽管数据驱动模型，如长短期记忆（LSTM）网络，已成为时间序列分析的流行选择，但其在结构化环境中的性能相对于基于模型的方法很少被客观评估。本文提出了一种性能评估框架，比较了LSTM分类器与基于模型的期望最大化（EM）分类器在二元时间序列分类中的性能。评估是在两个仅在噪声统计上不同的标量线性高斯状态空间模型上进行的，其中卡尔曼滤波似然比率检验使用真实参数作为最佳可实现分类性能的参考。通过蒙特卡洛模拟，分类器在三个轴上进行评估：任务难度，由过程或测量噪声之间分离度控制；序列长度；以及训练数据集大小。结果表明，当数据符合假设模型类时，利用已知模型结构的EM分类器表现良好。LSTM分类器需要更大的噪声统计分离度才能实现可靠的分类，并且在模型仅在测量噪声上不同的情况下，其性能低于参考分类器，无论序列长度或训练数据集大小如何。

英文摘要

The rapid adoption of deep learning has increasingly led to data-driven models replacing classical model-based algorithms, even in domains governed by well-understood physical laws. While data-driven models, such as long short-term memory (LSTM) networks, have become a popular choice for time-series analysis, their performance relative to model-based approaches in structured environments is rarely evaluated objectively. This paper presents a performance evaluation framework comparing an LSTM classifier against a model-based expectation maximization (EM) classifier for binary time-series classification. The evaluation is conducted on two scalar linear Gaussian state space models differing only in their noise statistics, where the Kalman filter likelihood ratio test with true parameters serves as a reference for the best achievable classification performance.Through Monte Carlo simulations, the classifiers are evaluated across three axes: task difficulty, controlled by the separation in process or measurement noise between the two models; sequence length; and training dataset size. The results show that the EM classifier, which exploits the known model structure, performs strongly when the data conform to the assumed model class. The LSTM classifier requires a larger separation in noise statistics to achieve reliable classification, and its performance saturates below the reference classifier when the models differ only in measurement noise, regardless of sequence length or training dataset size.

URL PDF HTML ☆

赞 0 踩 0

2605.19307 2026-05-20 cs.CV

iGSP：隐式梯度子空间投影用于高效视觉-语言模型的持续学习

Xuezhi Cui, Dongbo Zhou, Wang Guo, Zeyuan Wang, Ziyu Li, Gaozhi Zhou, Xian Li, Ling Zhao, Wentao Yang, Chao Tao, Haifeng Li

发表机构 * School of Geosciences and Info-Physics, Central South University（地质科学与信息物理学院，中南大学）； School of Earth Sciences and Spatial Information Engineering, Hunan University of Science and Technology（地球科学与空间信息工程学院，湖南科技大学）

AI总结本文提出iGSP框架，通过隐式梯度子空间投影实现视觉-语言模型的高效持续学习，解决了传统方法在参数效率和任务间对齐一致性上的不足，显著提升了训练效率和知识重用率。

详情

AI中文摘要

视觉-语言模型需要高效适应不断出现的下游任务。尽管参数高效微调可以缓解灾难性遗忘，但为每个任务分配孤立模块会导致参数爆炸。相反，最近的相似性驱动共享机制错误地将表面视觉相似性等同于底层对齐一致性。这种根本性不匹配导致在视觉相似但逻辑不同的任务之间产生严重的负迁移，并未能利用在视觉上多样的任务之间的对齐重用。我们提出，对齐共享本质上是共享低秩子空间内重叠优化轨迹的几何问题。基于这一见解，我们提出iGSP，一种通过隐式梯度子空间投影实现高效适应的新框架。利用MoE路由器的早期收敛性来建立子空间基底，iGSP将适应过程分为两个阶段。首先，子空间识别阶段通过基底预扩展引入候选专家，应用一种新的子空间约束正则化来隐式地将新任务梯度投影到历史子空间，并通过将路由概率视为梯度流指示器来精确修剪冗余维度，最终最大化知识重用。其次，正交子空间微调阶段固定这一结构基底并去除正则化，快速拟合任务特定的残差损失。在MTIL基准测试中，iGSP在准确率上达到最先进的水平，同时显著提高了训练效率，与当前最先进的方法相比，平均可训练参数减少了42.7%，相对于其他方法最终总参数减少了86.9%。源代码可在https://github.com/GeoX-Lab/iGSP上获得。

英文摘要

Vision-Language Models require efficient adaptation to continually emerging downstream tasks. While Parameter-Efficient Fine-Tuning mitigates catastrophic forgetting, assigning isolated modules per task leads to parameter explosion. Conversely, recent similarity-driven sharing mechanisms falsely equate superficial visual similarity with underlying alignment consistency. This fundamental mismatch triggers severe negative transfer between visually similar but logically distinct tasks and fails to exploit alignment reuse across visually diverse ones. We argue thatalignment sharing is fundamentally a geometric problem of overlapping optimization trajectories within shared low-rank subspaces. Grounded in this insight, we propose iGSP, a novel framework that achieves efficient adaptation via implicit gradient subspace projection. Leveraging the early convergence of MoE routers to establish the subspace basis, iGSP bifurcates the adaptation process into two phases. First, the Subspace Identification phase introduces candidate experts via basis pre-expansion, applies a novel subspace-constrained regularization to implicitly project new task gradients onto the historical subspace, and precisely prunes redundant dimensions by treating routing probabilities as gradient flow indicators, ultimately to maximize knowledge reuse. Second, the Orthogonal Subspace Fine-Tuning phase fixes this structural basis and removes the regularization to rapidly fit the task-specific residual loss. Extensive experiments on the MTIL benchmark demonstrate that iGSP achieves state-of-the-art accuracy while significantly improving training efficiency, reducing the average trainable parameters by 42.7\% compared to current SOTA methods, and decreasing the final total parameters by 86.9\% relative to counterparts. The source code is available at https://github.com/GeoX-Lab/iGSP.

URL PDF HTML ☆

赞 0 踩 0

2605.19299 2026-05-20 cs.LG

Cross-Paradigm Knowledge Distillation: A Comprehensive Study of Bidirectional Transfer Between Random Forests and Deep Neural Networks for Big Data Applications

跨范式知识蒸馏：随机森林与深度神经网络之间双向知识转移的综合性研究用于大数据应用

Mahdi Naser Moghadasi

发表机构 * BrightMind AI Research（BrightMind AI研究院）

AI总结本文研究了随机森林与深度神经网络之间双向知识蒸馏，提出了新的方法，通过144次实验展示了双向RF-DL蒸馏在分类和回归任务中的竞争力，同时提供了可解释性和表达性的互补优势。

详情

AI中文摘要

大数据的指数增长加剧了对能够处理多样化数据特征并保持计算效率的高效且可解释的机器学习模型的需求。知识蒸馏主要集中在神经网络到神经网络的转移，跨范式知识转移则鲜有探索。本文首次系统研究了随机森林（RF）与深度神经网络（DNN）之间的双向知识蒸馏，填补了集成学习和大数据应用中的模型压缩关键空白。我们提出了一种新的方法，包括渐进多阶段蒸馏、来自多样化树模型的多教师集成蒸馏以及不确定性感知的跨范式转移机制。通过在6个多样化的数据集上进行144次全面实验，涵盖了分类和回归任务，我们证明双向RF-DL蒸馏在保持可解释性的同时，提供了神经网络的表达能力。我们的结果表明，多教师集成蒸馏在传统方法上始终表现更优，其中NN-COMPACT在分类任务中达到98.13%的分类准确率，NN-WIDE在回归任务中达到92.6%的R²分数。所提出的框架使大数据环境中的部署更加灵活，可以根据计算约束和可解释性需求进行最优模型选择。这项工作在跨范式知识转移领域建立了新的研究方向，对可解释AI和资源受限大数据系统中的可扩展模型部署具有重要影响。

英文摘要

The exponential growth of big data has intensified the need for efficient and interpretable machine learning models that can handle diverse data characteristics while maintaining computational efficiency. Knowledge distillation has primarily focused on neural network-to-neural network transfer, leaving cross-paradigm knowledge transfer largely unexplored. This paper presents the first comprehensive study of bidirectional knowledge distillation between Random Forests (RF) and Deep Neural Networks (DNN), addressing critical gaps in ensemble learning and model compression for big data applications. We propose novel methodologies including progressive multi-stage distillation, multi-teacher ensemble distillation from diverse tree models, and uncertainty-aware cross-paradigm transfer mechanisms. Through 144 comprehensive experiments across 6 diverse datasets encompassing classification and regression tasks, we demonstrate that bidirectional RF-DL distillation achieves competitive performance while providing complementary benefits: interpretability from tree models and expressiveness from neural networks. Our results show that multi-teacher ensemble distillation consistently outperforms traditional approaches, with NN-COMPACT achieving 98.13% classification accuracy and NN-WIDE reaching 92.6% R^2 score in regression tasks. The proposed framework enables deployment flexibility in big data environments, allowing optimal model selection based on computational constraints and interpretability requirements. This work establishes a new research direction in cross-paradigm knowledge transfer with significant implications for interpretable AI and scalable model deployment in resource-constrained big data systems.

URL PDF HTML ☆

赞 0 踩 0

2605.19289 2026-05-20 cs.CV

What Makes Synthetic Data Effective in Image Segmentation

是什么使合成数据在图像分割中有效

Jinjin Zhang, Xiefan Guo, Yizhou Jin, Nan Zhou, Di Huang

发表机构 * State Key Laboratory of Complex and Critical Software Environment（复杂与关键软件环境国家重点实验室）； Beihang University（北京航空航天大学）； School of Computer Science and Engineering（计算机科学与工程学院）

AI总结本文研究了合成数据在图像分割中的有效性，通过分析最先进的扩散模型生成的合成图像，发现密集场景构成和精细实例保真度是关键因素，并提出了一种统一框架SENSE，以提升分割性能。

Comments Accepted to ICML 2026

详情

AI中文摘要

受大规模生成模型快速发展的推动，合成数据已成为视觉理解的有前途的解决方案。尽管现代扩散模型在生成逼真图像方面表现出色，但其在复杂视觉分割任务中的潜力仍待探索。在本工作中，我们系统分析了最先进的扩散模型生成的合成图像，以揭示其有效性的决定因素。特别是，具有密集场景构成和精细实例保真度的合成图像表现出显著优势，能够产生更具判别性的空间表示。基于这些见解，我们提出了SENSE，一种利用灵活且可扩展的合成数据显著提升分割性能的统一框架。值得注意的是，SENSE是模型无关的，可与多种架构（如DPT和Mask2Former）兼容，并能有效扩展到参数容量不同的模型。在Cityscapes、COCO和ADE20K上的广泛实验验证了我们方法的有效性和泛化能力。代码可在https://github.com/zhang0jhon/SENSE获取。

英文摘要

Driven by rapid advances in large-scale generative models, synthetic data has emerged as a promising solution for visual understanding. While modern diffusion models achieve remarkable photorealistic image synthesis, their potential in complex visual segmentation tasks remains underexplored. In this work, we conduct a systematic analysis of synthetic images from state-of-the-art diffusion models to uncover the factors governing their utility. In particular, synthetic images characterized by dense scene composition and fine instance fidelity demonstrate distinctive benefits, yielding significantly more discriminative spatial representations. Building on these insights, we propose SENSE, a unified framework that leverages flexible and scalable synthetic data to substantially enhance segmentation performance. Notably, SENSE is model-agnostic, compatible with diverse architectures (e.g., DPT and Mask2Former), and scales effectively across models with varying parameter capacities. Extensive experiments on Cityscapes, COCO, and ADE20K validate the effectiveness and generalization capability of our approach. Code is available at https://github.com/zhang0jhon/SENSE.

URL PDF HTML ☆

赞 0 踩 0

2605.19285 2026-05-20 cs.CL cs.AI cs.CY

Are Rationales Necessary and Sufficient? Tuning LLMs for Explainable Misinformation Detection

理性是否必要且充分？为可解释的虚假信息检测调优大语言模型

Bing Wang, Rui Miao, Ximing Li, Chen Shen, Shaotian Yan, Changchun Li, Kaiyuan Liu, Xiaosong Yuan, Jieping Ye

发表机构 * College of Computer Science and Technology, Jilin University（吉林大学计算机科学与技术学院）； Tongyi Lab, Alibaba Group（阿里云实验室）； School of Artificial Intelligence, Jilin University（吉林大学人工智能学院）； College of Computer Science and Technology, Zhejiang University（浙江大学计算机科学与技术学院）

AI总结本文研究了如何通过调优大语言模型（LLM）来提升可解释性虚假信息检测的性能，提出了一种新的数据合成管道LONSREX，用于定位必要且充分的理性，以解决现有方法中因粗粒度标签和过度验证行为导致的理性不足和冗余问题。

Comments Accepted by KDD 2026. 12 pages, 8 figures. Code: https://github.com/wangbing1416/LONSREX

详情

AI中文摘要

社交媒体上虚假信息的快速传播已成为一个严峻挑战。为缓解其扩散，虚假信息检测（MD）已成为关键研究领域。传统基于小模型的MD方法通常通过黑盒过程进行二元分类。近年来，大型语言模型（LLMs）的兴起使可解释性MD成为可能，其中模型生成理性以解释其决策，从而提高透明度。现有可解释性MD方法主要集中在构建复杂的提示以从现成的LLMs中提取理性。在本文中，我们提出了一种管道来调优专门用于可解释性MD的LLM。我们的管道首先收集大规模经过事实核查的文章，然后使用多个强大的LLMs生成真实性预测和理性。为了确保高质量的训练数据，我们利用一种过滤策略，仅选择正确的实例进行微调。虽然该管道直观且普遍，但我们的实验表明，仅基于标签正确性的简单过滤在实践中是不够的，并存在两个关键限制：（1）粗粒度标签导致理性不足：仅基于二元标签过滤的理性不足以充分支持其决策；（2）过度验证行为导致不必要的理性：更强的LLMs倾向于表现出过度验证行为，生成过度冗长和不必要的理性。为了解决这些问题，我们引入了LONSREX，一种新的数据合成管道，用于定位可解释性MD中必要且充分的理性。具体来说，我们提出了一种度量标准，量化每个验证步骤对最终预测的贡献，从而评估其必要性和充分性。实验结果展示了LONSREX的有效性。

英文摘要

The rapid spread of misinformation on social media platforms has become a formidable challenge. To mitigate its proliferation, Misinformation Detection (MD) has emerged as a critical research topic. Traditional MD approaches based on small models typically perform binary classification through a black-box process. Recently, the rise of Large Language Models (LLMs) has enabled explainable MD, where models generate rationales that explain their decisions, thereby enhancing transparency. Existing explainable MD methods primarily focus on crafting sophisticated prompts to elicit rationales from off-the-shelf LLMs. In this work, we propose a pipeline to fine-tune a dedicated LLM specifically for explainable MD. Our pipeline begins by collecting large-scale fact-checked articles, and then uses multiple strong LLMs to produce veracity predictions and rationales. To ensure high-quality training data, we leverage a filtering strategy that selects only the correct instances for fine-tuning. While this pipeline is intuitive and prevalent, our experiments reveal that naive filtering based solely on label correctness is insufficient in practice and suffers from two critical limitations: (1) Coarse-grained labels cause insufficient rationales: Rationales filtered solely based on binary labels are insufficient to adequately support their decisions; (2) Over-verification behavior causes unnecessary rationales: Stronger LLMs tend to exhibit over-verification behavior, producing excessively verbose and unnecessary rationales. To address these issues, we introduce LONSREX, a novel data synthesis pipeline to Locate Necessary and Sufficient Rationales for Explainable MD. Specifically, we propose a metric that quantifies the contribution of each verification step to the final prediction, thereby evaluating its necessity and sufficiency. Experimental results demonstrate the effectiveness of LONSREX.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

Sparse Mixture-of-Experts Routing in Visual Diffusion Transformers:Diagnosis, Boundary Calibration and Evolutionary Roadmap from Routing Collapse to Selective Deadlock

The Evaluation Game: Beyond Static LLM Benchmarking

Concept-Guided Noisy Negative Suppression for Zero-Shot Classification and Grounding of Chest X-Ray Findings

Multi-Scale Generative Modeling with Heat Dissipation Flow Matching

Accurate, Efficient, and Explainable Deep Learning Approaches for Environmental Science Problems

Scalable, Energy-Efficient Optical-Neural Architecture for Multiplexed Deepfake Video Detection

MAM-CLIP: Vision-Language Pretraining on Mammography Atlases for BI-RADS Classification

Taming the Thinker: Conditional Entropy Shaping for Adaptive LLM Reasoning

SciCustom: A Framework for Custom Evaluation of Scientific Capabilities in Large Language Models

IMLJD: A Computational Dataset for Indian Matrimonial Litigation Analysis

What Makes a Representation Good for Single-Cell Perturbation Prediction?

HalluWorld: A Controlled Benchmark for Hallucination via Reference World Models

Selective, Regularized, and Calibrated: Harnessing Vision Foundation Models for Cross-Domain Few-Shot Semantic Segmentation

Agentic Trading: When LLM Agents Meet Financial Markets

MOCHA: Multi-Objective Chebyshev Annealing for Agent Skill Optimization

An Exterior Method for Nonnegative Matrix Factorization

BrainDyn: A Sheaf Neural ODE for Generative Brain Dynamics

DynaTok: Temporally Adaptive and Positional Bias-Aware Token Compression for Video-LLMs

SWEET: Sparse World Modeling with Image Editing for Embodied Task Execution

Inference-Time Scaling in Diffusion Models through Iterative Partial Refinement

A Multi-Agent Framework for Feature-Constrained Difficulty Control in Reading Comprehension Item Generation

ContextFlow: Hierarchical Task-State Alignment for Long-Horizon Embodied Agents

An Objective Performance Evaluation of the LSTM Networks in Time Series Classification

MetaRA: Metamorphic Robustness Assessment for Multimodal Large Language Model-based Visual Question Answering Systems

A Two-Phase Adaptive Balanced Penalty Method for Controllable Pareto Front Learning under Split Feasibility Conditions

MMGS: 10$\times$ Compressed 3DGS through Optimal Transport Aggregation based on Multi-view Ranking

iGSP:Implicit Gradient Subspace Projection for Efficient Continual Learning of Vision-Language Models

Cross-Paradigm Knowledge Distillation: A Comprehensive Study of Bidirectional Transfer Between Random Forests and Deep Neural Networks for Big Data Applications

What Makes Synthetic Data Effective in Image Segmentation

Are Rationales Necessary and Sufficient? Tuning LLMs for Explainable Misinformation Detection