arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 4033
专题追踪 全部专题
2605.10154 2026-05-12 cs.LG

Stable Long-Horizon PDE Forecasting via Latent Structured Spectral Propagators

Xiaoxiao Lu, Ye Yuan, Jiahao Shi

发表机构 * School of AIA, Huazhong University of Science and Technology(华中科技大学人工智能学院)

AI总结 本文研究了长时间尺度偏微分方程(PDE)的稳定预测问题,提出了一种基于隐结构谱传播器(SSP)的神经预测框架。该方法通过将PDE演化重构为传播导向的潜在空间中的结构化谱传播过程,有效分离了动态演化与空间细节,提升了预测的稳定性与准确性。实验表明,SSP在长期预测任务中显著优于现有方法,大幅降低了预测误差并增强了时间外推的稳定性。

详情
英文摘要

Long-horizon forecasting of time-dependent partial differential equations (PDEs) is critical for characterizing the sustained evolution of physical systems. While neural operators have emerged as efficient surrogates, they typically learn implicit finite-time transitions from discrete observations. When deployed autoregressively, such propagators often suffer from rapid error accumulation and dynamic drift. To address this, we propose a neural forecasting framework that reformulates PDE rollout as learning a Structured Spectral Propagator (SSP) in a propagation-oriented latent space. Following an analysis-propagation-synthesis design, our framework: (i) maps physical states into a shared, time-consistent spatial representation; (ii) projects this space into a compact propagation state to isolate recurrent dynamics from fine-grained spatial details, thereby decoupling reconstruction fidelity from rollout regularity; and (iii) evolves retained spectral modes using a frequency-conditioned linear backbone complemented by a nonlinear spectral closure to account for truncated interactions. This explicit structuring endows the propagator with a strong inductive bias for coherent modal evolution. Extensive experiments demonstrate that SSP significantly outperforms state-of-the-art baselines, reducing relative $L_2$ errors by up to 48.9% and exhibiting improved stability in temporal extrapolation beyond the supervised horizon.

2605.10153 2026-05-12 cs.SD cs.LG

APEX: Audio Prototype EXplanations for Classification Tasks

Piotr Kawa, Kornel Howil, Piotr Borycki, Miłosz Adamczyk, Przemysław Spurek, Piotr Syga

发表机构 * Department of Artificial Intelligence, Wroclaw University of Science and Technology, Poland(华沙理工大学人工智能系) Resemble AI, USA(Resemble AI公司) IDEAS Research Institute, Poland(波兰IDEAS研究院) Faculty of Mathematics and Computer Science, Jagiellonian University, Poland(雅盖隆大学数学与计算机科学系) Doctoral School of Exact and Natural Sciences, Jagiellonian University, Poland(雅盖隆大学博士学院)

AI总结 本文提出了一种名为APEX的音频分类解释框架,旨在解决当前音频领域可解释AI方法不足的问题。该方法基于预训练音频分类器,无需微调即可生成与原模型输出一致的解释结果。APEX通过将解释分解为时域、频域及时频联合四个视角,提供了更符合音频特性的直观解释,提升了分类结果的语义可理解性。

详情
英文摘要

Explainable AI (XAI) has achieved remarkable success in image classification, yet the audio domain lacks equally mature solutions. Current methods apply vision-based attribution techniques to spectrograms, overlooking fundamental differences between visual and acoustic signals. While prototype reasoning is promising, acoustic similarity remains multidimensional. We introduce APEX (Audio Prototype EXplanations), a post-hoc framework for interpreting pre-trained audio classifiers. Crucially, APEX requires no fine-tuning of the original backbone and strictly preserves output invariance. APEX disentangles explanations into four perspectives: Square-based prototypes to localize transient events, Time-based for temporal patterns, Frequency-based highlighting spectral bands, and Time-Frequency-based integrating both. This yields intuitive, example-based explanations that respect acoustic properties, providing greater semantic clarity than standard gradient-based methods.

2605.10151 2026-05-12 cs.LG cs.SY eess.SY math.OC

Learning to Sparsify Stochastic Linear Bandits

Zhengmiao Wang, Ming Chi, Zhi-Wei Liu, Lintao Ye, Carla Fabiana Chiasserini

发表机构 * School of Artificial Intelligence and Automation, Huazhong University of Science and Technology(华中科技大学人工智能与自动化学院) Department of Electronics and Telecommunications, Politecnico di Torino(托里尼 Politecnico 电子与电信系)

AI总结 本文研究了在高维空间中带有稀疏性约束的随机线性博弈问题,旨在在最小化累积遗憾的同时选择稀疏动作。作者提出了一种自适应分阶段的探索与利用算法框架,结合普通最小二乘法进行参数学习,并采用专门的子程序进行稀疏动作选择。对于欧几里得球形动作集,算法可高效计算最优稀疏动作并获得 $\tilde{\mathcal{O}}(d\sqrt{T})$ 的遗憾界;对于一般凸紧动作集,采用贪心子程序并分别给出了不同情况下的遗憾上界。实验验证了算法在推荐系统等实际场景中的有效性。

Comments Include all the omitted details and proofs from the conference paper accepted to IJCAI 2026

详情
英文摘要

This paper addresses the problem of learning to sparsify stochastic linear bandits, where a decision-maker sequentially selects actions from a high-dimensional space subject to a sparsity constraint on the number of nonzero elements in the action vector. The key challenge lies in minimizing cumulative regret while tackling the potential NP-hardness of finding optimal sparse actions due to the inherent combinatorial structure of the problem. We propose an adaptively phased exploration and exploitation algorithmic framework, utilizing ordinary least squares for parameter learning and specialized subroutines for sparse action selection. When the action set is a Euclidean ball, optimal sparse actions can be efficiently computed, enabling us to establish a $\tilde{\mathcal{O}}(d\sqrt{T})$ regret, where $d$ is the dimension of the action vector and $T$ is the time horizon length. For general convex and compact action sets where finding optimal sparse actions is intractable, we employ a greedy subroutine. For general strongly convex action sets, we derive a $\tilde{\mathcal{O}}(d \sqrt{T})$ $α$-regret; for general compact sets lacking strong convexity, we establish a $\tilde{\mathcal{O}}(d T^{2/3})$ $α$-regret, where $α$ pertains to the approximation ratio of the greedy algorithm. Finally, we validate the performance of our algorithms using extensive experiments including an application to recommendation system.

2605.10149 2026-05-12 cs.CV

Improving Temporal Action Segmentation via Constraint-Aware Decoding

Yeo Keat Ee, Debaditya Roy, Chen Li, Hao Zhang, Basura Fernando

发表机构 * Institute of High-Performance Computing, Agency for Science, Technology and Research, Singapore(高性能计算研究所,科学、技术与研究局,新加坡) Centre for Frontier AI Research, Agency for Science, Technology and Research, Singapore(前沿人工智能研究中心,科学、技术与研究局,新加坡) Indian Institute of Technology Kharagpur, India(印度克哈格浦理工学院) College of Computing and Data Science, Nanyang Technological University, Singapore(计算与数据科学学院,南洋理工大学,新加坡)

AI总结 本文研究如何通过引入结构先验约束来提升时序动作分割的性能。作者提出了一种轻量级的约束感知解码框架,通过整合动作转移置信度、动作边界集和类别持续时间等统计结构先验,在不增加模型复杂度的情况下实现推理阶段的预测优化。该方法有效提升了全监督和半监督动作分割模型的性能,尤其在标注数据有限或新领域场景中表现突出。

Comments accepted to ICPR 2026

详情
英文摘要

Temporal action segmentation (TAS) divides untrimmed videos into labeled action segments. While fully supervised methods have advanced the field, challenges such as action variability, ambiguous boundaries, and high annotation costs remain, especially in new or low-resource domains. Grammar-based approaches improve segmentation with structural priors but rely on complex parsing limiting scalability. In this work, we propose a lightweight, constraint-based refinement framework that enhances TAS predictions by integrating statistical structural priors such as transition confidence, action boundary sets, and per-class duration, that can be directly extracted from annotated data. These constraints are integrated into a modified Viterbi decoding algorithm, allowing inference-time refinement without retraining or added model complexity. Our approach improves both fully and semi-supervised TAS models by correcting structural prediction errors while maintaining high efficiency. Code is available at https://github.com/LUNAProject22/CAD

2605.10148 2026-05-12 cs.CV

MicroViTv2: Beyond the FLOPS for Edge Energy-Friendly Vision Transformers

Novendra Setyawan, Chi-Chia Sun, Mao-Hsiu Hsu, Wen-Kai Kuo, Jun-Wei Hsieh

发表机构 * Department of Electro-Optics, National Formosa University(国立.formosa大学电光学系) Department of Electrical Engineering, National Taipei University(台北国立大学电气工程系) College of Artificial Intelligence and Green Energy, National Yang Ming Chiao Tung University(阳明交通大学人工智能与再生能源学院)

AI总结 本文提出了一种轻量级的视觉Transformer模型MicroViTv2,旨在提升边缘设备上的能效表现。通过引入重参数化设计,包括重参数化块嵌入(RepEmbed)和重参数化深度可分离卷积混合器(RepDW),并结合单深度可分离转置注意力(SDTA)模块,模型在保持快速推理速度的同时,实现了更高的准确率。实验表明,MicroViTv2在Jetson AGX Orin等硬件平台上展现出优越的能效比,验证了超越FLOPs指标进行效率评估的重要性。

详情
英文摘要

The Vision Transformer (ViT) achieves remarkable accuracy across visual tasks but remains computationally expensive for edge deployment. This paper presents MicroViTv2, a lightweight Vision Transformer optimized for real-device efficiency. Built upon the original MicroViT, the proposed model is designed based on reparameterized design, specifically Reparameterized Patch Embedding (RepEmbed) and Reparameterized Depth-Wise convolution mixer (RepDW) for faster inference, and introduces the Single Depth-Wise Transposed Attention (SDTA) to capture long-range dependencies with minimal redundancy. Despite slightly higher FLOPs, MicroViTv2 improves accuracy up to 0.5% compared to its predecessor and surpassing MobileViTv2, EdgeNeXt, and EfficientViT while maintaining fast inference and high energy efficiency on Jetson AGX Orin. Experiments on ImageNet-1K and COCO demonstrate that hardware-aware design and structural re-parameterization are key to achieving high accuracy and low energy consumption, validating the need to evaluate efficiency beyond FLOPs. Code is available at https://github.com/novendrastywn/MicroViT.

2605.10146 2026-05-12 cs.AI cs.CR

Benchmarking Safety Risks of Knowledge-Intensive Reasoning under Malicious Knowledge Editing

Qinghua Mao, Xi Lin, Jinze Gu, Jun Wu, Siyuan Li, Yuliang Chen

发表机构 * School of Computer Science(计算机科学学院)

AI总结 本文研究了在恶意知识编辑背景下,知识密集型推理中的安全风险问题。为填补现有基准在安全评估方面的不足,作者提出了EditRisk-Bench,该基准通过集成多种恶意场景和复杂的推理任务,系统评估恶意知识对推理行为和可靠性的影响。实验表明,恶意知识编辑能够在不显著影响模型整体能力的前提下,诱导错误或危险的推理,揭示了知识编辑安全风险的隐蔽性和复杂性。

详情
英文摘要

Large language models (LLMs) increasingly rely on knowledge editing to support knowledge-intensive reasoning, but this flexibility also introduces critical safety risks: adversaries can inject malicious or misleading knowledge that corrupts downstream reasoning and leads to harmful outcomes. Existing knowledge editing benchmarks primarily focus on editing efficacy and lack a unified framework for systematically evaluating the safety implications of edited knowledge on reasoning behavior. To address this gap, we present EditRisk-Bench, a benchmark for systematically evaluating safety risks of knowledge-intensive reasoning under malicious knowledge editing. Unlike prior benchmarks that mainly emphasize edit success, generalization, and locality, EditRisk-Bench focuses on how injected knowledge affects downstream reasoning behavior and reliability. It integrates diverse malicious scenarios, including misinformation, bias, and safety violations, together with multi-level knowledge-intensive reasoning tasks and representative editing strategies within a unified evaluation framework measuring attack effectiveness, reasoning correctness, and side effects. Extensive experiments on both open-source and closed-source LLMs show that malicious knowledge editing can reliably induce incorrect or unsafe reasoning while largely preserving general capabilities, making such risks difficult to detect. We further identify several key factors influencing these risks, including edit scale, knowledge characteristics, and reasoning complexity. EditRisk-Bench provides an extensible testbed for understanding and mitigating safety risks in knowledge editing for LLMs.

2605.10142 2026-05-12 cs.CV cs.AI

Scaling Vision Models Does Not Consistently Improve Localisation-Based Explanation Quality

Mateusz Cedro, Marcin Chlebus

发表机构 * University of Warsaw(华沙大学)

AI总结 本文研究了视觉模型的规模扩大是否能提升基于定位的解释质量。通过在多个图像数据集上评估不同深度和复杂度的ResNet、DenseNet和Vision Transformer模型,结合五种事后解释方法,发现模型规模的增加并未在大多数情况下提升解释质量,较小的模型往往表现相当甚至更优。研究还指出,预训练虽能提升预测性能,但对定位精度的提升并不一致,表明在模型选择中应明确评估解释性以确保安全应用。

Comments 28 pages, 8 figures, 8 tables

详情
英文摘要

Artificial intelligence models are increasingly scaled to improve predictive accuracy, yet it remains unclear whether scale improves the quality of post-hoc explanations. We investigate this relationship by evaluating 11 computer vision models representing increasing levels of depth and complexity within the ResNet, DenseNet, and Vision Transformer families, trained from scratch or pretrained, across three image datasets with ground-truth segmentation masks. For each model, we generate explanations using five post-hoc explainable AI methods and quantify mask alignment using two localisation metrics: Relevance Rank Accuracy (Arras et al., 2022) and the proposed Dual-Polarity Precision, which measures positive attributions inside the class mask and negative attributions outside it. Across datasets and methods, increasing architectural depth and parameter count does not improve explanation quality in most statistical comparisons, and smaller models often match or exceed deeper variants. While pretraining typically improves predictive performance and increases the dependence of explanations on learned weights, it does not consistently increase localisation scores. We also observe scenarios in which models achieve strong predictive performance while localisation precision is near zero, suggesting that performance metrics alone may not indicate whether predictions are based on the annotated regions. These results indicate that larger models do not reliably provide higher-quality explanations, and that explainability should therefore be assessed explicitly during model selection for safety-sensitive deployments.

2605.10141 2026-05-12 cs.AI

FormalRewardBench: A Benchmark for Formal Theorem Proving Reward Models

Zeynel A. Uluşan, Burak S. Akbudak, Can S. Erer, Gözde Gül Şahin

发表机构 * Koç University, Department of Computer Science and Engineering(科克大学计算机科学与工程系) Codeway Studios(Codeway工作室) Boğaziçi University, Department of Computer Engineering(博雅奇大学计算机工程系) Friedrich-Alexander-Universität Erlangen-Nürnberg, Intelligent Language Systems(埃尔兰根-纽伦堡弗里德里希-亚历山大大学智能语言系统)

AI总结 该论文提出了一种名为 FormalRewardBench 的基准,用于评估形式化定理证明中奖励模型的表现。研究针对当前基于可验证奖励的神经定理证明器在稀疏奖励分配上的不足,引入了五种专家设计的错误注入策略,构建了包含250对证明对比的基准数据集。实验表明,前沿大语言模型在证明质量评估上表现最佳,而专门的定理证明模型表现较差,揭示了定理证明能力与证明评估能力之间的差异。

详情
英文摘要

Recent neural theorem provers use reinforcement learning with verifiable rewards (RLVR), where proof assistants provide binary correctness signals. While verifiable rewards are cheap and scalable without reward hacking issues, they suffer from sparse credit assignment: models receive no learning signal from difficult problems where partial progress goes unrewarded. This motivates learned reward models that can evaluate proof quality beyond binary verification. However, comparing reward models is challenging since it typically requires expensive RL training ablations. To address this, we introduce \textbf{FormalRewardBench}, the first benchmark for evaluating reward models in formal theorem proving with Lean 4. Our benchmark consists of 250 preference pairs where correct proofs are paired with incorrect variants generated through five expert curated error injection strategies: forced mistakes, minimal single-point variations, verbose incorrect proofs, natural language justification, and Python code injection. We evaluate frontier LLMs (e.g., Claude Opus 4.5), judge LLMs (e.g., CompassJudger-1-14B), general-purpose LLMs (e.g., Qwen2.5-72B-Instruct), and specialized theorem proving models (e.g., DeepSeek-Prover-V2-7B). Our results reveal that frontier LLMs achieve the highest performance (59.8\%) while specialized theorem provers perform the worst (24.4\%), suggesting that theorem proving ability does not transfer to proof evaluation. We provide further insights on various error injection mechanisms, highlighting the challenging nature of most injection mechanisms. We release \textbf{FormalRewardBench} publicly to encourage more research on developing reward models in formal mathematics.

2605.10136 2026-05-12 cs.LG

Per-Loss Adapters for Gradient Conflict in Physics-Informed Neural Networks

Bum Jun Kim, Gnankan Landry Regis N'guessan

发表机构 * The University of Tokyo, Japan(东京大学) Axiom Research Group(Axiom研究组) Department of Applied Mathematics and Computational Science, NM-AIST, Tanzania(应用数学与计算科学系,NM-AIST,坦桑尼亚) African Institute for Mathematical Sciences (AIMS), Research and Innovation Centre, Rwanda(非洲数学科学研究所(AIMS),研究与创新中心,卢旺达)

AI总结 物理信息神经网络(PINNs)通过最小化多个物理和数据驱动的损失函数来训练单一神经网络近似模型,但这些损失的梯度常发生冲突,导致优化停滞。本文指出,这种梯度冲突并非单一失效模式,而是存在不同类型的冲突场景,需采用不同的干预策略。为此,作者提出了一种基于诊断的框架,通过低秩适配器为每个损失创建独立的参数子空间,从而在保持共享主干网络的前提下,为每个损失提供直接的梯度路径,实验表明该方法在多种偏微分方程问题中显著提升了性能。

Comments 49 pages, 10 figures

详情
英文摘要

Physics-informed neural networks (PINNs) train a single neural approximation by minimizing multiple physics- and data-derived losses, but the gradients of these losses often interfere and can stall optimization. Existing remedies typically treat this pathology either through scalar loss balancing or full-parameter-space gradient surgery, leaving it unclear which intervention is most appropriate. We show that PINN gradient conflict is not a uniform failure mode with one universal remedy. Instead, we identify distinct PINN gradient-conflict regimes, each associated with a different intervention class. Persistent directional conflict may require separate loss-indexed parameter subspaces, magnitude imbalance often favors scalar reweighting, and low or transient conflict may require no extra mitigation. To select between scalar reweighting and a lightweight architectural intervention, we propose a diagnostic-first framework. It profiles a 1000-step unmodified PINN run and, when intervention is warranted, uses one low-rank adapter per loss to create explicit loss-indexed parameter subspaces attached to a shared PINN trunk, providing each loss with a direct gradient pathway. Across more than 60 PDE configurations, including forward, inverse, multi-physics, parameter-varying, and high-dimensional problems up to 50D, persistent directional conflict dominates standard forward $K=3$ benchmarks and a natural $K=4$ thermoelastic system, where adapters combined with reweighting yield significant improvements. In contrast, $K=3$ inverse problems and natural $K=5$ and $K=6$ multi-physics systems are largely magnitude-dominated and often favor reweighting alone, while full-parameter-space gradient surgery can fail on heterogeneous parameter spaces.

2605.10130 2026-05-12 cs.CV

Thermal-Det: Language-Guided Cross-Modal Distillation for Open-Vocabulary Thermal Object Detection

Yasiru Ranasinghe, Elim Schenck, Florence Yellin, Shuowen Hu, Christopher Funk, Vishal M. Patel

发表机构 * Johns Hopkins University(约翰霍普金斯大学) Kitware DEVCOM Army Research Laboratory(国防部陆军研究实验室)

AI总结 现有开放词汇检测方法主要针对RGB图像,难以推广到热成像领域,因热图像纹理低、发射率变化大,给基于RGB的语义理解带来挑战。本文提出Thermal-Det,首个由大语言模型(LLM)监督的开放词汇热成像目标检测方法,通过构建包含百万级热成像对齐样本的合成数据集,并结合跨模态蒸馏与文本校准模块,实现了无需人工标注的热成像检测知识迁移。实验表明,该方法在公开数据集上相比现有开放词汇检测器平均精度提升2-4%,为语言驱动的热感知系统奠定了基础。

Comments Accepted at CVPR 26

详情
英文摘要

Existing open-vocabulary detectors focus on RGB images and fail to generalize to thermal imagery, where low texture and emissivity variations challenge RGB-based semantics. We present Thermal-Det, the first large language model (LLM) supervised open-vocabulary detector tailored for thermal images. To enable large-scale training, we develop a synthetic dataset by converting GroundingCap-1M into the thermal domain and filtering captions to remove RGB-specific terms, yielding over one million thermally aligned samples with bounding boxes, grounding texts, and detailed captions. Thermal-Det jointly optimizes detection, captioning, and cross-modal distillation objectives. A frozen RGB teacher provides geometric and semantic pseudo-supervision for paired but unlabeled RGB-thermal data, transferring open-vocabulary knowledge without manual annotation. The model further employs a Thermal-Text Alignment Head for text calibration and a Modality-Fused Cross-Attention module for dual-modality reasoning. Unlike prior domain-adaptation methods, the detector is fully fine-tuned to internalize thermal contrast patterns while preserving language alignment. Experiments on public benchmarks show consistent 2-4% AP gains over existing open-vocabulary detectors, establishing a strong foundation for scalable, language-driven thermal perception.

2605.10129 2026-05-12 cs.CL

Synthetic Pre-Pre-Training Improves Language Model Robustness to Noisy Pre-Training Data

Xu Guo, Runyu Peng, Jian Tong, Yunhua Zhou, Haijun Lv, Zhihui Lu, Qipeng Guo

发表机构 * Shanghai AI Laboratory(上海人工智能实验室) Shanghai Innovation Institute(上海创新研究院) Fudan University(复旦大学)

AI总结 本文研究了如何通过引入一种轻量级的预预训练(PPT)阶段来提升大型语言模型在噪声预训练数据下的鲁棒性。作者提出使用具有可学习时间结构的合成数据进行PPT,从而在正式预训练阶段增强模型对噪声的抵抗能力。实验表明,这种方法在不同噪声水平下均能有效提升模型性能,并减少了对自然文本预训练数据的依赖。

详情
英文摘要

Large language models (LLMs) rely on web-scale corpora for pre-training. The noise inherent in these datasets tends to obscure meaningful patterns and ultimately degrade model performance. Data curation mitigates but cannot eliminate such noise, so pre-training corpora remain noisy in practice. We therefore study whether a lightweight pre-pre-training (PPT) stage based on synthetic data with learnable temporal structure helps resist noisy data during the pre-training (PT) stage. Across various corruption settings, our method consistently improves robustness to noise during PT, with larger relative gains at higher noise levels. For a 1B-parameter model, a synthetic PPT stage with only 65M tokens achieves the same final loss as the baseline while using up to 49\% fewer natural-text PT tokens across different noise levels. Mechanistic analyses suggest PPT does not immediately suppress attention to noisy tokens. Rather, PPT-initialized models gradually downweight attention between corrupted tokens during noisy PT. This indicates that synthetic PPT inhibits noise self-modeling and shapes the subsequent optimization trajectory. Code is available at https://github.com/guox18/formal-language-prepretraining.

2605.10122 2026-05-12 cs.AI cs.LG

Rethinking Constraint Awareness for Efficient State Embedding of Neural Routing Solver

Canhong Yu, Changliang Zhou, Rongsheng Chen, Zhenkun Wang, Yu Zhou

发表机构 * College of Computer Science and Software Engineering, Shenzhen University(深圳大学计算机科学与软件工程学院) School of Automation and Intelligent Manufacturing, Southern University of Science and Technology(南方科技大学自动化与智能制造学院) Guangdong Provincial Key Laboratory of Fully Actuated System Control Theory and Technology, Southern University of Science and Technology(广东省全驱动系统控制理论与技术重点实验室,南方科技大学) Pengcheng Laboratory(鹏城实验室)

AI总结 本文针对神经路由求解器在处理具有复杂约束的车辆路径问题(VRP)时的不足,重新审视了状态嵌入的生成机制,指出当前方法在解码过程中限制了观察空间,成为性能瓶颈。为此,作者提出了一种名为CARM的约束感知残差调制模块,通过自适应地利用约束相关变量对上下文嵌入进行调制,有效增强了模型对约束的感知能力。实验表明,CARM模块在多个单任务和多任务路由求解器中均显著提升了性能,尤其在处理大规模实例和泛化到新VRP变体时表现突出。

详情
英文摘要

Heavy-Encoder-Light-Decoder (HELD) neural routing solvers have emerged as a promising paradigm due to their broad applicability across multiple vehicle routing problems (VRPs). However, they typically struggle with VRP variants with complex constraints. To address this limitation, this paper systematically revisits existing neural solvers from the perspective of the generation mechanism for state embeddings (i.e., query vector prior to compatibility calculation) during decoding. We identify that current mechanisms restrict the observation space during attention computation, introducing a key bottleneck to achieving high-quality solutions. Through detailed empirical analysis, we demonstrate the necessity of preserving a global observation space. To overcome the constraint-agnostic drawback inherent to global observation spaces, we propose a simple yet powerful Constraint-Aware Residual Modulation (CARM) module. By adaptively modulating the context embedding with constraint-relevant variables, CARM effectively enhances constraint awareness, enabling the neural solver to fully leverage the global observation space and generate an efficient state embedding. Extensive experimental results across two single-task and five multi-task neural routing solvers confirm that the CARM module consistently boosts baseline performance. Notably, solvers equipped with our CARM achieve substantial improvements in scaling to large-scale instances and in generalizing to unseen VRP variants. These findings provide valuable insights for the architectural design of neural routing solvers.

2605.10121 2026-05-12 cs.LG cs.AI cs.HC

Explainability of Recurrent Neural Networks for Enhancing P300-based Brain-Computer Interfaces

Christian Oliva, Vinicio Changoluisa, Francisco B Rodríguez, Luis F Lago-Fernández

发表机构 * Grupo de Neurocomputación Biológica, Departamento de Ingeniería Informática, Escuela Politécnica Superior, Universidad Autónoma de Madrid(生物神经计算组,信息工程系,理工大学高级学院,马德里自治大学) Grupo de Investigación en Electrónica y Telemática, Universidad Politécnica Salesiana(电子与电信研究组,萨利纳斯理工大学)

AI总结 本文研究了如何提高基于P300事件相关电位的脑机接口中循环神经网络的可解释性。作者提出了一种称为后循环模块(PRM)的附加层,将其集成到RNN架构中,以提升模型性能和透明度。该方法通过全局和局部解释技术,实现了对时空信号的双重分析,能够识别分类过程中涉及的关键脑区和时间区间,并与已有的神经生理学描述保持一致。实验表明,该方法在性能上比现有方法提升了9%,并揭示了个体间和个体内部变异的重要性,为构建可解释的脑电模型提供了有效框架。

详情
英文摘要

Brain-Computer Interfaces (BCIs) based on P300 event-related potentials offer promising applications in health, education, and assistive technologies. However, challenges related to inter- and intra-subject variability and the explainability of Deep Learning (DL) models limit their practical deployment. In this work, we present the Post-Recurrent Module (PRM), an additional layer designed to improve both performance and transparency, incorporated into a Recurrent Neural Network (RNN) architecture for classifying P300 signals from EEG data. Our approach enables a dual analysis of spatio-temporal signals through both global and local explainability techniques, allowing us not only to identify the most relevant brain regions and critical time intervals involved in classification, but also to interpret model decisions in terms of spatio-temporal EEG patterns consistent with well-stablished neurophysiological descriptions of the P300. Experimental results show a 9\% improvement in performance over state of the art, while also revealing the importance of inter- and intra-subject variability, in alignment with established neuroscience literature. By making model decisions transparent and efficient, we present a framework for explainable EEG-based models. This framework is not limited to more efficient P300 detection, but can be generalized to a wide range of EEG-based tasks. Its ability to identify key spatial and temporal features makes it suitable for applications such as motor imagery, steady-state visual evoked potentials, and even cognitive workload assessment.

2605.10120 2026-05-12 cs.CV cs.AI

MicroWorld: Empowering Multimodal Large Language Models to Bridge the Microscopic Domain Gap with Multimodal Attribute Graph

Manyu Li, Ruian He, Chenxi Ma, Weimin Tan, Bo Yan

发表机构 * Shanghai Key Laboratory of Intelligent Information Processing(上海智能信息处理关键实验室) School of Computer Science, Fudan University(复旦大学计算机科学学院)

AI总结 本文提出了一种名为MicroWorld的框架,旨在解决多模态大语言模型在显微镜等专业微观领域表现不足的问题。该方法通过构建多模态属性图(MAPG)来增强模型的推理能力,无需特定领域的微调即可在推理阶段提升模型表现。实验表明,MicroWorld显著提升了Qwen3-VL-8B-Instruct在MicroVQA等基准上的性能,取得了当前最优结果,并展示了其在跨领域泛化能力上的优势。

Comments 29 pages, 14 figures

详情
英文摘要

Multimodal large language models (MLLMs) show remarkable potential for scientific reasoning, yet their performance in specialized domains such as microscopy remains limited by the scarcity of domain-specific training data and the difficulty of encoding fine-grained expert knowledge into model parameters. To bridge the gap, we introduce MicroWorld, a framework that constructs a multimodal attributed property graph (MAPG) from large-scale scientific image--caption corpora and leverages it to augment MLLM reasoning at inference time without any domain-specific fine-tuning. MicroWorld extracts biomedical entities and relations via scispaCy or LLM-based triplet mining, aligns images and entities in a shared embedding space using Qwen3-VL-Embedding, and assembles a knowledge graph comprising approximately 111K nodes and 346K typed edges spanning eight relation categories. At inference time, a graph-augmented retrieval pipeline matches query entities to the MAPG and injects structured knowledge context into the MLLM prompt. On the MicroVQA benchmark, MicroWorld improves the reasoning performance of Qwen3-VL-8B-Instruct by 37.5%, outperforming GPT-5 by 13.0% to achieve a new state-of-the-art. Furthermore, it yields a 6.0% performance gain on the MicroBench benchmark. Extensive experiments demonstrate the enhanced generalization capability introduced by MicroWorld. A qualitative case study further reveals both the mechanisms through which structured knowledge improves reasoning and the failure modes that point to promising future directions. Code and data are available at https://github.com/ieellee/MicroWorld.

2605.10118 2026-05-12 cs.RO

Plan in Sandbox, Navigate in Open Worlds: Learning Physics-Grounded Abstracted Experience for Embodied Navigation

Zhixuan Shen, Jiawei Du, Ziyu Guo, Han Luo, Lilan Peng, Joey Tianyi Zhou, Haonan Luo, Tianrui Li

发表机构 * School of Computing and Artificial Intelligence, Southwest Jiaotong University, China(计算机与人工智能学院,西南交通大学,中国) Centre for Frontier AI Research A*STAR, Singapore(前沿人工智能研究A*STAR中心,新加坡) School of Computer Science, University of Leeds, UK(计算机科学学院,利兹大学,英国)

AI总结 该研究旨在解决具身导航中视觉语言模型因缺乏真实世界数据而表现受限的问题,提出了一种基于物理约束语义抽象的框架SAGE。通过构建语义环境、强化学习训练及抽象策略到现实控制的迁移,SAGE实现了在简化物理抽象中学习并规划,最终在A-EQA数据集上取得了显著提升的导航成功率,并展示了良好的现实机器人部署迁移能力。

Comments 28 pages, 15 figures, Extended Version of accepted ICML 2026 Paper

详情
英文摘要

Vision-Language Models (VLMs) have demonstrated exceptional general reasoning capabilities. However, their performance in embodied navigation remains hindered by a scarcity of aligned open-world vision and robot control data. Despite simulators providing a cost-effective alternative for data collection, the inherent reliance on photorealistic simulations often limits the transferability of learned policies. To this end, we propose \textit{\textbf{S}andbox-\textbf{A}bstracted \textbf{G}rounded \textbf{E}xperience} (\textbf{\textit{SAGE}}), a framework that enables agents to learn within a physics-grounded semantic abstraction rather than a photorealistic simulation, mimicking the human capacity for mental simulation where plans are rehearsed in simplified physics abstractions before execution. \textit{SAGE} system operates via three synergistic phases: (1) \textit{Genesis}: constructing diverse, physics-constrained semantic environments to bootstrap experience; (2) \textit{Evolution}: distilling experiences through Reinforcement Learning (RL), utilizing a novel asymmetric adaptive clipping mechanism to stabilize updates; (3) \textit{Navigation}: bridging the abstract policy to open-world control. We demonstrate that \textit{SAGE} significantly improves planner-assisted embodied navigation, achieving a 53.21\% LLM-Match Success Rate on A-EQA (+9.7\% over baseline), while showing encouraging transfer to physical indoor robot deployment.

2605.10117 2026-05-12 cs.CV cs.AI

Think as Needed: Geometry-Driven Adaptive Perception for Autonomous Driving

Donghyun Kim, Jaehyoung Park

发表机构 * Stony Brook University(史蒂文尼森布鲁克大学)

AI总结 本文研究了自动驾驶场景中如何根据环境复杂度动态调整感知计算资源的问题。提出了一种名为Enhanced HOPE的自适应感知架构,通过无监督方法估计LiDAR帧的几何复杂度,并据此选择浅层或深层处理路径,从而在保证精度的同时提升计算效率。该方法还引入了线性时间的子空间注意力网络和持续的时序记忆模块,有效提升了对遮挡目标的跟踪能力,并在多个基准测试中表现出优越的性能。

详情
英文摘要

Autonomous driving scenes range from empty highways to dense intersections with dozens of interacting road users, yet current 3D detection models apply a fixed computation budget to every frame, wasting resources on simple scenes while lacking capacity for complex ones. Existing approaches compound this problem: Transformer-based interaction models scale quadratically with the number of detected objects, and frame-by-frame processing causes the system to immediately forget objects the moment they become occluded. We propose Enhanced HOPE, an adaptive perception architecture that measures the geometric complexity of each incoming LiDAR frame using an unsupervised statistical estimator and routes it through a shallow or deep processing path accordingly, requiring no manual scene labels. To keep interaction modeling efficient, we replace quadratic pairwise attention with a linear-time subspace-based network that groups nearby objects into clusters and processes them jointly. The computational savings from these two mechanisms free up resources for a persistent temporal memory module that retains previously detected objects and traffic rules across frames, enabling the system to recall occluded objects seconds after they disappear from view. On the nuScenes and CARLA benchmarks, Enhanced HOPE reduces latency by 38% on simple scenes with no accuracy loss, improves mean Average Precision by 2.7 points on rare long-tail scenarios, and tracks objects through occlusions lasting over 5 seconds, where all tested baselines fail.

2605.10115 2026-05-12 cs.LG cond-mat.mtrl-sci

Generating Symmetric Materials using Latent Flow Matching

Anmar Karmush, Cedric Mathieu Brandenburg, Soheil Ershadrad, Johanna Rosén, Michael Felsberg, Filip Ekström Kelvinius

发表机构 * Department of Electrical Engineering (ISY) & AI4x, Linköping University(电气工程系(ISY)及AI4x,利厄普大学) Department of Physics, Chemistry and Biology (IFM), Linköping University(物理、化学与生物学系(IFM),利厄普大学) Wallenberg Initiative Materials Science for Sustainability (WISE), Linköping University(可持续材料科学倡议(WISE),利厄普大学) Department of Computer and Information Science (IDA), Linköping University(计算机与信息科学系(IDA),利厄普大学)

AI总结 本文提出了一种名为SymADiT的对称感知材料生成模型,旨在改进现有的全原子扩散变换器(ADiT)。该方法基于Wyckoff位置对材料进行表征,并在潜在空间中进行生成建模,通过强制生成结果满足晶体空间群和原子Wyckoff位置的对称性约束,从而生成具有更真实对称特性的材料。实验表明,SymADiT在生成稳定且对称的材料方面表现出与现有模型相当甚至更优的性能。

Comments Preprint

详情
英文摘要

Tackling the task of materials generation, we aim to enhance the previously proposed All-atom Diffusion Transformer (ADiT) by introducing SymADiT, a symmetry-aware variant. To do so, we use a representation of materials based on Wyckoff positions. We follow ADiT and perform generative modelling in latent space, adapted to our symmetry-aware representation. By forcing the output of the generative model to adhere to the symmetry restrictions imposed by the generated crystal's space group and each atom's Wyckoff-position, the generated materials exhibit more realistic symmetry properties. We benchmark our method against both symmetry-aware and symmetry-agnostic models for materials generation and show competitive performance, generating stable, symmetric materials with a simple Transformer architecture.

2605.10114 2026-05-12 cs.CL

SkillRAE: Agent Skill-Based Context Compilation for Retrieval-Augmented Execution

Xiangcheng Meng, Shu Wang, Yixiang Fang

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 SkillRAE 是一种基于技能的上下文编译方法,旨在提升检索增强执行(RAE)在复杂任务中的表现。该方法分为离线和在线两个阶段,离线阶段构建多级技能图谱以捕捉技能之间的关系,在线阶段通过技能排序检索和关键证据编译生成紧凑、可靠且易于使用的任务上下文。实验表明,SkillRAE 在多个基准测试中显著优于现有方法,展示了其在上下文编译方面的有效性与重要性。

详情
英文摘要

Large Language Model (LLM)-based agents (e.g., OpenClaw) increasingly rely on reusable skill libraries to solve artifact-rich tasks such as document-centric workflows and data-intensive analysis. As these libraries grow, a few works have attempted to study the Retrieval-Augmented Execution (RAE), which often first retrieves some external skills and other knowledge, then compiles the context using retrieved skills, and finally executes the task. Existing works mainly focus on optimizing skill retrieval and task execution, and they pay little attention to how to effectively organize the selected skill evidence in a form that is compact, grounded, and immediately usable for the downstream executors to complete tasks. To fill this gap, we propose SkillRAE, a two-stage RAE approach focusing on skill-based context compilation, which consists of the offline and online stages. Specifically, in the offline indexing stage, it builds a multi-level skill graph over skill communities, skills, and reusable subunits, for capturing their relationships. In the online retrieval stage, it first performs skill-ranked retrieval with selected-subunit evidence export in the graph, and then applies rescue-aware compact compilation to recover the key evidence. Together, these components compile a coarse-ranked skill set into a task-specific context that is compact, grounded, and immediately usable. Experiments on two public benchmarks show that SkillRAE achieves a significant improvement over baselines for RAE. For example, on SkillsBench, it achieves an improvement of 11.7% over the SOTA method. Ablation studies further show that our context compilation is crucial, instead of a mere prompt addition.

2605.07846 2026-05-12 cs.CV

BRIDGE: Background Routing and Isolated Discrete Gating for Coarse-Mask Local Editing

Peilin Xiong, Honghui Yuan, Junwen Chen, Keiji Yanai

发表机构 * Department of Informatics, The University of Electro-Communications(信息学系,电通大学)

AI总结 本文研究了粗粒度掩码局部图像编辑中因掩码形状偏差导致的编辑区域边界失真问题,提出了一种名为BRIDGE的方法。该方法通过将掩码分离于DiT主干网络之外,并引入可学习的离散几何门控机制,实现背景稳定与编辑区域灵活生成的双重约束。实验表明,BRIDGE在多个基准测试中显著提升了编辑质量,同时保持了模型的轻量化特性。

Comments 11 pages, 6 figures

详情
英文摘要

Coarse-mask local image editing asks a model to modify a user-indicated region while preserving the surrounding scene. In practice, however, rough masks often become unintended shape priors: instead of serving as flexible edit support, the mask can pull the generated object toward its accidental boundary. We study this failure as mask-shape bias and frame the task through a Two-Zone Constraint, where the background should remain stable while the editable region should follow the instruction without being forced to inherit the mask contour. BRIDGE addresses this setting by keeping masks outside the DiT backbone for support construction and blending, avoiding DiT-internal mask injection and copied control branches. It uses BridgePath generation, where a Main Path preserves background context and a Subject Path generates editable content from independent noise. Motivated by a diagnostic Qwen-Image experiment showing that positional embeddings and attention connectivity regulate which image context visual tokens reuse, BRIDGE introduces a learnable Discrete Geometric Gate for token-level positional-embedding routing. This gate lets subject tokens borrow background-anchored coordinates near fusion regions or keep subject-centric coordinates for geometric freedom. We evaluate BRIDGE on BRIDGE-Bench, MagicBrush, and ICE-Bench. On BRIDGE-Bench, BRIDGE improves Local SigLIP2-T from 0.262 with FLUX.1-Fill and 0.390 with ACE++ to 0.503, with parallel gains in local DINO and DreamSim. Zero-shot results on MagicBrush and ICE-Bench further indicate competitive alignment and source preservation beyond the curated benchmark, while the added routing module remains compact at 13.31M parameters compared with ControlNet-style copied branches.

2605.07820 2026-05-12 cs.LG

Scaling Categorical Flow Maps

Oscar Davis, Anastasiia Filippova, Pierre Ablin, Victor Turrisi, Amitis Shidani, Marco Cuturi, Louis Béthune

发表机构 * Apple, University of Oxford(苹果公司,牛津大学)

AI总结 本文研究了如何扩展分类流图(CFMs)在大规模语言建模中的应用,提出了一种基于1.7B参数的流模型,并通过自蒸馏方法将其转化为能够在4步内生成高质量文本的CFM。该方法在保持接近数据级词元熵的同时,实现了与离散扩散模型相当的性能。此外,作者还引入了半离散设置下的似然界,并探讨了大规模训练中出现的挑战及损失权重和时间调度的优化策略。

Comments Minor style changes

详情
英文摘要

Continuous diffusion and flow matching models could represent a powerful alternative to autoregressive approaches for language modelling (LM), as they unlock a host of advantages currently reserved for continuous modalities, including accelerated sampling and tilting. Recently, several works have demonstrated the possibility of generating discrete data continuously by a simple flow matching process between a Gaussian and the one-hot encoded data distribution. They have further shown the feasibility of accelerated sampling via Categorical Flow Maps (CFMs), resulting in competitive sample quality in the few-step regime. However, this method had only been evaluated at relatively modest scales ($<1$B), leaving the question of its scalability completely open. In this article, we train a $1.7$B-parameter base flow model on $2.1$T tokens and self-distill it into a CFM that generates diverse, high-quality text in as few as $4$ inference steps while maintaining near-data-level token entropy. Furthermore, we introduce a likelihood bound for CFMs in the semi-discrete setting, and show that they can be used to score the model on standard LM benchmarks, achieving results in the same range as discrete diffusion methods. Finally, we uncover some of the challenges that arise from training these models at scale, and we provide prescriptive insights on loss weighting and time scheduling.

2605.07786 2026-05-12 cs.CV cs.AI

APEX: Assumption-free Projection-based Embedding eXamination Metric for Image Quality Assessment

Caterina Gallegati, Monica Bianchini, Franco Scarselli, Vittorio Murino, Barbara Toniella Corradini

发表机构 * University of Siena(锡耶纳大学) AI for Good (AIGO), Istituto Italiano di Tecnologia(AI for Good(AIGO),意大利理工学院) University of Verona(威尼斯大学)

AI总结 随着生成模型在视觉质量上取得突破,传统的基于特征分布的图像评估指标(如FID)仍被视为黄金标准,但其受到过时特征和参数化假设的限制。为解决这些问题,本文提出APEX,一种基于切片沃谢尔距离的无假设嵌入评估框架,无需依赖特定参数形式,且能兼容多种嵌入模型,如CLIP和DINOv2。实验表明,APEX在高维空间中具有良好可扩展性,对视觉退化具有更强鲁棒性,并在跨数据集评估中表现出高度稳定性。

详情
英文摘要

As generative models achieve unprecedented visual quality, the gold standard for image evaluation remains traditional feature-distribution metrics (e.g., FID). However, these metrics are provably hindered by the closed-vocabulary bottleneck of outdated features and the assumptive bias of rigid parametric formulations. Recent alternatives exploit modern backbones to solve the feature bottleneck, yet continue to suffer from parametric limitations. To close this gap, we introduce APEX (Assumption-free Projection-based Embedding eXamination), a novel evaluation framework leveraging the Sliced Wasserstein Distance as a mathematically grounded, assumption-free similarity measure. APEX inherits effective scalability to high-dimensional spaces, as we prove with theoretical and empirical evidences. Moreover, APEX is embedding-agnostic and uses two open-vocabulary foundation models, CLIP and DINOv2, as feature extractors. Benchmarking APEX against established baselines reveals superior robustness to visual degradations. Additionally, we show that APEX metrics exhibit intra- and cross-dataset stability, ensuring highly stable evaluations on out-of-domain datasets.

2605.07575 2026-05-12 cs.CV cs.AI

Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding

Ke Ma, Jiaqi Tang, Bin Guo, Xueting Han, Ruonan Xu, Qingfeng He, Ziheng Wang, Xu Wang, Qifeng Chen, Zhiwen Yu, Yunhao Liu

发表机构 * Northwestern Polytechnical University(北华大学) Tsinghua University(清华大学) The Hong Kong University of Science and Technology(香港科技大学) Harbin Engineering University(哈尔滨工程大学)

AI总结 本文提出了一种名为Response-G1的新型框架,旨在解决流媒体视频理解中主动响应时机判断的问题。该方法通过显式的场景图建模,将视频内容与查询响应条件进行结构化对齐,从而提升响应决策的准确性和可解释性。框架包含三个无需微调的阶段,包括在线生成场景图、基于记忆的语义检索以及增强触发提示,实验表明其在主动和被动任务中均优于现有方法。

Comments Accepted to ACL 2026

详情
英文摘要

Proactive streaming video understanding requires Video-LLMs to decide when to respond as a video unfolds, a task where existing methods often fall short due to their implicit, query-agnostic modeling of visual evidence. We introduce Response-G1, a novel framework that establishes explicit, structured alignment between the accumulated video evidence and the query's expected response conditions via scene graphs. The framework operates in three fine-tuning-free stages: (1) online query-guided scene graph generation from streaming clips; (2) memory-based retrieval of the most semantically relevant historical scene graphs; and (3) retrieval-augmented trigger prompting for per-frame "silence/response" decisions. By grounding both evidence and conditions in a shared graph representation, Response-G1 achieves more interpretable and accurate response timing decisions. Experimental results on established benchmarks demonstrate the superiority of our method in both proactive and reactive tasks, validating the advantage of explicit scene graph modeling and retrieval in streaming video understanding.

2605.07574 2026-05-12 cs.CV

PolarVLM: Bridging the Semantic-Physical Gap in Vision-Language Models

Yuliang Li, Chu Zhou, Heng Guo, Boxin Shi, Imari Sato, Zhanyu Ma

发表机构 * Beijing University of Posts and Telecommunications, China(北京邮电大学) National Institute of Informatics, Japan(日本国立信息机构) Peking University, China(北京大学) The University of Tokyo, Japan(东京大学)

AI总结 主流的视觉-语言模型(VLMs)由于依赖标准RGB输入,在处理反射、透明物体等光学模糊场景时存在显著困难。为解决这一问题,本文提出PolarVLM,首个将偏振物理参数融入VLM的多模态框架,通过双流架构和渐进式训练策略,有效避免物理误判并保持通用视觉能力。同时,研究构建了首个面向偏振感知的视觉问答基准PolarVQA,实验表明PolarVLM在多个任务上显著优于RGB基线,尤其在反射识别和玻璃计数任务中提升明显。

Comments 23 pages, 12 figures, including appendices

详情
英文摘要

Mainstream vision-language models (VLMs) fundamentally struggle with severe optical ambiguities, such as reflections and transparent objects, due to the inherent limitations of standard RGB inputs. While polarization imaging captures polarimetric physical parameters that resolve these ambiguities, existing methods are constrained by fixed-format outputs and remain isolated from open-ended reasoning. To bridge this semantic-physical gap, we introduce PolarVLM, the first multimodal framework integrating polarimetric physical parameters into VLMs. By employing a dual-stream architecture and a progressive two-stage training strategy, PolarVLM effectively prevents physical misinterpretations while preserving general visual abilities. Complementing our architecture, we construct PolarVQA, the first benchmark for polarization-aware VQA, featuring 75K physics-grounded instruction-tuning pairs targeting reflective and transparent scenes. Experiments show that PolarVLM surpasses the RGB baseline by 25.4% overall across five evaluation tasks, with remarkable gains of 26.6% in reflection recognition and 34.0% in glass counting, successfully unlocking physics-aware semantic understanding.

2605.07429 2026-05-12 cs.CV

Towards Photorealistic and Efficient Bokeh Rendering via Diffusion Framework

Linxiao Shi, Siming Zheng, Zerong Wang, Hao Zhang, Jinwei Chen, Bo Li, Shifeng Chen, Peng-Tao Jiang

发表机构 * Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(深圳先进技术研究院,中国科学院) vivo BlueImage Lab, vivo Mobile Communication Co., Ltd.(vivo BlueImage实验室,vivo移动通信有限公司) Shenzhen University of Advanced Technology(深圳大学)

AI总结 现有移动设备由于光学设计限制,难以生成自然的光学景深效果。为解决这一问题,本文提出 MagicBokeh,一种基于扩散框架的统一方法,能够高效生成高质量的逼真景深效果。该方法通过替代训练策略和聚焦感知的掩码注意力机制,联合优化景深渲染与超分辨率,显著提升了控制精度和视觉真实感,并引入退化感知深度模块以提升低质量输入的深度估计准确性。实验表明,MagicBokeh 能在真实低分辨率图像上高效生成高度逼真的景深效果,为未来景深渲染研究提供了新方向。

Comments Accepted by CVPR 2026

详情
英文摘要

Existing mobile devices are constrained by compact optical designs, such as small apertures, which make it difficult to produce natural, optically realistic bokeh effects. Although recent learning-based methods have shown promising results, they still struggle with photos captured under high digital zoom levels, which often suffer from reduced resolution and loss of fine details. A naive solution is to enhance image quality before applying bokeh rendering, yet this two-stage pipeline reduces efficiency and introduces unnecessary error accumulation. To overcome these limitations, we propose MagicBokeh, a unified diffusion-based framework designed for high-quality and efficient bokeh rendering. Through an alternative training strategy and a focus-aware masked attention mechanism, our method jointly optimizes bokeh rendering and super-resolution, substantially improving both controllability and visual fidelity. Furthermore, we introduce degradation-aware depth module to enable more accurate depth estimation from low-quality inputs. Experimental results demonstrate that MagicBokeh efficiently produces photorealistic bokeh effects, particularly on real-world low-resolution images, paving the way for future advancements in bokeh rendering. Our code and models are available at https://github.com/vivoCameraResearch/MagicBokeh.

2605.07384 2026-05-12 cs.LG

StreamPhy: Streaming Inference of High-Dimensional Physical Dynamics via State Space Models

Panqi Chen, Yifan Sun, Shikai Fang, Xiao Fu, Lei Cheng

发表机构 * College of Information Science and Electronic Engineering, Zhejiang University(浙江大学信息科学与电子工程学院) School of EECS,Oregon State University(俄勒冈州立大学电子工程与计算机科学学院)

AI总结 StreamPhy 是一个用于从不规则稀疏测量数据中实时推断高维物理场动态的端到端框架。该方法结合了自适应观测编码器、结构化状态空间模型和高效的 FT-FiLM 解码器,能够在不规则时间间隔下实现内存高效的在线更新与高精度场生成。研究证明 FT-FiLM 在表达能力上优于传统函数张量模型,并在多个物理系统实验中展现出比现有方法更高的准确性和更快的推理速度。

详情
英文摘要

Inferring the evolution of high-dimensional and multi-modal (e.g., spatio-temporal) physical fields from irregular sparse measurements in real time is a fundamental challenge in science and engineering. Existing approaches, including diffusion-based generative models and functional tensor methods, typically operate in offline settings, depend on full temporal observations, or incur substantial inference cost. We propose StreamPhy, an end-to-end framework that enables efficient and accurate streaming inference of full-field physical dynamics from incoming irregular sparse measurements. The framework integrates a data-adaptive observation encoder that is robust to arbitrary observation patterns, a structured state-space model that supports memory-efficient online updates across irregular time intervals, and an expressive Functional Tensor Feature-wise Linear Modulation (FT-FiLM) decoder for continuous-field generation. We prove that FT-FiLM is more expressive than the functional Tucker model, admitting a richer function class for handling complex dynamics. Experiments on three representative physical systems under challenging sampling patterns show that StreamPhy consistently outperforms state-of-the-art baselines, with at least 48\% improvement in accuracy and up to 20--100X faster inference than diffusion-based methods.

2605.07177 2026-05-12 cs.LG cs.AI

HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents

Guankai Li, Jiabin Chen, Yi Xu, Xichen Zhang, Yuan Lu

发表机构 * Xiaohongshu Inc.(小红书公司) University of Cambridge(剑桥大学)

AI总结 现有的多模态搜索代理通常按顺序处理目标实体,导致在查询分解为多个独立检索任务时产生冗余的交互轮次。为此,本文提出HyperEyes,一种基于双粒度效率感知强化学习的并行多模态搜索代理,通过将视觉定位与检索融合为单一原子操作,实现对多个实体的并发搜索,并将推理效率作为核心训练目标。HyperEyes采用两阶段训练策略,结合平行可用数据合成管道和双粒度强化学习框架,有效提升了搜索效率与准确性,并引入了兼顾搜索能力与效率的新型评估基准IMEB。

Comments Code & Data: https://github.com/DeepExperience/HyperEyes

详情
英文摘要

Existing multimodal search agents process target entities sequentially, issuing one tool call per entity and accumulating redundant interaction rounds whenever a query decomposes into independent sub-retrievals. We argue that effective multimodal agents should search wider rather than longer: dispatching multiple grounded queries concurrently within a round. To this end, we present HyperEyes, a parallel multimodal search agent that fuses visual grounding and retrieval into a single atomic action, enabling concurrent search across multiple entities while treating inference efficiency as a first-class training objective. HyperEyes is trained in two stages. For cold-start supervision, we develop a Parallel-Amenable Data Synthesis Pipeline covering visual multi-entity and textual multi-constraint queries, curating efficiency-oriented trajectories via Progressive Rejection Sampling. Building on this, our central contribution, a Dual-Grained Efficiency-Aware Reinforcement Learning framework, operates at two levels. At the macro level, we propose TRACE (Tool-use Reference-Adaptive Cost Efficiency), a trajectory-level reward whose reference is monotonically tightened during training to suppress superfluous tool calls without restricting genuine multi-hop search. At the micro level, we adapt On-Policy Distillation to inject dense token-level corrective signals from an external teacher on failed rollouts, mitigating the credit-assignment deficiency of sparse outcome rewards. Since existing benchmarks evaluate accuracy as the sole metric, omitting inference cost, we introduce IMEB, a human-curated benchmark of 300 instances that jointly evaluates search capability and efficiency. Across six benchmarks, HyperEyes-30B surpasses the strongest comparable open-source agent by 9.9% in accuracy with 5.3x fewer tool-call rounds on average.

2605.06856 2026-05-12 cs.LG cs.CL

Benchmarked Yet Not Measured -- Generative AI Should be Evaluated Against Real-World Utility

Ishani Mondal, Shweta Bhardwaj

发表机构 * University of Maryland, College Park(马里兰大学学院公园分校)

AI总结 该论文指出,尽管生成式AI系统在标准基准测试中表现优异,但在实际应用场景中却难以发挥实际效用,这一问题在教育、医疗、软件工程和法律等28个部署案例中均有体现。研究认为,当前评估方法存在代理替代、时间坍缩和分布隐藏等缺陷,导致评估结果与实际效用脱节。为此,论文提出了一种新的评估框架SCU-GenEval,强调应基于人类目标和情境,通过长期交互效果来衡量AI系统的实际价值,并引入了多项实用工具以支持该评估范式的落地实施。

Comments 20 pages

详情
英文摘要

Generative AI systems achieve impressive performance on standard benchmarks yet fail to deliver real-world utility, a disconnect we identify across 28 deployment cases spanning education, healthcare, software engineering, and law. We argue that this benchmark utility gap arises from three recurring failures in evaluation practice: proxy displacement, temporal collapse, and distributional concealment. Motivated by these observations, we argue that generative AI evaluation requires a paradigm shift from static benchmark-centered transparency toward stakeholder, goal, and context-conditioned utility transparency grounded in human outcome trajectories. Existing evaluations primarily characterize properties of model outputs, while deployment success depends on whether interaction with AI improves stakeholders' ability to achieve their goals over time. The missing construct is therefore utility: the change in a stakeholder's capability induced through sustained interaction with an AI system within a deployment context. To operationalize this perspective, we propose SCU-GenEval, a four-stage evaluation framework consisting of stakeholder-goal mapping, construct-indicator specification, mechanism modeling, and longitudinal utility measurement. To make these stages practically deployable, we introduce three supporting instruments: structured deployment protocols, context-conditioned user simulators, and persona- and goal-conditioned proxy metrics. We conclude with domain-specific calls to action, arguing that progress in generative AI must be evaluated through measurable improvements in human outcomes rather than benchmark performance alone.

2605.06644 2026-05-12 cs.LG

Edge-specific signal propagation on mature chromophore-region 3D mechanism graphs for fluorescent protein quantum-yield prediction

Yuchen Xiong, Swee Keong Yeap, Steven Aw Yoong Kit

发表机构 * China-ASEAN College of Marine Sciences(中国-东盟海洋科学学院)

AI总结 该研究提出了一种基于成熟染料区域三维结构的机制图算法,用于预测荧光蛋白的量子产率。方法将蛋白质结构转化为分区域的三维残基图,并通过信号通道传播捕捉局部物理信号对染料区域的影响,结合121个特征进行回归预测。该方法在多个基准测试中表现出色,尤其在远程同源蛋白中优于现有模型,揭示了不同荧光蛋白的区域特异性机制。

Comments Includes appendix; source code, processed feature tables and evaluation scripts are available from the first author upon reasonable request

详情
英文摘要

Fluorescent protein quantum yield (QY) is governed by the mature chromophore and its three-dimensional microenvironment rather than sequence identity alone. Protein language models and emission-band averages capture global trends, but do not model how local physical signals act on specific chromophore regions. We present a chromophore-centred mechanism graph algorithm for QY prediction. Each PDB structure is converted into a typed 3D residue graph, registered to a mature-CRO state, partitioned into phenolate, bridge and imidazolinone regions, and transformed by channel-signal-region propagation. The representation contains 121 enrichment features; after removing identity shortcuts, 52 non-identity features are used for band-specific ExtraTrees regression. Because each feature encodes a contact channel, seed signal and target CRO region, interpretation is intrinsic rather than post hoc. On a 531-protein benchmark, the method achieved the best random-CV performance among model-based baselines (R = 0.772 +/- 0.008, MAE = 0.131 +/- 0.002), exceeding Band mean (R = 0.632), ESM-C (R = 0.734) and SaProt (R = 0.731), and ranked first in bright screening (Bright P@5 = 0.704). Under homology control, the advantage was clearest in the remote bucket (<50% similarity; R = 0.697 versus 0.633, 0.575 and 0.408), with the strongest overall bright/dark Top-K screening. Stable selected features recovered band-specific mechanisms: aromatic packing and clamp asymmetry in GFP-like proteins, charge/clamp balance in Red proteins, and flexibility-risk/bulky-contact features in Far-red proteins. Source code, feature tables and evaluation scripts are available from the first author upon request. Contact: yuchenak05@gmail.com

2605.06366 2026-05-12 cs.LG

Layer Collapse in Diffusion Language Models

Alexander Conzelmann, Albert Catalan-Tatjer, Shiwei Liu

发表机构 * Tübingen AI Center(图宾根人工智能中心) Max Planck Institute for Intelligent Systems(马克斯·普朗克智能系统研究所) ELLIS Institute Tübingen(图宾根ELLIS研究所)

AI总结 本文研究了扩散语言模型(DLMs)中出现的“层坍缩”现象,发现其早期层的激活模式高度相似,且由一个主导的超级异常值主导,这一结构在长文本范围内保持稳定。尽管该异常值看似冗余,但对模型输出至关重要,去除会导致输出退化为重复的随机序列。研究还表明,DLMs的冗余分布与自回归模型相反,其冗余主要集中在浅层,且层坍缩是由过度训练而非欠训练引起的,这对模型压缩和部署具有重要实践意义。

Comments 9 Pages, Preprint

详情
英文摘要

Diffusion language models (DLMs) have recently emerged as competitive alternatives to autoregressive (AR) language models, yet differences in their activation dynamics remain poorly understood. We characterize these dynamics in LLaDA-8B and identify a striking layer-collapse property: a few early layers exhibit highly similar, collapsed activation patterns dominated by a single large super-outlier persisting over a long token range. Despite its apparent redundancy, this outlier is critical: pruning it causes outputs to degrade into repetitive random token loops. Paradoxically, layers in LLaDA contain more redundant representations overall, with redundancy most pronounced in earlier layers -- the reverse of AR models, where deeper layers grow redundant due to undertraining. Our analysis indicates that layer collapse in DLMs is not driven by undertraining but by overtraining: a dominant outlier becomes an indispensable information carrier while remaining representations collapse into redundant structure. These findings have strong practical implications, verified through controlled pre-training experiments. DLMs are surprisingly robust to compression: LLaDA under 3-bit GPTQ quantization drops only -1.8% on GSM8K, whereas Llama-3.1-8B drops -64.7%. Optimal sparsity allocation also reverses between families: at 50% average sparsity, allocating more to early layers in LLaDA yields +8.4% over the reverse strategy, while the same allocation costs Llama -8.4%. Our findings reveal that the DLM training objective fundamentally reshapes layer dynamics relative to AR models, with direct consequences for compression and deployment. Code: github.com/Conzel/super-outlier-dlm.

2605.06042 2026-05-12 cs.RO

Accurate Trajectory Tracking with MPCC for Flapping-Wing MAVs

Charbel Toumieh, Jack Zeng, Niel Mistry, Dario Floreano

发表机构 * Laboratory of Intelligent Systems, Ecole Polytechnique Federale de Lausanne (EPFL)(智能系统实验室,瑞士联邦理工学院(EPFL))

AI总结 本文研究了扑翼式微型飞行器(MAVs)的高精度轨迹跟踪问题,针对其升力、空速和转向高度耦合且控制输入有限的特点,提出了基于模型预测轮廓控制(MPCC)的控制方法。该方法采用弧长参数化轨迹,实时优化飞行进度,无需预设时间剖面,同时设计了一个紧凑且连续可微的动力学模型,以准确描述扑翼飞行器的耦合气动特性。实验表明,该方法在复杂三维轨迹跟踪中实现了厘米级的轨迹偏差,显著优于现有方法。

Comments 7 pages, 6 figures

详情
英文摘要

Flapping-wing micro aerial vehicles offer quieter and safer operation than rotary-wing drones, yet achieving precise autonomous control of bird-scale ornithopters remains challenging: lift, airspeed, and turning authority are tightly coupled and governed by only a few control inputs. Conventional cascaded controllers treat altitude, speed, and heading independently, producing persistent tracking errors during complex maneuvers, while time-parameterized trajectory tracking requires predefined speed profiles that existing methods cannot robustly produce for these coupled dynamics. We address both limitations simultaneously with a Model Predictive Contouring Control (MPCC) approach that tracks arc-length-parameterized trajectories while optimizing progress online, eliminating the need for predefined timing. However, MPCC requires a dynamical model that captures the coupled aerodynamics without exceeding the computational budget of real-time nonlinear optimization. Here, we propose a compact, continuously differentiable model that captures the dominant couplings of bird-scale ornithopters, enabling real-time predictive control. We validated the method with the XFly ornithopter flying along circular and three-dimensional racing trajectories and achieved a mean deviation from the reference trajectory between 6.5 and 9 cm at speeds up to 3 m/s, which represents an almost 10-fold improvement over prior ornithopter control methods.