arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2094
热门方向导航
2606.20087 2026-06-19 cs.AI 新提交

Multi-Head Attention-Based Feature Extractor Integration with Soft Actor-Critic for Porosity Prediction and Process Parameter Optimization in Additive Manufacturing

基于多头注意力的特征提取器与软演员-评论家集成用于增材制造中的孔隙率预测和工艺参数优化

Kianoush Aqabakee, Leonardo Stella

发表机构 * Department of Electrical Engineering, Amirkabir University of Technology (Tehran Polytechnic)(阿米尔卡比尔理工大学(德黑兰理工大学)电气工程系) Department of Mechanical Engineering, Amirkabir University of Technology (Tehran Polytechnic)(阿米尔卡比尔理工大学(德黑兰理工大学)机械工程系) School of Computer Science, University of Birmingham(伯明翰大学计算机科学学院)

AI总结 提出一种结合多头注意力机制与软演员-评论家算法的连续动作空间方法,用于增材制造孔隙率预测和参数优化,实现更快收敛和更高奖励。

详情
AI中文摘要

增材制造工艺优化需要精确的参数控制以最小化孔隙等缺陷。传统的使用离散动作空间的强化学习方法收敛慢且易陷入局部最优,限制了其在精密制造任务中的有效性。本研究通过采用连续动作空间并结合一种新颖架构——将多头注意力机制与软演员-评论家(SAC)算法集成,来解决这些局限性。基于注意力的特征提取器增强了智能体捕捉低维输入特征中细微变化的能力,从而在存在局部极小值的价值空间中实现更有效的探索-利用平衡。我们在激光粉末床熔融中的孔隙率预测和工艺参数优化上验证了该方法,与标准强化学习方法(包括DQN、PPO、TD3和原始SAC)相比,展示了更快的收敛速度和更高的最终奖励值。所提出的方法在14个回合内达到322.79的收敛值,在保持训练稳定性的同时优于现有方法。

英文摘要

Additive manufacturing process optimization requires precise parameter control to minimize defects such as porosity. Traditional reinforcement learning (RL) approaches using discrete action spaces suffer from slow convergence and susceptibility to local optima, limiting their effectiveness for high-precision manufacturing tasks. This study addresses these limitations by employing a continuous action space combined with a novel architecture that integrates a multi-head attention mechanism with the Soft Actor-Critic (SAC) algorithm. The attention-based feature extractor enhances the agent's ability to capture subtle variations in low-dimensional input features, enabling more effective exploration-exploitation balance for navigating value spaces with local minima. We validate our approach on porosity prediction and process parameter optimization in laser powder bed fusion, demonstrating faster convergence and higher final reward values compared to standard RL methods including DQN, PPO, TD3, and vanilla SAC. The proposed methodology achieves a convergence value of 322.79 within 14 episodes, outperforming existing approaches while maintaining stability throughout training.

2606.20084 2026-06-19 cs.AI 新提交

Residual-Space Evolutionary Optimization via Flow-based Generative Models

基于流生成模型的残差空间进化优化

Zhuo Cao, Lena Krieger, Fernanda Nader, Xuan Zhao, Hanno Scharr, Ira Assent

发表机构 * LMU Munich, Munich Center for Machine Learning (MCML), Germany(慕尼黑大学,慕尼黑机器学习中心(MCML),德国) Department of Computer Science, Aarhus University, Denmark(丹麦奥胡斯大学计算机科学系)

AI总结 提出残差空间进化优化框架,结合流生成编辑与进化算法,在残差空间分离局部利用与全局探索,用于非可微黑盒目标的数据编辑。

Comments Accepted by ICML 2026 Workshop SPIGM, 5 pages, 3 figures

详情
AI中文摘要

使用生成方法进行数据编辑通常需要可微目标和基于梯度的搜索。然而,这些假设在基于流的设置中不成立,其中编辑通过前向和反向积分执行,并且通常涉及不可微或黑盒目标。我们引入了残差空间进化优化,这是一个模型无关的框架,通过将基于流的生成编辑与进化算法相结合来解决这一差距。基于条件流匹配(CFM)可以将条件控制因素与实例特定残差分离的观察,我们的框架直接在残差空间中操作,并分离两个互补的搜索机制:自花授粉通过保留特征的残差细化进行局部利用,而交叉授粉通过跨异质样本重组残差促进更广泛的探索。作为概念验证,我们在MorphoMNIST(一个用于反事实生成的基准数据集)和晶体数据上进行了验证,表明这种探索-利用分解为平衡目标对齐、实例保留和多样性提供了有用的机制,并且可以扩展到图像之外的真实世界科学领域。

英文摘要

Data editing with generative methods typically requires differentiable objectives and gradient-based search. However, these assumptions break down in flow-based settings, where edits are performed through forward and backward integration and often involve non-differentiable or black-box objectives. We introduce residual-space evolutionary optimization, a model-agnostic framework that addresses this gap by combining flow-based generative editing with evolutionary algorithms. Building on the observation that conditional flow matching (CFM) can disentangle condition-controlled factors from instance-specific residuals, our framework directly operates in residual space and separates two complementary search regimes: self-pollination performs local exploitation through feature-preserving residual refinement, and cross-pollination promotes broader exploration by recombining residuals across heterogeneous samples. As a proof of concept, we validate on MorphoMNIST, a benchmark dataset for counterfactual generation, and on crystal data, demonstrating that this exploration--exploitation decomposition provides a useful mechanism for balancing target alignment, instance preservation, and diversity, and extends beyond images to real-world scientific domains.

2606.20077 2026-06-19 cs.CV cs.AI 新提交

The Hidden Evolution of Disguised Visual Context inside the VLM

VLM内部伪装视觉上下文的隐藏演化

Wish Suharitdamrong, Tony Alex, Muhammad Awais, Sara Atito

发表机构 * Surrey Institute for People-Centred AI, University of Surrey(萨里大学以人为本人工智能研究所) Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey(萨里大学视觉、语音与信号处理中心)

AI总结 研究视觉语言模型中视觉令牌如何通过不同集成架构(上下文注入与逐层注入)转化为有意义表示,揭示其内部演化过程及对性能的影响。

详情
AI中文摘要

视觉令牌作为原始的外部信号进入大语言模型(LLM)。它们如何被转化为有意义的表示并与语言空间交互完全取决于集成架构——无论是将视觉令牌视为输入序列中的上下文提示,还是直接注入到LLM的中间层。对于这些架构选择如何影响视觉信息及其内部转换以与LLM集成,目前仍缺乏受控比较和理解。我们通过在相同训练条件下评估上下文注入和逐层注入的VLM集成范式,在单图像、多图像和视频基准上进行公平比较。在此过程中,我们揭示了一个隐藏的演化:视觉令牌作为伪装的视觉上下文(缺乏语言结构的原始表示)进入LLM,但根据集成范式逐渐被重塑,每种范式捕捉视觉信号的不同频率特征。我们表明,LLM内部的这种演化决定了VLM能够有效利用哪些视觉特征、视觉表示如何与语言空间对齐,以及最终每种范式在不同任务上的表现。我们进一步证明,仅关注注意力分配是不够的,性能由每一层视觉表示的质量驱动。

英文摘要

Visual tokens enter Large Language Models (LLMs) as raw, foreign signals. How they are transformed into meaningful representations and interact with the language space depends entirely on the integration architecture. Whether by treating visual tokens as in-context prompts within the input sequence or injecting them directly into the LLM's intermediate layers. A controlled comparison and understanding of how these architectural choices affect visual information and its internal transformation to integrate with the LLM remains underexplored. We provide a fair comparison by evaluating in-context and layer-wise injection VLM integration paradigms under identical training conditions across single image, multi-image, and video benchmarks. In doing so, we uncover a hidden evolution where visual tokens enter the LLM as disguised visual context, raw representations lacking linguistic structure, but are progressively reshaped depending on the integration paradigm, each capturing fundamentally different frequency characteristics of the visual signal. We show that this evolution inside the LLM determines what visual features the VLM can utilize effectively, how visual representations align with the language space, and ultimately how each paradigm performs across different tasks. We further demonstrate that attention allocation alone is insufficient, and that performance is driven by the quality of visual representations at each layer.

2606.20076 2026-06-19 cs.CV cs.AI 新提交

Variable-Length Tokenization via Learnable Global Merging for Diffusion Transformers

基于可学习全局合并的可变长度分词用于扩散变换器

Dong Hoon Lee, Seunghoon Hong

发表机构 * Kim Jaechul Graduate School of AI, KAIST, Daejeon, South Korea(韩国科学技术院金载哲人工智能研究生院,大田,韩国) School of Computing, KAIST, Daejeon, South Korea(韩国科学技术院计算学院,大田,韩国)

AI总结 针对固定压缩比限制扩散模型质量-计算权衡的问题,提出基于可学习全局合并的可变长度分词器,通过合并令牌实现跨长度表示对齐,在ImageNet 256×256生成中实现更优的gFID-计算权衡。

详情
AI中文摘要

潜在扩散模型(LDM)在视觉合成中占据主导地位,但其质量-计算权衡很大程度上受限于分词器的固定压缩比。可变长度分词器(VLT)通过改变令牌数量实现自适应压缩,使扩散模型能够灵活平衡质量和计算。然而,传统的VLT通过截断有序令牌序列来调节长度,这使得令牌语义依赖于令牌位置,并破坏了跨长度的表示对齐。这导致潜在分布出现跨长度偏移,阻碍单个可变长度扩散模型有效运行。为了解决这个问题,我们提出了一种新颖的可变长度分词器,通过合并令牌来调节长度。我们表明,当扩散变换器根据合并模式运行时,鼓励相似令牌合并可以实现直接的跨长度表示对齐。由于传统的合并方法是数据依赖的,使得生成过程中无法访问合并模式,我们引入了可学习的全局合并,它是数据独立的,以确保与扩散变换器的兼容性。在ImageNet 256×256生成中,我们的基于合并的可变长度分词器与扩散变换器集成,相比之前的VLT方法实现了更优的gFID-计算权衡。代码可在[此https URL](此https URL)获取。

英文摘要

Latent Diffusion Models (LDMs) have become dominant in visual synthesis, but their quality-compute trade-off is largely constrained by the tokenizer's fixed compression ratio. Variable-length tokenizers (VLTs) promise adaptive compression by varying token counts, allowing diffusion models to flexibly balance quality and compute. However, conventional VLTs modulate length by truncating ordered token sequences, which makes token semantics depend on token position and breaks representational alignment across lengths. This leads to a cross-length shift in the latent distribution that hinders a single variable-length diffusion model from operating effectively. To address this, we propose a novel variable-length tokenizer that modulates length by merging tokens. We show that encouraging similar tokens to merge enables direct cross-length representation alignment when the diffusion transformer operates according to the merging pattern. Since conventional merging methods are data-dependent, making the merging pattern inaccessible during generation, we introduce learnable global merging, which is data-independent, to ensure compatibility with diffusion transformers. On ImageNet 256$\times$256 generation, our merging-based variable-length tokenizer integrated with a diffusion transformer achieves a superior gFID-compute trade-off compared to prior VLT methods. Code is available at [this https URL](https://github.com/movinghoon/lgm)

2606.20075 2026-06-19 cs.LG cs.CL 新提交

What Makes Effective Supervision in Latent Chain-of-Thought: An Information-Theoretic Analysis

什么使得潜在思维链中的监督有效:一种信息论分析

Xinghao Chen, Chak Tou Leong, Wenjin Guo, Jian Wang, Wenjie Li, Xiaoyu Shen

发表机构 * Ningbo Institute of Digital Twin, Eastern Institute of Technology(宁波数字孪生研究院,东方理工大学) Department of Computing, The Hong Kong Polytechnic University(香港理工大学计算学系)

AI总结 本文从信息论角度分析潜在思维链中的监督失效问题,提出轨迹监督和空间监督两个维度,并引入统一潜在探针(ULP)量化信息保真度,揭示了信息-性能绑定关系。

详情
AI中文摘要

潜在思维链(Latent Chain-of-Thought, CoT)将推理内化到连续隐藏状态中,为冗长的离散推理轨迹提供了一种有前景的替代方案。然而,鲁棒的潜在推理仍然困难,因为结果监督提供的学习信号较弱,且容易导致潜在轨迹发生语义漂移。在这项工作中,我们从信息论角度分析潜在CoT,并将这种失效识别为双重崩溃:优化路径上的梯度衰减和潜在空间中的表征漂移。我们进一步将过程监督分解为两个互补维度:轨迹监督(注入密集的逐步推理信号)和空间监督(保持潜在流形的语义结构)。我们的分析表明,刚性几何压缩可能坍缩推理空间,而生成式重建提供了更灵活的语义锚点,更好地保留了信息容量。为了衡量这些效应,我们引入了统一潜在探针(Unified Latent Probe, ULP),用于量化潜在轨迹与显式推理步骤之间的互信息。实验揭示了清晰的信息-性能绑定关系:推理准确性取决于潜在链中保留的信息保真度。这些发现为潜在推理监督提供了一个原则性框架,并建议从几何模仿转向互信息最大化。我们的代码可在\href{this https URL}{此仓库}获取。

英文摘要

Latent Chain-of-Thought (CoT) internalizes reasoning within continuous hidden states, offering a promising alternative to verbose discrete reasoning traces. However, robust latent reasoning remains difficult because outcome supervision provides weak learning signals and leaves latent trajectories prone to semantic drift. In this work, we analyze Latent CoT from an information-theoretic perspective and identify this failure as a dual collapse: gradient attenuation along the optimization path and representational drift in the latent space. We further decompose process supervision into two complementary dimensions: Trajectory Supervision, which injects dense stepwise reasoning signals, and Space Supervision, which preserves the semantic structure of the latent manifold. Our analysis shows that rigid geometric compression can collapse the reasoning space, whereas generative reconstruction provides a more flexible semantic anchor that better preserves information capacity. To measure these effects, we introduce the Unified Latent Probe (ULP), which quantifies the mutual information between latent trajectories and explicit reasoning steps. Experiments reveal a clear Information-Performance Binding: reasoning accuracy depends on the information fidelity preserved in the latent chain. These findings provide a principled framework for latent reasoning supervision and suggest shifting from geometric imitation toward mutual information maximization. Our code is available at \href{https://github.com/EIT-NLP/Supervision-in-Latent-CoT}{this repository}.

2606.20072 2026-06-19 cs.CL 新提交

Source-Grounded Data Generation for Text-to-JSON Learning

基于源数据的文本到JSON学习数据生成

Sunghee Ahn, Guijin Son, Youngjae Yu

发表机构 * Seoul National University(首尔大学)

AI总结 提出STAGE方法,利用电子表格作为源数据,通过LLM生成报告和JSON模式,并验证真实值,显著提升文本到JSON任务的训练数据质量。

Comments Preprint

详情
AI中文摘要

从财务文件到临床记录,传统行业严重依赖冗长、非结构化的文档来存储高价值信息。将这些信息可靠地提取为结构化的、机器可读的表示形式,是使自动化系统能够访问这些内容的关键前提。JSON是这种结构化提取的自然目标,然而构建可靠且可扩展的文本到JSON训练数据仍然具有挑战性。为了解决这一差距,我们提出了STAGE(电子表格基础的文本到JSON工件生成),一种基于源数据的数据生成管道,通过使用LLM进行可扩展合成,同时根据底层电子表格验证真实值,来构建报告和JSON模式。在STAGE-Eval(我们的基于源数据的基准测试,包含851个示例的测试集)上的评估表明,STAGE生成的训练数据优于现有方法。这使Qwen3-4B的精确匹配从31.37%提高到74.27%,值准确率从45.46%提高到90.69%。

英文摘要

From financial filings to clinical records, legacy industries rely heavily on long, unstructured documents to store high-value information. Reliably extracting this information into structured, machine-readable representations is a key prerequisite to making the contents accessible to automated systems. JSON is a natural target for such structured extraction, yet constructing reliable and scalable text-to-JSON training data remains challenging. To address this gap, we propose STAGE (Spreadsheet-grounded Text-to-JSON Artifact GEneration), a source-grounded data generation pipeline that constructs reports and JSON schema by using LLMs for scalable synthesis while validating ground-truth values against the underlying spreadsheet. Evaluations on STAGE-Eval, our source-grounded benchmark with an 851-example test set, show that STAGE produces stronger training data than existing approaches. This improves Qwen3-4B exact match from 31.37% to 74.27% and value accuracy from 45.46% to 90.69%.

2606.20068 2026-06-19 cs.AI 新提交

Process-Verified Reinforcement Learning for Theorem Proving via Lean

基于Lean的过程验证强化学习用于定理证明

Minsu Kim, Se-Young Yun

发表机构 * KAIST AI(韩国科学技术院人工智能系)

AI总结 提出利用Lean证明助手提供过程级验证信号,结合GRPO风格强化学习目标,通过策略级监督提升定理证明性能。

详情
AI中文摘要

虽然基于可验证奖励的强化学习通常依赖于单一的二元验证信号,但形式推理中的符号证明助手提供了丰富、细粒度的结构化反馈。这种结构化过程与非结构化奖励之间的差距凸显了既密集又可靠的反馈的重要性。在这项工作中,我们证明Lean证明助手本身可以作为符号过程预言机,在训练期间提供结果级和细粒度的策略级验证反馈。证明尝试被解析为策略序列,Lean的细化标记出局部正确的步骤和最早失败的步骤,从而产生基于类型理论的密集、验证器基础的信用信号。我们将这些结构化奖励纳入GRPO风格的强化学习目标中,采用首次错误传播和首次令牌信用方法,平衡结果级和过程级优势。在STP-Lean和DeepSeek-Prover-V1.5上的实验表明,在大多数设置中,策略级监督优于仅结果基线,在MiniF2F和ProofNet等基准测试上取得了改进。除了经验上的提升,我们的研究还突出了一个更广阔的视角:符号证明助手不仅在评估时是验证器,而且在训练期间可以作为过程级奖励预言机。这为强化学习框架开辟了一条道路,该框架将语言模型的可扩展性与符号验证的可靠性相结合,用于形式推理。

英文摘要

While reinforcement learning from verifiable rewards (RLVR) typically has relied on a single binary verification signal, symbolic proof assistants in formal reasoning offer rich, fine-grained structured feedback. This gap between structured processes and unstructured rewards highlights the importance of feedback that is both dense and sound. In this work, we demonstrate that the Lean proof assistant itself can serve as a symbolic process oracle, supplying both outcome-level and fine-grained tactic-level verified feedback during training. Proof attempts are parsed into tactic sequences, and Lean's elaboration marks both locally sound steps and the earliest failing step, yielding dense, verifier-grounded credit signals rooted in type theory. We incorporate these structured rewards into a GRPO-style reinforcement learning objective with first-error propagation and first-token credit methods that balances outcome- and process-level advantages. Experiments with STP-Lean and DeepSeek-Prover-V1.5 show that tactic-level supervision outperforms outcome-only baselines in most settings, delivering improvements on benchmarks such as MiniF2F and ProofNet. Beyond empirical gains, our study highlights a broader perspective: symbolic proof assistants are not only verifiers at evaluation time, but can also act as process-level reward oracles during training. This opens a path toward reinforcement learning frameworks that combine the scalability of language models with the reliability of symbolic verification for formal reasoning.

2606.20056 2026-06-19 cs.RO 新提交

VFILC: Accurate Frequency Extrapolations in Imitation Learning via Sampling Frequency ILC

VFILC: 通过采样频率迭代学习控制实现模仿学习中的精确频率外推

Nozomu Masuya, Toshiaki Tsuji, Sho Sakaino

发表机构 * Grad. School of Science Technology University of Tsukuba Tsukuba, Japan Engineering Saitama University Saitama, Japan Information Engineering University of Tsukuba Tsukuba, Japan

AI总结 提出VFILC方法,结合可变频率模仿学习与前馈-反馈迭代学习控制,在三种任务中实现精确的速度外推,频率误差降低最高81%。

Comments 8 pages, 17 figures. Accepted at IROS 2026

详情
AI中文摘要

传统的基于神经网络(NN)的变速度运动模仿学习方法要么局限于内插速度,要么在外推超出训练速度范围时产生不可预测的运动。可变频率模仿学习(VFIL)通过将NN模型的采样频率与运动频率相关联,实现了速度的外推,但其开环配置导致频率误差,特别是在外推的高频设置中。本研究提出了基于VFIL和迭代学习控制(ILC)的可变频率模仿学习与迭代学习控制(VFILC),包含前馈和反馈两部分,前者利用VFIL的优势,后者调整频率误差。实验结果表明,所提方法成功且精确地外推了运动速度,并在所有三个任务中减少了频率误差;特别是在以训练数据中平均速度的两倍进行外推时,与简单前馈VFIL相比,反馈在擦拭任务中将频率误差显著降低了81%,在摇晃任务中降低了50%。即使在受复杂摩擦特性影响的接触密集混合任务的内插频率下,所提方法相比VFIL也将精度提高了27%。

英文摘要

Conventional neural network (NN)-based imitation learning methods for variable-speed motion either restricted their scope to interpolated speeds, or generated unpredictable motions when extrapolating beyond trained velocity ranges. Variable-frequency imitation learning (VFIL) enabled extrapolations of speeds by linking the NN model's sampling frequency to the motion frequency, whereas its open-loop configuration caused frequency errors, especially in the extrapolated high-frequency settings. This study proposes variable-frequency imitation learning with iterative learning control (VFILC) based on a combination of VFIL and iterative learning control (ILC) with both feedforward and feedback parts, the former taking advantage of VFIL and the latter adjusting the frequency errors. The experimental results showed that the proposed method successfully and accurately extrapolated motion speeds and reduced frequency errors in all three tasks, and that the feedback especially reduced the frequency errors by a remarkable 81% in the wiping task and 50% in the shaking task, both compared to simple feedforward VFIL, when extrapolating at double the average speed in the training data. The proposed method also improved accuracy by 27% compared with VFIL even at an interpolated frequency for a contact-rich mixing task affected by complex friction traits.

2606.20055 2026-06-19 cs.LG 新提交

PaAno+: Multiscale Encoding and Cross-Variable Attention for Time Series Anomaly Detection

PaAno+:用于时间序列异常检测的多尺度编码与跨变量注意力

Youji Zhu, Hongbing Wang, Wenchao Liu, Xiaodong Liu, Xiangguang Xiong

发表机构 * School of Mathematical Sciences, Guizhou Normal University(贵州师范大学数学科学学院) School of Big Data and Computer Science, Guizhou Normal University(贵州师范大学大数据与计算机科学学院)

AI总结 提出PaAno模型,通过多尺度特征提取、跨变量融合注意力和补丁窗口排序预任务,实现轻量高效的时间序列异常检测,在TSB-AD基准上达到SOTA。

详情
AI中文摘要

时间序列异常检测在工业和医疗监测等关键领域具有重要的实用价值。当前基于Transformer和大模型的检测方法计算开销过大,而现有的轻量级替代方案受限于特征提取不足以及多变量间依赖关系建模不充分。为缓解上述缺陷,本研究在面向补丁的表征学习范式下,开发了一种轻量高效的异常检测模型PaAno。在编码器模块中,使用具有差异化感受野的卷积核构建多尺度特征提取主干,以捕获层次化时间特征;随后通过跨尺度自适应注意力聚合结合残差连接优化,进一步稳定特征表征学习。嵌入跨变量融合注意力模块以显式表征变量间相关性,使模型能够在复杂运行条件下识别异常模式。此外,定制了一种基于时间补丁窗口排序的新型前置任务,以揭示时间序列的内在结构特性,并利用三元组损失优化补丁嵌入空间以增强特征判别性。在TSB-AD基准上的大量实验表明,所提出的PaAno在单变量和多变量任务上均实现了最先进的检测精度,在包括VUS-PR在内的评估指标上相对于原始PaAno取得了显著性能提升。凭借紧凑的网络设计,该模型实现了良好的计算效率,能够在资源受限的终端上部署用于实时异常推理。

英文摘要

Time-series anomaly detection has significant practical value for industrial and medical monitoring, as well as other critical domains. Current Transformer- and large-model-based detection approaches incur excessive computational overhead, while existing lightweight alternatives are constrained by insufficient feature extraction and inadequate modeling of dependencies across multivariate variables. To mitigate the above drawbacks, this study develops a lightweight, efficient anomaly detection model, dubbed PaAno, within the patch-oriented representation learning paradigm. In the encoder module, a multiscale feature-extraction backbone is constructed using convolutional kernels with differentiated receptive fields to capture hierarchical temporal characteristics; subsequent cross-scale adaptive attention aggregation, combined with residual connection optimization, further stabilizes feature representation learning. A cross-variable fusion attention module is embedded to explicitly characterize inter-variable correlations, empowering the model to identify anomalous patterns amid intricate operational conditions. Moreover, a novel pretext task based on temporal patch-window sorting is customized to uncover intrinsic structural properties of time series, and triplet loss is leveraged to optimize the patch embedding space for enhanced feature discrimination. Extensive experiments on the TSB-AD benchmark demonstrate that the proposed PaAno achieves state-of-the-art detection accuracy on both univariate and multivariate tasks, yielding significant performance gains across evaluation metrics, including VUS-PR, relative to the original PaAno. Leveraging a compact network design, the presented model achieves favorable computational efficiency, enabling deployment on resource-limited terminals for real-time anomaly inference.

2606.20045 2026-06-19 cs.CV cs.AI 新提交

See-and-Reach: Precise Vision-Language Navigation for UAVs within the Field of View

See-and-Reach: 视场内的精确视觉语言导航用于无人机

Fanfu Xue, En Yu, Yantian Shen, Zhikun Hu, Hongjun Wang, Yang Yang, Xindi Wang, Jiande Sun

发表机构 * School of Information Science and Engineering, Shandong University(山东大学信息科学与工程学院) Faculty of Engineering and Information Technology, University of Technology Sydney(悉尼科技大学工程与信息技术学院) School of Computer Science and Technology, Shandong University(山东大学计算机科学与技术学院) School of Artificial Intelligence, Shandong University(山东大学人工智能学院) School of Computer Science and Artificial Intelligence, Shandong Normal University(山东师范大学计算机科学与人工智能学院) Interdisciplinary Research Center of General Artificial Intelligence, Shandong Normal University(山东师范大学通用人工智能跨学科研究中心)

AI总结 针对无人机视觉语言导航中目标可见后精确到达能力评估不足的问题,提出UAV-VLN-FOV任务和3DG-VLN框架,通过动态3D方向线索增强细粒度视觉定位与空间对齐,在基准和真实实验中显著提升成功率。

Comments 12 pages, 7 figures

详情
AI中文摘要

无人机视觉语言导航(UAV-VLN)通常被形式化为一个整体的搜索与到达问题,其中远程目标发现和最终目标接近被联合优化和评估。这种表述使得评估空中具身代理的关键能力变得困难,即一旦目标进入其视场,无人机能否准确地将可见目标定位并将视觉语言证据转化为精确的3D运动。为了解决这一局限性,我们引入了UAV-VLN-FOV,一个目标可见的导航任务,它隔离了“看到并到达”阶段,并能够对终端到达能力进行更具诊断性的评估。我们进一步提出了3DG-VLN,一种由动态3D方向线索引导的视觉语言航点预测框架,以增强细粒度视觉定位和空间方向对齐,从而实现精确的目标到达。具体来说,3DG-VLN自适应地处理高分辨率的前视和下视观测,以保留用于目标定位的细粒度视觉和几何细节。它还在闭环导航过程中在线更新目标相对方向,使代理能够保持与目标的空间对齐并减少累积的方向漂移。为了支持该任务,我们构建了一个专用的高分辨率基准,包含2,717条轨迹,带有面向目标的高级指令、高分辨率的前视和下视自我中心观测以及连续的3D航点注释。实验表明,3DG-VLN优于具有竞争力的UAV-VLN基线,成功率提高了13.82%。真实世界试验进一步展示了3DG-VLN在实际“看到并到达”导航中的潜力。源代码和基准可在以下网址获取:此 https URL。

英文摘要

UAV Vision-Language Navigation (UAV-VLN) is typically formulated as a holistic search-and-reach problem, where long-range target discovery and final target approach are optimized and evaluated jointly. This formulation makes it difficult to assess a critical capability of aerial embodied agents, namely whether a UAV can accurately ground a visible target and translate vision-language evidence into precise 3D motion once the target enters its field of view. To address this limitation, we introduce UAV-VLN-FOV, a target-visible navigation task that isolates the see-and-reach stage and enables a more diagnostic evaluation of terminal reaching ability. We further propose 3DG-VLN, a vision-language waypoint prediction framework guided by dynamic 3D direction cues to enhance fine-grained visual grounding and spatial direction alignment for precise target reaching. Specifically, 3DG-VLN adaptively processes high-resolution front-view and downward-view observations to preserve fine-grained visual and geometric details for target grounding. It also updates the target-relative direction online during closed-loop navigation, allowing the agent to maintain spatial alignment with the target and reduce accumulated direction drift. To support this task, we construct a dedicated high-resolution benchmark which contains 2,717 trajectories with target-oriented high-level instructions, high-resolution front-view and downward-view egocentric observations, and continuous 3D waypoint annotations. Experiments show that 3DG-VLN outperforms competitive UAV-VLN baselines, achieving a 13.82\% improvement in success rate. Real-world trials further demonstrate the potential of 3DG-VLN for practical see-and-reach navigation. The source code and benchmark are available at https://github.com/xuefanfu/3DG-VLN.

2606.20044 2026-06-19 cs.CV 新提交

FUSE: Frequency-domain Unification and Spectral Energy Alignment for Multi-modal Object Re-Identification

FUSE:面向多模态目标重识别的频域统一与频谱能量对齐

Xuanhao Qi, Tom H. Luan, Yukang Zhang, Jinkai Zheng, Zhou Su, Shuwei Li, Lei Tan

发表机构 * School of Cyber Science and Engineering, Xi'an Jiaotong University(西安交通大学网络空间安全学院) School of Informatics, Xiamen University(厦门大学信息学院) National University of Singapore(新加坡国立大学)

AI总结 提出频域框架FUSE,通过频谱解耦和能量对齐两阶段处理,解决多模态重识别中低频偏置问题,在三个数据集上mAP提升9.1%。

Comments Accepted in ICML 2026

详情
AI中文摘要

尽管多模态重识别(ReID)取得了显著进展,现有方法往往强调低频线索。因此,它们关注颜色、光照和粗略外观等属性,而忽略了编码几何、纹理和身份判别细节的中高频结构。这种不平衡导致频谱表示不完整和跨模态对齐不稳定。为了克服这些限制,我们引入了FUSE,一个频域框架,将多模态ReID重新表述为频谱解耦和能量对齐的两阶段过程。所提出的频谱分解模块(SDM)自适应地将特征划分为低频、中频和高频子空间,实现分层频谱建模。跨模态对齐模块(CAM)进一步通过频率一致性正则化强制实现跨模态的能量对齐和子空间互补性。此外,FUSE结合了可学习的频率调制,以增强在不同光照和异构传感器条件下的鲁棒性。在RGBNT201、RGBNT100和MSVR310上的大量实验表明,FUSE实现了9.1%的mAP和9.5%的Rank-1改进,为多模态表示学习建立了一个可解释的频域范式。

英文摘要

Despite significant progress in multi-modal Re-Identification (ReID), existing methods tend to emphasize low-frequency cues. Consequently, they focus on attributes such as color, illumination, and coarse appearance, while overlooking mid and high-frequency structures that encode geometric, textural, and identity-discriminative details. This imbalance leads to incomplete spectral representations and unstable cross-modal alignment. To overcome these limitations, we introduce FUSE, a frequency-domain framework that reformulates multi-modal ReID as a two-stage process of spectral disentanglement and energy alignment. The proposed Spectral Decomposition Module (SDM) adaptively partitions features into low, mid, and high-frequency subspaces, enabling hierarchical spectral modeling. The Cross-Modal Alignment Module (CAM) further enforces energy alignment and subspace complementarity across modalities via frequency-consistency regularization. In addition, FUSE incorporates learnable frequency modulation to enhance robustness under varying illumination and heterogeneous sensor conditions. Extensive experiments on RGBNT201, RGBNT100, and MSVR310 show that FUSE achieves 9.1\% mAP and 9.5\% Rank-1 improvements, establishing an interpretable frequency-domain paradigm for multi-modal representation learning.

2606.20037 2026-06-19 cs.LG 新提交

Alzheimer's Disease Diagnosis using a Multimodal Approach with 3D MRI and PET

使用3D MRI和PET的多模态方法诊断阿尔茨海默病

Loukas Ilias, Anthi-Maria Vozinaki, Christos Ntanos, Dimitris Askounis

发表机构 * DSS Lab, School of ECE, NTUA(NTUA ECE学院DSS实验室)

AI总结 提出结合3D卷积特征提取器与三种融合策略(拼接、门控多模态单元、门控自注意力)及稀疏门控混合专家分类器的多模态模型,用于阿尔茨海默病诊断,在三个二分类任务上验证了输入自适应建模的有效性。

Comments 2025 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)

详情
AI中文摘要

阿尔茨海默病(AD)是一种不可逆的神经退行性疾病,也是全球主要的死亡原因之一。早期诊断尤为重要,尤其是在轻度认知障碍(MCI)阶段,及时干预有助于延缓其向AD的进展。神经影像数据,如磁共振成像(MRI)和正电子发射断层扫描(PET),可以通过提供与疾病相关的结构和功能脑变化来帮助早期检测脑部变化。然而,许多多模态模型仍通过静态拼接融合MRI和PET,并对所有受试者应用相同的计算,这限制了其对患者/站点异质性的鲁棒性,并可能浪费计算资源。为解决这些局限性,我们首次研究了将3D卷积特征提取器与三种融合策略(拼接、门控多模态单元(GMU)和门控自注意力)以及一个稀疏门控混合专家(MoE)分类器相结合的方法,该分类器执行输入自适应路由,仅激活每个病例中最具信息量的专家。最后,我们利用Grad-CAM可视化疾病相关区域,确保模型的可解释性。实验在三个二分类任务(NC vs. MCI、MCI vs. AD和NC vs. AD)上进行。结果表明,GMU在NC vs. MCI和NC vs. AD上分别达到80.46%和95.47%的准确率,而门控自注意力在MCI vs. AD上达到82.08%。消融实验表明,移除MoE会持续降低所有任务的准确率。这些发现强调了利用MRI和PET互补性的输入自适应多模态建模在AD诊断中的价值。

英文摘要

Alzheimer's disease (AD) is an irreversible neurodegenerative disorder and a leading cause of death worldwide. Early diagnosis plays an important part especially at the Mild Cognitive Impairment stage, where timely intervention can help slow its progression before it advances to AD. Neuroimaging data, like Magnetic Resonance Imaging (MRI) and Positron Emission Tomography (PET) scans, can help detect brain changes early by providing structural and functional brain changes related to the disease. Yet, many multimodal models still fuse MRI and PET with static concatenation and apply identical computation to all subjects, which limits robustness to patient/site heterogeneity and can waste computation. To address these limitations, we present the first study of combining 3D convolutional feature extractors with three fusion strategies - concatenation, Gated Multimodal Unit (GMU), and gated self-attention - and a sparsely gated Mixture-of-Experts (MoE) classifier that performs input-adaptive routing, activating only the most informative experts per case. Finally, we utilize Grad-CAM to visualize disease-related regions, ensuring model interpretability. Experiments are performed across three binary classification tasks (NC vs. MCI, MCI vs. AD, and NC vs. AD). Results show that GMU achieves accuracies of 80.46 % (NC vs. MCI) and 95.47 % (NC vs. AD), while gated self-attention attains 82.08 % on MCI vs. AD. Ablations show that removing the MoE consistently degrades accuracy across all tasks. These findings underscore the value of input-adaptive, multimodal modeling for AD diagnosis by leveraging the complementary nature of MRI and PET.

2606.20035 2026-06-19 cs.CV cs.LG 新提交

PU-UNet: Stable Multiplicative Interactions for Medical Image Segmentation

PU-UNet:用于医学图像分割的稳定乘法交互

Ziyuan Li, Osamah Sufyan, Uwe Jaekel, Babette Dellen

发表机构 * Department of Mathematics, Informatics and Technology, University of Applied Sciences Koblenz(科布伦茨应用科学大学数学、信息学与技术系) Technical University of Munich(慕尼黑工业大学)

AI总结 提出PU-UNet,通过稳定乘积单元残差块在低分辨率阶段实现显式乘法特征交互,在三个医学图像分割数据集上提升Dice和IoU,降低假阳性率。

Comments Accepted to the ICANN 2026

详情
AI中文摘要

许多密集预测网络依赖于加性特征变换,并且仅隐式地建模高阶特征交互。乘积单元为乘法特征建模提供了显式机制,但其对数-指数公式可能导致数值不稳定性,这限制了它们在深度密集预测网络中的使用。在这项工作中,我们提出了乘积单元U-Net(PU-UNet),这是一种残差U-Net,它将稳定的乘积单元残差块集成到丰富的低分辨率阶段,用于医学图像分割。所提出的公式结合了平滑正性映射和对数域裁剪,实现了稳定的乘法特征学习,且计算开销可忽略不计。在ISIC 2018、Kvasir-SEG和BUSI上,PU-UNet分别达到了0.942、0.959和高达0.925的Dice分数。与匹配的残差U-Net基线相比,PU-UNet在保持参数、FLOPs和推理延迟几乎不变的情况下,持续提高了Dice和IoU,并将正常BUSI病例的图像级假阳性率从0.077降至零。消融研究表明,这些增益与乘积单元交互相关,在低分辨率放置下最强,并受益于所提出的稳定化设计。这些结果表明,稳定的乘积单元残差学习可以成为通过显式乘法交互增强U-Net风格分割网络的有效方式。

英文摘要

Many dense prediction networks rely on additive feature transformations and model higher-order feature interactions only implicitly. Product units provide an explicit mechanism for multiplicative feature modeling, but their logarithmic--exponential formulation can cause numerical instability, which has limited their use in deep dense prediction networks. In this work, we propose Product-Unit U-Net (PU-UNet), a residual U-Net that integrates stable product-unit residual blocks into rich low-resolution stages for medical image segmentation. The proposed formulation combines smooth positivity mapping with log-domain clipping, enabling stable multiplicative feature learning with negligible computational overhead. On ISIC 2018, Kvasir-SEG, and BUSI, PU-UNet achieves Dice scores of 0.942, 0.959, and up to 0.925, respectively. Compared with a matched Residual U-Net baseline, PU-UNet consistently improves Dice and IoU while keeping parameters, FLOPs, and inference latency nearly unchanged, and reduces the image-level false-positive rate on normal BUSI cases from 0.077 to zero. Ablation studies suggest that the gains are associated with product-unit interactions, are strongest under low-resolution placement, and benefit from the proposed stabilization design. These results suggest that stable product-unit residual learning can be an effective way to enhance U-Net-style segmentation networks with explicit multiplicative interactions.

2606.20032 2026-06-19 cs.CV 新提交

ReA-OVCD: Reliability-Aware Open-Vocabulary Change Detection via Semantic and Spatial Refinement

ReA-OVCD:通过语义和空间精炼的可靠性感知开放词汇变化检测

Hongming Zhu, Huaji Chen, Bowen Du, Sicong Liu, Qin Liu

发表机构 * School of Computer Science and Technology, Tongji University(同济大学计算机科学与技术学院) College of Surveying and Geo-Informatics, Tongji University(同济大学测绘与地理信息学院)

AI总结 提出一种无需训练的可靠性感知开放词汇变化检测框架,通过语义变化推理和边界感知精炼策略,解决实例级比较忽略细粒度变化和像素级比较不可靠的问题,在多个数据集上F1提升2.13%-9.75%。

详情
AI中文摘要

与依赖预定义类别的传统遥感变化检测不同,开放词汇变化检测(OVCD)使用任意文本提示灵活识别土地覆盖变化。然而,现有方法在建模变化时存在固有折衷:实例级比较忽略了细粒度语义变化(例如部分建筑扩建),而直接像素比较不可靠,由于语义模糊和空间不一致导致不稳定响应和边界伪影。为此,我们提出一种高效的无训练可靠性感知开放词汇变化检测(ReA-OVCD)框架。它首先从像素级语义差异中推导候选变化区域,以确保灵活和详细的定位。为确保可靠性,随后引入协作精炼策略,从语义和空间角度显式建模变化有效性。具体而言,我们开发了语义变化推理(SCR)模块,通过联合分析分布差异和响应变化重新评估变化,从而抑制偶然不一致性同时保留可靠的语义转变。此外,设计了边界感知变化精炼(BCR)模块,通过验证候选区域是否得到可靠内部像素支持来减轻由边界错位和不确定性引起的伪影。在多个数据集(LEVIR-CD、WHU-CD、DSIFN和SECOND)上的大量实验表明,我们的方法持续优于现有技术,在更高计算效率下实现了2.13%至9.75%的F1提升。代码已公开于此 https URL。

英文摘要

Unlike traditional remote sensing change detection that relies on predefined categories, Open-Vocabulary Change Detection (OVCD) identifies land cover changes flexibly using arbitrary text prompts. However, existing methods suffer from an inherent trade-off when modeling changes: instance-level comparison overlooks fine-grained semantic variations (e.g., partial building extensions), while direct pixel comparison proves unreliable, yielding unstable responses and boundary artifacts due to semantic ambiguity and spatial inconsistency. To this end, we propose an efficient training-free Reliability-Aware Open-Vocabulary Change Detection (ReA-OVCD) framework. It first derives candidate change regions from pixel-wise semantic discrepancies to ensure flexible and detailed localization. To ensure reliability, it subsequently introduces a collaborative refinement strategy to explicitly model change validity from both semantic and spatial perspectives. Specifically, we develop a Semantic Change Reasoning (SCR) module that reassesses changes by jointly analyzing distributional divergence and response variation, enabling the suppression of incidental inconsistencies while preserving reliable semantic shifts. In addition, a Boundary-aware Change Refinement (BCR) module is designed to mitigate artifacts stemming from boundary misalignment and uncertainty through validating whether candidate regions are supported by reliable interior pixels. Extensive experiments across multiple datasets (LEVIR-CD, WHU-CD, DSIFN, and SECOND) demonstrate that our method consistently outperforms state-of-the-art approaches, achieving $\mathrm{F}_{1}^{C}$ improvements of 2.13\% to 9.75\% with higher computational efficiency. The code is publicly available at \https://github.com/Funny0101/ReA-OVCD

2606.20031 2026-06-19 cs.RO cs.AI 新提交

A Neuromorphic Reinforcement Learning Framework for Efficient Pathfinding in Robotic Mobile Fulfillment Systems

一种用于机器人移动履行系统高效路径规划的神经形态强化学习框架

Junzhe Xu, Zecui Zeng, Lusong Li, Yuetong Fang, Renjing Xu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) JD Explore Academy(京东探索研究院)

AI总结 提出SDQN-RMFS框架,通过ANN到SNN的转换和硬标签知识蒸馏,在神经形态芯片上实现超低功耗路径规划,相比GPU能耗降低11281倍,延迟减少近一半。

详情
AI中文摘要

动态环境变化、受限工作空间和严格的实时约束使得机器人移动履行系统(RMFS)中的路径规划对传统的搜索和基于规则的方法来说是一个具有挑战性的问题,这些方法通常遭受高计算复杂性和长决策延迟。虽然强化学习(RL)已成为一种强大的替代方案,但在资源受限的硬件上以极端的能源效率部署学习到的策略仍然是一个开放的挑战。我们提出了SDQN-RMFS,一个端到端的框架,实现了从全精度人工神经网络(ANN)训练的RL策略到神经形态芯片的高保真部署。通过仅在稀疏事件触发时进行计算,该框架实现了超低功耗的RMFS路径规划。我们的全栈流水线操作如下:首先通过碰撞允许策略高效训练ANN策略以密集化信息轨迹,然后通过硬标签知识蒸馏方法将其转换为脉冲神经网络(SNN)。这有效地解决了输出分布不匹配问题,在保持策略能力的同时显著降低了推理延迟。硬件实验表明,与高性能GPU基线相比,能耗节省高达11281倍,延迟几乎减少两倍,同时决策质量与原始训练策略相当。这些结果确立了物理神经形态推理作为大规模RMFS运营的实用且能源可持续的途径。

英文摘要

Dynamic environmental changes, confined workspaces, and stringent real-time constraints make pathfinding in Robotic Mobile Fulfillment Systems (RMFS) a challenging problem for conventional search- and rule-based methods, which typically suffer from high computational complexity and long decision latency. While reinforcement learning (RL) has emerged as a powerful alternative, deploying learned policies with extreme energy efficiency on resource-constrained hardware remains an open challenge. We present SDQN-RMFS, an end-to-end framework that achieves high-fidelity deployment of an RL-trained policy from a full-precision artificial neural network (ANN) through to a neuromorphic chip. By computing only when triggered by sparse events, this framework unlocks ultra-low-power RMFS pathfinding. Our full-stack pipeline operates as follows: an ANN policy is first efficiently trained via a collision-allowing strategy to densify informative trajectories, and then converted into a spiking neural network (SNN) via a hard-label knowledge distillation approach. This effectively addresses the output distribution mismatch, preserving policy capability across the ANN-to-SNN pipeline while substantially reducing inference latency. Hardware experiments demonstrate up to 11,281$\times$ energy savings and a nearly two-fold reduction in latency compared to a high-performance GPU baseline, while maintaining decision quality on par with the original trained policy. These results establish physical neuromorphic inference as a practical and energy-sustainable pathway for large-scale RMFS operations.

2606.20027 2026-06-19 cs.CV 新提交

QG-MIL: A Gated Transformer Aggregator for Domain-Agnostic Multiple Instance Learning in Medical Imaging

QG-MIL:一种用于医学影像中领域无关多实例学习的门控Transformer聚合器

Luca Zedda, Davide Antonio Mura, Cecilia Di Ruberto, Maurizio Atzori, Muhammed Furkan Dasdelen, Carsten Marr, Andrea Loddo

发表机构 * Department of Mathematics and Computer Science, University of Cagliari(卡利亚里大学数学与计算机科学系) Institute of AI for Health, Helmholtz Munich(亥姆霍兹慕尼黑人工智能健康研究所)

AI总结 提出QG-MIL门控Transformer聚合器,通过RMSNorm预归一化、逐头QK归一化、细粒度注意力输出门控和SwiGLU前馈模块,解决注意力集中问题,在六个基准上平均提升+6.1个宏F1分数。

详情
AI中文摘要

医学影像中基于注意力的多实例学习聚合器容易出现注意力集中,导致预测过于自信且不稳定。我们引入QG-MIL,一种门控Transformer聚合器,通过四个协同架构组件解决这一问题:基于RMSNorm的预归一化、逐头QK归一化、细粒度注意力输出门控和SwiGLU风格的前馈模块。这些设计选择共同稳定了训练,并将注意力更均匀地分布在实例上,无需辅助损失、掩码或多阶段正则化。我们在涵盖全切片病理学和细胞级血液学的六个基准上评估了QG-MIL,覆盖两种根本不同的MIL尺度。性能最佳的QG-MIL变体在所有六个基准上均优于领先的基线,平均提升+6.1个宏F1分数。注意力覆盖图和注意力质量分析证实了更分布的实例权重。消融研究表明,虽然单个组件在特定数据集上可以匹配完整模型,但与所选基线相比,QG-MIL设计提供了最一致的跨域性能和最紧凑的方差。我们发布了一个可配置的实现以支持可重复性,网址为:this https URL

英文摘要

Attention-based Multiple Instance Learning aggregators in medical imaging are prone to attention concentration, producing overconfident and unstable predictions. We introduce QG-MIL, a gated transformer aggregator that addresses this through four synergistic architectural components: RMSNorm-based pre-normalization, per-head QK normalization, fine-grained attention output gating, and SwiGLU-style feed-forward modules. Together, these design choices stabilize training and distribute attention more uniformly across instances without auxiliary losses, masking, or multi-stage regularization. We evaluate QG-MIL across six benchmarks spanning whole-slide pathology and cell-level hematology, covering two fundamentally different MIL scales. The best-performing QG-MIL variants outperform leading baselines on all six benchmarks, with an average improvement of +6.1 mean macro F1 points. Attention overlays and attention mass analysis confirm more distributed instance weighting. Ablation studies show that while individual components can match the full model on specific datasets, the QG-MIL design provides the most consistent cross-domain performance and tightest variance when compared to selected baselines. We release a configurable implementation to support reproducibility at: https://github.com/unica-visual-intelligence-lab/QG-MIL

2606.20015 2026-06-19 cs.LG 新提交

Adaptive Distance-Aware Trunk Deep Operator Learning for Long-Span Roadway Bridges

自适应距离感知主干深度算子学习用于大跨度公路桥梁

Bilal Ahmed, Diab W. Abueidda, Waleed El-Sekelly, Tarek Abdoun, Mostafa E. Mobasher

发表机构 * Urban Engineering Department , addressline= New York University Abu Dhabi , country= United Arab Emirates organization= National Center for Supercomputing Applications , addressline= University of Illinois at Urbana-Champaign , country= United States of America organization= Department of Structural Engineering , addressline= Mansoura University , country= Mansoura, Egypt

AI总结 提出自适应主干DeepONet框架,通过KNN构建荷载相关学习域、距离感知特征和刚度-informed Schur补全重建,实现大跨度桥梁局部响应高精度快速预测,相对误差低于5%,速度提升约60倍。

Comments 39 pages, 26 figures

详情
AI中文摘要

大跨度公路桥梁在车辆荷载下表现出高度局部化的结构响应,使得重复有限元分析在影响面生成和结构数字孪生等应用中计算成本高昂。现有的科学机器学习方法难以准确捕捉这些局部响应。为解决这一挑战,本研究提出了一种自适应主干DeepONet用于大型桥梁系统的局部结构响应预测。该框架利用KNN策略动态构建荷载相关的学习域,使网络聚焦于结构影响区域。主干网络进一步通过距离感知特征增强,这些特征编码了荷载与结构节点之间的几何关系。通过刚度-informed Schur补全公式引入基于物理的全场重建,使得自适应节点上的预测能够扩展到整个结构域。为了实现可扩展训练,使用降阶等效壳模型生成响应数据,该模型保留了主要的全局行为,同时显著降低了计算成本。该框架在基准桥梁模型和真实世界的Mussafah桥上进行了验证。结果表明,该方法实现了有限元级别的精度,相对误差低于5%,同时将总响应评估时间(包括全场重建)减少了约60倍;排除后处理重建步骤,AD-DeepONet推理比有限元快四个数量级。此外,该框架能够在任意车辆荷载配置下快速生成全场响应、影响线和影响面,显示出在大规模桥梁分析和数字孪生应用中的巨大潜力。

英文摘要

Long-span roadway bridges exhibit highly localized structural responses under vehicular loading, making repeated FE analysis computationally expensive for applications such as influence surface generation and structural digital twins. Existing SciML approaches struggle to accurately capture these localized responses. To address this challenge, this study proposes an adaptive-trunk DeepONet for localized structural response prediction in large-scale bridge systems. The framework dynamically constructs a load-dependent learning domain using a KNN strategy, allowing the network to focus on structural influence zones. The trunk network is further enhanced using distance-aware features that encode the geometric relationship between the load and structural nodes. A physics-based full-field reconstruction is incorporated through a stiffness-informed Schur complement formulation, enabling predictions at adaptive nodes to be extended to the entire structural domain. To enable scalable training, response data are generated using a reduced-order equivalent shell model that preserves the dominant global behavior while significantly reducing computational cost. The proposed framework is validated on both a benchmark bridge model and the real-world Mussafah Bridge. Results show that the method achieves FEM-level accuracy with relative errors below 5%, while reducing the total response evaluation time (including full-field reconstruction) by approximately 60x; excluding the post-processing reconstruction step, the AD-DeepONet inference is up to four orders of magnitude faster than FEM. In addition, the framework enables rapid generation of full-field responses, influence lines, and influence surfaces under arbitrary vehicular loading configurations, demonstrating strong potential for large-scale bridge analysis and digital twin applications.

2606.20010 2026-06-19 cs.LG 新提交

Self-Adaptive Scale Handling for Forecasting Time Series with Scale Heterogeneity

面向尺度异质性时间序列的自适应尺度处理方法

Xu Zhang, Zhengang Huang, Yunzhi Wu, Xun Lu, Erpeng Qi, Yunkai Chen, Zhongya Xue, Peng Wang, Wei Wang

发表机构 * College of Computer Science and Artificial Intelligence, Fudan University(复旦大学计算机科学与人工智能学院) Ant Group(蚂蚁集团)

AI总结 提出自适应尺度处理模块,通过学习自适应尺度因子保留语义区分性并减少逆缩放误差,在基金销售数据集上提升主流预测模型性能。

Comments This is the full version of the paper accepted by the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). The code and dataset are available at https://github.com/Meteor-Stars/ASTSF

详情
AI中文摘要

当前时间序列预测研究主要关注尺度同质数据,即不同时间序列具有相似的数值量级范围。然而,在金融产品销售等真实工业场景中,不同时间序列常相差多个数量级(尺度异质性)。由于这些序列共享相似的时间模式,联合建模有利于更好地利用数据,但现有缩放方法要么压缩低尺度信号(全局归一化),要么破坏语义区分性并放大逆缩放误差(基于窗口的缩放)。本文提出一种自适应尺度处理模块,该模块学习针对每个输入的自适应尺度因子,在保持语义区分性的同时减少逆缩放误差。AS由尺度校准(SC)和缩放选择(SS)组成,SC通过神经网络校准先验均值尺度因子,SS决定是否应用校准或保留原始因子,避免过度校准。在蚂蚁财富和支付宝的真实基金销售数据集上的实验表明,AS能无缝集成到主流TSF模型中并持续提升其性能。代码和数据集可在链接 https://this URL 获取。

英文摘要

Current time series forecasting (TSF) research predominantly focuses on scale-homogeneous data, where different time series share similar numerical magnitude ranges. However, in real-world industrial scenarios such as financial product sales, different time series often differ by orders of magnitude (scale heterogeneity). Since these series share similar temporal patterns, joint modeling is desirable for better data utilization, yet existing scaling methods either compress low-scale signals (global normalization) or destroy semantic discriminability and amplify inverse-scaling errors (window-based scaling). This paper proposes a self-Adaptive Scale-handling (AS) module that learns adaptive scale factors tailored to each input, preserving semantic discriminability while reducing inverse-scaling errors. AS consists of Scale Calibrating (SC), which calibrates prior mean scaling factors through neural networks, and Scaling Selection (SS), which decides whether to apply calibration or retain the original factor, avoiding over-calibration. Experiments on real-world fund sales datasets from Ant Fortune and Alipay show that AS seamlessly integrates into popular TSF models and consistently improves their performance. The code and dataset are available at the link https://github.com/Meteor-Stars/ASTSF.

2606.20008 2026-06-19 cs.LG 新提交

VIMPO: Value-Implicit Policy Optimization for LLMs

VIMPO: 值隐式策略优化用于大语言模型

Zhewei Kang, Aosong Feng, Sergey Levine, Dawn Song, Xuandong Zhao

发表机构 * UC Berkeley(加州大学伯克利分校) Yale University(耶鲁大学)

AI总结 提出VIMPO方法,通过KL正则化强化学习的最优条件导出策略隐含值函数,无需训练评论家,实现细粒度信用分配,在数学推理基准上优于GRPO。

详情
AI中文摘要

基于可验证奖励的强化学习已成为提升大语言模型推理能力的核心工具,但当前方法在简单性与信用分配之间存在权衡。GRPO等群组相对方法避免了训练评论家,但通常为每个token分配轨迹级优势。Actor-critic方法提供更密集的学习信号,但需要学习值函数,其自身存在训练不稳定性。我们提出VIMPO,一种无需评论家的策略优化方法,从KL正则化强化学习的最优条件推导出策略隐含值函数。对于自回归生成,得到的值递归可以用策略-参考对数比率表示,并由轨迹结束时无未来奖励的终止条件锚定。这给出了一个简单的值损失,它结合了结果级可验证奖励,而无需训练评论家。相同的推导也产生了无需评论家的actor优势,使VIMPO能够通过值损失分离奖励合并,并通过PPO风格的actor更新进行策略改进。在数学RLVR基准上,VIMPO在MATH-500、AIME 2024、AIME 2025和OlympiadBench上均优于GRPO,尤其在竞赛式评估中提升更大。在噪声奖励下,VIMPO保持对GRPO的持续优势,表明策略隐含值优化可以在保持无评论家训练实用简单性的同时提供更精细的信用分配。

英文摘要

Reinforcement learning with verifiable rewards has become a central tool for improving the reasoning ability of large language models, but current methods face a trade-off between simplicity and credit assignment. Group-relative methods such as GRPO avoid training a critic, but typically assign a trajectory-level advantage to every token. Actor-critic methods provide denser learning signals, but require a learned value function with its own training instability. We introduce VIMPO, a critic-free policy optimization method that derives a policy-implied value function from the optimality conditions of KL-regularized reinforcement learning. For autoregressive generation, the resulting value recurrence can be written in terms of policy-reference log-ratios and anchored by the terminal condition that no future reward remains at the end of a trajectory. This gives a simple value loss that incorporates outcome-level verifiable rewards without training a critic. The same derivation also yields a critic-free actor advantage, allowing VIMPO to separate reward incorporation through the value loss from policy improvement through a PPO-style actor update. On mathematical RLVR benchmarks, VIMPO improves over GRPO across MATH-500, AIME 2024, AIME 2025, and OlympiadBench, with especially larger gains on competition-style evaluations. Under noisy rewards, VIMPO retains a consistent advantage over GRPO, suggesting that policy-implied value optimization can provide finer credit assignment while preserving the practical simplicity of critic-free training.

2606.20005 2026-06-19 cs.LG cs.AI 新提交

StreamKL: Fast and Memory-Efficient KL Divergence for Boosting Attention Distillation

StreamKL: 快速且内存高效的KL散度用于提升注意力蒸馏

Guangda Liu, Yiquan Wang, Chengwei Li, Wenhao Chen, Jing Lin, Yiwu Yao, Danning Ke, Wenchao Ding, Jieru Zhao

发表机构 * Shanghai Jiao Tong University(上海交通大学) Huawei(华为) Fudan University(复旦大学)

AI总结 提出StreamKL,首个融合GPU原语,通过在线公式和逐块重计算将注意力蒸馏的内存和IO成本从O(N_QN_K)降至O(1),实现高达43倍前向和14倍反向加速。

详情
AI中文摘要

注意力蒸馏通过最小化Kullback-Leibler (KL)散度来训练一个注意力分布匹配另一个,广泛应用于知识蒸馏、模型压缩、持续学习和稀疏注意力LLM训练。然而,现有方法在计算KL归约前需要具体化两个注意力分布,导致$O(N_QN_K)$的内存和IO成本,在长上下文长度下变得不可接受。我们提出StreamKL,首个用于注意力KL散度的融合GPU原语,消除了这种二次具体化。StreamKL推导了一种新颖的在线公式用于耦合的双分布KL归约,使得单个前向内核能够通过片上SRAM流式处理查询-键块。对于反向传播,StreamKL逐块重计算注意力概率,避免存储二次中间结果。我们进一步设计并实现了具有专用优化的高效GPU内核。实验表明,StreamKL在前向和反向传播中分别比基线方法快高达43倍和14倍。最重要的是,StreamKL将注意力蒸馏的额外HBM占用从$O(N_QN_K)$减少到$O(1)$,使得在单个GPU上进行长上下文蒸馏成为可能。

英文摘要

Attention distillation, which trains one attention distribution to match another by minimizing their Kullback-Leibler (KL) divergence, is widely used in knowledge distillation, model compression, continual learning, and sparse-attention LLM training. However, existing approaches materialize both attention distributions before computing the KL reduction, incurring $O(N_QN_K)$ memory and IO costs that become prohibitive at long context lengths. We present StreamKL, the first fused GPU primitive for attention KL divergence that eliminates this quadratic materialization. StreamKL derives a novel online formulation for the coupled two-distribution KL reduction, enabling a single one-pass forward kernel that streams query-key tiles through on-chip SRAM. For the backward pass, StreamKL recomputes attention probabilities tile-by-tile, avoiding storage of quadratic intermediates. We further design and implement efficient GPU kernels with dedicated optimizations. Experiments show StreamKL delivers up to $43\times$ and $14\times$ speedups over baseline methods in the forward and backward passes, respectively. Most importantly, StreamKL reduces the extra HBM footprint of attention distillation from $O(N_QN_K)$ to $O(1)$, enabling long-context distillation on a single GPU.

2606.20002 2026-06-19 cs.LG cs.AI cs.CL 新提交

Connect the Dots: Training LLMs for Long-Lifecycle Agents with Cross-Domain Generalization Via Reinforcement Learning

Connect the Dots:通过强化学习训练具备跨域泛化能力的长期生命周期智能体

Yanxi Chen, Weijie Shi, Yuexiang Xie, Boyi Hu, Yaliang Li, Bolin Ding, Jingren Zhou

发表机构 * Alibaba Group(阿里巴巴集团)

AI总结 提出Connect the Dots框架,通过端到端强化学习训练LLM在长期任务中自我更新上下文并泛化到新领域,实验验证了跨域泛化能力。

Comments Work in progress; we will continuously update the codebase and arXiv version

详情
AI中文摘要

本文提出了一个通用框架,用于训练大型语言模型(LLMs)具备“Connect the Dots”(CoD)这一元能力,该能力是长期生命周期智能体所必需的:当基于LLM的AI智能体部署在环境中时,它解决一系列长期任务,同时持续探索环境、从自身经验中学习,并迭代地自我更新关于环境的上下文,从而在更新上下文的条件下,在未来任务上实现逐步更好的性能。CoD框架的主要组成部分包括:(1)用于端到端强化学习(RL)的算法设计和基础设施,其中包含交替执行任务和更新上下文的长展开序列;(2)用于在训练过程中激励和激发LLM中目标元能力的任务和环境,以及在评估过程中忠实衡量进展的任务和环境。我们展示了CoD框架的概念验证实现,包括具有细粒度信用分配的GRPO风格RL算法,以及针对目标元能力(而非特定领域的LLM能力或标准的逐任务RL)量身定制的任务和环境。实证结果验证了CoD设置中端到端RL训练的有效性,并展示了所激发元能力的分布外泛化潜力——在训练领域内、跨不同领域以及从CoD到Ralph-loop设置中。我们对CoD的研究连接了多项先前工作,并为推进LLM和AI智能体开辟了新的机遇。为促进进一步研究和应用,我们在\url{this https URL}上发布了我们的实现。

英文摘要

This work presents a general framework for training large language models (LLMs) to "Connect the Dots" (CoD), a meta-capability required by long-lifecycle agents: as an LLM-based AI agent gets deployed in an environment, it solves a long sequence of tasks while continuously exploring the environment, learning from its own experiences, and iteratively self-updating its context about the environment, thereby achieving progressively better performance on future tasks conditioned on the updated context. Major components of the CoD framework include: (1) algorithm design and infrastructure for end-to-end reinforcement learning (RL) with long rollout sequences interleaving solve-task and update-context episodes; (2) tasks and environments for incentivizing and eliciting the targeted meta-capability in LLMs during training, as well as for faithfully measuring progress during evaluation. We present proof-of-concept implementations of the CoD framework, including a GRPO-style RL algorithm with fine-grained credit assignment, as well as tasks and environments tailored to the targeted meta-capability (rather than domain-specific LLM capabilities or standard task-by-task RL). Empirical results validate the efficacy of end-to-end RL training in the CoD setting, and demonstrate the potential for out-of-distribution generalization -- within the training domains, across different domains, and from CoD to Ralph-loop settings -- of the elicited meta-capability. Our investigation of CoD connects several lines of prior works, and opens up new opportunities for advancing LLMs and AI agents. To facilitate further research and applications, we release our implementations at \url{https://github.com/agentscope-ai/Trinity-RFT/tree/research/cod/examples/research_cod}.

2606.19998 2026-06-19 cs.RO cs.AI cs.CV cs.LG 新提交

Tri-Info: Generalizable, Interpretable Failure Prediction for VLA Models via Information Theory

Tri-Info: 基于信息论的VLA模型可泛化、可解释的故障预测

Jinghan Yang, Yunchao Zhang, Wang Yuan, Haolun Wan, Jiaming Zhang, Zhengyang Hu, Yanchao Yang

发表机构 * InfoBodied AI Lab, The University of Hong Kong(香港大学信息具身人工智能实验室) HKU Musketeers Foundation Institute of Data Science(香港大学赛马会数据科学研究院)

AI总结 提出Tri-Info方法,通过信息论信号捕捉动作多样性、时间一致性和状态耦合,实现跨架构、环境及仿真到现实的零样本故障检测,准确率达83%。

详情
AI中文摘要

视觉-语言-动作(VLA)模型越来越多地部署在各种任务中,但它们仍然是黑箱,其物理交互可能导致不可逆的伤害,因此需要可泛化和可解释的故障检测。我们观察到成功和失败的轨迹具有系统不同的信息论特征。基于此,我们将VLA控制形式化为闭环信息管道,并推导出三重信息论(Tri-Info)信号,这些信号捕捉动作是否保持多样性、时间一致性以及与状态转换的耦合。在六个VLA模型和三个基准环境中,Tri-Info在域内匹配最强的基线。此外,Tri-Info无需重新训练即可跨架构、环境和仿真到现实差距迁移,在现实世界任务中达到83%的准确率,而先前的检测器则降至随机水平。这确立了Tri-Info作为一种简单而强大的方法,不仅能够检测故障并具有强大的跨域泛化能力,还能提供底层故障模式的可解释诊断。

英文摘要

Vision-Language-Action (VLA) models are increasingly deployed across diverse tasks, yet they remain black boxes whose physical interactions can cause irreversible harm, making generalizable and interpretable failure detection essential. We observe that successful and failed rollouts carry systematically different information-theoretic signatures. Building on this, we formalize VLA control as a closed-loop information pipeline and derive the Triple Information-theoretic (Tri-Info) signals that capture whether actions remain diverse, temporally consistent, and coupled to state transitions. Across six VLA models and three benchmark environments, Tri-Info matches the strongest baselines in-domain. Moreover, Tri-Info transfers across architectures, environments, and the sim-to-real gap without retraining, reaching 83\% accuracy on real-world tasks where prior detectors collapse to chance. This establishes Tri-Info as a simple yet powerful method that not only detects failures with strong cross-domain generalization, but also delivers interpretable diagnostics of the underlying failure modes.

2606.19996 2026-06-19 cs.SD cs.CL 新提交

Segment-Level Mandarin Chinese Speech-Based Cognitive Impairment Detection via an Autoencoder with Contrastive Learning

基于自编码器与对比学习的段级普通话语音认知障碍检测

Yongqi Shao, Hong Huo, Flavio Bertini, Danilo Montesi, Tao Fang

发表机构 * School of Automation and Intelligent Sensing, Shanghai Jiao Tong University(上海交通大学自动化与智能感知学院) Key Laboratory of System Control and Information Processing, Ministry of Education of China(教育部系统控制与信息处理重点实验室) Shanghai Key Laboratory of Perception and Control in Industrial Network Systems(上海市工业网络系统感知与控制重点实验室) Department of Computer Science and Engineering, University of Bologna(博洛尼亚大学计算机科学与工程系) Department of Mathematical, Physical and Computer Sciences, University of Parma(帕尔马大学数学、物理与计算机科学系)

AI总结 提出段级表示学习框架,结合自编码器和对比学习,在四个普通话数据集上实现稳定的二分类和三分类认知障碍检测,尤其改善了临床困难的三分类性能。

Comments 15 pages, 7 figures, 5 tables

详情
AI中文摘要

\noindent\textbf{背景与目标:} 语音已成为一种低成本、非侵入性的数字生物标志物,在认知障碍检测方面具有巨大潜力。然而,有限的标注数据和跨数据集变异性仍然是构建稳健的语音筛查系统的主要挑战。\par\noindent\textbf{方法:} 我们开发了一个用于语音认知障碍检测的段级表示学习框架。将语音录音分割成短片段并转换为语谱图表示。为了在有限数据条件下提高鲁棒性,将离线和在线增强策略与基于自编码器的表示学习和对比目标相结合,以增强判别性潜在表示。\par\noindent\textbf{结果:} 在四个独立的普通话语音数据集上进行的实验表明,在二分类和三分类任务中均取得了稳定且有竞争力的性能,尤其是在临床具有挑战性的三分类设置中取得了显著改进。消融研究进一步支持了所提框架的有效性。\par\noindent\textbf{结论:} 研究结果表明,段级语音表示学习可能为资源受限的临床环境中的认知障碍筛查提供一种可扩展且实用的方法。

英文摘要

\noindent\textbf{Background and Objective:} Speech has emerged as a low-cost and non-invasive digital biomarker with considerable potential for cognitive impairment detection. However, limited labeled data and cross-dataset variability remain major challenges for robust speech-based screening systems. \par\noindent\textbf{Methods:} We developed a segment-level representation learning framework for speech-based cognitive impairment detection. Speech recordings were divided into short segments and converted into spectrogram representations. To improve robustness under limited-data conditions, offline and online augmentation strategies were combined with autoencoder-based representation learning and contrastive objectives to enhance discriminative latent representations. \par\noindent\textbf{Results:} Experiments conducted on four independent Mandarin Chinese speech datasets demonstrated stable and competitive performance in both binary and three-class classification tasks, with particularly notable improvements in the clinically challenging three-class setting. Ablation studies further supported the effectiveness of the proposed framework. \par\noindent\textbf{Conclusions:} The findings suggest that segment-level speech representation learning may provide a scalable and practical approach for cognitive impairment screening in resource-constrained clinical settings.

2606.19990 2026-06-19 cs.AI 新提交

Reward as An Agent for Embodied World Models

奖励作为具身世界模型的智能体

Pu Li, Zhigang Lin, Qiang Wu, Yongxuan Lv, Fei Wang, Shan You

发表机构 * ACE Robotics(ACE机器人)

AI总结 提出奖励智能体框架和动态感知 rollout 多样化方法,通过鲁棒验证支持更广泛探索,缓解奖励黑客问题,提升世界模型性能。

详情
AI中文摘要

虽然强化学习已成为改进世界模型的有前景工具,现有方法大多依赖于训练分布附近的保守 rollout,限制了探索、行为多样性和更丰富的动态发现。在这项工作中,我们挑战这种保守范式。我们认为核心限制不是探索本身,而是缺乏支持更广泛探索的可靠验证策略。没有可靠的验证,扩展的探索极易受到奖励黑客攻击,即策略利用不完美的奖励而未能实现真正的改进。为了评估这一动机,我们在具身世界模型中实例化我们的方法,其中物理合理性和任务完成性为复杂动态下的可扩展强化学习提供了严格的测试平台。在验证方面,我们引入奖励作为智能体,一种主动评估生成行为以提供鲁棒奖励信号并减轻分布偏移下奖励黑客攻击的智能体奖励框架。在探索方面,我们通过 DynDiff-GRPO 引入动态感知 rollout 多样化,显式扩展动作空间探索以多样化轨迹、拓宽状态-动作覆盖范围,并鼓励超越保守 rollout 机制的更丰富具身行为。通过将奖励作为智能体与 DynDiff-GRPO 统一,我们在更可靠的奖励基础上实现强化学习,并大幅多样化采样,有效缓解奖励黑客攻击,同时在多个开源世界模型上取得显著的精度提升,从而证明当基于鲁棒验证时,更广泛的探索可以成功扩展。

英文摘要

While RL has become a promising tool for refining world models, existing methods largely rely on conservative rollouts near the training distribution, limiting exploration, behavioral diversity, and richer dynamic discovery. In this work, we challenge this conservative paradigm. We argue that the core limitation is not exploration itself, but the lack of reliable verification strategies to support broader exploration. Without reliable verification, expanded exploration becomes highly susceptible to reward hacking, where policies exploit imperfect rewards without achieving genuine improvement. To evaluate this motivation, we instantiate our method in embodied world models, where physical plausibility, and task completion provide a rigorous testbed for scalable RL under complex dynamics. On the verification side, we introduce Reward as an Agent, an agentic reward framework that actively evaluates generated behaviors to provide robust reward signals and mitigate reward hacking under distribution shifts. On the exploration side, we introduce Dynamic-Aware Rollout Diversification through DynDiff-GRPO, which explicitly expands action-space exploration to diversify trajectories, broaden state-action coverage, and encourage richer embodied behaviors beyond conservative rollout regimes. By unifying Reward as an Agent with DynDiff-GRPO, we enable RL on a more reliable reward foundation with substantially diversified sampling, effectively mitigating reward hacking while yielding significant accuracy gains across multiple open-source world models, thereby demonstrating that broader exploration can scale successfully when grounded in robust verification.

2606.19985 2026-06-19 cs.CV 新提交

Vision-Reasoning-Guided Occlusion Removal from Light Fields

视觉推理引导的光场遮挡去除

Mohamed Youssef, Oliver Bimber

发表机构 * Johannes Kepler University(约翰·开普勒大学)

AI总结 提出结合光场积分与视觉语言模型的框架,通过多视图融合和语义先验恢复被遮挡场景,在合成和真实数据上取得最优性能。

详情
AI中文摘要

遮挡鲁棒的场景恢复仍然是计算成像中的一个主要挑战,特别是在自然环境中,密集的前景植被严重限制了可见性。我们提出了一种视觉推理引导的光场遮挡去除框架,该框架结合了光场积分(LFI)的可见性恢复能力和视觉语言模型(VLM)的语义推理能力。首先通过LFI集成多视图观测以抑制前景遮挡,生成初始的可见性增强表示。然后,引入VLM作为条件语义先验,在观测测量的指导下恢复退化结构并恢复细节。为了提高恢复一致性并减少幻觉伪影,我们引入了一种多样本融合策略,将多个生成的假设聚合为统一的估计。在合成和真实世界数据集上的实验结果表明,该方法达到了最先进的性能,在四个合成光场基准场景(4-Syn)上取得了最高的平均SSIM,并在结构化和非结构化采集设置中表现出强大的泛化能力。这些结果凸显了将物理成像约束与视觉语言推理相结合在严重遮挡下实现鲁棒感知的有效性,可应用于搜索救援和探索性机器人导航。

英文摘要

Occlusion-robust scene recovery remains a major challenge in computational imaging, particularly in natural environments where dense foreground vegetation severely limits visibility. We propose a vision-reasoning-guided light field occlusion removal framework that combines the visibility recovery capability of light field integration (LFI) with the semantic reasoning capacity of vision-language models (VLMs). Multi-view observations are first integrated via LFI to suppress foreground occlusions and produce an initial visibility-enhanced representation. A VLM is then incorporated as a conditional semantic prior to restore degraded structures and recover fine details, guided by the observed measurements. To improve recovery consistency and reduce hallucination artifacts, we introduce a multi-sample fusion strategy that aggregates multiple generated hypotheses into a unified estimate. Experimental results on synthetic and real-world datasets demonstrate state-of-the-art performance, achieving the highest average SSIM across four synthetic light field benchmark scenes (4-Syn) and strong generalization across structured and unstructured acquisition settings. These results highlight the effectiveness of combining physical imaging constraints with vision-language reasoning for robust perception under severe occlusion, with applicability to search-and-rescue and exploratory robotic navigation.

2606.19984 2026-06-19 cs.LG 新提交

Kolmogorov-Arnold Reservoir Computing

Kolmogorov-Arnold 储层计算

Juntian Huang, Jurgen Kurths, Ying Tang

发表机构 * Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China(电子科技大学基础与前沿科学研究所) Potsdam Institute for Climate Impact Research(波茨坦气候影响研究所) Department of Physics, Humboldt University Berlin(柏林洪堡大学物理系) Research Institute of Intelligent Complex Systems, Fudan University(复旦大学智能复杂系统研究所) School of Physics, University of Electronic Science and Technology of China(电子科技大学物理学院) Key Laboratory of Quantum Physics and Photonic Quantum Information, Ministry of Education, University of Electronic Science and Technology of China(电子科技大学教育部量子物理与光子量子信息重点实验室) Non-classical Information Science Basic Discipline Research Center of Sichuan Province, University of Electronic Science and Technology of China(电子科技大学四川省非经典信息科学基础学科研究中心)

AI总结 提出Kolmogorov-Arnold储层计算(KARC),用显式基函数展开替代储层,结合KAN的表达能力和储层计算的闭式训练,在偏微分方程等基准上优于现有方法。

详情
AI中文摘要

储层计算为预测动力系统提供了轻量级框架,但由于表示能力有限,可能难以捕捉长程依赖。传统储层计算循环使用可训练储层,对超参数敏感,而下一代储层计算以特征维度快速增长为代价去除了循环。在此,我们开发了Kolmogorov-Arnold储层计算(KARC),它受Kolmogorov-Arnold表示定理启发,用显式基函数展开替代储层。我们严格证明KARC是Kolmogorov-Arnold网络(KAN)的轻量级设计,保留了KAN的潜在表达能力,同时允许储层计算的高效闭式训练。在相当的成本下,KARC在包括偏微分方程在内的挑战性基准上优于现有储层计算方法。它还可以与生成扩散模型集成用于文本到图像生成。因此,本工作建立了储层计算与KAN之间的原则性桥梁,实现了高效高保真的动力系统预测。

英文摘要

Reservoir computing offers a lightweight framework for forecasting dynamical systems but may struggle to capture long-range dependencies due to limited representational capacity. Conventional reservoir computing recurrently uses trainable reservoirs with hyperparameter sensitivity, while the next-generation reservoir computing removes recurrence at the cost of rapidly growing feature dimensions. Here, we develop Kolmogorov-Arnold Reservoir Computing (KARC), which replaces reservoirs with explicit basis-function expansions inspired by the Kolmogorov-Arnold representation theorem. We rigorously show that KARC is a lightweight design of Kolmogorov-Arnold networks (KANs), preserving the potential expressive capacity of KANs while admitting efficient closed-form training of reservoir computing. At comparable cost, KARC outperforms existing reservoir computing methods on challenging benchmarks including partial differential equations. It can also be integrated with generative diffusion models for text-to-image generation. This work thus establishes a principled bridge between reservoir computing and KANs, enabling efficient and high-fidelity dynamical system forecasting.

2606.19980 2026-06-19 cs.AI 新提交

ENPIRE: Agentic Robot Policy Self-Improvement in the Real World

ENPIRE: 现实世界中智能体机器人策略的自我改进

Wenli Xiao, Jia Xie, Tonghe Zhang, Haotian Lin, Letian "Max" Fu, Haoru Xue, Jalen Lu, Yi Yang, Cunxi Dai, Zi Wang, Jimmy Wu, Guanzhi Wang, S. Shankar Sastry, Ken Goldberg, Linxi "Jim" Fan, Yuke Zhu, Guanya Shi

发表机构 * NVIDIA(英伟达) CMU(卡内基梅隆大学) UC Berkeley(加州大学伯克利分校)

AI总结 提出ENPIRE框架,通过环境重置、策略执行、结果验证和迭代优化的闭环反馈,使编码智能体自主改进机器人操作策略,在灵巧操作任务上达到99%成功率。

详情
AI中文摘要

在现实世界中实现灵巧的机器人操作严重依赖人工监督和算法工程,这成为追求通用物理智能的核心瓶颈。尽管新兴的编码智能体可以生成代码来自动化算法搜索,但其成功主要局限于数字环境。我们推测,自动化机器人研究缺失的抽象是一个可重复的反馈循环,用于现实世界策略改进:重置场景、执行策略、验证结果并优化下一次迭代。为弥补这一差距,我们引入ENPIRE,一个用于编码智能体的框架,通过四个核心模块实例化这一物理反馈例程:环境模块(EN)用于自动重置和验证,策略改进模块(PI)启动策略优化,推出模块(R)用于评估一个或多个并行运行的物理机器人的策略,以及进化模块(E),其中编码智能体分析日志、查阅文献、改进训练基础设施和算法代码以解决失败模式。这一闭环系统将现实世界操作学习转化为可控的优化过程,在最小化人工努力的同时,允许对训练方案和智能体变体进行公平消融。在ENPIRE的支持下,前沿编码智能体可以自主训练策略,在具有挑战性的灵巧操作任务(如整理针盒、紧固扎带和工具使用)上达到99%的成功率,并且当我们派遣智能体团队在机器人集群上工作时,这一过程会进一步加速。我们的结果展示了将编码智能体部署到物理世界中自主推进机器人技术的实用且可扩展的路径。

英文摘要

Achieving dexterous robotic manipulation in the real world heavily relies on human supervision and algorithm engineering, which becomes a central bottleneck in the pursuit of general physical intelligence. Although emerging coding agents can generate code to automate algorithm search, their successes remain largely confined in digital environments. We conjecture that the missing abstraction to automate robotics research is a repeatable feedback loop for real-world policy improvement: reset the scene, execute a policy, verify the outcome, and refine the next iteration. To bridge this gap, we introduce ENPIRE, a harness framework for coding agents that instantiates this physical feedback routine with four core modules: an Environment module (EN) for automatic reset and verification, a Policy Improvement module (PI) that launches policy refinement, a Rollout module (R) to evaluate policies with one or multiple physical robots operating in parallel, and an Evolution module (E) in which coding agents analyze logs, consult literature, improve training infrastructure and algorithm code to address failure modes. This closed-loop system transforms real-world manipulation learning into a controllable optimization procedure, minimizing human effort while allowing fair ablations across training recipe and agent variants. Powered by ENPIRE, frontier coding agents can autonomously train a policy to achieve a 99% success rate on challenging, dexterous manipulation tasks, such as organizing a pin box, fastening a zip tie, and tool use, a process that further accelerates when we dispatch an agent team on a robot fleet. Our results suggest a practical and scalable path toward deploying coding agents to autonomously advancing robotics in the physical world.

2606.19970 2026-06-19 cs.CV 新提交

CrossFlow: One-Step Generation Across Latent and Pixel Spaces

CrossFlow: 跨潜在空间与像素空间的单步生成

Xiyuan Wang, Xiao Zhang, Yang Li, Ruoxi Jiang, Zhao Zhong, Liefeng Bo, Muhan Zhang

发表机构 * Institute for Artificial Intelligence, Peking University(北京大学人工智能研究院) Tencent(腾讯) Fudan University(复旦大学)

AI总结 提出CrossFlow,一种跨空间流模型,将噪声潜在输入直接映射到像素图像,通过无速度单步目标实现潜在到像素的生成,并替代潜在扩散中的解码器,在ImageNet-1k上达到1.62 FID。

Comments Preprint, Under Review

详情
AI中文摘要

大多数扩散和流匹配生成器在相同的表示空间中定义先验、概率路径和预测目标。潜在扩散通过将该路径移动到自编码器潜在空间来提高效率,但最终样本仍由单独训练的解码器生成。这种分离造成了不匹配:生成器针对潜在空间预测进行优化,而最终质量取决于解码器如何处理可能与干净编码器输出不同的生成潜在变量。我们引入了CrossFlow,一种跨空间流公式,将噪声潜在输入直接映射到像素空间图像。关键技术步骤是一个无速度的单步目标:潜在轨迹定义了训练路径,但监督预测是图像而非潜在位移。这使得一个模型既可以作为单步潜在到像素生成器,也可以作为潜在扩散管道的解码器替代品。在类别条件ImageNet-1k $256\ imes256$上,CrossFlow-XL通过一次函数评估达到了1.62 FID。消融实验表明,潜在编码器以及像素空间感知和对抗损失对保真度很重要。这些结果表明,跨空间流目标可以结合潜在表示的效率与直接像素空间监督,而无需在推理时使用单独的解码器。

英文摘要

Most diffusion and flow-matching generators define the prior, probability path, and prediction target in the same representation space. Latent diffusion improves efficiency by moving this path into an autoencoder latent space, but the final sample is still produced by a separately trained decoder. This separation creates a mismatch: the generator is optimized for latent-space prediction, while final quality depends on how the decoder handles generated latents that may differ from clean encoder outputs. We introduce CrossFlow, a cross-space flow formulation that maps noisy latent inputs directly to pixel-space images. The key technical step is a velocity-free one-step objective: the latent trajectory defines the training path, but the supervised prediction is an image rather than a latent displacement. This lets one model act both as a one-step latent-to-pixel generator and as a decoder replacement for latent diffusion pipelines. On class-conditional ImageNet-1k at $256\times256$, CrossFlow-XL achieves 1.62 FID with one function evaluation. Ablations show that the latent encoder and pixel-space perceptual and adversarial losses are important for fidelity. These results indicate that cross-space flow objectives can combine the efficiency of latent representations with direct pixel-space supervision, without requiring a separate decoder at inference.

2606.19966 2026-06-19 cs.CV cs.LG 新提交

Semantic-Anchored Evidential Fusion for Domain-Robust Whole-Slide Survival Analysis

语义锚定证据融合用于域鲁棒的全切片生存分析

Yucheng Xing, Ling Huang, Pei Liu, Jingying Ma, Jiaqing Xu, Kai He, Mengling Feng

发表机构 * National University of Singapore(新加坡国立大学) Imperial College London(帝国理工学院) Hunan University(湖南大学)

AI总结 提出SAEFS框架,通过视觉问答提取语义锚点,结合双流证据提取和狄利克雷主观逻辑建模不确定性,实现跨域零样本生存分析,平均C-index提升10.2%。

详情
AI中文摘要

全切片图像(WSIs)广泛用于计算癌症预后。然而,现有方法主要关注域内性能,难以泛化到不同临床中心。这一局限性源于它们依赖像素级表示,极易受到染色协议和扫描硬件导致的域特定伪影影响。我们假设高级病理语义(如肿瘤分级和微环境结构)提供了域不变的语义表示,反映了人类病理学家的鲁棒诊断逻辑。因此,我们提出了语义锚定证据融合生存(SAEFS)框架,其中SAEFS通过视觉问答(VQA)从WSIs中推导语义锚点,采用双流WSI证据提取架构,使用基于狄利克雷的主观逻辑建模不确定性,并通过谨慎合取规则融合语义和视觉证据,以避免来自相关源的过度自信融合。仅在单一源域上训练并在四个未见域上进行零样本评估,SAEFS在预测准确性和可靠性上均一致优于最先进模型,平均C-index提升10.2%。定量分析进一步表明,VQA导出的语义特征比像素级特征表现出显著更低的跨中心差异,突显了其在跨中心临床应用中的鲁棒性。

英文摘要

Whole-slide images (WSIs) are widely used for computational cancer prognosis. However, most existing methods primarily focus on in-domain performance and fail to generalize across clinical centers. This limitation stems from their reliance on pixel-derived representations that are highly susceptible to domain-specific artifacts caused by staining protocols and scanner hardware. We hypothesize that high-level pathology semantics, such as tumor grade and micro-environmental architecture, provide a domain-invariant semantic representation that mirrors the robust diagnostic logic of human pathologists. Therefore, we propose a Semantic-Anchored Evidential Fusion Survival (SAEFS) framework, where SAEFS derives semantic anchors from WSIs via Visual Question Answering (VQA), employs a dual-stream WSI evidence extraction architecture, uses Dirichlet-based Subjective Logic to model uncertainty, and fuses semantic and visual evidence through a cautious conjunction rule to avoid overconfident fusion from correlated sources. Trained exclusively on one source domain and evaluated zero-shot across four unseen domains, SAEFS consistently outperforms state-of-the-art models both in prediction accuracy and reliability, improving the average C-index by 10.2%. Quantitative analyses further show that VQA-derived semantic features exhibit significantly lower cross-center divergence than pixel-derived features, highlighting their robustness for cross-center clinical applications.

2606.19965 2026-06-19 cs.CV cs.AI 新提交

ROSE: Benchmarking the Perception-to-Action Gap in Multimodal Models

ROSE:多模态模型中感知到行动差距的基准测试

Yihao Wang, Zijian He, Jie Ren, Keze Wang

发表机构 * Sun Yat-sen University(中山大学) Shaanxi Normal University(陕西师范大学)

AI总结 提出ROSE基准,通过固定视觉场景并变化区域约束与符号输出,测试多模态大模型在不同上下文中将相同视觉证据转化为所需行动的能力,发现模型性能下降高达44.5个百分点,揭示感知到行动的瓶颈。

Comments 29 pages, 11 figures

详情
AI中文摘要

多模态大语言模型(MLLMs)越来越被期望基于视觉信息采取行动,然而同一场景在不同任务上下文中可能需要不同的行动。模型能否可靠地将相同的视觉证据转化为当前上下文所需的行动?为了回答这个问题,我们引入了\textsc{ROSE}(\textbf{R}eference-conditioned \textbf{O}ddity and \textbf{S}ymbolic \textbf{E}xecution),一个受控基准,它在保持视觉场景固定的同时变化区域约束和所需的符号输出。通过耦合的计数和坐标行动任务,\textsc{ROSE}测试模型是否能够推断出隐含的多数参考,并在变化的上下文中基于由此产生的细粒度视觉证据采取行动。在九个最近的MLLMs中,从计数导向任务到区域条件行动的性能下降高达44.5个百分点,而人类表现达到98.8%。这种差距在成对的场景和区域中持续存在,即使同一模型在这些场景和区域上返回正确的计数,而全局点击和匹配的局部控制表明坐标定位仅解释了部分损失,揭示了在将共享视觉证据转化为上下文特定行动时存在一个独特的、模型相关的瓶颈。

英文摘要

Multimodal large language models (MLLMs) are increasingly expected to act on visual information, yet the same scene may require different actions under different task contexts. How reliably can a model turn the same visual evidence into the action required by the current context? To answer this question, we introduce \textsc{ROSE} (\textbf{R}eference-conditioned \textbf{O}ddity and \textbf{S}ymbolic \textbf{E}xecution), a controlled benchmark that holds the visual scene fixed while varying region constraints and required symbolic outputs. Through coupled counting and coordinate-action tasks, \textsc{ROSE} tests whether models can infer an implicit majority reference and act on the resulting fine-grained visual evidence under changing contexts. Across nine recent MLLMs, performance drops by as much as 44.5 percentage points from counting-oriented tasks to region-conditioned action, despite 98.8\% human performance. The gap persists on paired scenes and regions for which the same model returns the correct count, while global-click and matched local controls show that coordinate grounding explains only part of the loss, revealing a distinct, model-dependent bottleneck in turning shared visual evidence into context-specific actions.