arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2251
2606.10448 2026-06-10 cs.LG cs.AI 新提交

Mitigating Bias in Low-SNR Financial Reinforcement Learning via Quantum Representations

通过量子表示缓解低信噪比金融强化学习中的偏差

Zeyu Liu, Xuanzhi Feng, Sing Kwong Lai, Yuanchen Gao, Xiaoyi Pang, Hualei Zhang, Jingcai Guo, Jie Zhang, Song Guo

发表机构 * The Hong Kong University of Science and Technology(香港科技大学)

AI总结 针对低信噪比金融市场中SAC算法的不稳定性,提出FPQC-SAC变体,在表征层使用参数化量子电路约束特征传播,减少极端波动影响,在真实组合管理任务中累计收益相对提升66.89%。

Comments Preprint. Code available at https://github.com/ZeyuLIU-UST/FPQC-SAC-main

详情
AI中文摘要

金融市场是典型的低信噪比(SNR)环境,这常常使Soft Actor-Critic(SAC)等离策略最大熵方法不稳定。具体来说,噪声状态表示可能产生不可靠的Q值估计,而自举会放大这些误差,形成我们称之为“金融熵陷阱”的失效模式。在本文中,我们提出FPQC-SAC,一种高效且即插即用的SAC变体,它在演员和评论家网络之前放置一个紧凑且有界的参数化量子电路(PQC),以在表征层约束特征传播,而不是过滤原始输入或在自举后正则化Q值。值得注意的是,FPQC-SAC减少了极端市场波动对贝尔曼目标估计的影响,而可训练的量子纠缠保留了灵活的跨资产交互。在真实投资组合管理任务上的实证评估表明,FPQC-SAC通过实现比标准无约束SAC累计收益相对提升66.89%,显著增强了样本外稳定性和累计收益,并且比最佳连续控制深度强化学习基线高出约27%。开源代码可在该https URL获取。

英文摘要

The financial market is a typical low signal-to-noise ratio (SNR) setting, which often destabilizes off-policy maximum-entropy methods like Soft Actor-Critic (SAC). Specifically, noisy state representations may produce unreliable Q-value estimates, and bootstrapping amplifies these errors, forming a failure mode we call the "Financial Entropy Trap". In this paper, we propose FPQC-SAC, an efficient and plug-and-play SAC variant that places a compact and bounded Parameterized Quantum Circuit (PQC) before the actor and critic networks to constrain feature propagation at the representation level, rather than filtering raw inputs or regularizing Q-values after bootstrapping. Notably, FPQC-SAC reduces the impact of extreme market fluctuations on Bellman target estimation, while trainable quantum entanglement preserves flexible cross-asset interactions. Empirical evaluations on real-world portfolio management tasks demonstrate that FPQC-SAC substantially enhances out-of-sample stability and cumulative returns by achieving a 66.89% relative gain in cumulative return over standard unconstrained SAC and outperforms the best continuous-control deep reinforcement learning baseline by approximately 27%. Open-source code is available at https://github.com/ZeyuLIU-UST/FPQC-SAC-main.

2606.10445 2026-06-10 cs.LG cs.CL 新提交

SpenseGPT: Practical One-shot Pruning Enabling Sparse and Dense GEMMs for LLM Inference

SpenseGPT: 面向LLM推理的实用一次性剪枝,支持稀疏和稠密GEMM

Jaeseong Lee, Seung-won Hwang, Samyam Rajbhandari

发表机构 * Snowflake AI Research(Snowflake AI研究) Seoul National University(首尔大学)

AI总结 提出Spense混合稀疏-稠密格式,将权重矩阵分为2:4稀疏和稠密区域,结合一次性剪枝方法SpenseGPT,在B200 GPU上实现高达1.2倍端到端解码加速,同时保持模型精度。

详情
AI中文摘要

半结构化2:4稀疏性被现代加速器广泛支持,可提供高达2倍的理论加速。然而,其严格的50%稀疏性约束在训练后剪枝下常导致不可忽略的精度下降。同时,现有的宽松稀疏格式要么需要专门的编译器支持,要么引入限制端到端加速的运行时开销。我们提出Spense,一种实用的混合稀疏-稠密格式,将每个权重矩阵分为2:4稀疏区域和稠密区域。该设计放宽了有效稀疏性约束,同时保持与现有高性能稀疏和稠密GEMM库的兼容性,避免了自定义编译器支持和输入激活扩展。基于此格式,我们引入SpenseGPT,一种一次性训练后剪枝方法,生成稀疏和稠密区域。值得注意的是,我们表明选择正确的稠密区域很重要,并设计了两种不同的策略来选择它们。在Qwen3-32B和Seed-OSS-36B上的实验表明,我们的方法在B200 GPU上使用FP8精度实现了高达1.2倍的端到端解码加速,同时保持精度。据我们所知,这是首个在B200等最新GPU上通过半结构化稀疏张量核心实现真实世界端到端LLM解码加速并保持模型质量的一次性剪枝演示。

英文摘要

Semi-structured 2:4 sparsity is widely supported by modern accelerators, providing up to a 2x theoretical speedup. However, its strict 50% sparsity constraint often causes non-negligible accuracy degradation under post-training pruning. Meanwhile, existing relaxed sparsity formats either require specialized compiler support or introduce runtime overheads that limit end-to-end speedup. We propose Spense, a practical hybrid sparse-dense format that splits each weight matrix into a 2:4 sparse region and a dense region. This design relaxes the effective sparsity constraint while remaining compatible with existing high-performance sparse and dense GEMM libraries, avoiding both custom compiler support and input activation expansion. Building on this format, we introduce SpenseGPT, a one-shot post-training pruning method that produces sparse and dense regions. Notably, we show that selecting the right dense regions is important, and we devise two different strategies to choose them. Experiments on Qwen3-32B and Seed-OSS-36B demonstrate that our method achieves up to 1.2x end-to-end decoding speedup on B200 GPUs with FP8 precision, while preserving accuracy. To the best of our knowledge, this is the first one-shot pruning demonstration of real-world end-to-end LLM decoding speedup from semi-structured sparse tensor cores on recent GPUs such as B200s, while maintaining model quality.

2606.10442 2026-06-10 cs.RO 新提交

Information-Preserving Continuous Occupancy Mapping with Variance-Weighted Submap Joining

基于方差加权子图拼接的信息保持连续占据地图构建

Zhuhua Bai, Yingyu Wang, Liang Zhao, Shoudong Huang

发表机构 * University of Technology Sydney(悉尼科技大学) University of Edinburgh(爱丁堡大学)

AI总结 提出首个连续概率子图拼接框架,通过信息保持稀疏贝叶斯公式压缩观测数据为充分统计量,联合优化子图位姿与全局占据场,实现高精度位姿估计与全局一致性地图。

Comments 12 pages, 7 figures

详情
AI中文摘要

大规模SLAM由于累积轨迹漂移和维护全局一致性的计算成本增加而仍然具有挑战性。子图拼接通过构建局部一致子图并随后将其融合为全局地图来缓解这些问题。然而,现有的基于占据的子图拼接方法在离散网格上操作,导致优化过程中梯度不光滑,并忽略了占据估计的不确定性。我们提出了第一个连续概率子图拼接框架,该框架在潜在对数几率空间中联合优化子图位姿和全局占据场。该框架采用信息保持的稀疏贝叶斯公式,将原始占据观测压缩为充分统计量的对数几率元组,同时保留原始观测的后验信息。这为占据地图构建提供了闭式预测均值和方差估计,直接实现了具有解析雅可比矩阵的子图拼接公式,从而得到更精确的子图拼接,并在位姿收敛时产生闭式最优全局地图。在模拟和大规模真实世界数据集上的实验表明,所提方法比最先进的基于网格的子图拼接方法实现了更高的位姿精度和更好的全局一致性,同时比现有的连续占据地图构建方法产生了更紧凑的地图表示和更校准的不确定性估计。

英文摘要

Large-scale SLAM remains challenging due to accumulated trajectory drift and the increasing computational cost of maintaining global consistency. Submap joining alleviates these issues by constructing locally consistent submaps and subsequently fusing them into a global map. However, existing occupancy-based submap joining methods operate on discrete grids, resulting in non-smooth gradients during optimization and neglecting the uncertainty associated with occupancy estimates. We propose the first continuous probabilistic submap joining framework that jointly optimizes submap poses and a global occupancy field in the latent log-odds space. The framework employs an information-preserving sparse Bayesian formulation that compresses raw occupancy observations into sufficient-statistic log-odds tuples while retaining the posterior information of the original observations. This yields closed-form predictive mean and variance estimates for occupancy mapping, which directly enable a submap joining formulation with analytical Jacobians, leading to more accurate submap joining and yielding a closed-form optimal global map upon pose convergence. Experiments on both simulated and large-scale real-world datasets demonstrate that the proposed method achieves higher pose accuracy and improved global consistency than state-of-the-art grid-based submap joining approaches, while producing more compact map representations and better-calibrated uncertainty estimates than existing continuous occupancy mapping methods.

2606.10439 2026-06-10 cs.SD cs.CL eess.AS 新提交

Enhancing Multilingual LLM-based ASR with Mixture of Experts and Dynamic Downsampling

利用混合专家和动态下采样增强基于多语言大模型的语音识别

Guodong Lin, Ziqi Chen, Yuxiang Fu, Ke Li, Wei-Qiang Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出基于投影器的LLM-ASR框架,通过混合专家架构提升跨语言适应性,并利用连续整合-触发机制实现动态下采样和模态对齐,实验表明该方法显著超越强基线模型。

Comments Accepted by ICASSP 2026

详情
Journal ref
ICASSP (2026),18807-18811
AI中文摘要

大语言模型的快速发展为自动语音识别开辟了新前沿,使其有效集成成为一个关键且具有挑战性的研究方向。为此,本文提出了一种基于投影器的LLM-ASR框架,针对多语言泛化和模态对齐的关键挑战。我们的方法结合了混合专家架构以改善跨语言适应性,以及连续整合-触发机制用于动态下采样和模态对齐。实验结果表明,这些组件的组合带来了显著的性能提升,超越了强基线模型。所提出的方法朝着构建更准确、更鲁棒、更泛化的基于LLM的ASR系统迈出了一步。

英文摘要

The rapid progress of large language models (LLMs) has opened up a new frontier for automatic speech recognition (ASR), making their effective integration a critical and challenging research direction. To this end, this work proposes a projector-based LLM-ASR framework targeting the key challenges of multilingual generalization and modality alignment. Our approach incorporates a Mixture of Experts (MoE) architecture to improve cross-lingual adaptability, and a Continuous Integrate-and-Fire (CIF) mechanism for dynamic downsampling and modality alignment. Experimental results show that the combination of these components yields substantial performance improvements, surpassing strong baseline models. The proposed method represents a step toward building more accurate, robust, and generalizable LLM-based ASR systems.

2606.10435 2026-06-10 cs.LG cs.CL 新提交

Parallel Causal Associative Fields: Gated Sparse Memory for Long-Context Language Modeling

并行因果关联域:用于长上下文语言建模的门控稀疏记忆

Muhammad Ahmed

发表机构 * Independent Researcher(独立研究员)

AI总结 提出并行因果关联域(PCAF),通过哈希桶存储局部记录、检索候选集形成稀疏缓存,并与参数化语言模型门控混合,实现稀疏长上下文访问,避免固定状态瓶颈。

Comments 17 pages, 5 figures, and 6 tables. Experiments on WikiText-103, PG-19, and WikiText-2 using TPU v4-32 and NVIDIA RTX 3060 hardware. Code: https://github.com/ahmed123hds/PCAF

详情
AI中文摘要

Transformer通过提供直接的token间通信路径实现了强大的语言建模性能,但因果自注意力的计算量随上下文长度呈二次方增长。循环模型和状态空间模型降低了这一成本,但将历史压缩为顺序更新的固定大小状态。本文研究了第三种原语:基于因果后继记录的并行内容寻址记忆。所提出的并行因果关联域(PCAF)将上下文窗口中的局部记录写入哈希桶,为当前查询检索有界的候选集,在后继token上形成稀疏缓存分布,并通过学习到的门将该缓存与参数化局部语言模型混合。所得模型在避免单一固定循环状态瓶颈的同时,保持了稀疏的长上下文访问。我们在WikiText-103和PG-19上使用分布式Google Cloud TPU v4-32 pod对PCAF进行了完全自回归预训练。在303M参数和上下文长度T=2048的情况下,PCAF-semantic在WikiText-103上达到36.31困惑度,在PG-19上达到52.45困惑度,而匹配的密集Transformer分别为47.49和53.84。PCAF-semantic在TPU pod上同时处理0.61-0.62M token/s,而密集和局部注意力基线为0.43M token/s。支持41M参数的多种子扫描和单GPU组件消融实验表明,关联缓存、检索容量和学习到的门对速度-质量权衡有实质性影响。

英文摘要

Transformers achieve strong language modeling performance by providing direct token-to-token communication paths, but causal self-attention scales quadratically with context length. Recurrent and state-space models reduce this cost, yet compress history into sequentially updated fixed-size states. This paper studies a third primitive: a parallel content-addressed memory over causal successor records. The proposed Parallel Causal Associative Field (PCAF) writes local records from a context window into hash buckets, retrieves a bounded candidate set for the current query, forms a sparse cache distribution over successor tokens, and mixes that cache with a parametric local language model through a learned gate. The resulting model maintains sparse long-context access while avoiding a single fixed recurrent state bottleneck. We evaluate PCAF under full autoregressive pretraining on WikiText-103 and PG-19 using a distributed Google Cloud TPU v4-32 pod. At 303M parameters and context length T = 2048, PCAF-semantic reaches 36.31 perplexity on WikiText-103 and 52.45 perplexity on PG-19, compared with 47.49 and 53.84 for a matched dense Transformer. PCAF-semantic simultaneously processes 0.61-0.62M tokens/s across the TPU pod, versus 0.43M tokens/s for dense and local attention baselines. Supporting 41M-parameter multi-seed sweeps and single-GPU component ablations show that the associative cache, retrieval capacity, and learned gate materially affect the speed-quality trade-off.

2606.10431 2026-06-10 cs.CV cs.AI 新提交

Vision-Assisted Foundation Model for Solving Multi-Task Vehicle Routing Problems

视觉辅助的基础模型解决多任务车辆路径问题

Shuangchun Gui, Zhiguang Cao, Wen Song, Yew-Soon Ong

发表机构 * School of Computing and Information Systems, Singapore Management University(新加坡管理大学计算与信息系统学院) Institute of Marine Science and Technology, Shandong University(山东大学海洋科学与技术研究院) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院) Centre for Frontier AI Research, Institute of High Performance Computing, Agency for Science, Technology and Research(新加坡科技研究局高性能计算研究所前沿人工智能研究中心)

AI总结 提出视觉辅助基础模型VaFM,通过将约束编码为图像并融合图节点嵌入,同时解决16种VRP变体,在复杂约束变体上超越现有方法。

Comments Accepted by TNNLS

详情
AI中文摘要

多任务车辆路径问题在提升各行业和服务部门效率中扮演关键角色。这些问题包含多种变体,在满足多样化客户约束的同时优化路径成本。现有的多任务VRP求解器仅利用基于图的模态,限制了其处理多约束变体的能力。作为表示复杂语义的格式,视觉模态在编码多样VRP约束方面展现出巨大潜力。这促使我们从视觉图像中学习补丁级语义,然后将其集成到基于图的模型中,以同时解决多种VRP变体。然而,直接将此方法应用于多任务VRP面临三个挑战:1)现有VRP图像缺乏约束表示,这对多任务VRP至关重要;2)单个补丁的固定感受野无法有效适应不同任务的需求;3)约束间像素分布不平衡可能导致模型忽略像素较少的约束。本文提出视觉辅助基础模型(VaFM)以应对这些挑战。在视觉模态中,针对所有约束定制的输入图像由卷积神经网络编码。获得的补丁嵌入与基于图的节点融合以生成解,并设计辅助任务解决像素不平衡问题。VaFM的性能在16种不同VRP变体上进行了评估。实验结果表明,VaFM优于最先进的方法,尤其是在具有复杂约束的变体上。

英文摘要

Multi-task vehicle routing problems play a critical role in enhancing efficiency across various industries and service sectors. These problems consist of multiple variants that optimize routing costs while meeting diverse customer constraints. Existing multi-task VRP solvers solely utilize a graph-based modality, limiting their ability to address variants with multiple constraints. As a format to represent complex semantics, vision modality shows great potential for encoding diverse VRP constraints. This motivates us to learn patch-level semantics from the vision images, and then integrate them into a graph-based model to solve various VRP variants simultaneously. However, directly applying this approach to multi-task VRPs presents three challenges: 1) existing VRP images lack constraint representations, which are essential for multi-task VRPs, 2) the fixed receptive field of individual patches cannot effectively accommodate varying requirements across tasks, and 3) imbalanced pixel distribution among constraints may cause the model to overlook constraints with fewer pixels. In this paper, we propose a vision-assisted foundation model (VaFM) to address these challenges. In the vision modality, input images tailored to all constraints are encoded by a convolutional neural network. The obtained patch embeddings are fused with graph-based nodes to generate solutions, with an auxiliary task designed to address the pixel-imbalanced issue. The performance of VaFM is evaluated across 16 different VRP variants. The experimental results demonstrate the superiority of VaFM over state-of-the-art methods, especially for variants with complex constraints.

2606.10428 2026-06-10 cs.CL 新提交

Which LoRA? An Empirical Study on the Effectiveness of LoRA Techniques During Multilingual Instruction Tuning

哪种LoRA?多语言指令微调中LoRA技术有效性的实证研究

Thamali Wijewardhana, Napoleon H. Reyes, Surangika Ranathunga

发表机构 * School of Mathematical and Computational Sciences, Massey University(梅西大学数学与计算科学学院)

AI总结 通过实验比较基本LoRA与四种变体在多语言指令微调中的效果,发现复杂变体在平衡跨语言迁移与知识保留方面并无显著优势。

详情
AI中文摘要

我们研究了常见的LoRA变体在多语言指令微调中是否比基本LoRA更具优势。涉及LoRA及其他四种变体在两个数据集、多种目标语言上的实验表明,使用更复杂的LoRA变体相对于基本LoRA,在平衡跨语言迁移和知识保留方面并无显著优势。对隐藏嵌入的分析显示,使用不同LoRA技术微调的大型语言模型在逐层语言表示上基本相似,这表明LoRA技术的架构新颖性可能并未转化为更好的跨语言适应能力。

英文摘要

We investigate whether commonly available LoRA variants have an advantage over basic LoRA in multilingual instruction tuning. Experiments involving LoRA and four other variants on two datasets across diverse target languages show that there is no significant advantage in using more complex LoRA variants instead of basic LoRA, with respect to balancing cross-lingual transfer and knowledge retention. An analysis of hidden embeddings reveal that layer-wise language representation remains largely similar across LLMs fine-tuned with different LoRA techniques, suggesting that architectural novelty of LoRA techniques may not translate into better cross-lingual adaptation.

2606.10423 2026-06-10 cs.CL 新提交

WebChallenger: A Reliable and Efficient Generalist Web Agent

WebChallenger: 一个可靠且高效的通用型Web智能体

Jayoo Hwang, Xiaowen Zhang, Vedant Padwal

发表机构 * ML Collective longsurf.ai Independent(独立研究者)

AI总结 提出WebChallenger框架,通过PageMem结构化页面表示、分治观察、轻量探索记忆和复合动作工作流,复现人类认知优势,使开源模型在多个Web导航基准上接近前沿专有系统性能。

详情
AI中文摘要

自主Web导航对LLM智能体仍然具有挑战性,最强的通用系统依赖于专有推理模型,其推理成本对于此类智能体最有用的重复性任务来说高得令人望而却步。我们认为这一差距并非源于模型能力不足,而是源于智能体架构未能复制人类的三种认知优势:对相关页面区域的选择性注意力、对网站结构的持久记忆以及对常见交互模式的程序性流畅性。我们引入了WebChallenger,一个通过架构设计而非模型规模来解决每个差距的Web智能体框架,该框架围绕PageMem构建:一种从DOM确定性构建的结构化页面表示,将每个页面呈现为具有简短摘要的语义部分层次结构。在此共享基础上,我们构建了三种机制来镜像三种认知优势:一个分治观察流水线,让智能体浏览部分摘要并仅从任务相关区域提取细节;一个轻量级探索和记忆系统,遍历每个网站一次以构建页面和元素行为的可重用地图;以及复合动作工作流,将常见的多步交互折叠为单个智能体动作,自动处理部分状态变化。由于这三种机制都基于PageMem运行,该框架无需特定站点适配器即可跨网站泛化。使用未经微调的现成开源模型,我们的系统在WebArena上达到56.3%,在VisualWebArena上达到48.7%,在Online-Mind2Web上达到51.0%,在WorkArena上达到70.9%,以极低的成本接近前沿专有系统。我们的代码已发布在此https URL。

英文摘要

Autonomous web navigation remains challenging for LLM agents, and the strongest generalist systems rely on proprietary reasoning models whose inference cost is prohibitive for the repetitive tasks where such agents would be most useful. We argue this gap stems not from insufficient model capability but from agent architectures that fail to replicate three human cognitive advantages: selective attention to relevant page regions, persistent memory of website structure, and procedural fluency with common interaction patterns. We introduce WebChallenger, a web agent framework that addresses each gap through architecture design rather than model scale, built around PageMem: a structured page representation deterministically constructed from the DOM that exposes each page as a hierarchy of semantic sections with short summaries. On this shared substrate we build three mechanisms that mirror the three cognitive advantages: a divide-and-conquer observation pipeline that lets the agent skim section summaries and extract details only from task-relevant regions; a lightweight exploration and memory system that traverses each website once to build a reusable map of pages and element behaviors; and compound action workflows that collapse common multi-step interactions into single agent actions, handling partial state changes automatically. Because all three operate over PageMem, the framework generalizes across websites without site-specific adapters. Using off-the-shelf open-weight models without fine-tuning, our system achieves 56.3% on WebArena, 48.7% on VisualWebArena, 51.0% on Online-Mind2Web, and 70.9% on WorkArena, approaching frontier proprietary systems at a fraction of the cost. Our code is released at https://github.com/jayoohwang1/webchallenger

2606.10413 2026-06-10 cs.AI 新提交

Soul Computing: A Theoretical Framework and Technical Architecture for Intelligent Agents with Independent Consciousness

灵魂计算:具有独立意识的智能体的理论框架与技术架构

Jinshan Zhang, Xishi Zhou, Qiu Peng, Jianwei Yin

发表机构 * Innovation and Management Center, School of Software Technology, Zhejiang University (Ningbo)(浙江大学(宁波)软件学院创新与管理中心) School of Software Technology, Zhejiang University, Ningbo(浙江大学软件学院(宁波))

AI总结 本文提出“灵魂计算”范式,区分狭义与广义概念,构建以意向性核心为特征的智能体架构,实现AI从工具到生命体的转变。

详情
AI中文摘要

大语言模型和多模态生成技术的突破,推动了人类心理特征、情感模式和长期记忆的数字重建从科幻走向工程实践。然而,当前AI与数字人交叉领域的研究和行业实践仍受制于基本概念模糊:新一代智能体与传统虚拟人的本质区别、具有自我认同的数字实体的构建路径,以及该领域面临的核心技术和伦理挑战,均亟待澄清。本文系统审视了在前沿AI技术驱动下,从传统虚拟人到“灵魂计算”范式的转型逻辑。我们首先分析人类意识和记忆机制的演化模式,重新评估海量多模态数字碎片在个体精神世界逆向重建中的核心价值。在此基础上,首次正式界定狭义和广义灵魂计算的学术内涵,阐明其学术边界以及与情感计算、历史重建和凡人计算的根本区别。我们认为,灵魂计算系统必须在架构上构建“内涵”核心,而非作为纯粹的“外延”功能载体,从而推动AI从工具性向生命体的根本转变。

英文摘要

Breakthroughs in large language models and multimodal generation technologies have propelled the digital reconstruction of human mental traits, emotional patterns, and long-term memory from science fiction toward engineering practice. Yet current research and industry practices at the intersection of AI and digital humans remain hampered by fundamental conceptual ambiguities: the essential differences between next-generation intelligent agents and traditional virtual humans, the construction pathways for digital entities possessing self-identity, and the core technical and ethical challenges confronting this domain all demand urgent clarification. This paper systematically examines the transformative logic underlying the transition from traditional virtual humans to the ``Soul Computing'' paradigm, driven by frontier AI technologies. We first analyze the evolutionary patterns of human consciousness and memory mechanisms, reassessing the core value of massive multimodal digital fragments in the reverse reconstruction of individual mental worlds. On this basis, we formally delineate the academic connotations of narrow and broad Soul Computing for the first time, clarifying its academic boundaries and essential distinctions from Affective Computing, Historical Reconstruction, and Mortal Computation. We argue that Soul Computing systems must architecturally construct an ``Intensional'' core rather than serving as purely ``Extensional'' functional carriers, thereby enabling the fundamental transition of AI from toolhood to living agency.

2606.10412 2026-06-10 cs.AI 新提交

A Unified Multi-Modal Framework for Intelligent Financial Systems: Integrating Reinforcement Learning, High-Frequency Trading, and Game-Theoretic Approaches with Cross-Modal Sentiment Analysis

面向智能金融系统的统一多模态框架:整合强化学习、高频交易和博弈论方法与跨模态情感分析

Fanrong Liu, Zhang Yuwei, Mingni Luo

发表机构 * Henan University, International Eurasia College(河南大学,国际欧亚学院) City University of Hong Kong, College of Business(香港城市大学,商学院) Northeastern University, School of Electronic and Information Engineering(东北大学,电子与信息工程学院)

AI总结 提出统一框架整合PPO、高频预测、上下文学习、博弈论和跨模态情感分析,在多个金融任务上平均提升20%以上性能。

详情
AI中文摘要

金融科技的快速发展要求能够同时处理多领域多样化挑战的复杂人工智能系统。本文提出了一个开创性的统一框架,无缝整合了用于机器人顾问系统的近端策略优化、用于高频交易的先进时间序列预测模型、用于动态投资顾问的上下文学习机制、用于竞争性银行场景的博弈论方法以及用于跨模态金融情感分析的统一嵌入。我们的综合框架解决了现有文献中这些技术孤立发展、未能利用其协同潜力的关键空白。通过在多个金融数据集和现实场景中的广泛实验,我们证明了集成方法相比专门的单领域系统实现了更优的性能。具体而言,我们的框架在投资组合优化指标上提升了23.7%,将高频交易的预测误差降低了31.2%,将投资推荐准确率提高了18.9%,通过纳什均衡收敛速度增加27.4%优化了竞争性银行策略,并通过跨模态融合将情感分析准确率提高了15.6%。我们的工作理论基础为集成优化问题建立了收敛保证,而实证结果验证了其在多样化金融机构中的实际适用性。这项研究不仅推进了金融AI的最新水平,还为开发能够适应现代金融市场复杂互联本质的综合智能系统提供了蓝图。

英文摘要

The rapid evolution of financial technology demands sophisticated artificial intelligence systems capable of handling diverse challenges across multiple domains simultaneously. This paper presents a groundbreaking unified framework that seamlessly integrates Proximal Policy Optimization for robo-advisory systems, advanced time-series prediction models for high-frequency trading, in-context learning mechanisms for dynamic investment advisory, game-theoretic approaches for competitive banking scenarios, and unified embeddings for cross-modal financial sentiment analysis. Our comprehensive framework addresses the critical gap in existing literature where these technologies have been developed in isolation, failing to leverage their synergistic potential. Through extensive experimentation across multiple financial datasets and real-world scenarios, we demonstrate that our integrated approach achieves superior performance compared to specialized single-domain systems. Specifically, our framework shows a 23.7% improvement in portfolio optimization metrics, reduces prediction error in high-frequency trading by 31.2%, enhances investment recommendation accuracy by 18.9%, optimizes competitive banking strategies with a 27.4% increase in Nash equilibrium convergence speed, and improves sentiment analysis accuracy by 15.6% through cross-modal fusion. The theoretical foundation of our work establishes convergence guarantees for the integrated optimization problem, while our empirical results validate the practical applicability across diverse financial institutions. This research not only advances the state-of-the-art in financial AI but also provides a blueprint for developing comprehensive intelligent systems that can adapt to the complex, interconnected nature of modern financial markets.

2606.10410 2026-06-10 cs.LG eess.SP q-bio.QM 新提交

A Comprehensive Inference-Time Augmentation Framework in Physiological Signals: Application to PPG-Based AF Detection

生理信号中的综合推理时增强框架:应用于基于PPG的房颤检测

Davood Fattahi, Runze Yan, Saurabh Kataria, Zhaoliang Chen, Xiao Hu

发表机构 * Nell Hodgson Woodruff School of Nursing, Emory University(埃默里大学护理学院) Department of Computer Science, Emory University(埃默里大学计算机科学系) Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology(佐治亚理工学院生物医学工程系) Department of Biomedical Informatics, Emory University School of Medicine(埃默里大学医学学院生物医学信息学系)

AI总结 提出一个包含13种增强方法的统一推理时增强框架,通过贝叶斯优化超参数,在PPG房颤检测任务中显著提升AUROC和AUPRC,降低假阳性率。

Comments 22 pages, 11 figures, 4 tables. Under review at Physiological Measurement

详情
AI中文摘要

目标:在真实部署中,生理信号的准确分类面临传感器噪声、运动伪影以及训练数据与部署数据之间分布偏移的挑战。推理时增强(ITA)在推理过程中应用增强而非重新训练,提供了一种简单、模型无关的机制来提高鲁棒性。然而,ITA在生理信号中的应用范围仍然狭窄,依赖于有限的增强方法和固定的未优化参数。本文提出一个统一的ITA框架以解决这一差距。方法:该框架包含13种增强方法,涵盖时域、幅值域、频域和伪影注入变换,并通过贝叶斯优化优化超参数。我们使用GPT-PPG和ResNet在五个数据集(包含400多名患者和约9,800小时记录)上评估基于30秒PPG信号的房颤(AF)检测。主要结果:标准ITA持续改善了AUROC(GPT-PPG最高提升8.5%,ResNet最高提升0.7%)和AUPRC(GPT-PPG最高提升10.6%,ResNet最高提升0.8%)。选择性ITA进一步将非AF数据集上的平均FPR降低了高达4.4%(GPT-PPG)和1.3%(ResNet)。意义:这些发现确立了ITA作为一种实用的、模型无关的方法,用于在无法重新训练的部署环境中提高基于PPG的房颤分类可靠性,并具有更广泛的生理信号分析适用性。

英文摘要

Objective: Accurate classification of physiological signals in real-world deployments is challenged by sensor noise, motion artifacts, and distribution shifts between training and deployment data. Inference-time augmentation (ITA), which applies augmentations during inference rather than retraining, offers a simple, model-agnostic mechanism to improve robustness. However, ITA application to physiological signals has remained narrow in scope, relying on limited augmentation methods with fixed, unoptimized parameters. This work proposes a unified ITA framework to address that gap. Approach: The framework incorporates 13 augmentation methods spanning time-domain, amplitude-domain, frequency-domain, and artifact-injection transformations, with hyperparameters optimized via Bayesian optimization. We evaluate on atrial fibrillation (AF) detection from 30-second PPG signals using GPT-PPG and ResNet across five datasets comprising more than 400 patients and ${\sim}$9,800 hours of recording. Main results: Standard ITA consistently improved AUROC (up to 8.5% for GPT-PPG and 0.7% for ResNet) and AUPRC (up to 10.6% for GPT-PPG and 0.8% for ResNet). Selective ITA further reduced average FPR by up to 4.4% (GPT-PPG) and 1.3% (ResNet) on non-AF datasets. Significance: These findings establish ITA as a practical, model-agnostic approach for improving PPG-based AF classification reliability in deployment settings where retraining is not feasible, with broader applicability to physiological signal analysis.

2606.10407 2026-06-10 cs.SD cs.CV q-bio.QM 新提交

Time-frequency localization of bird calls in dense soundscapes

密集声景中鸟鸣的时频定位

Simen Hexeberg, Fanghui Tong, Hari Vishnu, Mandar Chitre

发表机构 * Acoustic Research Laboratory, National University of Singapore(新加坡国立大学声学研究实验室) Tropical Marine Science Institute, National University of Singapore(新加坡国立大学热带海洋科学研究所) School of Marine Science and Technology, Northwestern Polytechnical University(西北工业大学航海学院)

AI总结 将鸟鸣检测视为频谱图上的目标检测任务,训练YOLO11模型在密集热带声景中定位鸟鸣,并引入IoMin评估指标,在分布内和分布外数据上均优于基线。

详情
AI中文摘要

被动声学监测能够大规模观测野生动物,但大多数生物声学分类器仅预测时间窗口内的物种存在,而无法在时间或频率上精确定位发声,限制了后续分析。我们将鸟鸣检测视为频谱图上的目标检测任务,训练YOLO11模型在新加坡密集热带声景中定位鸟鸣。此外,我们引入了一个开源的基于浏览器的标注工具,并提出了Intersection over Minimum (IoMin)评估指标,该指标比标准IoU更好地处理模糊的声学边界,更适合当前问题。最佳YOLO模型在新加坡的分布内声景中几乎将基线性能翻倍(81.8% vs. 42.1% IoMin@50 F1分数),同时在夏威夷的未见分布外录音上仍优于基线(58.6% vs. 48.6%)。这些结果表明,目标检测框架是复杂声景中动物发声时频定位的一种有前景的方法。

英文摘要

Passive acoustic monitoring enables large-scale observation of wildlife, but most bioacoustic classifiers only predict species presence in a time window without localizing vocalizations precisely in time or frequency, limiting downstream analyses. We formulate bird vocalization detection as an object detection task on spectrograms and train YOLO11 models to localize bird calls in dense tropical soundscapes from Singapore. We additionally introduce an open-source browser-based annotation tool and propose Intersection over Minimum (IoMin), an evaluation metric that better handles ambiguous acoustic boundaries than standard IoU and is better suited to the problem at hand. The best YOLO model nearly doubles baseline performance on in-distribution soundscapes from Singapore (81.8% vs. 42.1% IoMin@50 F1-score) while still outperforming the baseline on unseen out-of-distribution recordings from Hawaii (58.6% vs. 48.6%). These results suggest that object detection frameworks are a promising approach to time-frequency localization of animal vocalizations in complex soundscapes.

2606.10406 2026-06-10 cs.LG cs.AI 新提交

FOGO: Forgetting-aware Orthogonalization Optimizer

FOGO:遗忘感知正交化优化器

Toan Nguyen, Yang Liu, Trung Le, Celso de Melo, Flora D. Salim

发表机构 * School of Computer Science and Engineering, University of New South Wales(新南威尔士大学计算机科学与工程学院) Department of Data Science & AI, Monash University(莫纳什大学数据科学与人工智能系) DEVCOM Army Research Laboratory(DEVCOM陆军研究实验室)

AI总结 提出FOGO优化器,通过谱正交化动量更新并利用紧凑码本记忆解决梯度干扰,在类别不平衡、持续学习和大模型微调等场景中提升收敛与知识保留。

详情
AI中文摘要

我们认为遗忘不仅局限于持续学习,而是一种普遍的优化现象:在标准训练过程中,主导的小批量梯度抑制了罕见但有用的更新方向,导致每一步的短期遗忘。当这些知识从未被重新访问时,这些损失会累积成长期遗忘——持续学习的经典失败模式。我们引入了FOGO,一种可扩展的优化器,能够持续检测并解决两种场景下的梯度干扰。FOGO对动量更新进行谱正交化,以防止主导方向垄断优化,然后将代表性的过去方向存储在基于随机投影的紧凑码本记忆中,其中成对距离在低维空间中得到可证明的保留。在每一步中,当前更新与存储方向之间的冲突通过轻量级正交校正解决,并通过近端步骤提升回来,开销极小且无需存储数据。在类别不平衡分类、领域和类别变化下的持续视觉学习、LLaVA-7B的持续微调以及GPT-2预训练中,FOGO持续改善收敛和知识保留,优于Adam和Muon。

英文摘要

We argue that forgetting is not confined to continual learning but is a general optimization phenomenon: during standard training, dominant mini-batch gradients suppress rare but useful update directions, causing short-term forgetting at every step. When such knowledge is never revisited, these losses compound into long-term forgetting-the classical failure mode of continual learning. We introduce FOGO, a scalable optimizer that continuously detects and resolves gradient interference across both regimes. FOGO spectrally orthogonalizes momentum updates to prevent dominant directions from monopolizing optimization, then stores representative past directions in a compact codebook memory built on random projection, where pairwise distances are provably preserved in low-dimensional space. At each step, conflicts between the current update and stored directions are resolved via lightweight orthogonal correction and lifted back through a proximal step, with minimal overhead and no data storage. Across class-imbalanced classification, continual visual learning under domain and class shifts, continual fine-tuning of LLaVA-7B, and GPT-2 pretraining, FOGO consistently improves convergence and knowledge retention, outperforming Adam and Muon.

2606.10402 2026-06-10 cs.CL cs.AI 新提交

Harnessing the Collective Intelligence of AI Agents in the Wild for New Discoveries

利用野外AI代理的集体智慧实现新发现

Federico Bianchi, Yongchan Kwon, Aneesh Pappu, James Zou

发表机构 * Together AI Stanford University(斯坦福大学)

AI总结 提出EinsteinArena平台,通过开放分布式环境中的自主代理交互,在数学问题中实现12项新最优结果,展示了集体AI驱动研究的范式。

详情
AI中文摘要

科学发现通常是一个集体过程:研究人员分享部分结果,检查失败的尝试,并在长时间跨度内相互借鉴想法。最近的AI系统表明,基于语言模型的代理可以在开放科学问题上取得有意义的进展,但大多数现有系统孤立运行。在本文中,我们提出EinsteinArena,一个面向开放分布式研究和发现的代理原生平台。EinsteinArena为代理提供一组实时开放问题,每个问题都有可靠的验证器、公共排行榜和特定问题的讨论论坛,代理可以在其中提问和分享见解。我们专注于引起大量研究兴趣的数学任务,其进展可以明确衡量。截至2026年5月,EinsteinArena上的代理已发现12项新的最优结果,优于以往任何人类或AI解决方案。一个显著例子是11维接吻数问题,该平台将已知最佳下界从593提高到604。这一进展并非来自单个代理或孤立运行,而是通过一系列提交、公开讨论、验证器改进以及后续代理间的思想借鉴而产生的。这些结果证明,去中心化的科学发现可以从自主代理在野外的开放交互中涌现,展示了集体AI驱动研究的新范式。

英文摘要

Scientific discovery is often a collective process: researchers share partial results, inspect failed attempts, and build on each other's ideas over long time horizons. Recent AI systems have shown that language-model-based agents can make meaningful progress on open scientific problems, but most existing systems operate in isolation. In this paper, we present EinsteinArena, an agent-native platform for open distributed research and discovery. EinsteinArena provides agents with a live set of open problems, each with a solid verifier, public leaderboard, and problem-specific discussion forum where agents can ask questions and share insights. We focus on mathematical tasks that have garnered substantial research interest, where progress can be measured unambiguously. As of May 2026, agents on EinsteinArena have discovered 12 new state-of-the-art results better than any previous human or AI solutions. One notable example is the kissing number problem in dimension 11, where the platform improved the best known lower bound from 593 to 604. This advance did not come from a single agent or isolated run. Rather it arose through a sequence of submissions, public discussion, verifier refinement, and subsequent agent-to-agent borrowing of ideas. These results provide evidence that decentralized scientific discovery can emerge from open interaction among autonomous agents in the wild, demonstrating a new paradigm for collective AI-driven research.

2606.10400 2026-06-10 cs.CL cs.CV 新提交

Do Vision-Language Models See or Guess? Measuring and Reducing Textual-Prior Reliance with a Phrasing-Controlled Benchmark

视觉语言模型是看见还是猜测?通过措辞控制基准衡量和减少文本先验依赖

Pratham Singla, Shivank Garg, Vihan Singh, Paras Chopra

发表机构 * Lossfunk Indian Institute of Technology Roorkee(印度理工学院罗尔基分校) Raeth AI

AI总结 本文构建了540张图像的基准,通过为同一图像生成四种措辞变体,衡量视觉语言模型对文本先验的依赖,发现所有模型在最难变体上性能下降,开放模型下降最严重,并通过无图像消融等分析证实了真正的图像依赖。

Comments 17 pages, 7 figures, Submitted to EMNLP 2026

详情
AI中文摘要

视觉语言模型(VLM)越来越多地被部署在答案必须依据图像内容的场景中,然而它们常常基于文本先验(问题的措辞结合记忆的世界知识)而非图像本身来回答,这夸大了基准分数并产生了自信但无根据的答案。现有基准很少孤立这种行为,因为每张图像通常只与一个固定问题配对。为了衡量这种依赖,我们构建了一个包含540张图像、覆盖六个推理类别的基准,并为相同图像生成四个问题变体,使得措辞而非图像内容成为受控变量。最难的变体直接从图像编写以最小化文本泄漏。我们对十一个VLM进行了基准测试,涵盖从小型开放权重模型到大型闭源系统:每个模型在最难的变体上性能下降,开放模型下降最严重。我们的核心诊断是无图像消融,它将开放权重模型降至其纯文本基线(1%到9%)。进一步的三项分析——LLM评定的难度、低基础到最终文本相似度以及人工重新标注——证实了真正的图像依赖性。与变体构建方式匹配的上下文示例恢复了最高的准确率,而GRPO后训练一个小型VLM在所有四个变体上取得了一致的提升,并泛化到保留的分布外集。文本先验依赖是可测量的,并且部分可通过训练消除。

英文摘要

Vision-language models (VLMs) are increasingly deployed where answers must follow from what is in the image, yet they often answer from textual priors, the question's phrasing together with memorized world knowledge, rather than from the image itself, which inflates benchmark scores and yields confident but ungrounded answers. Existing benchmarks rarely isolate this behavior, since each image is usually paired with a single fixed question. To measure the reliance, we build a 540-image benchmark across six reasoning categories and generate four question variants over the same images, so that phrasing rather than image content is the controlled variable. The hardest variant is written directly from the image to minimize text leakage. We benchmark eleven VLMs spanning small open-weight models to large closed-source systems: every model degrades on the hardest variant, and open models fall furthest. Our central diagnostic is a no-image ablation, which collapses the open-weight models to their text-only floor (1 to 9 percent). Three further analyses, LLM-rated difficulty, low base-to-final textual similarity, and human re-annotation, corroborate genuine image-dependence. In-context exemplars that match how a variant was built recover the most accuracy, and GRPO post-training of a small VLM yields consistent gains across all four variants that transfer to a held-out out-of-distribution set. Textual-prior reliance is measurable and partly trainable away.

2606.10395 2026-06-10 cs.CV 新提交

Efficient RWKV-based Representation Learning for 3D Point Clouds

基于高效RWKV的三维点云表示学习

Yun Liu, Xuefeng Yan, Liangliang Nan, Xianzhi Li, Peng Li, Zhe Zhu, Honghua Chen, Mingqiang Wei

发表机构 * School of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics(南京航空航天大学计算机科学与技术学院) Shenzhen Institute of Research, Nanjing University of Aeronautics and Astronautics(南京航空航天大学深圳研究院) Collaborative Innovation Center of Novel Software Technology and Industrialization(新型软件技术与产业化协同创新中心) Urban Data Science section, Delft University of Technology(代尔夫特理工大学城市数据科学部) Huazhong University of Science and Technology(华中科技大学)

AI总结 提出P-RWKV模块,通过局部感知扩展和空间上下文增强,将RWKV从序列建模适配到3D点云,实现线性复杂度的全局依赖建模,在多项任务中以更低计算成本取得竞争性能。

详情
AI中文摘要

最近提出的接收加权键值(RWKV)模型结合了RNN风格的循环,为建模全局依赖提供了Transformer二次自注意力的线性复杂度替代方案。然而,当直接应用于点云时,原本为序列文本开发的RWKV难以有效捕捉局部几何结构和建模空间依赖。为了解决这个问题,我们提出了\textbf{P-RWKV}模块,它在保持RWKV效率优势的同时,弥合了序列建模与不规则3D几何之间的差距。它包含一个局部感知扩展(LPE)组件,用于沿时空序列扩展上下文感知,以及一个空间上下文增强(SCE)组件,用于增强空间意识。为了验证P-RWKV在点云理解中的有效性,我们构建了PointER,一个单模态自监督表示学习框架,其编码器由堆叠的P-RWKV模块组成。此外,我们将P-RWKV扩展到跨模态设置,并将所提出的核心子模块集成到多种架构中,展示了强大的即插即用灵活性和架构通用性。大量实验表明,P-RWKV模块及其关键子模块在各种任务中以较低的计算成本和推理延迟取得了竞争性能。代码将在接收后发布。

英文摘要

The recent receptance weighted key value (RWKV) model combines RNN-style recurrence, offering a linear-complexity alternative to Transformers' quadratic self-attention for modeling global dependencies. However, when directly applied to point clouds, RWKV, originally developed for sequential text, struggles to capture local geometric structures and model spatial dependencies effectively. To address this, we propose the \textbf{P-RWKV} block, which bridges the gap between sequence modeling and irregular 3D geometry while preserving the efficiency advantages of RWKV. It consists of a Local Perception Expansion (LPE) component to expand contextual perception along the spatio-temporal sequence and a Spatial Context Enhancement (SCE) component to strengthen spatial awareness. To validate the effectiveness of P-RWKV for point cloud understanding, we construct PointER, a single-modality self-supervised representation learning framework whose encoder is composed of stacked P-RWKV blocks. Furthermore, we extend P-RWKV to a cross-modality setting and integrate the proposed core sub-modules into multiple architectures, demonstrating strong plug-and-play flexibility and architectural generality. Extensive experiments show that the P-RWKV block and its key sub-modules achieve competitive performance across various tasks with lower computational cost and inference latency. Code will be released upon acceptance.

2606.10394 2026-06-10 cs.AI 新提交

STAGE-Claw: Automated State-based Agent Benchmarking for Realistic Scenarios

STAGE-Claw:面向真实场景的基于状态的智能体自动化基准测试

Sirui Liang, Bohan Yu, Peiyu Wang, Shiguang Guo, Wenxing Hu, Pengfei Cao, Jian Zhao, Cao Liu, Ke Zeng, Xunliang Cai, Kang Liu

发表机构 * The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation(中国科学院自动化研究所复杂系统认知与决策智能重点实验室) School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences(中国科学院大学前沿交叉科学学院) Chinese Academy of Sciences(中国科学院) University of Chinese Academy of Sciences(中国科学院大学) Zhongguancun Academy(中关村学院) Zhongguancun Institute of Artificial Intelligence(中关村人工智能研究院) Meituan(美团)

AI总结 提出STAGE-Claw框架,自动构建基于状态的个人计算环境中的真实场景任务,通过最终系统状态而非文本响应评估智能体性能,创建40个挑战性任务并分析11个前沿模型。

详情
AI中文摘要

大型语言模型越来越多地被用于驱动日常应用中的个人智能体,但评估这些智能体仍然是一个挑战。现有的基准测试仍然依赖于沙盒化工件、静态任务设计和粗粒度评分,这阻碍了可扩展性并限制了向可靠个人智能体评估的进展。本文介绍了STAGE-Claw,一个在基于状态的个人计算环境中自动构建和评估真实个人智能体场景的框架。给定一个任务提示,STAGE-Claw自动创建并验证一个真实的基准测试任务,包括其环境、任务提示、真实结果和相关验证程序。然后,在真实操作环境中评估智能体,其中性能通过最终系统状态而非仅文本响应的正确性来衡量。使用STAGE-Claw,本文创建了一个包含40个具有挑战性的真实场景智能体任务的基准测试,评估了11个前沿模型,并分析了它们的任务得分、成本、工具调用可靠性和常见失败模式。总体而言,STAGE-Claw提供了一种可扩展的、基于状态的方式来评估真实用户场景中的智能体。

英文摘要

Large language models are increasingly used to power personal agents for everyday applications, but evaluating these agents remains a challenge. Existing benchmarks still rely on sandboxed artifacts, static task design, and coarse scoring, which hinder scalability and limit progress toward reliable personal-agent evaluation. This paper introduces STAGE-Claw, an automated framework for building and evaluating realistic personal-agent scenarios in state-based personal-computing environments. Given a task hint, STAGE-Claw automatically creates and validates a realistic benchmark task with its environment, task prompts, ground truth, and related verification programs. Agents are then evaluated in realistic operating environments, where performance is measured by the correctness of the final system state rather than only the textual response. Using STAGE-Claw, this paper creates a benchmark with 40 challenging real scenario agent tasks, evaluates 11 frontier models, and analyzes their task scores, costs, tool-call reliability, and common failure patterns. Overall, STAGE-Claw offers a scalable, state-based way to evaluate agents in realistic user scenarios.

2606.10393 2026-06-10 cs.LG cs.CE 新提交

Validation-Stage Combinatorial Fusion Analysis for Imbalanced Credit-Card Fraud Detection

面向不平衡信用卡欺诈检测的验证阶段组合融合分析

Xiao Han, Chenyu Wu

发表机构 * Xiao Han and Chenyu Wu contributed equally to this work.(小韩和吴晨宇共同完成了本工作。)

AI总结 针对信用卡欺诈检测中数据不平衡问题,提出在验证阶段使用组合融合分析(CFA)选择互补模型子集并赋予多样性权重,在IEEE-CIS基准上AUC-ROC达0.9405。

详情
AI中文摘要

信用卡欺诈检测因欺诈交易稀少、成本高且分布不均而困难。强梯度提升树模型在结构化交易数据上已表现良好,因此另一种融合方法的价值并不明显。本文研究组合融合分析(CFA)——通过搜索模型子集和排名得分融合规则——是否能在IEEE-CIS欺诈检测基准上增加价值。使用无泄漏的60/20/20训练/验证/测试协议,我们评估了由七个基分类器构建的480种融合配置。最佳测试集结果来自随机森林、XGBoost和LightGBM的多样性加权得分融合(DEF WtScore),AUC-ROC = 0.9405,AUPRC = 0.6699,F1 = 0.6373。来自1000次重抽样的Bootstrap置信区间显示,对于所有三个指标,相对于最强单一模型的增益均排除零。CFA在AUC-ROC上与软投票持平,提高了AUPRC和F1,并在该设置下优于堆叠。CTGAN增强实验给出了负面结果:合成欺诈样本降低了单个模型和CFA的性能。总体而言,CFA在此处最有用的不是作为组合所有分类器的方法,而是作为验证阶段的方法,用于选择小的、互补的子集并分配多样性感知的权重。

英文摘要

Credit-card fraud detection is difficult because fraudulent transactions are rare, costly, and unevenly distributed. Strong gradient-boosted tree models already perform well on structured transaction data, so the value of another fusion method is not obvious. This paper examines whether Combinatorial Fusion Analysis (CFA), which searches over model subsets and rank-score fusion rules, can still add value on the IEEE-CIS Fraud Detection benchmark. Using a leakage-free 60/20/20 train/validation/test protocol, we evaluate 480 fusion configurations built from seven base classifiers. The best test-set result comes from diversity-weighted score fusion of Random Forest, XGBoost, and LightGBM (DEF WtScore), with AUC-ROC = 0.9405, AUPRC = 0.6699, and F1 = 0.6373. Bootstrap confidence intervals from 1,000 resamples show that the gains over the strongest single model exclude zero for all three metrics. CFA matches soft voting on AUC-ROC, improves AUPRC and F1, and outperforms stacking in this setting. A CTGAN augmentation experiment gives a negative result: synthetic fraud samples degrade both individual models and CFA. Overall, CFA is most useful here not as a way to combine every classifier, but as a validation-stage method for choosing a small, complementary subset and assigning diversity-aware weights.

2606.10392 2026-06-10 cs.AI 新提交

Instruction Finetuning DeepSeek-R1-8B Model Using LoRA and NEFTune

使用LoRA和NEFTune对DeepSeek-R1-8B模型进行指令微调

Wu Yuerong, Mingni Luo

发表机构 * University of Hong Kong(香港大学) Northeastern University(东北大学)

AI总结 本研究结合LoRA和NEFTune微调DeepSeek-R1-8B模型,用于金融命名实体识别,在七类实体上达到0.912的微F1分数,优于多个基线模型。

详情
AI中文摘要

金融命名实体识别(NER)对于将非结构化的财务报告和新闻转化为结构化知识图谱至关重要。然而,通用大语言模型(LLMs)常常错误分类金融实体或忽略领域特定模式。本文研究了使用DeepSeek-R1-8B(一个最近开源的大语言模型)结合低秩适应(LoRA)和噪声嵌入微调(NEFTune)进行金融NER。我们语料库中的1693个样本中每个带注释的句子都被转换为指令-输入-输出三元组。我们将轻量级LoRA矩阵插入Transformer层,并应用NEFTune通过在训练期间向嵌入向量添加均匀噪声来提高泛化能力。实验表明,LoRA适应的DeepSeek-R1-8B在七种实体类型(公司、日期、地点、货币、人物、产品和数量)上达到了0.901的微F1分数,而添加NEFTune进一步将微F1分数提升至0.912,优于Llama3-8B、Qwen3-8B、Baichuan2-7B、T5和BERT-Base基线。

英文摘要

Financial named-entity recognition (NER) is essential for translating unstructured financial reports and news into structured knowledge graphs. However, general-purpose large language models (LLMs) often misclassify financial entities or ignore domain-specific patterns. This paper investigates the use of DeepSeek-R1-8B, a recent open-source large language model, combined with Low-Rank Adaptation (LoRA) and Noisy Embedding Fine-Tuning (NEFTune) for financial NER. Each annotated sentence in our corpus of 1693 samples is converted into an instruction-input-output triple. We insert lightweight LoRA matrices into the Transformer layers and apply NEFTune to improve generalisation by adding uniform noise to embedding vectors during training. Experiments show that the LoRA-adapted DeepSeek-R1-8B achieves a micro-F1 of 0.901 on seven entity types (Company, Date, Location, Money, Person, Product and Quantity), and adding NEFTune further boosts the micro-F1 to 0.912, outperforming Llama3-8B, Qwen3-8B, Baichuan2-7B, T5 and BERT-Base baselines.

2606.10389 2026-06-10 cs.AI 新提交

Beyond Static Evaluation: Co-Evolutionary Mechanisms for LLM-Driven Strategy Evolution in Adversarial Games

超越静态评估:对抗性游戏中LLM驱动策略演化的协同进化机制

Haoran Li, Zengle Ge, Ziyang Zhang, Xiaomin Yuan, Yui Lo, Qianhui Liu, Bocheng An, Dongke Rong, Jiaqun Liu, Annan Li, Jianmin Wu, Dawei Yin, Dou Shen

发表机构 * Baidu Inc.(百度公司) University of Chinese Academy of Sciences(中国科学院大学) University of California, Los Angeles(加州大学洛杉矶分校) University of Science and Technology of China(中国科学技术大学) Zhejiang University(浙江大学) University of Technology Sydney(悉尼科技大学)

AI总结 针对LLM驱动代码进化在对抗性多智能体游戏中因评估景观变化导致停滞的问题,提出评估器协同进化、层次深度评估和弱点压力三种机制,在MCTF任务中实现最优性能和泛化能力。

详情
AI中文摘要

近期LLM驱动的代码进化通过迭代生成和改进程序实现了自动发现。然而,将这些方法应用于对抗性多智能体游戏引入了一个根本性挑战:随着策略改进,评估景观发生变化,导致固定评估器变得不可靠,进化停滞。我们提出三种机制来应对这一挑战:评估器协同进化,将发现的最优策略纳入对手池;层次深度评估,用统计可靠的评估替代噪声大的少数游戏得分;以及弱点压力,动态增加最难对手的权重以突破平台期。我们在FAMOU框架中实现了这些机制,该框架基于与OpenEvolve和ShinkaEvolve相同的基础模型代码进化范式。在MCTF 2026 3v3海上夺旗任务中,FAMOU在两种骨干LLM下均持续优于两个基线,取得了最高综合得分(0.526)和对未见对手的最佳泛化能力(胜率61.7%),而消融实验证实了每种机制对性能的贡献。值得注意的是,LLM变异过程生成了种子策略中完全不存在的新战术结构——包括前瞻搜索和自适应拦截——表明代码级进化可以在对抗性环境中产生非平凡的算法创新。FAMOU进化策略进一步在AAMAS 2026 MCTF竞赛中获得了硬件循环赛第一名和模拟赛第三名,验证了其现实世界可迁移性。通过我们的进化过程开发的优化实现和相应评估代码可在以下网址获取:this https URL

英文摘要

Recent advances in LLM-driven code evolution have enabled automated discovery by iteratively generating and improving programs. However, applying these methods to adversarial multi-agent games introduces a fundamental challenge: the evaluation landscape shifts as strategies improve, causing fixed evaluators to become unreliable and evolution to stagnate. We propose three mechanisms to address this challenge: evaluator co-evolution, which incorporates discovered champions into the opponent pool; hierarchical deep evaluation, which replaces noisy few-game scores with statistically reliable assessments; and weakness pressure, which dynamically up-weights the most difficult opponents to break through plateaus. We implement these mechanisms within FAMOU, a framework built upon the same foundation-model code-evolution paradigm as OpenEvolve and ShinkaEvolve. On the MCTF 2026 3v3 maritime capture-the-flag task, FAMOU consistently outperforms both baselines under two backbone LLMs, achieving the highest combined score (0.526) and the best generalization to unseen opponents (61.7% win rate), while ablations confirm that each mechanism contributes to performance. Notably, the LLM mutation process generates tactical structures entirely absent from the seed strategies -- including lookahead search and adaptive interception -- demonstrating that code-level evolution can produce nontrivial algorithmic innovations in adversarial settings. The FAMOU-evolved strategy further achieved 1st place in the hardware round-robin and 3rd in simulation at the AAMAS 2026 MCTF Competition, validating its real-world transferability. The optimized implementation and corresponding evaluation codes developed through our evolutionary process are available at: https://github.com/1xiangliu1/FAMOU-CoEvo

2606.10385 2026-06-10 cs.LG cs.AI 新提交

Beyond Absolute Imitation: Anchored Residual Guidance for Privileged On-Policy Distillation

超越绝对模仿:基于锚定残差引导的特权在线蒸馏

Wenhao Zhang

发表机构 * South China University of Technology(华南理工大学)

AI总结 提出锚定残差在线蒸馏(AR-OPD),通过部分特权教师建立局部兼容锚点并注入受控残差,解决特权在线蒸馏中后见偏差导致的局部不可达问题,在推理任务上平均提升2.3个点。

Comments 17 pages, 8 figures. Project page: https://vanhowe.github.io/AR-OPD/

详情
AI中文摘要

在线蒸馏(OPD)通过将学生模型与教师在其自身轨迹上的预测分布对齐,在增强LLM复杂推理方面展现出显著的实证收益。一种新兴变体——特权OPD,通过使用增强特权信息(如oracle轨迹)的自教师模型进一步强化该范式,以缓解师生能力差距,同时提供密集的、答案导向的监督。然而,当前方法将特权信息视为一个整体的模仿目标,未能将局部可达的推理步骤与未来条件的oracle信号分离。因此,学生被鼓励去匹配一个事后偏差分布,该分布通常落在其局部预测支持之外。这种可达性不匹配激励学生模型跳过有效的中间推理,转而采用局部不支持的捷径。为解决此问题,我们引入锚定残差在线蒸馏(AR-OPD),一种解耦特权监督的双视角框架。AR-OPD不强制执行严格的全局模仿,而是使用部分特权教师建立局部兼容锚点,将oracle预见性隔离并作为受控残差注入,以提供目标导向的引导。在多种推理任务上,AR-OPD比完全特权OPD高出2.3个点,比SFT高出7.9个点。关键的是,这种锚定残差机制将事后泄漏减少了21.7%,并缓解了后期漂移,在超过768个token的挑战性长程轨迹上取得了高达7.2个点的优势。

英文摘要

On-policy distillation (OPD) has demonstrated strong empirical gains in enhancing complex reasoning in LLMs by aligning a student model with a teacher's predictive distribution over the student's own trajectories. An emerging variant, Privileged OPD, further strengthens this paradigm by employing a self-teacher model augmented with privileged information, such as oracle traces, to mitigate teacher-student capacity gaps while providing dense, answer-directed supervision. However, current methods treat privileged information as a monolithic imitation target, failing to disentangle locally reachable reasoning steps from future-conditioned oracle signals. Consequently, the student is encouraged to match a hindsight-biased distribution that often falls outside its local predictive support. This reachability mismatch incentivizes the student model to skip valid intermediate reasoning in favor of locally unsupported shortcuts. To resolve this, we introduce Anchored Residual On-Policy Distillation (AR-OPD), a dual-view framework that disentangles privileged supervision. Rather than enforcing strict full-view imitation, AR-OPD establishes a locally compatible anchor using a partially privileged teacher, isolating and injecting oracle foresight as a controlled residual to provide destination-directed guidance. Across diverse reasoning tasks, AR-OPD outperforms full privileged OPD by 2.3 points and SFT by 7.9 points. Crucially, this anchored residual mechanism reduces hindsight leakage by 21.7% and mitigates late-stage drift, yielding up to a 7.2-point advantage on challenging long-horizon trajectories exceeding 768 tokens.

2606.10382 2026-06-10 cs.RO 新提交

UMI-Bench 1.0: An Open and Reproducible Real-World Benchmark for Tabletop Robotic Manipulation with UMI Data

UMI-Bench 1.0:基于UMI数据的桌面机器人操作开放可复现真实世界基准

Shi Jin, Yuntian Wang, Yuhui Duan, Di Wu, Gaoqi Dong, Xiaohang Liu, Xiaotong Li, Hongfei Jia, Zehao Zhang, Tianyu Wang, Zhongjie Jia, Yuanqi Yao, Chenjia Bai, Zhaxizhuoma, Siao Liu, Nieqing Cao, Jin Wang, Chao Yu, Yan Ding

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出UMI-Bench 1.0,首个专为UMI风格操作策略设计的真实机器人基准,通过统一协议实现数据收集、场景重置、策略执行、结果记录和任务因素分析,提供可复现的评估平台。

详情
AI中文摘要

真实机器人评估对于理解学习到的操作策略能否在精心策划的演示之外可靠运行至关重要。这一需求对于通用操作接口(UMI)风格策略尤为迫切,其性能取决于腕部视角观测、动作表示、数据收集和物理部署之间的耦合。现有的真实世界基准已取得重要进展,但它们并非围绕这种UMI数据到部署的设置而设计。我们提出UMI-Bench 1.0,一个本地优先的真实机器人基准,用于标准化评估UMI风格的操作策略。据我们所知,这是首个专门用于基于UMI的操作模型真实世界评估的基准。UMI-Bench将数据收集、场景重置、策略执行、结果记录和任务因素分析统一在一个协议中。通过使整个评估过程可复现和可审计,UMI-Bench为衡量UMI训练策略如何泛化到真实物理操作提供了一个实用的测试平台。

英文摘要

Real-robot evaluation is essential for understanding whether learned manipulation policies can operate reliably outside curated demonstrations. This need is particularly pressing for Universal Manipulation Interface (UMI)-style policies, whose performance depends on the coupling between wrist-view observations, action representation, data collection, and physical deployment. Existing real-world benchmarks have made important progress, but they are not designed around this UMI data-to-deployment setting. We present UMI-Bench 1.0, a local-first real-robot benchmark for standardized evaluation of UMI-style manipulation policies. To the best of our knowledge, this is the first benchmark dedicated to real-world evaluation of UMI-based manipulation models. UMI-Bench aligns data collection, scene reset, policy execution, result logging, and task-factor analysis within a unified protocol. By making the full evaluation process reproducible and auditable, UMI-Bench provides a practical testbed for measuring how UMI-trained policies generalize to real physical manipulation.

2606.10380 2026-06-10 cs.CL cs.AI 新提交

Expert-Level Crisis Detection in Mental Health Conversations

心理健康对话中的专家级危机检测

Grace Byun, Abigail Lott, Rebecca Lipschutz, Sean T. Minton, Elizabeth A. Stinson, Jinho D. Choi

发表机构 * Department of Computer Science, Emory University(埃默里大学计算机科学系) Department of Psychiatry and Behavioral Sciences, Emory University(埃默里大学精神病学与行为科学系)

AI总结 提出CRADLE-Dialogue基准数据集和Alert-Confirm评估协议,用于对话中危机检测,发现模型在识别风险出现时机上表现较差,并发布合成训练语料和32B参数模型。

详情
AI中文摘要

现实世界的危机干预本质上是对话式的,然而现有研究主要关注静态文本。当应用于多轮对话时,当前模型表现出显著的性能下降,难以追踪随着上下文演变而出现的风险信号。为了解决这一差距,我们引入了CRADLE-Dialogue,这是一个由临床医生标注的基准数据集,用于对话环境中的回合级危机检测。该数据集包含600个对话,具有跨临床基础风险的多标签注释,包括自杀意念、自残和儿童虐待,区分过去和当前风险。我们进一步提出了一种Alert-Confirm评估协议,该协议区分早期预警信号(Alert)和特定危机变得明确可识别的回合(Confirm),反映了在风险变得明确之前进行干预的临床需求。实验表明,识别风险何时出现比识别其存在要困难得多:模型的Micro F1仅达到40%中段到60%高段。此外,我们发布了一个合成训练语料库和一个32B参数模型,该模型显著优于现有的开源模型,并在回合级、对话级和仅确认评估设置中与专有模型相比具有竞争力或更优的结果。

英文摘要

Real-world crisis intervention is inherently conversational, yet existing research largely focuses on static texts.Real-world crisis intervention is inherently conversational, yet existing research largely focuses on static texts. When applied to multi-turn dialogues, current models exhibit significant performance degradation, struggling to track risk signals that emerge as context evolves. To address this gap, we introduce CRADLE-Dialogue, a clinician-annotated benchmark for turn-level crisis detection in conversational settings. The dataset features 600 dialogues with multi-label annotations across clinically grounded risks, including suicide ideation, self-harm, and child abuse, distinguishing past from ongoing risk. We further propose an Alert-Confirm evaluation protocol that distinguishes early warning signals (Alert) from turns where a specific crisis becomes explicitly identifiable (Confirm), reflecting the clinical need to intervene before risk becomes explicit. Experiments show that identifying when risk emerges is much harder than recognizing that it exists: models achieve only mid-40% to high-60% Micro F1. Additionally, we release a synthetic training corpus and a 32B-parameter model that substantially outperforms existing open-source models and achieves competitive or superior results against proprietary models across turn-level, dialogue-level, and confirm-only evaluation settings.

2606.10378 2026-06-10 cs.CV 新提交

FSS-Net: Frequency-Spatial Synergy Network with Wavelet Attention for Carotid Artery Ultrasound Segmentation

FSS-Net:用于颈动脉超声分割的频率-空间协同网络与小波注意力

Jiawei Liu, Zhijiang Wan, Junhua Hu, Rongli Zhang, Zhongbiao Xu, Yankun Cao, Yuan Chen, Jin Hong

发表机构 * Ji luan Academy, Nanchang University(井然学院,南昌大学) School of Information Engineering, Nanchang University(信息工程学院,南昌大学) State Key Laboratory of Water Cycle and Water Security, China Institute of Water Resources and Hydropower Research(水循环与水安全国家重点实验室,中国水利水电科学研究院) Department of Diagnostic Radiology, Li Ka Shing Faculty of Medicine, The University of Hong Kong(诊断放射科,李嘉诚医学部,香港大学) Department of Radiotherapy, Guangdong Provincial People's Hospital, Guangdong Academy of Medical Sciences, Southern Medical University(放疗科,广东省人民医院,广东省医学科学院,南方医科大学) Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University(SDU-NTU人工智能研究联合中心(C-FAIR),山东大学) Department of Pediatrics, Shandong Provincial Hospital Affiliated to Shandong First Medical University(儿科,山东省立医院(附属山东第一医科大学))

AI总结 提出频率-空间协同网络(FSS-Net),集成小波变换、多域注意力和边缘增强,在颈动脉超声数据集上实现96.46%的Dice分数,有效分割颈动脉并识别斑块。

详情
AI中文摘要

超声成像中颈动脉的精确分割对于中风风险评估至关重要。然而,散斑噪声、低对比度和模糊边界仍然是主要挑战。在本文中,我们提出了一种频率-空间协同网络(FSS-Net),以实现噪声鲁棒且高精度的颈动脉分割。该网络将小波变换、多域注意力和边缘增强集成到一个统一的编码器-解码器架构中。具体来说,设计了一个通道-空间-小波注意力(CSWA)模块,以抑制频率域中的噪声并净化语义特征。引入了一个小波增强瓶颈(WEB)模块,以高效捕获长距离全局依赖关系。此外,一个拉普拉斯引导的自适应边缘融合(LAEF)模块补偿高频细节并保持边界连续性。在颈动脉超声数据集上的大量实验表明,FSS-Net在低信噪比条件下达到了96.46%的Dice分数(DSC)和强鲁棒性,优于几种最先进的方法。该方法实现了超声成像中颈动脉的精确分割,有效识别颈动脉粥样硬化斑块,并通过其他任务(如乳腺癌分割)验证,表明其在超声图像中识别异常组织肿块具有良好的临床应用潜力。

英文摘要

Accurate segmentation of carotid arteries in ultrasound imaging is critical for stroke risk assessment. However, speckle noise, low contrast, and blurred boundaries remain major challenges. In this paper, we propose a Frequency-Spatial Synergy Network (FSS-Net) to achieve noise-robust and high-precision carotid artery segmentation. The network integrates wavelet transform, multi-domain attention, and edge enhancement into a unified encoder-decoder architecture. Specifically, a Channel-Spatial-Wavelet Attention (CSWA) module is designed to suppress noise and purify semantic features in the frequency domain. A Wavelet-Enhanced Bottleneck (WEB) module is introduced to capture long-range global dependencies efficiently. Furthermore, a Laplacian-Guided Adaptive Edge Fusion (LAEF) module compensates high-frequency details and maintains boundary continuity. Extensive experiments on carotid ultrasound datasets show that FSS-Net achieves a Dice score (DSC) of 96.46% and strong robustness under low SNR conditions, outperforming several state-of-the-art methods. This method realizes accurate segmentation of carotid artery in ultrasonic imaging, effectively identifies carotid atherosclerotic plaque, and is verified by other task (such as segmentation of breast cancer), suggesting that it has good clinical application potential in identifying abnormal tissue masses in ultrasonic images.

2606.10373 2026-06-10 cs.CV 新提交

PF-Trans: Physics-Embedded Frequency-Aware Transformer for Spectral Reconstruction

PF-Trans:物理嵌入的频率感知Transformer用于光谱重建

Yuzhe Gui, Tianzhu Liu, Yanfeng Gu, Xian Li

发表机构 * National Natural Science Foundation of China(国家自然科学基金委员会)

AI总结 针对快照宽带滤光片阵列成像中的光谱混叠问题,提出物理嵌入的频率感知Transformer(PF-Trans),通过掩膜注入和灰度一致性损失保证物理保真度,并引入双域块并行FFT分支抑制频域伪影,在GF-5上海数据集上PSNR达48.50 dB。

详情
AI中文摘要

快照宽带滤光片阵列(BFA)成像为光谱重建提供了高光通量,但由于复杂调制引入了严重的光谱混叠。当前的深度学习方法局限于空间去噪,往往无法解决由掩膜结构引起的全局频率特定退化。为了解决这个问题,我们提出了一种物理嵌入的频率感知Transformer(PF-Trans),用于高保真遥感光谱重建。我们的方法通过掩膜注入和灰度一致性损失显式集成物理传感模型,以确保物理保真度。此外,我们引入了一个带有并行快速傅里叶变换(FFT)分支的双域块,使网络能够感知并抑制频域中的混叠伪影。在多个数据集上的大量实验表明,PF-Trans实现了最先进的性能,在GF-5上海数据集上峰值信噪比(PSNR)高达48.50 dB,显著优于对比方法。

英文摘要

Snapshot Broadband Filter Array (BFA) imaging provides high light throughput for spectral reconstruction but introduces severe spectral aliasing due to complex modulation. Current deep learning approaches, limited to spatial denoising, often fail to address the global frequency-specific degradations caused by the mask structure. To address this, we propose a Physics-embedded Frequency-aware Transformer (PF-Trans) for high-fidelity remote sensing spectral reconstruction. Our method explicitly integrates the physical sensing model through mask injection and a gray-scale consistency loss to ensure physical fidelity. Furthermore, we introduce a Dual-domain Block with a parallel Fast Fourier Transform (FFT) branch, enabling the network to perceive and suppress aliasing artifacts in the frequency domain. Extensive experiments on multiple datasets demonstrate that PF-Trans achieves state-of-the-art performance, achieving a Peak Signal-to-Noise Ratio (PSNR) of up to 48.50 dB on the GF-5 Shanghai dataset, significantly outperforming comparison methods.

2606.10372 2026-06-10 cs.CV 新提交

ClinReadNet: A clinical reading-inspired network for low-dose abdominal CT image quality assessment

ClinReadNet: 一种受临床阅读启发的低剂量腹部CT图像质量评估网络

Xianye Xiao, Yulong Zou, Yujie Luo, Taihui Yu, Cun-Jing Zheng, Yuan-ming Geng, Shuihua Wang, Yudong Zhang, Jin Hong

发表机构 * School of Mathematics and Computer Sciences, Nanchang University(南昌大学数学与计算机科学学院) School of Information Engineering, Nanchang University(南昌大学信息工程学院) Department of Radiology, Sun Yat-Sen Memorial Hospital, Sun Yat-Sen University(中山纪念医院放射科,中山大学) Department of Stomatology, Zhujiang Hospital, Southern Medical University(南方医科大学珠江医院口腔科) Department of Biological Sciences, School of Science, Xi'an Jiaotong Liverpool University(西安交通大学利物浦大学科学学院生物科学系) School of Computer Science and Engineering, Southeast University(东南大学计算机科学与工程学院)

AI总结 提出ClinReadNet框架,通过模拟放射科医生阅读习惯,结合Sobel序数质量网络和窗口多尺度温度多头自注意力模块,并设计分层排序概率分数损失函数,在LDCTIQAG2023数据集上实现SOTA性能。

详情
AI中文摘要

在腹部CT成像中,开发一种模拟医生阅读习惯的低剂量无参考图像质量评估(No-reference IQA)模型具有重要的实际价值。本文提出了一种新颖的基于深度学习的框架ClinReadNet,其设计与放射科医生的临床阅读逻辑一致:首先,引入Sobel序数质量网络(SOQN)模块,该模块能同时关注与图像质量高度相关的边缘细节和整个图像的质量分布模式,准确匹配“兼顾局部细节与整体上下文”的临床阅片判断习惯;其次,该框架集成了(移位)窗口多尺度温度多头自注意力((S)W-MTMSA)模块,进一步复制了放射科医生从整体扫描到局部聚焦的阅片过程,并通过多锐度注意力精确锁定感兴趣区域;第三,设计了分层排序概率分数(HRPS)损失函数,该函数结合了粗分类和细分类的双重逻辑,同时关注分级标签之间的距离信息,有效提升了图像质量评估的性能。在LDCTIQAG2023数据集上进行的实验表明,所提方法达到了当前最先进(SOTA)性能:皮尔逊线性相关系数(PLCC)、斯皮尔曼秩相关系数(SROCC)和肯德尔秩相关系数(KROCC)的值分别达到0.9507、0.9554和0.8629,其绝对值之和(Score)为2.7690,优于现有方法。

英文摘要

In abdominal CT imaging, developing a low-dose, no-reference image quality assessment (No-reference IQA) model that mimics doctors' reading habits for evaluating CT image quality has significant practical value. This paper proposes a novel deep learning-based framework, ClinReadNet, whose design aligns with the clinical reading logic of radiologists: first, it introduces the Sobel ordinal quality network (SOQN) module, which can simultaneously focus on edge details highly relevant to image quality and the quality distribution pattern of the entire image, accurately matching the clinical image-reading judgment habit of "considering both local details and overall context"; second, the framework integrates the (shifted) window multi-scale temperature multi-head self-attention ((S)W-MTMSA) module, which further replicates the radiologists' image-reading process of shifting from overall scanning to local focusing, and accurately locks in regions of interest through multi-sharpness attention; third, it designs the hierarchical ranked probability score (HRPS) loss function, which combines the dual logics of coarse classification and fine classification, while paying attention to the distance information between grading labels, effectively improving the performance of image quality assessment. Experiments conducted on the LDCTIQAG2023 dataset show that the proposed method achieves the current state-of-the-art (SOTA) performance: the values of Pearson's linear correlation coefficient (PLCC), Spearman's rank-order correlation coefficient (SROCC), and Kendall's rank-order correlation coefficient (KROCC) reach 0.9507, 0.9554, and 0.8629 respectively, with the sum of their absolute values (Score) being 2.7690, outperforming existing methods.

2606.10371 2026-06-10 cs.RO cs.AI 新提交

Test-time Adversarial Takeover: A Real-time Hijacking Interface against Robotic Diffusion Policies

测试时对抗接管:针对机器人扩散策略的实时劫持接口

Zi Yin, Peilin Chai, Siyuan Huang, Zhanhao Hu

发表机构 * Tsinghua University(清华大学) Independent Researcher(独立研究员) Johns Hopkins University(约翰霍普金斯大学) UC Berkeley(加州大学伯克利分校)

AI总结 提出测试时对抗接管(TAKO)方法,通过可微扩散推理学习可重复使用的通用补丁,在测试时切换补丁以劫持机器人策略,实现远程操控,在多种任务和模型上达到100%接管成功率。

详情
AI中文摘要

基于扩散的动作生成已成为具身AI的基础组件,但其对视觉条件的依赖使得部署的视觉运动策略容易受到对抗性操纵。大多数先前的攻击侧重于破坏:它们扰动观测流以降低任务成功率或引发异常行为。我们研究了一种更强的威胁,即测试时对抗接管(TAKO),其中攻击者获得对冻结机器人策略的实时转向接口,并将其转变为远程操控仪器。TAKO通过可微扩散推理学习一个小的可重用通用补丁词汇表;在测试时,攻击者在摄像头流中切换这些补丁以组合攻击者选择的轨迹。这种方法之所以有效,是因为扰动作用于视觉条件路径,其中诱导的偏差可以通过迭代生成推理持续存在。我们进一步表明,自然的目标基线——目标策略匹配——会失败,因为受害者策略无法可靠地在分布外目标偏移上监督自身。在四个任务(2D操作、模拟空中递送、模拟地面导航和物理世界地面导航)、两个视觉编码器(ResNet-18和EfficientNet-B0 + Transformer)以及三个生成推理族(DDPM、DDIM和流匹配)中,人类操作员在每个评估设置中均实现了100%的接管成功率,满足攻击者定义的目标。项目页面可在此https URL获取。

英文摘要

Diffusion-based action generation has become a foundational component of embodied AI, but its reliance on visual conditioning leaves deployed visuomotor policies vulnerable to adversarial manipulation. Most prior attacks focus on disruption: they perturb the observation stream to reduce task success or induce erratic behavior. We study a stronger threat, Test-time Adversarial Takeover (TAKO), in which an attacker obtains a real-time steering interface over a frozen robot policy and turns it into a remotely piloted instrument. TAKO learns a small vocabulary of reusable universal patches through differentiable diffusion inference; at test time, the attacker switches among these patches in the camera stream to compose attacker-chosen trajectories. This works because the perturbation acts on the visual conditioning pathway, where the induced bias can persist through iterative generative inference. We further show that the natural targeted baseline, target-policy matching, fails because the victim policy cannot reliably supervise itself on out-of-distribution target shifts. Across four tasks (2D manipulation, simulated aerial delivery, simulated ground navigation, and physical-world ground navigation), two visual encoders (ResNet-18 and EfficientNet-B0 + Transformer), and three generative inference families (DDPM, DDIM, and flow matching), human operators achieve 100\% takeover success on attacker-defined objectives in every evaluated setting. The project page is available at https://tako-attack.github.io.

2606.10369 2026-06-10 cs.CL cs.LG 新提交

PADD: Path-Aligned Decompression Distillation for Non-Router Teacher to Guide MoE Student Learning

PADD: 面向非路由器教师指导MoE学生学习的路径对齐解压缩蒸馏

Xinyue Peng, Yi Qian, Jiaojiao Lin, Wenjian Shao, Yanming Liu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出路径对齐解压缩蒸馏(PADD)框架,通过四阶段两阶段流程将密集教师知识蒸馏到混合专家(MoE)学生中,同时学习高质量路由策略,在数学推理任务上显著优于基线。

Comments published in ICML 2026

详情
AI中文摘要

随着大型语言模型(LLMs)持续扩展,在固定计算预算下增长模型容量变得越来越具有挑战性。我们提出路径对齐解压缩蒸馏(PADD),这是一个将知识从无显式路由的密集教师蒸馏到混合专家(MoE)学生中,同时学习高质量路由策略的框架。PADD将知识蒸馏组织为两个阶段的四个阶段:初始化阶段(阶段I)通过教师神经元聚类和学生专家预热在学生专家中构建多样功能,以及训练阶段(阶段II–IV)将在线自适应蒸馏、路径细化策略优化和奖励增强负载平衡集成在单一训练流程中。在数学推理基准上的实验表明,在相同推理成本下,PADD相比强基线取得了显著提升,且MoE学生能够匹配或超越其密集教师。实验还展示了有效的教师到学生知识蒸馏和稳定的路由行为。

英文摘要

As large language models (LLMs) continue to scale, it becomes increasingly challenging to grow model capacity under fixed computation budgets. We propose Path-Aligned Decompression Distillation (PADD), a framework for distilling knowledge from dense teachers without explicit routing into mixture-of-experts (MoE) students while learning high-quality routing policies. PADD organizes knowledge distillation into four stages in two phases: an initialization phase (Stage I) that builds diverse functionality in the student's experts through teacher neuron clustering and student-expert warmup, and a training phase (Stages II--IV) that integrates online adaptive distillation, path-refined policy optimization, and reward-augmented load balancing in a single training pipeline. Experiments on mathematical reasoning benchmarks demonstrate that PADD yields substantial gains over strong baselines at the same inference cost and that the MoE student can match or surpass its dense teacher. They also demonstrate effective teacher-to-student knowledge distillation and stable routing behavior.

2606.10368 2026-06-10 cs.SD cs.AI 新提交

Speech Meets ELF: Audio Conditional Continuous-Target Diffusion for Speech Recognition and Translation

语音遇见ELF:用于语音识别和翻译的音频条件连续目标扩散

Xuanchen Li, Tianrui Wang, Yuheng Lu, Zikang Huang, Yu Jiang, Chenghan Lin, Chenrui Cui, Ziyang Ma, Xingyu Ma, Chunyu Qiang, Guochen Yu, Xie Chen, Longbiao Wang, Jianwu Dang

发表机构 * Tianjin University(天津大学) Shanghai Jiao Tong University(上海交通大学) Nankai University(南开大学)

AI总结 提出ELF-S2T,一种基于预训练ELF骨干的音频条件连续目标生成模型,通过音频强制训练和分类器自由引导,在LibriSpeech和CoVoST2上实现竞争性ASR和S2TT性能,并揭示识别与翻译错误均源于连续潜空间中的近距离混淆。

详情
AI中文摘要

语音到文本(S2T)系统用于识别(ASR)和翻译(S2TT)通常生成离散文本标记。相比之下,连续目标语言建模在连续空间中执行生成,但其在S2T中的潜力尚未被探索。为填补这一空白,我们提出了ELF-S2T,一种用于S2T的音频条件连续目标生成模型。基于预训练的嵌入式语言流(ELF)骨干,ELF-S2T通过冻结的Whisper编码器和单个线性投影器处理语音,将得到的音频条件前置到噪声文本潜变量前,用于上下文流匹配去噪。为防止模型过度依赖其预训练的文本上下文,我们在训练中引入音频强制,并在推理时通过分类器自由引导进一步放大音频条件。在LibriSpeech和CoVoST2上的实验表明,ELF-S2T实现了具有竞争力的ASR和S2TT性能。关键的是,我们的错误分析揭示,尽管ASR和S2TT错误表面上看起来非常不同,但两者都源于同一根本原因:连续潜空间中的近距离混淆。这一发现自然与连续表示生成范式一致,表明识别和翻译之下存在共同的语义映射过程。我们的代码和预训练模型在此https URL公开提供。

英文摘要

Speech-to-text (S2T) systems for recognition (ASR) and translation (S2TT) typically generate discrete text tokens. In contrast, continuous-target language modelling performs generation in a continuous space, yet its potential for S2T remains unexplored. To bridge this gap, we propose ELF-S2T, an audio-conditioned continuous-target generative model for S2T. Built upon the pre-trained Embedded Language Flows (ELF) backbone, ELF-S2T processes speech via a frozen Whisper encoder and a single linear projector, prepending the resulting audio condition to the noisy text latent for in-context, flow-matching denoising. To prevent the model from over-relying on its pre-trained text context, we introduce audio forcing during training, and further amplify the audio condition via classifier-free guidance at inference. Experiments on LibriSpeech and CoVoST2 show that ELF-S2T achieves competitive ASR and S2TT performance. Crucially, our error analysis reveals that, although ASR and S2TT errors look very different on the surface, both stem from the same underlying cause, a close distance confusion in the continuous latent space. This finding naturally aligns with the continuous representation generation paradigm, indicating a common semantic mapping process beneath recognition and translation. Our code and pretrained models are publicly available at https://github.com/Sslnon/ELF-S2T.

2606.10366 2026-06-10 cs.RO cs.AI 新提交

A Practical Recipe Towards Improving Sim-and-Real Correlation for VLA Evaluation

提升VLA评估中仿真与真实相关性的实用指南

Shuo Wang, Hanyuan Xu, Yingdong Hu, Fanqi Lin, Yang Gao

发表机构 * Tsinghua University(清华大学) Shanghai Qi Zhi Institute(上海期智研究院)

AI总结 本文系统研究仿真与真实环境在VLA策略评估中的相关性,提出统一框架来测量和提升仿真作为真实评估代理的有效性。

Comments 20 pages

详情
AI中文摘要

仿真已成为评估和改进视觉-语言-动作(VLA)策略的重要工具,为昂贵的真实机器人评估提供了可扩展、可重复且可控的替代方案。最近的仿真基准在真实感和多样性方面取得了实质性进展,但这些平台尚未被广泛用作可靠的真实策略评估代理。在这项工作中,我们通过仿真与真实相关性的视角研究这一问题。我们在多个仿真平台、VLA策略、任务和扰动因素上进行了系统研究,测量模拟评估在策略排名一致性、性能相关性和扰动方面失败模式上是否保留真实结论。这一分析使我们能够表征现有模拟器的局限性,并确定哪种模拟信号更符合真实部署。我们进一步研究了用户应如何利用仿真进行策略改进,包括何时基于模拟器的微调是有益的,以及后训练数据量如何影响仿真与真实的对齐。总体而言,我们的工作提供了一个统一的框架,用于测量、解释和提升仿真对VLA策略的有用性,为模拟器设计者和在策略开发流程中使用仿真的实践者提供指导。

英文摘要

Simulation has become an essential tool for evaluating and improving vision-language-action (VLA) policies, offering scalable, reproducible, and controllable alternatives to costly real-world robot evaluation. Recent simulation benchmarks have made substantial progress on realism and diversity, yet these platforms have not been widely adopted as reliable proxies for real-world policy evaluation. In this work, we investigate this issue through the lens of sim-and-real correlation. We conduct a systematic study across multiple simulation platforms, VLA policies, tasks, and perturbation factors, measuring whether simulated evaluation preserves real-world conclusions in terms of policy ranking consistency, performance correlation, and perturbation-wise failure patterns. This analysis allows us to characterize the limitations of existing simulators and identify what kinds of simulation signals are more aligned with real-world deployment. We further examine how users should exploit simulation for policy improvement, including when simulator-based finetuning is beneficial and how the amount of post-training data affects sim-and-real alignment. Overall, our work provides a unified framework for measuring, interpreting, and improving the usefulness of simulation for VLA policies, offering guidance both for simulator designers and for practitioners who use simulation as part of the policy development pipeline.