arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2075
专题追踪 全部专题
2510.01105 2026-05-11 cs.LG

Geometric Analysis of Neural Regression Collapse via Intrinsic Dimension

神经回归崩溃的几何分析:通过内在维度

George Andriopoulos, Zixuan Dong, Bimarsha Adhikari, Keith Ross

发表机构 * New York University Abu Dhabi(纽约大学阿布扎比分校) SFSC of AI and DL, NYU Shanghai(纽约大学上海人工智能与深度学习学院) New York University(纽约大学)

AI总结 研究通过内在维度分析揭示神经回归崩溃的几何机制,发现过度压缩导致泛化能力下降,提出扩展或减少特征维度的策略以提升性能。

Comments 36 pages, 21 figures

详情
AI中文摘要

神经多变量回归支撑着控制、机器人和金融等多个领域,但其学习表示的几何特性仍不明确。尽管神经崩溃在分类中有助于泛化,但回归中的崩溃却持续降低性能。通过分析内在维度,发现崩溃模型的最后层特征内在维度(ID_H)小于回归目标(ID_Y),导致过度压缩和泛化能力差。非崩溃模型通常保持ID_H > ID_Y,其性能取决于数据量和噪声水平。研究识别出过度压缩和欠压缩两种状态,确定扩展或减少特征维度以提升性能的策略。

英文摘要

Neural multivariate regression underpins a wide range of domains, including control, robotics, and finance, yet the geometry of its learned representations remains poorly characterized. While neural collapse has been shown to benefit generalization in classification, we find that analogous collapse in regression consistently degrades performance. To explain this contrast, we analyze regression models through the lens of intrinsic dimension. Across control tasks and synthetic datasets, we estimate the intrinsic dimension of last-layer features (ID_H) and compare it with that of the regression targets (ID_Y). Collapsed models exhibit ID_H < ID_Y, leading to over-compression and poor generalization, whereas non-collapsed models typically maintain ID_H > ID_Y. For the non-collapsed models, performance with respect to ID_H depends on the data quantity and noise levels. From these observations, we identify two regimes (over-compressed and under-compressed) that determine when expanding or reducing feature dimensionality improves performance. Our results provide new geometric insights into neural regression collapse and suggest practical strategies for enhancing generalization.

2510.00568 2026-05-11 cs.CL

ReSeek: A Self-Correcting Framework for Search Agents with Instructive Rewards

ReSeek:一种具有指导性奖励的自我纠正搜索代理框架

Shiyu Li, Yang Tang, Yifan Wang, Peiming Li, Xi Chen

发表机构 * Basic Algorithm Center, PCG, Tencent, Shenzhen, China(腾讯基本算法中心、PCG、腾讯,深圳,中国) Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China(清华大学深圳国际研究生学院,清华大学,深圳,中国)

AI总结 本文提出ReSeek框架,通过引入自我纠正机制和密集指导性奖励函数,提升搜索代理在复杂任务中的表现和路径忠实度。

Comments ICML 2026

详情
AI中文摘要

基于大语言模型的搜索代理在解决知识密集型任务中展现出巨大潜力。强化学习(RL)已成为训练这些代理进行复杂多步骤推理的强大范式。然而,先前基于RL的方法往往依赖稀疏或规则奖励,导致代理无法恢复错误的推理路径。为解决这些问题,我们提出了ReSeek,一种新的自我纠正框架。该框架引入了一种自我纠正机制,使代理能够在回合中动态识别并恢复错误的搜索路径。通过调用特殊JUDGE动作,代理可以判断信息并重新规划搜索策略。为了引导此过程,我们设计了一个密集、指导性的过程奖励函数,将其分解为检索事实信息的正确性奖励和找到对查询真正有用的信息的效用奖励。此外,为了减轻现有数据集中的数据污染风险,我们引入了FictionalHot,一个具有最新整理问题的挑战性基准,需要复杂的推理。经过大量实验表明,使用ReSeek训练的代理在任务成功率和路径忠实度上显著优于SOTA基线。

英文摘要

Search agents powered by Large Language Models (LLMs) have demonstrated significant potential in tackling knowledge-intensive tasks. Reinforcement learning (RL) has emerged as a powerful paradigm for training these agents to perform complex, multi-step reasoning. However, prior RL-based methods often rely on sparse or rule-based rewards, which can lead agents to commit to suboptimal or erroneous reasoning paths without the ability to recover. To address these limitations, we propose ReSeek, a novel self-correcting framework for training search agents. Our framework introduces a self-correction mechanism that empowers the agent to dynamically identify and recover from erroneous search paths during an episode. By invoking a special JUDGE action, the agent can judge the information and re-plan its search strategy. To guide this process, we design a dense, instructive process reward function, which decomposes into a correctness reward for retrieving factual information and a utility reward for finding information genuinely useful for the query. Furthermore, to mitigate the risk of data contamination in existing datasets, we introduce FictionalHot, a new and challenging benchmark with recently curated questions requiring complex reasoning. Being intuitively reasonable and practically simple, extensive experiments show that agents trained with ReSeek significantly outperform SOTA baselines in task success rate and path faithfulness.

2510.00253 2026-05-11 cs.LG

DReS: Dual Reconstruction Smoothing for Functional Regularization

DReS:用于函数正则化的双重构平滑

Parsa Moradi, Tayyebeh Jahaninezhad, Hanzaleh Akbarinodehi, Mohammad Ali Maddah-Ali

发表机构 * University of Minnesota(明尼苏达大学) Technical University Berlin(柏林技术大学)

AI总结 DReS通过基于样条的辅助分支引入平滑性,无需额外参数,适用于无监督、自监督和监督学习,提升表示学习和生成质量。

详情
AI中文摘要

平滑性是机器学习中的关键归纳偏差,与泛化密切相关。现有平滑诱导方法通常依赖显式梯度正则化或数据混合策略,但前者带来计算和内存开销,后者在无监督和自监督设置中应用受限。本文提出双重构平滑(DReS),一种非参数正则化框架,通过共享模型参数的样条辅助分支诱导平滑性。该方法不引入额外可训练参数,适用于任意子模块。理论证明目标函数与DReS近似之间的差异由函数的高阶平滑性量控制,建立隐式高阶平滑正则化方法。实验表明DReS在多种自监督方法中提升表示学习,在生成建模中提升生成质量,并在监督学习中相对于竞争基线表现优异。

英文摘要

Smoothness is a key inductive bias in machine learning and is closely related to generalization. Existing smoothness-inducing methods typically rely either on explicit gradient regularization, which often incurs substantial computational and memory overhead, or on data-mixing strategies, which are less naturally applicable to unsupervised and self-supervised settings. In this work, we propose $\textit{Dual Reconstruction Smoothing}$ (DReS), a nonparametric regularization framework that induces smoothness through a spline-based auxiliary branch with shared model parameters. The method introduces no additional trainable parameters and can be applied to arbitrary submodules, making it suitable for unsupervised, self-supervised, and supervised regimes. We show theoretically that the discrepancy between the target function and its DReS approximation is controlled by higher-order smoothness quantities of the function, establishing the method as an implicit higher-order smoothness regularizer. Empirically, DReS improves representation learning across several self-supervised methods, improves generation quality in generative modeling, and achieves strong performance relative to competitive baselines in supervised learning.

2509.26524 2026-05-11 cs.LG cs.AI

TAP: Two-Stage Adaptive Personalization of Multi-Task and Multi-Modal Foundation Models in Federated Learning

TAP: 多任务和多模态基础模型在联邦学习中的两阶段自适应个性化

Seohyun Lee, Wenzhi Fang, Dong-Jun Han, Seyyedali Hosseinalipour, Christopher G. Brinton

发表机构 * Purdue University(普渡大学) Yonsei University(延世大学) University at Buffalo-SUNY(布法罗大学-苏尼尔)

AI总结 本文提出TAP方法,通过两阶段自适应个性化在联邦学习中提升多任务和多模态基础模型的泛化能力,解决数据、任务和模态异质性问题。

Comments 29 pages

详情
AI中文摘要

在联邦学习(FL)中,局部模型个性化受到广泛关注,但基础模型的个性化微调仍待深入研究。本文提出两阶段自适应个性化(TAP),第一阶段利用客户端与服务器间模型架构不匹配,选择性替换个性化参数以限制跨任务和跨模态干扰;第二阶段通过FL后蒸馏恢复全局模型的有益共享结构。通过引入首个在客户端跨模态-任务异质性下的联邦基础模型训练收敛分析,本文展示了模态-任务对数量对模型微调的影响。实验表明,TAP在多种数据集和任务上优于现有基线。代码已公开在https://github.com/lee3296/TAP。

英文摘要

In federated learning (FL), local personalization of models has received significant attention, yet personalized fine-tuning of foundation models remains underexplored. In particular, there is a lack of understanding in the literature on how to personalize foundation models in settings where there exist heterogeneity not only in data, but also in tasks and modalities across the clients. To address this gap, we propose Two-Stage Adaptive Personalization (TAP). In the first stage, TAP leverages mismatched model architectures between clients and the server to selectively replace personalized parameters with global updates, explicitly limiting cross-task and cross-modality interference. In the second stage, TAP conducts post-FL distillation on the global model to recover a beneficial shared structure. By reintroducing generalizable knowledge only after the global model has stabilized, TAP enhances generalization without compromising personalization. In developing our methodology, we introduce the first convergence analysis of federated foundation model training at the server under modality-task pair heterogeneity across clients, and demonstrate the impact of the number of modality-task pairs on model fine-tuning. Through extensive experiments, we demonstrate the effectiveness of TAP across a variety of datasets and tasks in comparison to state-of-the-art baselines. The implementation code is publicly available at https://github.com/lee3296/TAP.

2509.26272 2026-05-11 cs.CV cs.LG

PRPO: Paragraph-level Policy Optimization for Vision-Language Deepfake Detection

PRPO:用于视觉-语言深度伪造检测的段级策略优化

Tuan Nguyen, Naseem Khan, Khang Tran, NhatHai Phan, Issa Khalil

发表机构 * Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar(卡塔尔计算研究所,哈马德·本·哈利法大学,多哈) New Jersey Institute of Technology, NJ, USA(新泽西理工学院)

AI总结 本文提出PRPO算法,通过段级相对策略优化提升深度伪造检测的准确性与推理能力,实验显示其在检测准确率和推理评分上均显著优于现有方法。

Comments Accepted at ICML 2026

详情
AI中文摘要

合成媒体的迅速发展使深度伪造检测成为在线安全和信任的关键挑战。进展受限于大规模高质量数据集的稀缺性。尽管多模态大语言模型(LLMs)具有强大的推理能力,但其在深度伪造检测中的表现较差,常产生与视觉证据不一致或虚构的解释。为解决这一限制,我们引入了深度伪造检测的推理标注数据集,并提出了段级相对策略优化(PRPO),一种通过强化学习对齐LLM推理与图像内容的算法。实验显示,PRPO在检测准确率上显著提升,并取得了4.55/5.0的最高推理评分。消融研究进一步表明,PRPO在测试时条件下显著优于GRPO。这些结果强调了将多模态推理扎根于视觉证据以实现更可靠和可解释的深度伪造检测的重要性。

英文摘要

The rapid rise of synthetic media has made deepfake detection a critical challenge for online safety and trust. Progress remains constrained by the scarcity of large, high-quality datasets. Although multimodal large language models (LLMs) exhibit strong reasoning capabilities, their performance on deepfake detection is poor, often producing explanations that are misaligned with visual evidence or hallucinatory. To address this limitation, we introduce a reasoning-annotated dataset for deepfake detection and propose Paragraph-level Relative Policy Optimization (PRPO), a reinforcement learning algorithm that aligns LLM reasoning with image content at the paragraph level. Experiments show that PRPO improves detection accuracy by a wide margin and achieves the highest reasoning score of 4.55/5.0. Ablation studies further demonstrate that PRPO significantly outperforms GRPO under test-time conditions. These results underscore the importance of grounding multimodal reasoning in visual evidence to enable more reliable and interpretable deepfake detection.

2509.23370 2026-05-11 cs.CV

GRAPE: Let GRPO Supervise Query Rewriting by Ranking for Retrieval

GRAPE:通过检索排名监督查询重写

Zhaohua Zhang, Jianhuan Zhuo, Muxi Chen, Chenchen Zhao, Wenyu Jiang, Tianwen Jiang, Mingyang Chen, Yutang, Qiuyong Xiao, Jihong Zhang, Zhixun Su

发表机构 * Dalian University of Technology, Dalian, China(大连理工大学) Tencent HunYuan Data, Shenzhen, China(腾讯混元数据)

AI总结 GRAPE通过组相对策略优化提升检索排名,改进多语言、长文本和多模态查询的检索性能,平均提升Recall@10 4.9%。

详情
AI中文摘要

CLIP模型已成为大规模检索系统的核心,但其性能在分布偏移下会下降。为避免重新训练检索器或重新嵌入语料的高成本,我们提出GRAPE(组相对策略优化增强),利用LLM基于查询重写填补差距。不同于现有方法缺乏显式监督,GRAPE通过组相对策略优化将排名信号整合到重写LLM中,确保重写查询更符合冻结检索器的潜在分布。关键发现是朴素相似度基于微调中的分数膨胀现象——无关候选者会获得无差别高分,并通过新的基于语料的排名奖励缓解。在多语言(Flickr30k-CN,CVLUE,XM3600)、长文本(维基百科)和多模态(CIRR)基准测试中,GRAPE在不修改底层检索器的情况下,一致提升性能,平均Recall@10提升4.9%。代码可在https://github.com/mogulzhang/GRAPE获取。

英文摘要

The CLIP model has established itself as a cornerstone of large-scale retrieval systems. However, its performance often degrades under distributional shifts such as multilingual, long-form, or multimodal queries. To avoid the prohibitive costs associated with retriever retraining or corpus re-embedding, we propose GRAPE (Grouped Ranking-Aware Policy Optimization Enhancement), a plug-and-play approach that leverages LLM-based query rewriting to bridge these gaps. Unlike existing methods that lack explicit supervision, GRAPE integrates ranking signals into the rewriting LLM via Grouped Relative Policy Optimization (GRPO), ensuring rewritten queries are better aligned with the frozen retriever's latent distribution. Crucially, we identify a score inflation phenomenon in naive similarity-based finetuning - where irrelevant candidates receive indiscriminately high scores - and mitigate it with a novel corpus-relative ranking-based reward. Extensive experiments across multilingual (Flickr30k-CN, CVLUE, XM3600), long-form (Wikipedia), and multimodal (CIRR) benchmarks demonstrate that GRAPE consistently improves performance, achieving an average gain of 4.9% in Recall@10 without any modification to the underlying retriever. The code is available at https://github.com/mogulzhang/GRAPE.

2509.08089 2026-05-11 cs.LG cs.CR

Hammer and Anvil: Toward a Theory of Backdoors in Federated Learning

锤与砧:迈向联邦学习中后门的理论框架

Lucas Fenaux, Zheng Wang, Jacob Yan, Nathan Chung, Florian Kerschbaum

发表机构 * University of Waterloo(滑铁卢大学)

AI总结 本文提出'锤与砧'理论框架,通过分析后门更新偏离程度δ,分类两种防御类型,证明单一类型或非原则性组合防御存在漏洞,而结合两种类型防御能有效抵御适应性攻击。

详情
AI中文摘要

联邦学习(FL)虽然支持分布式模型训练,但易受后门攻击,即恶意客户端将攻击者控制的行为嵌入全局模型。现有防御措施无法抵御适应性攻击。本文提出'锤与砧'理论框架,通过分析后门更新偏离程度δ,将后门分为两种基本防御类型:类型1(砧)包含异常检测和鲁棒聚合,有效对抗大偏离攻击;类型2(锤)包含移除类防御,有效对抗小偏离攻击。本文证明单一类型或非原则性组合防御会留下可被适应性攻击利用的漏洞。为弥补这一漏洞,本文提出结合类型1和类型2的原理性防御方法。在针对新、最坏情况、全信息适应性攻击者的评估中,该攻击者仍无法突破我们的综合防御。在各种数据集和设置下的实证评估显示,单一类型或非原则性组合防御极易被突破,通常由单个恶意客户端即可完成。相比之下,我们的最佳综合防御变体HA_Flame_CSFT、HA_Krum_CSFT和HA_Multi-Metrics_CSFT即使在最敌对的设置下仍无败绩。我们的结果为联邦学习中后门研究提供了原理性方法。

英文摘要

Federated Learning (FL) enables distributed model training but is vulnerable to backdoor attacks, where malicious clients embed attacker-controlled behaviors into the global model. Existing defenses fail against adaptive adversaries. In this paper, we present "Hammer and Anvil", a principled theoretical framework that categorizes backdoors by the deviation, $δ$, of their updates to the mean of the updates. We identify two fundamental defense types: "Type 1 (The Anvil)", comprising outlier detection and robust aggregation effective against large-deviation attacks, and "Type 2 (The Hammer)", consisting of removal-based defenses effective against small-deviation attacks. We demonstrate that defenses of a single type and non-principled combined defenses inherently leave an exploitable gap for adaptive attackers. To bridge this gap, we propose the principled combination of Type 1 and Type 2 defenses. We evaluate our framework against a new, worst-case, full-information adaptive adversary that knows the benign updates, the aggregation algorithm, and its parameters, and yet this adversary fails against our combined defenses. Our empirical evaluation across various datasets and settings shows that single-typed and non-principled combined defenses are easily broken, often by a single malicious client. In contrast, our best combined defense variants, $HA_{Flame}^{CSFT}$, $HA_{Krum}^{CSFT}$, and $HA_{Multi-Metrics}^{CSFT}$, remain undefeated even in the most adversarial settings. Our results provide a principled approach for research on backdoors in federated learning.

2509.03736 2026-05-11 cs.AI cs.CL cs.LG

Are LLM Agents Behaviorally Coherent? Latent Profiles for Social Simulation

LLM代理行为是否一致?用于社交模拟的潜在特征

James Mooney, Josef Woldense, Zheng Robert Jia, Shirley Anugrah Hayati, My Ha Nguyen, Vipul Raheja, Dongyeop Kang

发表机构 * Department of Computer Science and Engineering, University of Minnesota(明尼苏达大学计算机科学与工程系) Department of African American & African Studies, University of Minnesota(明尼苏达大学非裔美国人与非洲研究系) Department of Sociology, University of Chicago(芝加哥大学社会学系)

AI总结 本文研究LLM代理在不同实验环境下是否保持经验一致性,通过揭示代理的潜在特征来检验其行为一致性,发现不同模型家族和大小的LLM存在显著不一致。

Comments 25 pages, 9 figures, 7 tables

详情
AI中文摘要

大型语言模型(LLM)的出色能力引发了合成代理可能替代真实参与者进行人类研究的可能。为了评估这一主张,先前研究主要关注LLM生成的调查响应是否与人类响应一致。相反,我们提出了一个更根本的问题:代理是否在不同实验设置下保持经验一致性?为此,我们开发了一项研究,旨在(a)提出一组问题以揭示代理的潜在特征,并(b)在与其他代理的对话设置中检验代理行为一致性。此设计使我们能够探索一组行为假设,以评估代理的对话行为是否与其揭示状态相符。我们的发现显示,LLM在不同模型家族和不同大小下存在显著不一致。最重要的是,尽管代理可能生成与人类对应者相同的响应,但它们未能保持经验一致性,这表明其在准确替代真实参与者进行人类研究方面的关键不足。

英文摘要

The impressive capabilities of Large Language Models (LLMs) raise the possibility that synthetic agents can serve as substitutes for real participants in human-subject research. To evaluate this claim, prior research has largely focused on whether LLM-generated survey responses align with those produced by human respondents whom the LLMs are prompted to represent. In contrast, we address a more fundamental question: Do agents maintain empirical consistency; aligning to human behavioral models when examined under different experimental settings? To this end, we develop a study designed to (a) ask a set of questions which reveals an agent's latent profile and (b) examine agent behavioral consistency in a conversational setting with other agents. This design enables us to explore a set of behavioral hypotheses to assess whether an agent's conversational behavior is consistent with what we would expect from its revealed state. Our findings show significant inconsistencies in LLMs across model families and at differing model sizes. Most importantly, we find that, although agents may generate responses matching those of their human counterparts, they fail to be empirically consistent, representing a critical gap in their capabilities to accurately substitute for real participants in human-subject research.

2509.00338 2026-05-11 cs.LG cs.AI

Scalable Option Learning in High-Throughput Environments

高吞吐环境下的可扩展选项学习

Mikael Henaff, Scott Fujimoto, Michael Matthews, Michael Rabbat

发表机构 * Meta Superintelligence Labs(Meta超智能实验室) University of Oxford(牛津大学)

AI总结 本文提出SOL算法,通过高吞吐量的层级强化学习方法,在NetHack等复杂游戏中实现35倍的吞吐量提升,验证了其在高吞吐环境中的扩展性和通用性。

详情
AI中文摘要

层级强化学习(RL)有潜力在长时间尺度上实现有效的决策。现有方法虽然有前景,但尚未实现大规模训练的好处。本文识别并解决了将在线层级RL扩展到高吞吐环境中的几个关键挑战。我们提出可扩展选项学习(SOL),一种高度可扩展的层级RL算法,其吞吐量比现有层级方法高约35倍。为了展示SOL的性能和可扩展性,我们在复杂的NetHack游戏中使用300亿帧的经验训练层级代理,显著超越了扁平代理,并展示了积极的扩展趋势。我们还在MiniHack和Mujoco环境中验证了SOL,展示了其通用性。我们的代码已开源:github.com/facebookresearch/sol.

英文摘要

Hierarchical reinforcement learning (RL) has the potential to enable effective decision-making over long timescales. Existing approaches, while promising, have yet to realize the benefits of large-scale training. In this work, we identify and solve several key challenges in scaling online hierarchical RL to high-throughput environments. We propose Scalable Option Learning (SOL), a highly scalable hierarchical RL algorithm which achieves a ~35x higher throughput compared to existing hierarchical methods. To demonstrate SOL's performance and scalability, we train hierarchical agents using 30 billion frames of experience on the complex game of NetHack, significantly surpassing flat agents and demonstrating positive scaling trends. We also validate SOL on MiniHack and Mujoco environments, showcasing its general applicability. Our code is open sourced at: github.com/facebookresearch/sol.

2508.15989 2026-05-11 cs.LG cs.ET

Scalable Equilibrium Propagation via Intermediate Error Signals for Deep Convolutional CRNNs

通过中间误差信号实现深度卷积CRNNs的可扩展均衡传播

Jiaqi Lin, Malyaban Bal, Abhronil Sengupta

发表机构 * School of Electrical Engineering and Computer Science(电气工程与计算机科学学院) The Pennsylvania State University(宾夕法尼亚州立大学) University Park, PA 16802

AI总结 本文提出一种新的均衡传播框架,通过层间学习信号提供辅助监督,解决深度网络中消失梯度问题,提升收敛性,并在CIFAR-10和CIFAR-100数据集上取得最佳性能,展示了均衡传播在深度VGG架构中的可扩展性。

详情
AI中文摘要

均衡传播(EP)是一种生物启发的局部学习规则,最初提出用于收敛性递归神经网络(CRNNs),其中突触更新仅依赖于两个不同阶段的神经元状态。EP估计的梯度与反向传播通过时间(BPTT)计算的梯度高度一致,同时显著减少计算需求,使其成为神经形态架构中芯片级训练的潜在候选。然而,先前的EP研究受限于浅层架构,因为更深层的网络会遭遇消失梯度问题,导致能量最小化和梯度计算的收敛困难。为缓解深度EP网络中的消失梯度问题,我们提出了一种新的EP框架,该框架结合了层间学习信号以提供辅助监督,从而增强神经元动态的收敛性。这是首次将知识蒸馏和局部误差信号整合到EP中,使训练显著更深层的架构成为可能。我们的方法在CIFAR-10和CIFAR-100数据集上实现了最先进的性能,展示了其在深度VGG架构中的可扩展性。这些结果代表了EP可扩展性的重大进展,表明中间学习信号可以扩展EP在更深层架构中的实际应用。

英文摘要

Equilibrium Propagation (EP) is a biologically inspired local learning rule first proposed for convergent recurrent neural networks (CRNNs), in which synaptic updates depend only on neuron states from two distinct phases. EP estimates gradients that closely align with those computed by Backpropagation Through Time (BPTT) while significantly reducing computational demands, positioning it as a potential candidate for on-chip training in neuromorphic architectures. However, prior studies on EP have been constrained to shallow architectures, as deeper networks suffer from the vanishing gradient problem, leading to convergence difficulties in both energy minimization and gradient computation. To alleviate the vanishing gradient problem in deep EP networks, we propose a novel EP framework that incorporates layer-wise learning signals to provide auxiliary supervision, which enhances the convergence of neuron dynamics. This is the first work to integrate knowledge distillation and local error signals into EP, enabling the training of significantly deeper architectures. Our proposed approach achieves state-of-the-art performance on the CIFAR-10 and CIFAR-100 datasets, showcasing its scalability on deep VGG architectures. These results represent a significant advancement in the scalability of EP, suggesting that intermediate learning signals can extend the practical applicability of EP to deeper architectures.

2508.05803 2026-05-11 cs.CL

Human-like fleeting memory improves language learning but impairs reading time prediction in transformer language models

类人短暂记忆提升语言学习但损害变压器语言模型的阅读时间预测

Abishek Thamma, Micha Heilbron

发表机构 * University of Amsterdam, Amsterdam Brain and Cognition(阿姆斯特丹大学,阿姆斯特丹脑与认知中心) Vrije Universiteit Amsterdam, Department of Informatics(阿姆斯特丹自由大学,信息学院) Max Planck Institute for Psycholinguistics(马克斯·普朗克心理学语言学研究所)

AI总结 研究探讨了短暂记忆对语言学习和阅读时间预测的影响,发现短暂记忆提升语言学习但损害阅读时间预测,挑战了传统认知科学观点。

Comments Revised after peer review. Accepted for publication in Transactions of the Association for Computational Linguistics

详情
AI中文摘要

人类记忆是短暂的。随着词语的处理,构成输入句子的精确词形会被迅速丢失。认知科学家长期认为这种记忆限制可能有助于语言学习,这一观点得到了经典连接主义建模工作的支持。transformers的兴起似乎挑战了这一观点,因为这些模型可以有效学习语言,尽管缺乏记忆限制或其他架构近期偏见。在此,我们通过严格控制的实验研究了短暂记忆对语言学习的潜在益处。在发育现实的训练集上训练具有和没有短暂记忆的transformers,发现短暂记忆一致提升语言学习(通过整体语言建模性能和目标语法评估量化),但意外地损害基于意外的阅读时间预测。有趣的是,后续分析揭示这种差异——更好的语言建模但更差的阅读时间预测——无法用先前解释为何更好的语言模型有时更差地拟合人类阅读时间的原因来解释。这些结果支持了记忆限制对神经网络语言学习的益处,但不支持对行为预测的益处。

英文摘要

Human memory is fleeting. As words are processed, the exact wordforms that make up incoming sentences are rapidly lost. Cognitive scientists have long believed that this limitation of memory may, paradoxically, help in learning language - an idea supported by classic connectionist modelling work. The rise of Transformers appears to challenge this idea, as these models can learn language effectively, despite lacking memory limitations or other architectural recency biases. Here, we investigate the hypothesized benefit of fleeting memory for language learning in tightly controlled experiments on transformer language models. Training transformers with and without fleeting memory on a developmentally realistic training set, we find that fleeting memory consistently improves language learning (as quantified by both overall language modelling performance and targeted syntactic evaluation) but, unexpectedly, impairs surprisal-based prediction of human reading times. Interestingly, follow up analyses revealed that this discrepancy - better language modeling, yet worse reading time prediction - could not be accounted for by prior explanations of why better language models sometimes fit human reading time worse. Together, these results support a benefit of memory limitations on neural network language learning - but not on predicting behavior.

2508.05773 2026-05-11 cs.RO cs.SY eess.SY

GPU-Accelerated Barrier-Rate Guided MPPI Control for Tractor-Trailer Systems

GPU加速的障碍率引导的MPPI控制用于拖车-拖车系统

Keyvan Majd, Hardik Parwana, Bardh Hoxha, Steven Hong, Hideki Okamoto, Georgios Fainekos

发表机构 * Toyota Motor North America, Research & Development(丰田美国电机公司研发部)

AI总结 本文提出基于障碍率引导的MPPI控制方法,用于解决拖车-拖车系统在复杂环境中的导航问题,通过在路径积分更新中嵌入控制屏障函数约束,提高轨迹鲁棒性和探索能力。

Comments Accepted to IEEE ITSC 2025

详情
AI中文摘要

本文提出基于障碍率引导的MPPI控制方法,用于解决拖车-拖车系统在复杂环境中的导航问题,通过在路径积分更新中嵌入控制屏障函数约束,提高轨迹鲁棒性和探索能力。

英文摘要

Articulated vehicles such as tractor-trailers, yard trucks, and similar platforms must often reverse and maneuver in cluttered spaces where pedestrians are present. We present how Barrier-Rate guided Model Predictive Path Integral (BR-MPPI) control can solve navigation in such challenging environments. BR-MPPI embeds Control Barrier Function (CBF) constraints directly into the path-integral update. By steering the importance-sampling distribution toward collision-free, dynamically feasible trajectories, BR-MPPI enhances the exploration strength of MPPI and improves robustness of resulting trajectories. The method is evaluated in the high-fidelity CarMaker simulator on a 12 [m] tractor-trailer tasked with reverse and forward parking in a parking lot. BR-MPPI computes control inputs in above 100 [Hz] on a single GPU (for scenarios with eight obstacles) and maintains better parking clearance than a standard MPPI baseline and an MPPI with collision cost baseline.

2508.04056 2026-05-11 cs.RO q-bio.QM

SCOUT: Closed-Loop in-vivo System for Continuous Methane Concentration Monitoring in Cattle

SCOUT:用于持续监测反刍动物瘤胃空间甲烷浓度的闭环系统

Yuelin Deng, Hinayah Rojas de Oliveira, Richard M. Voyles, Upinder Kaur

发表机构 * Purdue Polytechnic, Purdue University(普渡大学Polytechnic学院) Department of Animal Sciences, Purdue University(普渡大学动物科学系) University of Texas at Arlington Research Institute, The University of Texas at Arlington(德克萨斯大学阿灵顿分校研究学院) Department of Agricultural and Biological Engineering, Purdue University(普渡大学农业与生物工程系)

AI总结 SCOUT是首个自主闭环系统,用于持续监测反刍动物瘤胃空间甲烷浓度,通过闭环气体循环维持厌氧瘤胃条件,提供更高信号分辨率以表征甲烷动态。

详情
AI中文摘要

反刍动物瘤胃甲烷测量面临准确性和操作可行性之间的根本权衡。现有方法在反刍后和大气稀释后量化甲烷,限制了时间分辨率并混淆了生物信号与环境变量。我们介绍了SCOUT(智能导管-mounted光学单元用于痕量甲烷),首个自主系统用于持续体内监测瘤胃空间甲烷浓度。该系统通过闭环气体循环解决关键工程障碍,维持厌氧瘤胃条件进行持续空间采样。SCOUT在插管的西门子奶牛上部署,对比饮食处理。空间浓度比同时环境嗅探器读数高100至1000倍,提供更高的信号分辨率以表征甲烷动态。高频监测揭示了之前无法访问的行为-生产耦合,包括与15分钟内姿势转换相关的快速浓度变化(14.5±11.3k ppm)。与环境嗅探器的跨平台比较显示,生产与释放测量之间存在规模依赖的对应关系,最佳相关性(r = -0.564)在40分钟平均窗口内,与反刍周期一致。这些结果表明,瘤胃空间含有连续、可生物解释的甲烷信号,SCOUT可以可靠地访问,建立了开发浓度-流量模型所需的测量基础设施,以支持精准表型、排放代理校准和缓解策略评估。

英文摘要

Enteric methane measurement from ruminant livestock faces fundamental trade-offs between accuracy and operational feasibility. Existing methods quantify methane after eructation and atmospheric dilution, limiting temporal resolution and confounding biological signals with environmental variables. We present SCOUT (Smart Cannula-mounted Optical Unit for Trace-methane), the first autonomous system for continuous in-vivo monitoring of ruminal headspace methane concentrations. The system addresses a critical engineering barrier through closed-loop gas recirculation that maintains anaerobic ruminal conditions during persistent headspace sampling. SCOUT was deployed on cannulated Simmental heifers under contrasting dietary treatments. Headspace concentrations were 100 to 1000 times higher than concurrent ambient sniffer readings, providing substantially greater signal resolution for characterizing methane dynamics. High-frequency monitoring revealed behavior-production coupling previously inaccessible, including rapid concentration changes ($14.5 \pm 11.3k$ ppm) associated with postural transitions within 15-minute intervals. Cross-platform comparison with ambient sniffers showed scale-dependent correspondence between production and release measurements, with an optimal correlation (r = -0.564) at 40-minute averaging windows consistent with eructation cycles. These results demonstrate that the rumen headspace contains continuous, biologically interpretable methane signals that SCOUT can reliably access, establishing the measurement infrastructure necessary for developing concentration-to-flux models that would support precision phenotyping, emission proxy calibration, and mitigation strategy evaluation.

2508.02129 2026-05-11 cs.CV

VDEGaussian: Video Diffusion Enhanced 4D Gaussian Splatting for Dynamic Urban Scenes Modeling

VDEGaussian:基于视频扩散的4D高斯点云用于动态城市场景建模

Yuru Xiao, Zihan Lin, Chao Lu, Deming Zhai, Kui Jiang, Wenbo Zhao, Wei Zhang, Junjun Jiang, Huanran Wang, Xianming Liu

发表机构 * Harbin Institute of Technology(哈尔滨工业大学)

AI总结 本文提出一种结合视频扩散的4D高斯点云方法,解决动态场景建模中快速移动物体建模难题,通过时间一致性先验和不确定性蒸馏提升新型视角合成效果。

详情
AI中文摘要

动态城市场景建模是一个快速发展的领域,具有广泛的应用。尽管当前利用神经辐射场或高斯点云的方法在细粒度重建和高保真新型视角合成方面取得了进展,但仍面临显著限制。这些限制往往源于对预先校准物体轨迹的依赖或在低采样捕获中准确建模快速移动物体的困难,特别是处理时间不连续性方面存在挑战。为克服这些问题,我们提出了一种新的视频扩散增强的4D高斯点云框架。我们的关键见解是通过测试时间适应的视频扩散模型提取稳健且时间一致的先验。为确保精确姿态对齐和有效整合去噪内容,我们引入了两个核心创新:联合时间戳优化策略,用于细化插值帧姿态;以及不确定性蒸馏方法,能够自适应提取目标内容同时保留良好重建区域。大量实验表明,我们的方法显著提升了动态建模效果,尤其在快速移动物体建模方面,新型视角合成相比基线方法实现了约2 dB的PSNR增益。

英文摘要

Dynamic urban scene modeling is a rapidly evolving area with broad applications. While current approaches leveraging neural radiance fields or Gaussian Splatting have achieved fine-grained reconstruction and high-fidelity novel view synthesis, they still face significant limitations. These often stem from a dependence on pre-calibrated object tracks or difficulties in accurately modeling fast-moving objects from undersampled capture, particularly due to challenges in handling temporal discontinuities. To overcome these issues, we propose a novel video diffusion-enhanced 4D Gaussian Splatting framework. Our key insight is to distill robust, temporally consistent priors from a test-time adapted video diffusion model. To ensure precise pose alignment and effective integration of this denoised content, we introduce two core innovations: a joint timestamp optimization strategy that refines interpolated frame poses, and an uncertainty distillation method that adaptively extracts target content while preserving well-reconstructed regions. Extensive experiments demonstrate that our method significantly enhances dynamic modeling, especially for fast-moving objects, achieving an approximate PSNR gain of 2 dB for novel view synthesis over baseline approaches.

2508.01248 2026-05-11 cs.CV

NS-Net: Decoupling CLIP Semantic Information through NULL-Space for Generalizable AI-Generated Image Detection

NS-Net:通过NULL空间解耦CLIP语义信息以实现通用的AI生成图像检测

Jiazhen Yan, Fan Wang, Weiwei Jiang, Ziqiang Li, Zhangjie Fu

发表机构 * Engineering Research Center of Digital Forensics, Ministry of Education, Nanjing University of Information Science and Technology(数字取证工程研究中心,教育部,南京信息科技大学) University of Macau(澳门大学)

AI总结 本文提出NS-Net,通过NULL空间投影解耦CLIP视觉特征中的语义信息,结合对比学习捕捉真实与生成图像的分布差异,提升AI生成图像检测的泛化能力。

详情
AI中文摘要

本文重新审视CLIP特征在AI生成图像检测中的应用,发现其嵌入的高层语义信息阻碍了有效区分。为此,我们提出NS-Net,一种新的检测框架,利用NULL空间投影解耦CLIP的视觉特征中的语义信息,随后通过对比学习捕捉真实与生成图像的内在分布差异。此外,我们设计了Patch Selection策略,通过减轻全局图像结构引起的语义偏差来保留细粒度的特征。在包含40种不同生成模型生成图像的开放世界基准测试中,NS-Net在检测准确率上超越了现有最先进的方法,实现了7.4%的提升,从而在GAN和扩散模型图像生成技术上均表现出强大的泛化能力。

英文摘要

The rapid progress of generative models, such as GANs and diffusion models, has facilitated the creation of highly realistic images, raising growing concerns over their misuse in security-sensitive domains. While existing detectors perform well under known generative settings, they often fail to generalize to unknown generative models, especially when semantic content between real and fake images is closely aligned. In this paper, we revisit the use of CLIP features for AI-generated image detection and uncover a critical limitation: the high-level semantic information embedded in CLIP's visual features hinders effective discrimination. To address this, we propose NS-Net, a novel detection framework that leverages NULL-Space projection to decouple semantic information from CLIP's visual features, followed by contrastive learning to capture intrinsic distributional differences between real and generated images. Furthermore, we design a Patch Selection strategy to preserve fine-grained artifacts by mitigating semantic bias caused by global image structures. Extensive experiments on an open-world benchmark comprising images generated by 40 diverse generative models show that NS-Net outperforms existing state-of-the-art methods, achieving a 7.4\% improvement in detection accuracy, thereby demonstrating strong generalization across both GAN- and diffusion-based image generation techniques.

2506.23875 2026-05-11 cs.LG cs.AI

Discovering Learning-Friendly Generation Orders for Sequential Computation

发现有利于学习的顺序生成顺序以进行序列计算

Yuta Sato, Kazuhiko Kawamoto, Hiroshi Kera

发表机构 * Chiba University(千叶大学) National Institute of Informatics(日本信息处理学会)

AI总结 本文提出通过损失分析自动发现有利于学习的顺序生成顺序,通过分层全局-局部搜索在块和块内层面优化顺序,提升训练成功率。

Comments 10+24 pages, 10 figures

详情
AI中文摘要

通过自回归生成进行序列计算可以使困难任务可学习,但中间状态的生成顺序强烈影响训练是否成功。我们解决自动发现学习友好的目标顺序的问题,而不是依赖任务特定设计。我们的关键观察是,学习友好的顺序在训练早期阶段导致更快的损失下降。我们通过损失分析来利用这一点,该方法通过单次短运行的早期阶段损失对候选顺序进行排名。为处理阶乘候选空间,我们将损失分析封装在分层全局-局部搜索中,覆盖块和块内层面的顺序。在六个对顺序敏感的任务上,该方法从随机初始化发现有效顺序至L=13,从结构化初始化发现至L=40,将成功率从约10%提升到近100%。在整数乘法中,它重新发现之前研究中报告的反向数字顺序。在延迟动态系统中,作为多变量递归的案例研究,可学习性在有效拓扑排序之间变化剧烈:损失分析识别出学习友好的顺序,而全局搜索甚至发现了超越人工设计候选的顺序。

英文摘要

Sequential computation via autoregressive generation can make difficult tasks learnable, but the generation order of intermediate states strongly affects whether training succeeds. We address the problem of discovering a learning-friendly target order automatically, rather than relying on task-specific design. Our key observation is that learning-friendly orders cause a faster loss drop in the early stage of training. We exploit this by \emph{loss profiling}, which ranks candidate orders by the early-stage loss of a single short run. To handle the factorial candidate space, we wrap loss profiling in a hierarchical global -- local search over block- and within-block-level orderings. On six order-sensitive tasks, the method discovers effective orders up to $L=13$ from random initialization and up to $L=40$ from structured initialization, lifting success rates from about 10\% to near 100\%. On integer multiplication, it rediscovers the reverse-digit order that was reported to be efficient in prior studies. On delay dynamical systems, as a case study of multi-variate recurrences, learnability varies sharply even among valid topological sorts of the dependency graph: loss profiling identifies a learning-friendly one, and the global search even discovers orders surpassing hand-designed candidates.

2506.14951 2026-05-11 cs.LG cs.AI cs.NE

Flat Channels to Infinity in Neural Loss Landscapes

神经损失景观中的无限平坦通道

Flavio Martinelli, Alexander Van Meegen, Berfin Şimşek, Wulfram Gerstner, Johanni Brea

发表机构 * EPFL(苏黎世联邦理工学院) Flatiron Institute(Flatiron研究所)

AI总结 研究揭示神经网络损失景观中存在无限平坦通道结构,其中损失缓慢下降而输出权重趋于无穷大,通过梯度动力学和几何分析揭示其特性及计算能力。

Comments Accepted to NeurIPS'25 (fixed resolution of equations in figs.1,2,3)

详情
AI中文摘要

神经网络的损失景观包含极小值和鞍点,这些点可能在平坦区域相连或孤立存在。本文识别并表征了一种特殊结构:通道中损失极慢下降,同时至少两个神经元的输出权重$a_i$和$a_j$趋于正负无穷大,其输入权重向量$\mathbf{w_i}$和$\mathbf{w_j}$趋于相等。在收敛时,这两个神经元实现门控线性单元:$a_iσ(\mathbf{w_i} \cdot \mathbf{x}) + a_jσ(\mathbf{w_j} \cdot \mathbf{x}) \rightarrow σ(\mathbf{w} \cdot \mathbf{x}) + (\mathbf{v} \cdot \mathbf{x}) σ'(\mathbf{w} \cdot \mathbf{x})$。几何上,这些无限通道渐近平行于对称诱导的临界点线。梯度流求解器及相关优化方法如SGD或ADAM在多种回归设置中高概率达到这些通道,但未仔细检查时看起来像具有有限参数值的平坦局部极小值。本文的表征提供了梯度动力学、几何和功能解释方面的全面视角,通道末端出现的门控线性单元突显了全连接层计算能力的惊人方面。

英文摘要

The loss landscapes of neural networks contain minima and saddle points that may be connected in flat regions or appear in isolation. We identify and characterize a special structure in the loss landscape: channels along which the loss decreases extremely slowly, while the output weights of at least two neurons, $a_i$ and $a_j$, diverge to $\pm$infinity, and their input weight vectors, $\mathbf{w_i}$ and $\mathbf{w_j}$, become equal to each other. At convergence, the two neurons implement a gated linear unit: $a_iσ(\mathbf{w_i} \cdot \mathbf{x}) + a_jσ(\mathbf{w_j} \cdot \mathbf{x}) \rightarrow σ(\mathbf{w} \cdot \mathbf{x}) + (\mathbf{v} \cdot \mathbf{x}) σ'(\mathbf{w} \cdot \mathbf{x})$. Geometrically, these channels to infinity are asymptotically parallel to symmetry-induced lines of critical points. Gradient flow solvers, and related optimization methods like SGD or ADAM, reach the channels with high probability in diverse regression settings, but without careful inspection they look like flat local minima with finite parameter values. Our characterization provides a comprehensive picture of these quasi-flat regions in terms of gradient dynamics, geometry, and functional interpretation. The emergence of gated linear units at the end of the channels highlights a surprising aspect of the computational capabilities of fully connected layers.

2506.12362 2026-05-11 cs.LG cs.AI

HYPER: A Foundation Model for Inductive Link Prediction with Knowledge Hypergraphs

HYPER:一种用于知识超图归纳链接预测的基础模型

Xingyue Huang, Mikhail Galkin, Michael M. Bronstein, İsmail İlkan Ceylan

发表机构 * University of Oxford(牛津大学) Google Research(谷歌研究) AITHYRA TU Wien(维也纳技术大学)

AI总结 HYPER是一种基础模型,能处理包含新实体和新关系的知识超图归纳链接预测,通过编码超边中的实体及其位置实现跨不同关系类型的迁移学习。

详情
AI中文摘要

归纳链接预测涉及预测包含完全新实体的缺失超边。现有方法假设固定关系词汇,无法泛化到包含新关系类型的知识超图。受知识图谱基础模型启发,我们提出HYPER,能泛化到任何知识超图,包括新实体和新关系。重要的是,HYPER通过编码每个超边的实体及其在超边中的位置,学习并迁移不同 arity 的关系类型。我们构建了16个新归纳数据集进行评估,HYPER在节点-only 和节点-关系归纳设置中均优于现有方法,显示出对未见高 arity 关系结构的强泛化能力。

英文摘要

Inductive link prediction with knowledge hypergraphs is the task of predicting missing hyperedges involving completely novel entities (i.e., nodes unseen during training). Existing methods for inductive link prediction with knowledge hypergraphs assume a fixed relational vocabulary and, as a result, cannot generalize to knowledge hypergraphs with novel relation types (i.e., relations unseen during training). Inspired by knowledge graph foundation models, we propose HYPER as a foundation model for link prediction, which can generalize to any knowledge hypergraph, including novel entities and novel relations. Importantly, HYPER can learn and transfer across different relation types of varying arities, by encoding the entities of each hyperedge along with their respective positions in the hyperedge. To evaluate HYPER, we construct 16 new inductive datasets from existing knowledge hypergraphs, covering a diverse range of relation types of varying arities. Empirically, HYPER consistently outperforms all existing methods in both node-only and node-and-relation inductive settings, showing strong generalization to unseen, higher-arity relational structures.

2506.11512 2026-05-11 cs.LG cs.AI

From Time Series Analysis to Question Answering: A Survey in the LLM Era

从时间序列分析到问答:在大语言模型时代的一次综述

Wei Li, Zhe Xie, Yuxuan Liang, Xinli Hao, Yunyao Cheng, Dan Pei, Xiaofeng Meng

发表机构 * Renmin University of China(中国人民大学) Tsinghua University(清华大学) Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) Aalborg University(奥胡斯大学)

AI总结 本文综述了大语言模型时代时间序列分析向时间序列问答的转变,探讨了TSQA在灵活性、经济性和通用性方面的选择指导及未来研究方向。

Comments Accepted by IJCAI 2026 Survey Track

详情
AI中文摘要

近年来,大语言模型(LLMs)在时间序列分析(TSA)中引入了新的范式,利用强大的语言能力支持预测和异常检测等任务。然而,这些分析任务无法充分涵盖时间语言任务,如解释和描述。TSA与LLMs之间存在根本性差距:LLMs是为问答优化自然语言相关性,而非专为TSA优化。为弥合这一差距,TSA正向时间序列问答(TSQA)演变,从专家驱动和任务特定分析转向用户驱动和任务统一问答。TSQA依赖于灵活探索而非预定义TSA流程。本文首先提出一个分类,反映从TSA到TSQA的演变,由外部到内部对齐的转变驱动。然后将现有文献组织成三种对齐范式:注入对齐、桥梁对齐和内部对齐,并提供灵活、经济和通用选择的实用指导。最后分析跨领域和特征的数据集,识别挑战并突出未来研究方向。

英文摘要

Recently, Large Language Models (LLMs) have introduced a novel paradigm in Time Series Analysis (TSA), leveraging strong language capabilities to support tasks such as forecasting and anomaly detection. However, these analysis tasks cannot adequately cover temporal language tasks, such as interpretation and captioning. A fundamental gap remains between TSA and LLMs: LLMs are pre-trained to optimize natural language relevance for question answering rather than objectives specialized for TSA. To bridge this gap, TSA is evolving toward Time Series Question Answering (TSQA), shifting from expert-driven and task-specific analysis to user-driven and task-unified question answering. TSQA depends on flexible exploration rather than predefined TSA pipelines. In this survey, we first propose a taxonomy that reflects the evolution from TSA to TSQA, driven by a shift from external to internal alignment. We then organize existing literature into three alignment paradigms: Injective Alignment, Bridging Alignment, and Internal Alignment, and provide practical guidance for flexible, economical, and generalizable selection of alignment paradigms. We finally analyze datasets across domains and characteristics, identify challenges, and highlight future research directions.

2506.05668 2026-05-11 cs.LG stat.ML

RNE: plug-and-play diffusion inference-time control and energy-based training

RNE: 可插拔的扩散推理时间控制与基于能量的训练

Jiajun He, José Miguel Hernández-Lobato, Yuanqi Du, Francisco Vargas

发表机构 * University of Cambridge(剑桥大学) Cornell University(康奈尔大学) Xaira Therapeutics

AI总结 本文提出RNE,通过路径分布密度比的概念,建立了边际密度与转移核之间的联系,统一了扩散密度估计、推理时间控制和基于能量的扩散训练。实验显示RNE在推理时间控制和训练能量扩散模型方面表现优异。

Comments Accepted at ICLR 2026

详情
AI中文摘要

扩散模型通过逐步去除噪声生成数据,对应于去噪过程的时间反转。然而,仅访问去噪核往往不足。在许多应用中,我们需要生成轨迹上的边际密度知识,以实现推理时间控制。为解决这一差距,本文引入了Radon-Nikodym估计器(RNE)。基于路径分布之间的密度比概念,RNE揭示了边际密度与转移核之间的基本联系,提供了一个灵活的可插拔框架,从单一视角统一了(1)扩散密度估计、(2)推理时间控制和(3)基于能量的扩散训练。实验表明,RNE在推理时间控制应用(如退火和模型组合)中表现强劲,具有有希望的推理时间扩展性能,并实现了简单而高效的训练能量扩散模型正则化。此外,所提出的RNE是模态无关的,不仅适用于连续扩散模型,也适用于其离散扩散对应物。

英文摘要

Diffusion models generate data by removing noise gradually, which corresponds to the time-reversal of a noising process. However, access to only the denoising kernels is often insufficient. In many applications, we need the knowledge of the marginal densities along the generation trajectory, which enables tasks such as inference-time control. To address this gap, in this paper, we introduce the Radon-Nikodym Estimator (RNE). Based on the concept of the \textit{density ratio} between path distributions, it reveals a fundamental connection between marginal densities and transition kernels, providing a flexible plug-and-play framework that unifies (1) diffusion density estimation, (2) inference-time control, and (3) energy-based diffusion training under a single perspective. Experiments demonstrate that RNE delivers strong results in inference-time control applications, such as annealing and model composition, with promising inference-time scaling performance, and achieves a simple yet efficient regularisation for training energy-based diffusion models. Additionally, our proposed RNE is modality-agnostic and applicable not only to continuous diffusion models but also to their discrete diffusion counterparts.

2506.00886 2026-05-11 cs.AI

Position: Agent Should Invoke External Tools ONLY When Epistemically Necessary

位置:智能体应仅在知识上必要时调用外部工具

Hongru Wang, Cheng Qian, Manling Li, Jiahao Qiu, Boyang Xue, Mengdi Wang, Heng Ji, Amos Storkey, Kam-Fai Wong

发表机构 * University of Edinburgh(爱丁堡大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Northwestern University(西北大学) Princeton University(普林斯顿大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 本文探讨智能体在何种情况下应调用外部工具,提出应仅在内部推理无法可靠完成任务时才调用。引入了理论框架,强调决策中的不确定性管理,指出不必要的委托会降低效率并阻碍内部推理能力的发展。

详情
AI中文摘要

随着大型语言模型发展为具备工具功能的智能体,一个核心问题仍未解决:何时调用外部工具是正当的?现有智能体框架通常将工具视为普通动作,优化任务成功或奖励,但缺乏在知识上必要互动与不必要的委托之间做出原则性区分的依据。本文主张智能体应仅在知识上必要时调用外部工具。此处,知识必要性意味着在不进行外部互动的情况下,任务无法通过智能体的内部推理在其当前上下文中可靠完成。我们引入了理论框架(ToA),该框架将智能体视为在连续决策中判断剩余不确定性是否应通过内部解决或委托外部解决。从这一视角来看,常见的智能体失败模式(如过度思考和过度行动)源于在不确定性下的决策校准错误,而非推理或工具执行本身的缺陷。我们进一步讨论了训练、评估和智能体设计的含义,强调不必要的委托不仅导致低效,还可能阻碍内部推理能力的发展。本文提出了一种规范性的工具使用标准,补充了现有的决策理论模型,并对于构建不仅正确而且日益智能的智能体至关重要。

英文摘要

As large language models evolve into tool-augmented agents, a central question remains unresolved: when is external tool use actually justified? Existing agent frameworks typically treat tools as ordinary actions and optimize for task success or reward, offering little principled distinction between epistemically necessary interaction and unnecessary delegation. This position paper argues that agents should invoke external tools only when epistemically necessary. Here, epistemic necessity means that a task cannot be completed reliably via the agent's internal reasoning over its current context, without any external interaction. We introduce the Theory of Agent (ToA), a framework that treats agents as making sequential decisions about whether remaining uncertainty should be resolved internally or delegated externally. From this perspective, common agent failure modes (e.g., overthinking and overacting) arise from miscalibrated decisions under uncertainty rather than deficiencies in reasoning or tool execution alone. We further discuss implications for training, evaluation, and agent design, highlighting that unnecessary delegation not only causes inefficiency but can impede the development of internal reasoning capability. Our position provides a normative criterion for tool use that complements existing decision-theoretic models and is essential for building agents that are not only correct, but increasingly intelligent.

2505.13741 2026-05-11 cs.CV cs.NE

Frozen Backpropagation: Relaxing Weight Symmetry in Deep Spiking Neural Networks

冻结反向传播:在深度脉冲神经网络中放松权重对称性

Gaspard Goupy, Pierre Tirilly, Ioan Marius Bilasco

发表机构 * Nokia Bell Labs(诺基亚贝尔实验室) Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL(里尔大学、国家科学研究中心、里尔中央理工大学、UMR 9189 CRIStAL)

AI总结 本文提出冻结反向传播算法,通过减少权重传输来降低深度脉冲神经网络中权重对称性的要求,从而提升能效并提高准确性。

详情
AI中文摘要

直接在类脑硬件上训练脉冲神经网络(SNNs)可以显著降低能耗,但实现反向传播(BP)面临挑战,因为正向和反向传递通常由不同网络执行。为计算正确梯度,正向和反馈权重必须在训练期间保持对称,需要在两个网络之间传输权重。这种对称性要求增加了硬件开销和能耗。为解决此问题,我们引入冻结反向传播(fBP),一种基于反向传播的训练算法,放松了单独网络设置中的权重对称性要求。fBP通过使用冻结的反馈权重计算正向权重的梯度,减少训练期间的权重传输并最小化同步开销。为进一步提高传输效率,我们提出了三种不同计算复杂度的部分权重传输方案,每次仅传输一部分权重。我们在图像识别任务中评估了我们的方法,使用时间和速率编码的SNNs,并将结果与现有方法进行比较。我们的结果表明,fBP优于这些方法,在保持与BP相当的准确性的同时显著降低了传输成本。通过部分权重传输,fBP可以进一步将成本降低高达10000倍,但会以中等的精度损失为代价。本工作为指导设计包含基于芯片学习的类脑硬件提供了见解。

英文摘要

Direct training of Spiking Neural Networks (SNNs) on neuromorphic hardware can greatly reduce energy costs compared to GPU-based training. However, implementing Backpropagation (BP) on such hardware is challenging because forward and backward passes are typically performed by separate networks with distinct weights. To compute correct gradients, forward and feedback weights must remain symmetric during training, necessitating weight transport between the two networks. This symmetry requirement imposes hardware overhead and increases energy costs. To address this issue, we introduce Frozen Backpropagation (\textsc{fBP}), a BP-based training algorithm relaxing weight symmetry in settings with separate networks. fBP updates forward weights by computing gradients with periodically frozen feedback weights, reducing weight transports during training and minimizing synchronization overhead. To further improve transport efficiency, we propose three partial weight transport schemes of varying computational complexity, where only a subset of weights is transported at a time. We evaluate our methods on image recognition tasks using both temporally and rate-coded SNNs, and compare them to existing approaches addressing the weight symmetry requirement. Our results show that fBP outperforms these methods and achieves accuracy comparable to BP while significantly lowering transport costs. With partial weight transport, fBP can further lower those costs by up to 10,000x at the expense of moderate accuracy loss. This work provides insights for guiding the design of neuromorphic hardware incorporating BP-based on-chip learning.

2504.16559 2026-05-11 cs.LG q-bio.QM

Synergistic Benefits of Joint Molecule Generation and Property Prediction

联合分子生成与性质预测的协同效益

Adam Izdebski, Jan Olszewski, Pankhil Gawade, Krzysztof Koras, Serra Korkmaz, Valentin Rauscher, Jakub M. Tomczak, Ewa Szczurek

发表机构 * Institute of AI for Health, Helmholtz Zentrum Munchen(人工智能与健康研究所,海德堡中心) Technical University of Munich(慕尼黑技术大学) School of Computation, Information and Technology(计算、信息与技术学院) Faculty of Mathematics, Informatics and Mechanics, University of Warsaw(数学、信息学与力学学院,华沙大学) Ardigen SA(Ardigen公司) Eindhoven University of Technology(埃因霍温理工大学)

AI总结 本文提出Hyformer模型,通过交替注意力机制和联合预训练方案,实现分子生成与性质预测的协同优化,提升了条件采样、异常分布预测和表征学习性能。

Comments 17 pages, 4 figures

Journal ref Transactions on Machine Learning Research (TMLR), 2026

详情
AI中文摘要

联合数据样本及其属性的建模允许构建一个同时用于数据生成和属性预测的单一模型,协同效益超越了纯生成或预测模型。然而,训练联合模型面临严峻的架构和优化挑战。本文提出Hyformer,一种基于transformer的联合模型,通过交替注意力机制和联合预训练方案,成功融合了生成和预测功能。我们证明Hyformer在分子生成和属性预测方面同时优化,并在条件采样、异常分布属性预测和表征学习中展现出协同效益。最后,我们通过药物设计中的新型抗菌肽发现用例,展示了联合学习的优势。

英文摘要

Modeling the joint distribution of data samples and their properties allows to construct a single model for both data generation and property prediction, with synergistic benefits reaching beyond purely generative or predictive models. However, training joint models presents daunting architectural and optimization challenges. Here, we propose Hyformer, a transformer-based joint model that successfully blends the generative and predictive functionalities, using an alternating attention mechanism and a joint pre-training scheme. We show that Hyformer is simultaneously optimized for molecule generation and property prediction, while exhibiting synergistic benefits in conditional sampling, out-of-distribution property prediction and representation learning. Finally, we demonstrate the benefits of joint learning in a drug design use case of discovering novel antimicrobial~peptides.

2503.14998 2026-05-11 cs.CV

Tables Guide Vision: Learning to See the Heart through Tabular Data

表格引导视觉:通过表格数据学习看见心脏

Marta Hasny, Maxime Di Folco, Keno Bressem, Julia Schnabel

发表机构 * School of Computation, Information and Technology, Technical University of Munich(技术大学慕尼黑计算、信息与技术学院) Institute of Machine Learning in Biomedical Imaging, Helmholtz Munich(生物医学成像中的机器学习研究所,海德堡慕尼黑) School of Biomedical Engineering and Imaging Sciences, King’s College London(国王学院伦敦生物医学工程与成像科学学院) TUM University Hospital, Technical University of Munich(技术大学慕尼黑大学医院)

AI总结 本文提出基于表格数据的对比学习框架,利用临床相关表格数据识别患者层面相似性,构建更有意义的配对,提升语义对齐的表示学习,并通过零样本预测验证方法有效性。

详情
AI中文摘要

计算机视觉中的对比学习方法通常依赖于同一图像的增强视图或多模态预训练策略来对齐配对模态。然而,这些方法往往忽略了不同实例之间的语义关系,导致在语义相似样本被当作负样本时产生假阴性。这种限制在医学影像领域尤其关键,如心脏病学,其中人口统计学和临床特征在评估疾病风险和患者结果中起关键作用。我们引入了一种表格引导的对比学习框架,利用临床相关的表格数据来识别患者层面的相似性,并构建更有意义的配对,从而在不需要跨模态联合嵌入的情况下实现语义对齐的表示学习。此外,我们还适应k-NN算法进行零样本预测,以克服单模态表示中缺乏零样本能力的不足。我们通过大量短轴心脏MRI图像和临床特征的大型队列展示了方法的强度,其中表格数据有助于更有效地区分患者亚组。在下游任务,包括微调、线性探测和心血管动脉疾病和心脏表型的零样本预测中,评估表明,结合表格数据指导的方法比仅依赖图像增强或结合图像-表格嵌入的传统方法产生了更强的视觉表示。此外,我们还通过在汽车广告数据集上评估证明了该方法可以泛化到自然图像。代码可在https://github.com/marteczkah/tables_guide_vision获取。

英文摘要

Contrastive learning methods in computer vision typically rely on augmented views of the same image or multimodal pretraining strategies that align paired modalities. However, these approaches often overlook semantic relationships between distinct instances, leading to false negatives when semantically similar samples are treated as negatives. This limitation is especially critical in medical imaging domains such as cardiology, where demographic and clinical attributes play a critical role in assessing disease risk and patient outcomes. We introduce a tabular-guided contrastive learning framework that leverages clinically relevant tabular data to identify patient-level similarities and construct more meaningful pairs, enabling semantically aligned representation learning without requiring joint embeddings across modalities. Additionally, we adapt the k-NN algorithm for zero-shot prediction to overcome the lack of zero-shot capability in unimodal representations. We demonstrate the strength of our methods using a large cohort of short-axis cardiac MR images and clinical attributes, where tabular data helps to more effectively distinguish between patient subgroups. Evaluation on downstream tasks, including fine-tuning, linear probing, and zero-shot prediction of cardiovascular artery diseases and cardiac phenotypes, shows that incorporating tabular data guidance yields stronger visual representations than conventional methods that rely solely on image augmentation or combined image-tabular embeddings. Further, we show that our method can generalize to natural images by evaluating it on a car advertisement dataset. Code is available at https://github.com/marteczkah/tables_guide_vision.

2503.12285 2026-05-11 cs.LG cs.AI cs.GT cs.SY eess.SY stat.ML

A Resilience Framework for Bi-Criteria Combinatorial Optimization with Bandit Feedback

双目标组合优化的鲁棒框架与带反馈机制

Vaneet Aggarwal, Shweta Jain, Subham Pokhriyal, Christopher John Quinn

发表机构 * Purdue University(普渡大学) IIT Ropar(罗帕尔理工学院) Iowa State University(爱荷华州立大学)

AI总结 本文提出双目标组合优化的鲁棒框架,通过带反馈机制将离线算法转化为在线算法,实现子线性遗憾和约束违反的渐近最优。

Journal ref Transactions on Machine Learning Research, May 2026

详情
AI中文摘要

我们研究在噪声函数评估下的双目标组合优化。虽然在单目标设置中已研究了鲁棒性和黑盒离线到在线转换,但将其扩展到双目标问题带来了新的挑战,因为近似保证的退化是耦合的。我们引入了$(α,β,δ,\texttt{N})$-鲁棒性概念,捕捉在受界(可能最坏)Oracle噪声下的联合近似保证退化,并开发了一个通用的黑盒框架,将任何鲁棒的离线算法转换为双目标组合多臂老虎机的在线算法。所得到的在线保证实现了子线性遗憾和累积约束违反的顺序$\tilde{O}(δ^{2/3}\texttt{N}^{1/3}T^{2/3})$,无需假设函数的线性、次模性或半老虎机反馈。我们通过建立经典贪心算法在次模优化中的鲁棒性,展示了该框架的应用性。

英文摘要

We study bi-criteria combinatorial optimization under noisy function evaluations. While resilience and black-box offline-to-online reductions have been studied in single-objective settings, extending these ideas to bi-criteria problems introduces new challenges due to the coupled degradation of approximation guarantees for objectives and constraints. We introduce a notion of $(α,β,δ,\texttt{N})$-resilience for bi-criteria approximation algorithms, capturing how joint approximation guarantees degrade under bounded (possibly worst-case) oracle noise, and develop a general black-box framework that converts any resilient offline algorithm into an online algorithm for bi-criteria combinatorial multi-armed bandits with bandit feedback. The resulting online guarantees achieve sublinear regret and cumulative constraint violation of order $\tilde{O}(δ^{2/3}\texttt{N}^{1/3}T^{2/3})$ without requiring structural assumptions such as linearity, submodularity, or semi-bandit feedback on the noisy functions. We demonstrate the applicability of the framework by establishing resilience for several classical greedy algorithms in submodular optimization.

2502.07143 2026-05-11 cs.CL

Ask Patients with Patience: Enabling LLMs for Human-Centric Medical Dialogue with Grounded Reasoning

耐心询问患者:通过 grounded 推理实现面向人类的医疗对话 LLM

Jiayuan Zhu, Jiazhen Pan, Yuyuan Liu, Fenglin Liu, Junde Wu

发表机构 * University of Oxford(牛津大学) Technical University of Munich(慕尼黑技术大学)

AI总结 本文提出 APP,一种基于 LLM 的医疗助手,通过 grounded 推理和透明诊断提升医疗对话的人性化体验,改进诊断准确性并增强用户参与度。

详情
AI中文摘要

医疗医生短缺限制了及时可靠的医疗访问,留下数百万患者未得到充分照顾。大型语言模型(LLMs)提供了潜在解决方案,但在真实临床互动中面临挑战。许多 LLMs 未基于权威医学指南,无法透明管理诊断不确定性。其语言往往僵硬机械,缺乏对患者信任至关重要的人性化特质。为此,我们提出 Ask Patients with Patience(APP),一种多轮次 LLM 基础医疗助手,旨在实现 grounded 推理、透明诊断和人性化交互。APP 通过富有同理心的对话获取用户症状,显著提高可及性和用户参与度。它还整合了贝叶斯主动学习以支持透明和适应性诊断。该框架基于已验证的医学指南,确保临床基础和证据导向的推理。为评估其性能,我们开发了一个新基准,使用从真实世界咨询案例中提取的患者代理模拟现实医疗对话。我们比较了 APP 与 SOTA 一次性及多轮次 LLM 基线。结果表明,APP 提高了诊断准确性,减少了不确定性,并增强了用户体验。通过将医学专业知识与透明、人性化交互相结合,APP 桥接了 AI 驱动医疗援助与现实临床实践之间的差距。

英文摘要

The severe shortage of medical doctors limits access to timely and reliable healthcare, leaving millions underserved. Large language models (LLMs) offer a potential solution but struggle in real-world clinical interactions. Many LLMs are not grounded in authoritative medical guidelines and fail to transparently manage diagnostic uncertainty. Their language is often rigid and mechanical, lacking the human-like qualities essential for patient trust. To address these challenges, we propose Ask Patients with Patience (APP), a multi-turn LLM-based medical assistant designed for grounded reasoning, transparent diagnoses, and human-centric interaction. APP enhances communication by eliciting user symptoms through empathetic dialogue, significantly improving accessibility and user engagement. It also incorporates Bayesian active learning to support transparent and adaptive diagnoses. The framework is built on verified medical guidelines, ensuring clinically grounded and evidence-based reasoning. To evaluate its performance, we develop a new benchmark that simulates realistic medical conversations using patient agents driven by profiles extracted from real-world consultation cases. We compare APP against SOTA one-shot and multi-turn LLM baselines. The results show that APP improves diagnostic accuracy, reduces uncertainty, and enhances user experience. By integrating medical expertise with transparent, human-like interaction, APP bridges the gap between AI-driven medical assistance and real-world clinical practice.

2501.09209 2026-05-11 cs.CV

Surgical Visual Understanding (SurgVU) Dataset

外科视觉理解(SurgVU)数据集

Aneeq Zia, Max Berniker, Rogerio Nespolo, Xiaorui Zhang, Conor Perreault, Ziheng Wang, Benjamin Mueller, Ryan Schmidt, Kiran Bhattacharyya, Xi Liu, Anthony Jarc

发表机构 * Intuitive Surgical

AI总结 本文介绍了用于外科数据科学的基础研究的大型外科视频数据集,包含视频及标注,展示了数据收集方法和独特属性,并提出了多个示例问题,旨在推动机器学习领域在手术数据科学中的应用。

详情
AI中文摘要

由于最近的机器学习进展和机器人辅助手术中大量数据的获取能力,外科数据科学迎来了基础研究的机会。我们提出了一个包含外科视频及其对应标签的大型数据集,描述了数据的收集方法和一些独特属性。列举了多个示例问题。尽管该数据集是为特定科学挑战(在配套论文中详述)而编纂的,但它足够通用,可用于广泛的问题。我们的期望是通过该数据集让机器学习社区接触到外科数据科学中的挑战性问题,并成为未来研究的基准。视频可在 https://storage.googleapis.com/isi-surgvu/surgvu24_videos_only.zip 获取,标签可在 https://storage.googleapis.com/isi-surgvu/surgvu24_labels_updated_v2.zip 获取,用于工具检测问题的验证集在 https://storage.googleapis.com/isi-surgvu/cat1_test_set_public.zip,以及用于外科视觉问答的问答对样本集在 https://storage.googleapis.com/isi-surgvu/SURGVU25_cat_2_sample_set_public.zip。

英文摘要

Owing to recent advances in machine learning and the ability to harvest large amounts of data during robotic-assisted surgeries, surgical data science is ripe for foundational work. We present a large dataset of surgical videos and their accompanying labels for this purpose. We describe how the data was collected and some of its unique attributes. Multiple example problems are outlined. Although the dataset was curated for a particular set of scientific challenges (in an accompanying paper), it is general enough to be used for a broad range machine learning questions. Our hope is that this dataset exposes the larger machine learning community to the challenging problems within surgical data science, and becomes a touch-stone for future research. The videos are available at https://storage.googleapis.com/isi-surgvu/surgvu24_videos_only.zip, the labels at https://storage.googleapis.com/isi-surgvu/surgvu24_labels_updated_v2.zip, a validation set for tool detection problem at https://storage.googleapis.com/isi-surgvu/cat1_test_set_public.zip, and a sample set of question & answer pairs dataset for surgical visual question answering at https://storage.googleapis.com/isi-surgvu/SURGVU25_cat_2_sample_set_public.zip.

2410.06355 2026-05-11 cs.RO cs.AI

UNCOM: Zero-shot Context-Aware Command Understanding for Tabletop Scenarios

UNCOM:面向桌面上场景的零样本上下文感知命令理解

Antonio Galiza Cerdeira Gonzalez, Paweł Gajewski, Bipin Indurkhya

发表机构 * Jagiellonian University(杰克利昂大学) AGH University of Krakow(克拉科夫AGH大学)

AI总结 UNCOM提出了一种新的混合框架,通过整合语音、手势和场景上下文信息,实现桌面上场景中自然人类命令的解析,无需预定义对象模型或特定任务训练数据,展示了系统的鲁棒性。

详情
AI中文摘要

本文提出了UNCOM,一种新的混合框架,用于解读桌面上场景中的自然人类命令。系统整合了语音、手势和场景上下文等多种信息,以提取结构化、可操作的指令供机器人执行。为满足家庭环境中通用的人机交互需求,UNCOM设计为零样本操作,不依赖预定义的对象模型或特定任务的训练数据。使用基础和任务特定的深度学习模型,它允许即开即用的语音识别、自然语言理解、手势检测和物体分割。模块化架构通过显式解析命令为对象-动作-目标表示,增强了透明度和可解释性,使系统能够与符号机器人框架集成。我们在TIAGo++机器人上展示了该系统,并在真实世界的人机交互数据集上进行了评估;在我们的基准数据集上实现了82.39%的成功率,突显了系统对多样性、噪声和沟通模糊性的鲁棒性。数据集、评估场景和代码已公开,以支持未来研究。

英文摘要

This paper presents UNCOM, a novel hybrid framework for interpreting natural human commands in tabletop scenarios. The system integrates multiple sources of information -- speech, gestures, and scene context -- to extract structured, actionable instructions for robots. Addressing the need for general-purpose human-robot interaction in domestic environments, UNCOM is designed for zero-shot operation, without reliance on predefined object models or training data specific to a given task. Using foundational and task-specific deep learning models, it allows out-of-the-box speech recognition, natural language understanding, gesture detection, and object segmentation. The modular architecture enhances transparency and explainability by explicitly parsing commands into object-action-target representations, enabling integration with symbolic robotic frameworks. We demonstrate the system in a TIAGo++ robot and provide an evaluation on a real-world data set of human-robot interaction scenarios; achieving an 82.39\% success rate over our benchmark data set, highlighting the robustness of the system to diversity, noise, and communication ambiguity. The data set, evaluation scenarios, and the code are publicly available to support future research.

2410.06347 2026-05-11 cs.RO cs.AI

Goal-Conditioned Decision Transformer for Multi-Goal Offline Reinforcement Learning

多目标离线强化学习中的目标条件决策变压器

Paweł Gajewski, Dominik Żurek, Marcin Pietroń, Kamil Faber

发表机构 * Faculty of Computer Science, AGH University of Krakow(计算机科学学院,克拉科夫AGH大学)

AI总结 本文提出一种目标条件决策变压器,用于多目标离线强化学习,通过显式整合目标状态到序列建模框架中,有效解决不同任务,仅使用预收集数据,并在Franka Emika Panda平台上验证,优于现有在线基线。

详情
AI中文摘要

强化学习在机器人领域面临样本效率和跨不同目标泛化能力的显著挑战。虽然离线强化学习减少了对昂贵在线交互的需求,但其与目标条件策略和基于变压器的架构的结合仍不充分。我们介绍了一种针对离线多目标机器人的目标条件决策变压器。通过将目标状态显式地纳入序列建模框架中,我们的方法仅使用预收集的数据即可高效解决不同任务。我们在新发布的离线数据集上验证了该方法。实验结果表明,我们的方法在复杂任务中优于现有最先进的在线基线,并且在稀疏奖励设置中保持稳健性,即使专家演示有限。

英文摘要

Reinforcement learning (RL) in robotics faces significant hurdles regarding sample efficiency and generalization across varying goals. While Offline RL mitigates the need for costly online interactions, its integration with goal-conditioned policies and transformer-based architectures remains underexplored. We introduce a Goal-Conditioned Decision Transformer adapted for offline multi-goal robotics. By explicitly incorporating goal states into the sequence modeling framework, our approach efficiently solves varying tasks using only pre-collected data. We validate this method on a newly released offline dataset for the Franka Emika Panda platform. Experimental results demonstrate that our approach outperforms state-of-the-art online baselines in complex tasks and maintains robustness in sparse-reward settings, even with limited expert demonstrations.

2408.07522 2026-05-11 cs.SD cs.LG eess.AS

Optimising MFCC parameters for the automatic detection of respiratory diseases

优化MFCC参数以实现呼吸系统疾病的自动检测

Yuyang Yan, Sami O. Simons, Loes van Bemmel, Lauren Reinders, Frits M. E. Franssen, Visara Urovi

发表机构 * Institute of Data Science, Maastricht University(数据科学研究所,马斯特里赫特大学) Department of Respiratory Medicine, NUTRIM Research Institute of Nutrition and Translational Research in Metabolism, Faculty of Health Medicine and Life Sciences, Maastricht University(呼吸科系,营养与代谢转化研究 institute,健康医学与生命科学学院,马斯特里赫特大学) Department of Respiratory Medicine, Maastricht University Medical Centre(呼吸科系,马斯特里赫特大学医学中心) Department of Research and Development, Ciro(研发部,Ciro)

AI总结 本文研究MFCC参数对呼吸疾病诊断的影响,通过四个数据集验证不同参数组合对准确率的提升,发现最佳系数数约为30,帧长和步长对不同数据集性能有不同影响,优化参数后SVM模型在不同数据集上准确率显著提高。

详情
AI中文摘要

呼吸道产生的语音信号被用作有价值的声学生物标记物,用于呼吸系统疾病的诊断和评估。在所用的声学特征中,梅尔频率倒谱系数(MFCC)广泛用于自动分析,其提取通常依赖默认参数。然而,没有系统研究MFCC提取参数对呼吸疾病诊断的影响。本研究通过四个数据集:剑桥新冠声音数据库、Coswara数据集、萨尔布吕肯语音障碍(SVD)数据库和TACTICAS数据集,研究关键参数(系数数、帧长、帧间步长)对呼吸状况检查的影响。支持向量机(SVM)作为分类器被采用,因其广泛应用和有效性。研究发现,随着步长增加,MFCC准确性降低,最佳系数数约为30。MFCC性能随帧长变化:对于新冠数据集(剑桥新冠声音数据库和Coswara数据集),性能随帧长增加而下降;而对于SVD数据集,性能随帧长增加(从50ms到500ms)而提高。此外,我们研究了这些参数的优化组合,并观察到准确性显著提高。与最差组合相比,SVM模型在不同数据集上的准确率分别为81.1%、80.6%和71.7%,分别提高了19.6%、16.10%和14.90%。

英文摘要

Voice signals originating from the respiratory tract are utilized as valuable acoustic biomarkers for the diagnosis and assessment of respiratory diseases. Among the employed acoustic features, Mel Frequency Cepstral Coefficients (MFCC) is widely used for automatic analysis, with MFCC extraction commonly relying on default parameters. However, no comprehensive study has systematically investigated the impact of MFCC extraction parameters on respiratory disease diagnosis. In this study, we address this gap by examining the effects of key parameters, namely the number of coefficients, frame length, and hop length between frames, on respiratory condition examination. Our investigation uses four datasets: the Cambridge COVID-19 Sound database, the Coswara dataset, the Saarbrucken Voice Disorders (SVD) database, and a TACTICAS dataset. The Support Vector Machine (SVM) is employed as the classifier, given its widespread adoption and efficacy. Our findings indicate that the accuracy of MFCC decreases as hop length increases, and the optimal number of coefficients is observed to be approximately 30. The performance of MFCC varies with frame length across the datasets: for the COVID-19 datasets (Cambridge COVID-19 Sound database and Coswara dataset), performance declines with longer frame lengths, while for the SVD dataset, performance improves with increasing frame length (from 50 ms to 500 ms). Furthermore, we investigate the optimized combination of these parameters and observe substantial enhancements in accuracy. Compared to the worst combination, the SVM model achieves an accuracy of 81.1%, 80.6%, and 71.7%, with improvements of 19.6%, 16.10%, and 14.90% for the Cambridge COVID-19 Sound database, the Coswara dataset, and the SVD dataset respectively.