arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2411.05729 2026-06-08 cs.LG stat.ML

Graph-Dictionary Signal Model for Sparse Representations of Multivariate Data

图词典信号模型用于多变量数据的稀疏表示

William Cappelletti, Pascal Frossard

发表机构 * LTS4, EPFL, Lausanne, Switzerland（瑞士洛桑联邦理工学院LTS4实验室）

AI总结本文提出图词典信号模型，通过图结构描述多变量数据中的关系，利用稀疏组合的图原子进行信号重构，优于现有基线方法。

详情

DOI: 10.1109/TSIPN.2026.3653623

AI中文摘要

表示和利用多变量信号需要捕捉变量间的关系，我们通过图来表示这些关系。图词典允许将复杂的关联信息表示为稀疏简单结构之和，但目前尚无先验模型能从数据中推断此类底层结构元素。我们定义了新的图词典信号模型，其中有限的图集合通过其拉普拉斯算子加权和的稀疏组合来描述数据分布中的关系。我们提出了一种从观测节点信号中推断图词典表示的框架，允许包含关于信号属性、底层图及其系数的先验知识。我们引入了原始-对偶分裂算法的双线性推广来解决学习问题。我们展示了该方法在多个合成设置中从信号中重建图的能力，其中我们的模型优于流行的基线方法。然后，我们利用图词典表示在脑活动数据上的示例运动解码任务中，比依赖更多特征的标准方法更好地分类想象运动。我们的图词典模型弥合了多变量数据稀疏表示与样本变化关系的结构分解之间的差距。

英文摘要

Representing and exploiting multivariate signals requires capturing relations between variables, which we can represent by graphs. Graph dictionaries allow to describe complex relational information as a sparse sum of simpler structures, but no prior model exists to infer such underlying structure elements from data. We define a novel Graph-Dictionary signal model, where a finite set of graphs characterizes relationships in data distribution as filters on the weighted sum of their Laplacians. We propose a framework to infer the graph dictionary representation from observed node signals, which allows to include a priori knowledge about signal properties, and about underlying graphs and their coefficients. We introduce a bilinear generalization of the primal-dual splitting algorithm to solve the learning problem. We show the capability of our method to reconstruct graphs from signals in multiple synthetic settings, where our model outperforms popular baselines. Then, we exploit graph-dictionary representations in an illustrative motor imagery decoding task on brain activity data, where we classify imagined motion better than standard methods relying on many more features. Our graph-dictionary model bridges a gap between sparse representations of multivariate data and a structured decomposition of sample-varying relationships into a sparse combination of elementary graph atoms.

URL PDF HTML ☆

赞 0 踩 0

2403.09110 2026-06-08 cs.LG cs.SY eess.SY math.DS math.OC

SINDy-RL: Interpretable and Efficient Model-Based Reinforcement Learning

SINDy-RL：可解释且高效的基于模型的强化学习

Nicholas Zolman, Christian Lagemann, Urban Fasel, J. Nathan Kutz, Steven L. Brunton

发表机构 * Department of Mechanical Engineering, University of Washington, Seattle, WA 98195, USA（华盛顿大学机械工程系）； Data Science and Artificial Intelligence Department, The Aerospace Corporation, El Segundo, CA 90245（航空航天公司数据科学与人工智能部）； Department of Aeronautics, Imperial College, London SW7 2AZ, United Kingdom（帝国理工学院航空系）； Department of Applied Mathematics, University of Washington, Seattle, WA 98195（华盛顿大学应用数学系）； Department of Electrical and Computer Engineering, University of Washington, Seattle, WA 98195（华盛顿大学电气与计算机工程系）

AI总结本文提出SINDy-RL框架，结合SINDy和DRL，实现低数据下高效、可解释的动力学模型和控制策略，通过基准环境和流体控制实验验证其有效性。

Comments For code, see https://github.com/nzolman/sindy-rl. v2 Update: Included Pinball and 3D Airfoil examples. Christian Lagemann added as an author for contributions with the 3D Airfoil code. To appear in Nature Communications

Journal ref Nat. Commun. 16, 10714 (2025)

详情

DOI: 10.1038/s41467-025-65738-4

AI中文摘要

深度强化学习（DRL）在复杂环境中揭示复杂控制策略方面展现出巨大潜力，如稳定托卡马克聚变反应堆或最小化流体中物体的阻力。然而，DRL需要大量训练示例且成本高昂。此外，依赖深度神经网络导致不可解释的黑箱策略，可能在嵌入式系统中计算成本过高。最近的稀疏字典学习进展，如非线性动态的稀疏识别（SINDy），在低数据条件下展示了创建高效且可解释的数据驱动模型的潜力。本文介绍SINDy-RL，一种结合SINDy和DRL的统一框架，以创建高效、可解释且可信的动力学模型、奖励函数和控制策略。我们在基准控制环境和流体控制问题上展示了方法的有效性，包括在Re=1000时的3D NACA 0012翼型气流抑制。SINDy-RL在显著较少的环境交互中实现了与现代DRL算法相当的性能，并产生比DRL策略小多个数量级的可解释控制策略。

英文摘要

Deep reinforcement learning (DRL) has shown significant promise for uncovering sophisticated control policies that interact in complex environments, such as stabilizing a tokamak fusion reactor or minimizing the drag force on an object in a fluid flow. However, DRL requires an abundance of training examples and may become prohibitively expensive for many applications. In addition, the reliance on deep neural networks often results in an uninterpretable, black-box policy that may be too computationally expensive to use with certain embedded systems. Recent advances in sparse dictionary learning, such as the sparse identification of nonlinear dynamics (SINDy), have shown promise for creating efficient and interpretable data-driven models in the low-data regime. In this work we introduce SINDy-RL, a unifying framework for combining SINDy and DRL to create efficient, interpretable, and trustworthy representations of the dynamics model, reward function, and control policy. We demonstrate the effectiveness of our approaches on benchmark control environments and flow control problems, including gust mitigation on a 3D NACA 0012 airfoil at $Re=1000$. SINDy-RL achieves comparable performance to modern DRL algorithms using significantly fewer interactions in the environment and results in an interpretable control policy orders of magnitude smaller than a DRL policy.

URL PDF HTML ☆

赞 0 踩 0

2504.21614 2026-06-08 cs.CV

Mcity Data Engine: Iterative Model Improvement Through Open-Vocabulary Data Selection

Mcity数据引擎：通过开放词汇数据选择实现迭代模型改进

Daniel Bogdoll, Rajanikant Patnaik Ananta, Abeyankar Giridharan, Isabel Moore, Gregory Stevens, Henry X. Liu

发表机构 * University of Michigan Transportation Research Institute（密歇根大学交通研究所）； Karlsruhe Institute of Technology（卡尔斯鲁厄理工学院）； Texas A&M University（德克萨斯A&M大学）

AI总结本文提出Mcity数据引擎，通过开放词汇数据选择解决大规模未标记数据中长尾类检测难题，提供从数据采集到模型部署的完整数据开发流程。

Comments Accepted for publication at ITSC 2025

详情

DOI: 10.1109/ITSC60802.2025.11423797

AI中文摘要

随着数据可用性的持续增长，选择和标注适合机器学习模型训练的样本变得越来越具有挑战性。特别是在大规模未标记数据中检测感兴趣的长尾类更是困难重重。这尤其适用于智能交通系统（ITS），其中车辆车队和道路侧感知系统产生大量的原始数据。虽然存在用于此类迭代数据选择和模型训练过程的工业专有数据引擎，但研究人员和开源社区却缺乏一个公开可用的系统。我们提出了Mcity数据引擎，它提供了完整的基于数据的发展周期模块，从数据采集阶段开始，到模型部署阶段结束。Mcity数据引擎通过开放词汇数据选择过程专注于罕见和新颖的类别。所有代码均以MIT许可证公开发布在GitHub上：https://github.com/mcity/mcity_data_engine

英文摘要

With an ever-increasing availability of data, it has become more and more challenging to select and label appropriate samples for the training of machine learning models. It is especially difficult to detect long-tail classes of interest in large amounts of unlabeled data. This holds especially true for Intelligent Transportation Systems (ITS), where vehicle fleets and roadside perception systems generate an abundance of raw data. While industrial, proprietary data engines for such iterative data selection and model training processes exist, researchers and the open-source community suffer from a lack of an openly available system. We present the Mcity Data Engine, which provides modules for the complete data-based development cycle, beginning at the data acquisition phase and ending at the model deployment stage. The Mcity Data Engine focuses on rare and novel classes through an open-vocabulary data selection process. All code is publicly available on GitHub under an MIT license: https://github.com/mcity/mcity_data_engine

URL PDF HTML ☆

赞 0 踩 0

2506.14634 2026-06-08 cs.CL cs.AI cs.CY

AIn't Nothing But a Survey? Using Large Language Models for Coding German Open-Ended Survey Responses on Survey Motivation

不是什么别的吗？利用大语言模型对德国开放式调查回答进行编码：调查动机

Leah von der Heyde, Anna-Carolina Haensch, Bernd Weiß, Jessica Daikeler

发表机构 * Social Data Science & AI Lab, LMU Munich（社会科学与人工智能实验室，慕尼黑大学）； Munich Center for Machine Learning（慕尼黑机器学习中心）； University of Maryland, College Park（马里兰大学学院公园分校）； GESIS – Leibniz Institute for the Social Sciences（莱比锡社会科学研究机构）

AI总结本文探讨了使用大语言模型对开放式调查回答进行编码的有效性，通过德国调查参与原因的数据，比较了不同LLM和提示方法的性能，发现仅微调的LLM能获得满意预测效果，且分类性能差异影响类别分布。

Comments to appear in Survey Research Methods

Journal ref Survey Research Methods (2025)

详情

DOI: 10.18148/srm/2025.v19i4.8568

AI中文摘要

近年来，大语言模型（LLM）的发展和广泛可及性引发了关于其在调查研究中应用的讨论，包括对开放式调查回答的分类。由于其语言能力，LLM可能成为耗时的手动编码和监督学习模型预训练的高效替代方案。由于现有研究大多集中在英语回答的非复杂主题或单一LLM上，尚不清楚其发现是否具有普遍性以及这些分类的质量如何与传统方法相比。本研究探讨了不同LLM在其他情境下对开放式调查回答进行编码的程度，以德国调查参与原因的数据为例。我们比较了几种最先进的LLM和提示方法，并通过人类专家编码评估LLM的性能。总体而言，LLM之间的性能差异很大，只有微调的LLM才能达到满意的预测性能。提示方法之间的性能差异取决于所用的LLM。最后，LLM在不同调查参与原因类别上的不均等分类性能导致了不同的类别分布，当不使用微调时。我们讨论了这些发现的含义，不仅对开放式回答编码的方法学研究，还对其实质分析，以及处理或实质性分析此类数据的实践者。最后，我们强调了研究人员在选择LLM时代开放式回答分类自动化方法时需要考虑的许多权衡。通过这样做，我们的研究为关于LLM在调查研究中高效、准确和可靠应用条件的日益增长的研究做出了贡献。

英文摘要

The recent development and wider accessibility of LLMs have spurred discussions about how they can be used in survey research, including classifying open-ended survey responses. Due to their linguistic capacities, it is possible that LLMs are an efficient alternative to time-consuming manual coding and the pre-training of supervised machine learning models. As most existing research on this topic has focused on English-language responses relating to non-complex topics or on single LLMs, it is unclear whether its findings generalize and how the quality of these classifications compares to established methods. In this study, we investigate to what extent different LLMs can be used to code open-ended survey responses in other contexts, using German data on reasons for survey participation as an example. We compare several state-of-the-art LLMs and several prompting approaches, and evaluate the LLMs' performance by using human expert codings. Overall performance differs greatly between LLMs, and only a fine-tuned LLM achieves satisfactory levels of predictive performance. Performance differences between prompting approaches are conditional on the LLM used. Finally, LLMs' unequal classification performance across different categories of reasons for survey participation results in different categorical distributions when not using fine-tuning. We discuss the implications of these findings, both for methodological research on coding open-ended responses and for their substantive analysis, and for practitioners processing or substantively analyzing such data. Finally, we highlight the many trade-offs researchers need to consider when choosing automated methods for open-ended response classification in the age of LLMs. In doing so, our study contributes to the growing body of research about the conditions under which LLMs can be efficiently, accurately, and reliably leveraged in survey research.

URL PDF HTML ☆

赞 0 踩 0

2311.00212 2026-06-08 cs.LG cs.NA math.DG math.NA

A Unified Framework to Enforce, Discover, and Promote Symmetry in Machine Learning

一种统一的框架用于在机器学习中强制、发现和促进对称性

Samuel E. Otto, Nicholas Zolman, J. Nathan Kutz, Steven L. Brunton

发表机构 * AI Institute in Dynamic Systems University of Washington（动态系统人工智能研究所华盛顿大学）； Sibley School of Mechanical and Aerospace Engineering, Cornell University（机械与航空航天工程学院，康奈尔大学）

AI总结本文提出统一框架，通过强制已知对称性、发现未知对称性和促进对称性三种方式，将对称性纳入机器学习模型中，基于李导数的数学框架统一了现有结果。

Journal ref J. Mach. Learn. Res. 26(248):1-83 (2025)

详情

AI中文摘要

对称性在自然界中普遍存在，并在物理和机器学习中扮演越来越重要的角色。基本对称性，如庞加莱不变性，使在地球实验室发现的物理定律能够扩展到宇宙的最远区域。对称性对于在机器学习应用中实现这种扩展能力至关重要。例如，图像分类中的平移不变性使具有较少参数的模型，如卷积神经网络，能够用较小的数据集进行训练并达到最先进的性能。本文提供了一个统一的理论和方法框架，用于在三种方式中将对称性纳入机器学习模型：1. 在训练模型时强制已知对称性；2. 发现给定模型或数据集的未知对称性；3. 通过学习一个模型来促进对称性，该模型在用户指定的候选群中学习时，当数据中有足够证据时会打破对称性。我们证明这些任务可以被一个共同的数学框架所涵盖，其核心对象是与向量丛上的纤维线性李群作用相关的李导数。我们通过展示强制和发现对称性是线性代数任务，并且在李导数的双线性结构下是互为对偶的，扩展并统一了现有的结果。我们还提出了一种新的促进对称性的方式，通过引入基于李导数和核范数松弛的一类凸正则化函数，以在训练机器学习模型时惩罚对称性破坏。我们解释了这些想法如何应用于广泛范围的机器学习模型，包括基函数回归、动态系统发现、神经网络和作用于场的神经算子。

英文摘要

Symmetry is present throughout nature and continues to play an increasingly central role in physics and machine learning. Fundamental symmetries, such as Poincaré invariance, allow physical laws discovered in laboratories on Earth to be extrapolated to the farthest reaches of the universe. Symmetry is essential to achieving this extrapolatory power in machine learning applications. For example, translation invariance in image classification allows models with fewer parameters, such as convolutional neural networks, to be trained on smaller data sets and achieve state-of-the-art performance. In this paper, we provide a unifying theoretical and methodological framework for incorporating symmetry into machine learning models in three ways: 1. enforcing known symmetry when training a model; 2. discovering unknown symmetries of a given model or data set; and 3. promoting symmetry during training by learning a model that breaks symmetries within a user-specified group of candidates when there is sufficient evidence in the data. We show that these tasks can be cast within a common mathematical framework whose central object is the Lie derivative associated with fiber-linear Lie group actions on vector bundles. We extend and unify several existing results by showing that enforcing and discovering symmetry are linear-algebraic tasks that are dual with respect to the bilinear structure of the Lie derivative. We also propose a novel way to promote symmetry by introducing a class of convex regularization functions based on the Lie derivative and nuclear norm relaxation to penalize symmetry breaking during training of machine learning models. We explain how these ideas can be applied to a wide range of machine learning models including basis function regression, dynamical systems discovery, neural networks, and neural operators acting on fields.

URL PDF HTML ☆

赞 0 踩 0

2502.08903 2026-06-08 cs.RO cs.AI

3D-Grounded Vision-Language Framework for Robotic Task Planning: Automated Prompt Synthesis and Supervised Reasoning

面向机器人任务规划的3D grounded视觉-语言框架：自动化提示合成与监督推理

Guoqin Tang, Qingxuan Jia, Zeyuan Huang, Gang Chen, Ning Ji, Zhipeng Yao

发表机构 * Tsinghua University（清华大学）

AI总结本文提出融合2D提示合成模块和小语言模型的框架，提升机器人3D场景理解与任务执行能力，实验显示任务成功率高达96.0%。

Journal ref Engineering Applications of Artificial Intelligence, vol. 164, p. 113268, 2026

详情

DOI: 10.1016/j.engappai.2025.113268

AI中文摘要

视觉-语言模型（VLMs）在场景理解和感知任务中取得了显著成功，使机器人能够在动态环境中自适应地规划和执行动作。然而，大多数多模态大语言模型缺乏稳健的3D场景定位能力，限制了其在精细机器人操作中的有效性。此外，低识别精度、低效、差的迁移性和可靠性等挑战阻碍了其在精密任务中的应用。为解决这些限制，我们提出了一种新的框架，该框架整合了一个2D提示合成模块，通过将2D图像映射到点云，以及一个小型语言模型（SLM）来监督VLM的输出。2D提示合成模块使训练于2D图像和文本的VLM能够自主提取精确的3D空间信息，无需人工干预，显著增强了3D场景理解。同时，SLM监督VLM的输出，缓解幻觉并确保可靠的可执行机器人控制代码生成。我们的框架消除了在新环境中重新训练的需要，从而提高了成本效率和操作鲁棒性。实验结果表明，所提出的框架实现了96.0%的任务成功率（TSR），优于其他方法。消融研究证明了2D提示合成模块和输出监督模块的关键作用（当移除时，TSR下降67%）。这些发现验证了框架在提升3D识别、任务规划和机器人任务执行方面的有效性。

英文摘要

Vision-language models (VLMs) have achieved remarkable success in scene understanding and perception tasks, enabling robots to plan and execute actions adaptively in dynamic environments. However, most multimodal large language models lack robust 3D scene localization capabilities, limiting their effectiveness in fine-grained robotic operations. Additionally, challenges such as low recognition accuracy, inefficiency, poor transferability, and reliability hinder their use in precision tasks. To address these limitations, we propose a novel framework that integrates a 2D prompt synthesis module by mapping 2D images to point clouds, and incorporates a small language model (SLM) for supervising VLM outputs. The 2D prompt synthesis module enables VLMs, trained on 2D images and text, to autonomously extract precise 3D spatial information without manual intervention, significantly enhancing 3D scene understanding. Meanwhile, the SLM supervises VLM outputs, mitigating hallucinations and ensuring reliable, executable robotic control code generation. Our framework eliminates the need for retraining in new environments, thereby improving cost efficiency and operational robustness. Experimental results that the proposed framework achieved a 96.0\% Task Success Rate (TSR), outperforming other methods. Ablation studies demonstrated the critical role of both the 2D prompt synthesis module and the output supervision module (which, when removed, caused a 67\% TSR drop). These findings validate the framework's effectiveness in improving 3D recognition, task planning, and robotic task execution.

URL PDF HTML ☆

赞 0 踩 0

2501.11592 2026-06-08 cs.LG cs.AI cs.CL

Training-free Ultra Small Model for Universal Sparse Reconstruction in Compressed Sensing

无需训练的超小模型用于压缩感知中的通用稀疏重建

Chaoqing Tang, Huanze Zhuang, Guiyun Tian, Zhenli Zeng, Yi Ding, Wenzhong Liu, Xiang Bai

发表机构 * School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan, China（华中科技大学人工智能与自动化学院）； China Belt and Road Joint Lab on Measurement and Control Technology, Wuhan, China（中国一带一路测量与控制技术联合实验室）； School of Electric and Electrical Engineering, Chongqing University of Technology, Chongqing, China（重庆理工大学电气工程学院）； Optics Valley Laboratory, Wuhan, China（光谷实验室）； School of Water Conservancy and Transportation, Zhengzhou University, Zhengzhou, China（郑州大学水利与交通学院）； School of Software Engineering, Huazhong University of Science and Technology, Wuhan, China（华中科技大学软件工程学院）

AI总结本文提出无需训练的超小神经模型CL，实现快速稀疏重建，继承传统迭代方法的通用性和可解释性，提升效率和精度。

详情

DOI: 10.1109/TPAMI.2026.3680162

AI中文摘要

预训练大模型近年来受到广泛关注，但在需要高可解释性或资源有限的应用中面临挑战，如物理传感、医学成像和生物信息学。压缩感知（CS）是已证明的理论，推动了这些应用的许多突破。然而，作为典型的欠定线性系统，CS在使用传统迭代方法时，对大规模数据的稀疏重建时间过长。当前的AI方法如深度展开失败于替代它们，因为预训练模型在超出训练条件和数据分布时泛化性差或缺乏可解释性。本文提出名为系数学习（CL）的超小人工神经模型，实现无需训练的快速稀疏重建，同时完美继承传统迭代方法的泛化性和可解释性，带来融合先验知识的新特性。在CL中，长度为n的信号仅需最少n个可训练参数。一个案例研究模型称为CLOMP用于评估。实验在合成和真实的一维和二维信号上进行，显示了显著的效率和精度提升。与代表性的迭代方法相比，CLOMP在大规模数据上提高了100到1000倍的效率。在八个不同的图像数据集上的测试结果表明，CLOMP在采样率为0.1、0.3、0.5时分别提高了结构相似性指数292%、98%、45%。我们相信这种方法可以真正将CS重建带入AI时代，造福无数依赖稀疏解的欠定线性系统。

英文摘要

Pre-trained large models attract widespread attention in recent years, but they face challenges in applications that require high interpretability or have limited resources, such as physical sensing, medical imaging, and bioinformatics. Compressed Sensing (CS) is a well-proved theory that drives many recent breakthroughs in these applications. However, as a typical under-determined linear system, CS suffers from excessively long sparse reconstruction times when using traditional iterative methods, particularly with large-scale data. Current AI methods like deep unfolding fail to substitute them because pre-trained models exhibit poor generality beyond their training conditions and dataset distributions, or lack interpretability. Instead of following the big model fervor, this paper proposes ultra-small artificial neural models called coefficients learning (CL), enabling training-free and rapid sparse reconstruction while perfectly inheriting the generality and interpretability of traditional iterative methods, bringing new feature of incorporating prior knowledges. In CL, a signal of length $n$ only needs a minimal of $n$ trainable parameters. A case study model called CLOMP is implemented for evaluation. Experiments are conducted on both synthetic and real one-dimensional and two-dimensional signals, demonstrating significant improvements in efficiency and accuracy. Compared to representative iterative methods, CLOMP improves efficiency by 100 to 1000 folds for large-scale data. Test results on eight diverse image datasets indicate that CLOMP improves structural similarity index by 292%, 98%, 45% for sampling rates of 0.1, 0.3, 0.5, respectively. We believe this method can truly usher CS reconstruction into the AI era, benefiting countless under-determined linear systems that rely on sparse solution.

URL PDF HTML ☆

赞 0 踩 0

2606.07463 2026-06-08 eess.SP cs.CE cs.LG 新提交

Amortized Neural Optimization for Pre-Layout Signal Integrity Design Space Exploration using Differentiable Surrogates

基于可微代理的布局前信号完整性设计空间探索的摊销神经优化

Julian Withöft, Werner John, Emre Ecik, Ralf Brüning, Jürgen Götze

发表机构 * Information Processing Lab, Faculty for Electrical Engineering and Information Technology, TU Dortmund（信息处理实验室，电气工程与信息科技学院，图腾大学）； Pyramide2525, Paderborn, Germany（Pyramide2525，帕德博恩，德国）； EMC Technology Center Paderborn, Zuken GmbH, Paderborn, Germany（EMC技术中心帕德博恩，Zuken GmbH，帕德博恩，德国）

AI总结提出摊销神经优化（ANO）框架，利用可微神经网络代理模型替代迭代黑盒优化，实现单次前向传播获取近最优设计参数，在DDR5 DFE、SerDes均衡等场景中加速三到四个数量级。

Comments 16 pages, 20 figures, 8 tables

详情

AI中文摘要

高速信号完整性（SI）分析的布局前设计空间探索（DSE）通常受限于现代电子设计自动化（EDA）工作流程中仿真和迭代优化算法的计算成本。虽然机器学习代理模型加速了仿真步骤，但优化设计仍需利用迭代黑盒搜索方法。这种迭代性质扩展性差，使得多角点扫描计算成本高昂。作为解决方案，本文提出了用于布局前SI设计的摊销神经优化（ANO）。ANO通过利用完全可微的神经网络代理模型，完全消除了迭代黑盒推理。ANO从代理中提取解析梯度，以训练全局优化策略。推理时不再重复求解优化问题，而是离线学习优化过程，从而实现摊销。一旦ANO策略训练完成，它就能在单个确定性前向传播中直接将不同的通道上下文映射到近最优设计参数。基于三个复杂的SI设计场景展示了ANO框架的效率和准确性，包括DDR5决策反馈均衡（DFE）、9维SerDes Tx/Rx联合均衡以及DDR3 DQS差分对布线（在内部对偏斜约束下优化眼图指标）。与实例特定的黑盒算法相比，在牺牲约10%最优性的代价下，实现了三到四个数量级的加速。对于大规模32万实例多角点SerDes扫描优化，ANO将原本需要数天计算时间的迭代搜索算法压缩为一次批量前向传播，毫秒级完成。这将计算昂贵的SI优化转变为实时、交互式的布局前DSE。

英文摘要

Pre-layout design space exploration (DSE) for high-speed signal integrity (SI) analysis is often limited by the computational cost of simulations and iterative optimization algorithms within modern electronic design automation (EDA) workflows. While machine learning surrogate models accelerate the simulation step, optimizing designs still requires utilizing iterative black-box search methods. This iterative nature scales poorly, making multi-corner sweeps computationally expensive. As a solution, this paper proposes amortized neural optimization (ANO) for pre-layout SI design. ANO entirely eliminates iterative black-box inference by utilizing fully differentiable neural network surrogate models. ANO extracts analytical gradients from the surrogate to train a global optimization policy. Instead of solving the optimization problem repeatedly at inference, the optimization process is learned offline and therefore amortized. Once the ANO policy is trained, it maps different channel contexts directly to near-optimal design parameters in a single deterministic forward pass. The efficiency and accuracy of the ANO framework are demonstrated based on three complex SI design scenarios, including DDR5 decision feedback equalization (DFE), 9-dimensional SerDes Tx/Rx co-equalization, and DDR3 DQS differential pair routing to optimize eye diagram metrics under intra-pair skew constraints. By trading roughly 10% in optimality compared to instance-specific black-box algorithms, it realizes speedups of three to four orders of magnitude. For a large-scale 320,000-instance multi-corner SerDes sweep optimization, ANO collapses what would have taken days of computation using iterative search algorithms into a single batched forward pass that completes in milliseconds. This transforms computationally expensive SI optimization into real-time and interactive pre-layout DSE.

URL PDF HTML ☆

赞 0 踩 0

2606.07381 2026-06-08 eess.IV cs.AI cs.CV 新提交

Impact of Synthetic Lesional MR Images in Automated Focal Cortical Dysplasia Detection in Low-Data Scenarios

合成病灶MR图像在低数据场景下自动局灶性皮质发育不良检测中的影响

Prabhjot Kaur, Hakim Ouaalam, Sedat Kandemirli, Sanjay P. Prabhu, Simon K. Warfield

发表机构 * Computational Radiology Laboratory（计算放射学实验室）； Boston Children’s Hospital（波士顿儿童医院）； Harvard Medical School（哈佛医学院）

AI总结本研究通过条件生成网络合成FCD病灶MRI数据，评估其真实性及对自动检测的影响，发现合成数据可减少约20%标注需求，但真实数据仍更有效。

详情

DOI: 10.1111/jon.70137

AI中文摘要

背景与目的：自动检测局灶性皮质发育不良（FCD）需要大量体素级病灶勾画的MRI数据，这些数据难以获取。本研究旨在生成呈现FCD的合成MRI数据，评估其真实性，并评估其对自动FCD检测的影响，特别是在减少手动标注需求方面。方法：回顾性研究了来自多个（3个）中心的131例FCD患者和90例健康对照的T1加权（T1w）和T2加权液体衰减反转恢复（FLAIR）MRI扫描。通过将生成网络以二元FCD掩膜为条件生成合成MRI。两位神经放射科医生从14张真实和14张合成扫描的随机集合中识别真实图像。训练了三个nnU-Net模型用于检测FCD，分别使用：（i）仅真实数据（35例FCD/35例对照），（ii）真实数据（35例FCD/35例对照）加合成增强，以及（iii）扩展的真实数据（70例FCD/70例对照）。结果：专家区分真实与合成图像的能力有限，T1w分类准确率为60%，FLAIR为70%（评分者间一致性kappa=0.86）。用合成数据增强自动FCD检测使灵敏度提高8.14%（p=0.12），并改善了模型在真实病灶部位的置信度（0.83±0.11至0.89±0.12；p=0.02）。扩展真实数据模型进一步将灵敏度提高至73.8%（p<0.001），置信度提高至0.90±0.14（p=0.01）。结论：条件生成网络可以生成逼真的合成FCD-MRI，在保持同等灵敏度的情况下减少约20%的标注数据需求。当可用时，等量的真实数据仍比合成增强更有效。

英文摘要

Background and Purpose: Automated detection of focal cortical dysplasia (FCD) requires large volumes of voxelwise lesion-delineated MRI data, which are difficult to acquire. This study aims to generate synthetic MRI data exhibiting FCD, assess their realism, and evaluate their impact on automated FCD detection, particularly in reducing the need for manual annotations. Methods: T1-weighted (T1w) and T2-weighted Fluid-Attenuated Inversion Recovery (FLAIR) MRI scans from 131 FCD patients and 90 healthy controls from multiple (3) sites were retrospectively studied. Synthetic MRIs were generated by conditioning a generative network on binary FCD masks. Two neuroradiologists identified real images from a random set of 14 real and 14 synthetic scans. Three nnU-Net models were trained to detect FCD using: (i) real-only (35 FCD / 35 controls), (ii) real (35 FCD / 35 controls) plus synthetic augmentation, and (iii) expanded real data (70 FCD / 70 controls). Results: Experts showed limited ability to distinguish real from synthetic images, with classification accuracy of 60% for T1w and 70% for FLAIR (inter-rater agreement kappa = 0.86). Augmenting automated FCD detection with synthetic data increased sensitivity by 8.14% (p = 0.12) and improved model confidence at true lesion sites (0.83 +/- 0.11 to 0.89 +/- 0.12; p = 0.02). The expanded real-data model further improved sensitivity to 73.8% (p < 0.001) and confidence to 0.90 +/- 0.14 (p = 0.01). Conclusion: Conditional generative networks can generate realistic synthetic FCD-MRIs, reducing labeled data needs by approximately 20% while maintaining equivalent sensitivity. Equivalent amounts of real data, when available, remain more effective than synthetic augmentation.

URL PDF HTML ☆

赞 0 踩 0

2606.07374 2026-06-08 eess.SP cs.CV 新提交

Beyond Backscatter: InSAR coherence from detected SAR images

超越后向散射：来自检测SAR图像的InSAR相干性

Francescopaolo Sica, Andrea Pulella, Michael Schmitt

发表机构 * Department of Aerospace Engineering, University of the Bundeswehr Munich（联邦国防军 Munich航空航天工程系）； Microwaves and Radar Institute, German Aerospace Center (DLR)（德国航空航天中心 (DLR) 微波与雷达研究所）

AI总结提出一种深度学习框架，直接从检测SAR图像回归相干性，无需精确配准，使用Residual U-Net学习后向散射幅度与相干性的关系，在多种数据集上验证了高分辨率相干性回归的准确性提升和泛化能力。

Comments 27 pages, 20 figures

详情

AI中文摘要

在这项工作中，我们提出了一个深度学习框架，用于直接从检测SAR图像进行相干性回归，无需精确配准。使用从精确配准的Sentinel-1 SLC数据导出的相干性图训练Residual U-Net，以学习后向散射幅度与相干性之间的关系。模型在12天SLC对上训练，并在不同数据集上进行评估，包括配准的SLC产品和开放存取的分析就绪数据，覆盖不同的辐射特性、几何形状和位置。实验结果表明，与现有的基于强度的方法相比，所提出的方法实现了高分辨率相干性回归，且准确性更高。该网络在多样化的地理位置以及训练时从未见过的不同时间基线之间都能很好地泛化。此外，能够在全球可用的分析就绪数据（例如通过Google Earth Engine分发的地距检测数据）上运行，使其在任务设计、变化监测和多种制图任务中能够大规模应用。

英文摘要

In this work, we propose a deep learning framework for coherence regression directly from detected SAR images, without the need for accurate coregistration. A Residual U-Net is trained using coherence maps derived from precisely coregistered Sentinel-1 SLC data to learn the relationship between backscatter magnitudes and coherence. The model is trained on 12-day SLC pairs and evaluated across different datasets, including coregistered SLC products and open access analysis-ready data, covering diverse radiometric properties, geometries, and locations. Experimental results demonstrate that the proposed method achieves high-resolution coherence regression with improved accuracy compared to existing intensity-based approaches. The network generalizes well across diverse geographical locations and even across different temporal baselines that were never seen at training time. Additionally, the ability to operate on globally available analysis-ready data, such as ground range detected data, e.g., distributed through Google Earth Engine, enables its large-scale application in mission design, change monitoring, and diverse mapping tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.07259 2026-06-08 eess.AS cs.SD 新提交

Assessing True Generalisability of Audio-Visual Speech Recognisers

评估音视频语音识别器的真正泛化能力

Zhaofeng Lin, Stavros Petridis, Maja Pantic, Naomi Harte

发表机构 * Trinity College Dublin（都柏林三一学院）； Imperial College London（伦敦帝国理工学院）

AI总结通过构建与LRS3测试集严格匹配的评估集，发现当前最先进的音视频语音识别模型在未见数据上性能全面崩溃，揭示了其泛化能力不足，并分析了退化原因、词汇偏差和错误模式。

Comments Accepted to Interspeech 2026 Long paper track. 9 pages, 4 figures

详情

AI中文摘要

当前的音视频语音识别（AVSR）模型在标准LRS3基准上实现了近乎完美的性能，引发了对自适应过拟合的担忧。为了系统评估真正的泛化能力，我们从大规模MultiVSR数据集中构建了一个高度可控、未见过的评估子集。与标准的分布外基准不同，我们的子集在声学、视觉和人口统计分布上与LRS3测试集严格匹配。评估五种最先进的架构揭示了普遍的性能崩溃，证明当前系统即使在严格对齐的条件下也无法泛化。通过跨七个因素的细粒度属性分析，我们隔离了这种退化的具体驱动因素。此外，我们发现了深刻的词汇偏差，揭示了不同的错误模式，并令人惊讶地发现音视频性能甚至落后于纯音频设置。我们发布了匹配的测试集，用于未来的基准测试。

英文摘要

Current Audio-Visual Speech Recognition (AVSR) models achieve near-perfect performance on the standard LRS3 benchmark, raising concerns of adaptive overfitting. To systematically assess true generalisability, we construct a highly controlled, unseen evaluation set subsampled from the massive MultiVSR dataset. Unlike standard out-of-distribution benchmarks, our subset strictly matches the acoustic, visual, and demographic distributions of the LRS3 test set. Evaluating five state-of-the-art architectures reveals a universal performance collapse, proving that current systems fail to generalise even under strictly aligned conditions. Through a fine-grained attribute analysis across seven factors, we isolate the specific drivers of this degradation. Furthermore, we uncover a profound lexical bias, expose distinct error patterns, and surprisingly reveal that audio-visual performance even lags behind audio-only settings. We release our matched test set for future benchmarking.

URL PDF HTML ☆

赞 0 踩 0

2606.07063 2026-06-08 eess.IV cs.CV 新提交

Beyond Universality: The GCC-FER Dataset and Culture-Aware Adaptation for Dynamic Facial Expression Recognition

超越普遍性：GCC-FER数据集及面向动态面部表情识别的文化感知适应

Sonalika Singh, Jyotirindra Dandapat, Avishi Razdan, Kshipra V. Moghe, Puneet Gupta, Lalan Kumar

发表机构 * Department of Electrical Engineering, Indian Institute of Technology Delhi, India（印度理工学院德里分校电子工程系）； Department of Computer Science and Engineering, Indian Institute of Technology Indore, India（印度理工学院印尔德分校计算机科学与工程系）； Department of Psychology, COEP Technological University, India（COEP技术大学心理学系）

AI总结针对动态面部表情识别中文化差异被忽视的问题，提出首个大规模全球跨文化数据集GCC-FER，并设计文化感知适应系统CA-FER，通过自适应校准面部表示减轻文化偏差，实验证明其有效性。

详情

AI中文摘要

动态面部表情识别（DFER）是情感计算、人机交互和智能多媒体系统中的关键使能技术。尽管文化细微差别对FER性能有显著影响，但大多数现有FER系统假设情感表达在人群中普遍一致。这种差异可归因于不同文化中面部肌肉激活模式的系统性差异。推进跨文化FER的主要挑战在于缺乏文化多样性的基准数据集。为解决这一问题，本文引入了一个名为全球跨文化面部表情识别（GCC-FER）的新型混合多元文化视频数据集。GCC-FER包含跨越四种文化群体（非洲、高加索、东亚和南亚）的23,934个视频样本，涵盖七种基本表情，结合了对代表性不足人群的心理学家监督内部数据收集以及对现有来源的严格种族过滤。据我们所知，GCC-FER是首个旨在解决这些人口统计差距的大规模全球跨文化DFER数据集。利用该数据集，为每个文化群体推导出基于行为的文化先验，并为实际部署推导出全局先验。提出了一种文化感知FER（CA-FER）系统，通过自适应重新校准潜在面部表示来减轻文化偏差。在GCC-FER和DFEW上的大量实验表明，所提系统在多文化环境下持续提高了FER性能。

英文摘要

Dynamic Facial Expression Recognition (DFER) is a key enabling technology in affective computing, human-computer interaction, and intelligent multimedia systems. Despite the significant influence of cultural nuances on FER performance, most existing FER systems assume that emotional expressions are universally consistent across populations. This variation can be attributed to systematic differences in facial muscle activation patterns across cultures. A major challenge in advancing cross-cultural FER lies in the scarcity of culturally diverse benchmark datasets. To address this, a new hybrid multicultural video dataset termed Global Cross-Cultural Facial Expression Recognition (GCC-FER) is introduced. GCC-FER comprises 23,934 video samples spanning four cultural groups (African, Caucasian, East Asian, and South Asian) across seven basic expressions, combining psychologically supervised in-house data collection for underrepresented populations with rigorous ethnicity filtering of existing sources. To the best of our knowledge, GCC-FER is the first large-scale global cross-cultural DFER dataset designed to address these demographic gaps. Leveraging this dataset, behaviorally grounded cultural priors are derived for each cultural group and a global prior for practical deployment. A Culture-Aware FER (CA-FER) system is proposed to mitigate cultural bias by adaptively recalibrating latent facial representations. Extensive experiments on GCC-FER and DFEW demonstrate that the proposed system consistently improves FER performance across multicultural settings.

URL PDF HTML ☆

赞 0 踩 0

2606.06983 2026-06-08 eess.IV cs.AI cs.CV 新提交

DaX: Learning General Pathology Representations Across Scales

DaX: 跨尺度的通用病理学表示学习

Bokai Zhao, Yiyang Zhang, Long Bai, Tai Ma, Hanqing Chao, Minfeng Xu

发表机构 * DAMO Academy, Alibaba Group（达摩院，阿里巴巴集团）； Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； Hupan Lab（虎斑实验室）

AI总结提出病理视觉基础模型DaX，通过改进DINOv3自监督学习，结合连续放大训练、跨尺度组织视图等设计，在44个公开数据集的161项临床任务上取得最佳平均性能。

详情

AI中文摘要

计算病理学需要能够跨不同临床终点迁移且对放大倍数、染色、扫描仪类型、切片制备和输入分辨率变化保持鲁棒的视觉表示。我们提出DaX，一个病理视觉基础模型，它将DINOv3风格的自监督学习适应到全切片组织病理学。DaX从自然图像DINOv3权重初始化，并融合了连续放大训练、跨尺度组织视图、方向无关和采集鲁棒的数据增强、多输入尺寸训练以及Gram锚定的密集一致性。这些设计旨在连接局部细胞形态与全局组织结构，同时稳定跨输入尺度的密集token级表示。我们进一步构建了一个WSI级基准，包含来自44个公共数据集的161项临床有意义任务，涵盖28,182名患者和34,394张切片，跨越四个临床领域和九个任务类别。所有模型在固定的患者级交叉验证协议下进行评估，并采用折叠级统计排名，从而实现可重复的比较，对分割依赖的变异性不敏感。在该基准上，DaX在任务中取得了最高的平均性能，并持续获得强大的任务级排名分数，其增益涵盖诊断病理学、生物标志物和分子谱分析、组织/标本背景以及风险、反应和预后。这些结果支持DaX作为计算病理学的可迁移视觉编码器，并为未来的病理基础模型提供了标准化的评估框架。项目页面：此https URL。

英文摘要

Computational pathology requires visual representations that transfer across diverse clinical endpoints and remain robust to variation in magnification, staining, scanner type, slide preparation, and input resolution. We present DaX, a pathology vision foundation model that adapts DINOv3-style self-supervised learning to whole-slide histopathology. DaX is initialized from natural-image DINOv3 weights and incorporates continuous magnification training, cross-scale tissue views, orientation-agnostic and acquisition-robust augmentation, multi-input-size training, and Gram-anchored dense consistency. These designs aim to connect local cellular morphology with global tissue architecture while stabilizing dense token-level representations across input scales. We further construct a WSI-level benchmark comprising 161 clinically meaningful tasks from 44 public datasets, covering 28,182 patients and 34,394 slides across four clinical domains and nine task categories. All models are evaluated under a fixed patient-level cross-validation protocol with fold-level statistical ranking, enabling reproducible comparisons that are less sensitive to split-dependent variation. Across this benchmark, DaX achieves the highest mean performance across tasks and consistently strong task-level ranking scores, with gains spanning diagnostic pathology, biomarker and molecular profiling, tissue/specimen context, and risk, response, and prognosis. These results support DaX as a transferable visual encoder for computational pathology and provide a standardized evaluation framework for future pathology foundation models. Project page: https://alibaba-damo-academy.github.io/DaX/benchboard/.

URL PDF HTML ☆

赞 0 踩 0

2606.06907 2026-06-08 eess.AS cs.AI cs.SD 新提交

SpectCount: Spectrotemporal Counting via Synthetic Signals Improves Large Audio Language Models

SpectCount: 通过合成信号进行频谱时间计数改进大型音频语言模型

Seonuk Kim, Yonghyeon Jun, Ju Yeon Kang, Jimin Hong, Yoonhyeong Lee, Nam Soo Kim

发表机构 * Department of Electrical and Computer Engineering and INMC, Seoul National University, Seoul, South Korea（电气与计算机工程系和INMC，首尔国立大学，首尔，韩国）

AI总结针对大型音频语言模型在频谱时间感知上的弱点，提出SpectCount方法，利用动态生成的完全合成音频信号进行数据高效微调，无需真实音频或标注，显著提升多种听觉基准性能。

Comments 5 pages, 5 figures

2606.06847 2026-06-08 eess.IV cs.CV 新提交

Physics-Driven Semantic Scattering Structure Understanding of Aircraft Target in SAR Images

SAR图像中飞机目标的物理驱动语义散射结构理解

Yifei Yin, Xiaogang Yu, Hao Shi, Liang Chen, Wei Li

发表机构 * School of Information and Electronics, Beijing Institute of Technology（信息与电子学院，北京理工大学）； National Key Laboratory of Science and Technology on Space-Born Intelligent Information Processing（空间智能信息处理国家级重点实验室）； Beijing Institute of Remote Sensing Information（遥感信息北京市研究院）

AI总结针对SAR图像中飞机目标散射中心表示不稳定、弱散射部件缺失的问题，提出物理驱动框架S3U-SAR，通过定义语义散射关键点并利用多维物理先验约束，实现完整拓扑结构重建，在基准数据集上取得最优性能。

详情

AI中文摘要

合成孔径雷达（SAR）因其全天时、全天候观测能力，已成为目标解译不可或缺的手段。在SAR目标解译中，电磁散射信息提供了超越视觉纹理的物理基础线索，并被广泛用于目标解译。然而，现有方法仍以局部散射中心表示为主。这种无序且与部件无关的表示对飞机目标极不稳定。因此，物理存在的弱散射响应部件常被遗漏，导致重建的拓扑结构不完整。为解决这一局限，我们建立了语义散射结构理解作为SAR飞机解译的新范式。定义语义散射关键点以将局部电磁响应与物理上有意义的飞机部件关联，同时引入可见性感知属性以保留弱可观测但物理存在的部件。关键点进一步组织为稳定的语义散射结构。基于此，我们提出S3U-SAR，一个物理驱动框架，用于定位语义散射关键点并构建由多维物理先验（包括散射异质性、刚体拓扑、散斑不确定性）约束的完整表示。进一步引入置信门控联合监督策略以缓解优化冲突。我们构建了KP-SAR-Aircraft-1.0，首个用于语义散射结构理解的细粒度基准。大量实验表明，S3U-SAR相比基线取得了最佳性能。跨类别和跨数据集评估进一步验证了其鲁棒性和可迁移性。

英文摘要

Synthetic aperture radar (SAR) has become indispensable for target interpretation owing to its all-day and all-weather observation capability. In SAR target interpretation, electromagnetic scattering information provides a physically grounded cue beyond visual texture and has been widely exploited for target interpretation. However, existing methods remain dominated by local scattering center representations. Such unordered and component-agnostic representations are highly unstable for aircraft targets. As a result, physically existing components with weak scattering responses are often missed, resulting in the incomplete reconstructed topology structure. To address this limitation, we establish Semantic Scattering Structure Understanding as a new paradigm for SAR aircraft interpretation. Semantic scattering keypoints are defined to associate local electromagnetic responses with physically meaningful aircraft components, while visibility-aware attributes are introduced to retain weakly observable yet physically existed components. The keypoints are further organized into a stable semantic scattering structure. Build upon this, we propose S3U-SAR, a physics-driven framework to localize semantic scattering keypoints and construct the complete representation constrained by multi-dimensional physical priors containing scattering heterogeneity, rigid-body topology, speckle uncertainty. A confidence-gated joint supervision strategy is further introduced to alleviate optimization conflicts. We construct KP-SAR-Aircraft-1.0, the first fine-grained benchmark for semantic scattering structure understanding. Extensive experiments demonstrate that S3U-SAR achieves the best performance compared with baselines. Cross-category and cross-dataset evaluations further verify its robustness and transferability.

URL PDF HTML ☆

赞 0 踩 0

2606.06837 2026-06-08 eess.AS cs.LG 新提交

SEAM: Shortcut-Aware Real-Time Detection of Scripted vs. Spontaneous Speech for Interview Guardrails

SEAM：面向面试防护栏的脚本化与自发语音的快捷方式感知实时检测

Vsevolod, Kovalev, Pranay Manocha

发表机构 * Symbal AI ； Princeton University（普林斯顿大学）

AI总结提出SEAM框架，通过统一预处理、接缝感知采样、非语音增强和紧凑DistilHuBERT骨干，在8秒窗口下实现0.971 ROC-AUC，并揭示快捷方式学习问题。

Comments Accepted to Interspeech 2026

详情

AI中文摘要

脚本化与自发语音检测对面试防护栏具有吸引力，但基准性能可能因与语料库身份、信道条件和录音伪影相关的快捷方式（而非说话风格本身）而膨胀。我们提出SEAM，一个用于实时脚本化检测的快捷方式感知框架，结合了统一预处理、接缝感知采样、非语音增强和紧凑的DistilHuBERT骨干。使用8秒窗口，该模型在外部面试领域评估集上达到0.971 ± 0.004的ROC-AUC。移除快捷方式预防组件可改善内部留出指标，但急剧降低外部性能，表明存在快捷方式学习。训练后量化将模型占用减少至41.8MB，且外部性能损失很小。结果表明，鲁棒的实时脚本化检测不仅依赖于骨干网络，还依赖于快捷方式感知的数据设计和评估。我们发布代码和模型检查点。

英文摘要

Scripted vs spontaneous speech detection is appealing for interview guardrails, but benchmark performance can be inflated by shortcuts tied to corpus identity, channel conditions, and recording artifacts rather than speaking style itself. We present SEAM, a shortcut-aware framework for real-time scriptedness detection that combines uniform preprocessing, seam-aware sampling, non-speech augmentation, and a compact DistilHuBERT backbone. With 8s windows, the model achieves 0.971 +- 0.004 ROC-AUC on an external interview-domain evaluation set. Removing the shortcut-prevention components improves internal held-out metrics but sharply reduces external performance, indicating shortcut learning. Post-training quantization reduces the model footprint to 41.8MB with little loss in external performance. The results demonstrate that robust real-time scriptedness detection depends not only on the backbone, but on shortcut-aware data design and evaluation. We release code and model checkpoints.

URL PDF HTML ☆

赞 0 踩 0

2606.06795 2026-06-08 eess.AS cs.SD 新提交

BiEAR: A Human Auditory-Inspired Adaptive Binaural Front-end for Multi-Speaker Localisation and Distance Estimation

BiEAR: 一种受人类听觉启发的自适应双耳前端，用于多说话人定位和距离估计

Hanyu Meng, Eliathamby Ambikairajah, Vidhyasaharan Sethu, Qiquan Zhang, Haizhou Li

发表机构 * The University of New South Wales（新南威尔士大学）； Tongyi Speech Lab, Alibaba Group（通义语音实验室，阿里巴巴集团）； School of Artificial Intelligence, The Chinese University of Hong Kong, Shenzhen（人工智能学院，香港中文大学（深圳））

AI总结提出受人类听觉启发的自适应双耳前端BiEAR，通过神经控制器动态调整滤波器组频率选择性，提升多说话人定位和距离估计的准确性与鲁棒性。

Comments Accepted to INTERSPEECH 2026

详情

AI中文摘要

我们提出BiEAR，一种受人类听觉启发的自适应双耳前端，用于多说话人定位和距离估计。受人类听觉中内侧橄榄耳蜗（MOC）反馈的启发，BiEAR使用神经控制器在推理过程中自适应调整双耳听觉滤波器组的频率选择性。这为双耳产生时频自适应表示，使模型能够响应变化的声学条件。我们在消声和真实房间环境中评估了BiEAR在多说话人定位和距离估计上的性能。结果表明，与常用的固定双耳前端相比，自适应前端提高了定位准确性以及对未见说话人和房间的鲁棒性。对学习到的滤波器自适应的可视化和分析表明，BiEAR随时间强调信息丰富的频带。这些发现表明，自适应的、受生物启发的双耳前端可以改善机器在复杂声学场景中的听觉鲁棒性。

英文摘要

We present BiEAR, a human auditory-inspired adaptive binaural front-end for multi-speaker localisation and distance estimation. Inspired by medial olivocochlear (MOC) feedback in human hearing, BiEAR uses a neural controller to adaptively adjust the frequency selectivity of a binaural auditory filterbank during inference. This yields time-frequency adaptive representations for ears, enabling the model to respond to changing acoustic conditions. We evaluate BiEAR on multi-speaker localisation and distance estimation in anechoic and real-room environments. Results show that the adaptive front-end improves localisation accuracy and robustness to unseen speakers and rooms compared with commonly used fixed binaural front-ends. Visualisation and analysis of learned filter adaptations show that BiEAR emphasises informative frequency bands over time. These findings suggest that adaptive, biologically inspired binaural front-ends can improve machine hearing robustness in complex acoustic scenes.

URL PDF HTML ☆

赞 0 踩 0

2606.06725 2026-06-08 eess.IV cs.CV 新提交

Compute-Optimal Network Design for Echocardiography Myocardial Segmentation and Perfusion Quantification using Neural Scaling Laws

基于神经缩放定律的超声心动图心肌分割与灌注量化的计算最优网络设计

Clara Rodrigo González, Matthieu Toulemonde, Lasha Gvinianidze, Cameron A. B. Smith, Oscar Bates, Roxy Senior, Fu Siong Ng, Meng-Xing Tang

发表机构 * Department of Bioengineering, Imperial College London（生物工程系，帝国理工学院伦敦分校）； National Heart and Lung Institute, Imperial College London（国家心脏和肺 institute，帝国理工学院伦敦分校）； Guy’s and St. Thomas’ NHS Foundation Trust（圣泰莫斯国家健康服务信托基金）

AI总结应用神经缩放定律预测心肌分割性能，在CAMUS和CEUS数据集上确定最优网络大小，实现参数减少240倍且性能达最优，自动分割在心肌灌注量化中与资深心脏病专家等效。

Comments 15 pages, 4 figures, 5 tables, journal

详情

AI中文摘要

使用对比增强超声进行心肌灌注量化提供了一种床旁非电离替代核成像模态的方法。然而，其临床采用受到耗时的手动标注的限制。由于域内训练数据匮乏，自动分割已被证明具有挑战性。我们应用当前用于优化大数据集上大型语言模型的策略，将神经缩放定律应用于预测心肌分割的网络性能。我们在数据子集上外推性能，以确定CAMUS超声心动图数据集和25名患者的对比增强超声（CEUS）数据集上的最优网络大小。最后，通过将最终心肌灌注参数与资深心脏病专家获得的参数进行比较，验证了我们模型的临床实用性。基于缩放定律的外推能够预测完整数据集大小下的测试损失，使我们能够选择两个网络，在CAMUS上以240倍的参数减少获得最先进性能。我们观察到缩放定律的梯度从CAMUS迁移到CEUS数据集，但预测损失存在偏差。自动分割的掩膜在心肌灌注量化中与资深心脏病专家表现相当。这些结果确立了神经缩放定律作为小成像数据集上数据驱动计算最优模型设计的实用工具。

英文摘要

Myocardial perfusion quantification using contrast-enhanced ultrasound offers a bedside non-ionizing alternative to nuclear imaging modalities. However, its clinical adoption is hindered by time-consuming manual labelling. Automated segmentation has proved challenging due to a paucity of in-domain training data. Adapting strategies currently used to optimise large language models for large datasets, we apply neural scaling laws to predict network performance for myocardial segmentation. We extrapolate performance on subsets of the data to determine optimal network size on the CAMUS echocardiography dataset and a 25-patient contrast-enhanced ultrasound (CEUS) dataset. Finally, we validate the clinical utility of our models by comparing the final myocardial perfusion parameters with those obtained by a senior cardiologist. Extrapolation based on the scaling law is predictive of test loss at the full dataset size, allowing us to select two networks that obtained state-of-the-art performance on CAMUS with a 240-fold reduction in parameter count. We observe the gradient of the scaling law transfers from CAMUS to the CEUS dataset with a bias in the predicted losses. The automatically segmented masks perform equivalently to a senior cardiologist in myocardial perfusion quantification. These results establish neural scaling laws as a practical tool for data-driven compute-optimal model design for small imaging datasets.

URL PDF HTML ☆

赞 0 踩 0

2606.06540 2026-06-08 eess.IV cs.CV 新提交

ErA: Error-Aware Deep Unrolling Network for Single Image Defocus Deblurring

ErA：用于单图像散焦去模糊的误差感知深度展开网络

Tu Vo, Chan Y. Park

发表机构 * KC Machine Learning Lab（KC机器学习实验室）

AI总结提出ErA网络，通过联合学习紧凑核基和逐像素权重，并利用增广拉格朗日展开中的误差感知项交替更新和ResUNet去噪器校正核估计误差，在多个数据集上达到最优性能。

2606.06534 2026-06-08 eess.IV cs.AI 新提交

Attention Consistent Longitudinal Medical Visual Question Answering Guided by Vision Foundation Models

基于视觉基础模型的注意力一致纵向医学视觉问答

Jialin Wu, Qianru Zhang, Georges El Fakhri, Xiaofeng Liu

发表机构 * University of California, San Diego（加州大学圣地亚哥分校）； Yale Biomedical Imaging Institute（耶鲁大学生物医学成像研究所）

AI总结提出一种注意力引导的编码器-解码器框架，通过轻量级配准和自适应掩码生成，结合辅助损失函数，实现胸部X光片的纵向医学视觉问答，在Medical-Diff-VQA基准上取得优异性能。

Comments Accepted to CVPR 2026 Workshop PHAROS-AIF-MIH

Journal ref Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026, pp. 6448-6458

详情

AI中文摘要

纵向医学视觉问答（VQA）需要推理当前时间点图像与参考时间点图像之间的解剖差异。我们针对胸部X光片提出了一种注意力引导的编码器-解码器。与传统的直接对比不同，我们引入了一个轻量级仿射配准模块，通过小配准正则化将当前图像与参考图像进行共配准，以减少无关运动。配准后的图像对输入图像编码器，随后通过冻结的DINO掩码生成器和可训练的自适应掩码生成器生成应用于原始图像对的掩码。掩码图像对再次输入图像编码器，并与文本特征拼接，作为基于多模态Transformer的解码器的输入以生成最终答案。为了促进学习稳定并澄清变化信号，受DINO-v3启发，我们加入了额外的辅助目标，包括掩码重建损失、成对Gram风格一致性损失和KoLeo均匀性损失，以增强表示的几何结构。在Medical-Diff-VQA基准上，该模型获得了强大的BLEU、ROUGE-L、CIDEr和METEOR分数，同时通过共享显著性掩码提供了内在的可解释性。这些结果支持将显著性条件生成与轻度预对齐作为医学VQA中纵向推理的原则性框架。我们的训练策略也展示了在生物医学中利用图像基础模型的范式潜力：同时优化监督和无监督学习目标。

英文摘要

Longitudinal medical visual question answering (VQA) requires reasoning about anatomical differences between an image of a current time point and an image of a referred time point. We propose an attention-guided encoder-decoder for this task with chest X-rays. Instead of conventional direct contrast, we propose to include a lightweight affine registration module to reduce nuisance motion by co-registering the current image to the reference image with a small registration regularizer. The registered image pair is fed into the image encoder, followed by a frozen DINO-based mask generator and a trainable adaptive mask generator to produce masks applied to the original image pairs. The masked image pairs are again fed into the image encoder and concatenated with text features as the input to a multimodal transformer-based decoder to generate final answers. To facilitate learning stabilization and clarify the change signal, inspired by DINO-v3, we include additional auxiliary objectives, including a mask rebuilding loss, a pairwise Gram-style consistency loss, and a KoLeo uniformity loss, which enhances the geometry of the representation. On the Medical-Diff-VQA benchmark, the model delivers strong BLEU, ROUGE-L, CIDEr, and METEOR scores while offering intrinsic interpretability through the shared saliency mask. These results support saliency-conditioned generation with mild pre-alignment as a principled framework for longitudinal reasoning in medical VQA. Our training strategy also illustrates the potential of a paradigm in utilizing image foundation models in biomedicine: optimizing both supervised and unsupervised learning objectives simultaneously.

URL PDF HTML ☆

赞 0 踩 0

2606.06524 2026-06-08 eess.IV cs.CV cs.LG 新提交

Advanced Flood Prediction with Physics-Guided Deep Learning: Combining UNet, FNO, and SAR/Optical Imagery

基于物理引导深度学习的先进洪水预测：结合UNet、FNO与SAR/光学影像

Tewodros Syum Gebre, Jagrati Talreja, Leila Hashemi-Beni

发表机构 * National Center for Atmospheric Research (NCAR)（国家大气研究中心）

AI总结提出物理引导深度学习框架，融合多模态遥感与浅水方程约束，通过UNet-FNO混合架构实现高精度洪水预测，IoU达0.82，F1达0.90。

Comments This paper has been accepted for publication in the Proceedings of the IEEE Radar Conference (RadarConf 2026). The final authenticated version will be available through IEEE Xplore

详情

AI中文摘要

由于地面观测有限、地形条件异质以及数据驱动模型中难以强制执行水动力学一致性，准确且可扩展的洪水测绘仍然具有挑战性。本文介绍了一种物理引导的深度学习框架，该框架集成了多模态遥感（Sentinel-1 SAR、Sentinel-2光学影像和DEM衍生的地形特征）与深度平均浅水方程（SWE）的约束。所提出的混合架构结合了用于捕捉精细尺度空间细节的UNet和用于模拟流域尺度水力相互作用的傅里叶神经算子（FNO），而物理信息残差损失确保了质量和动量一致性。在多种洪泛区环境下评估，混合模型在洪水范围预测中实现了0.82的交并比和0.90的F1分数，优于仅使用UNet和仅使用FNO的基线模型。以水动力学模拟作为参考数据，该模型在水深方面实现了0.21米的均方根误差，在流速方面实现了0.15米/秒的均方根误差。物理一致性得以保持，残差低且质量不平衡低于2.1%。消融研究证实，去除基于物理的正则化会显著降低性能，突显了物理约束对稳定性和泛化能力的价值。这些结果表明，将水动力学原理嵌入深度学习可产生更准确、可靠且物理一致的洪水预测，为业务监测和大规模部署提供了巨大潜力。

英文摘要

Accurate and scalable flood mapping remains challenging due to limited ground observations, heterogeneous terrain conditions, and the difficulty of enforcing hydrodynamic consistency within data-driven models. This work introduces a physics-guided deep learning framework that integrates multi-modal remote sensing (Sentinel-1 SAR, Sentinel-2 optical imagery, and DEM-derived terrain features) with constraints from the depth-averaged shallow water equations (SWE). The proposed hybrid architecture combines a UNet to capture fine-scale spatial details with a Fourier Neural Operator (FNO) to model basin-scale hydraulic interactions, while physics-informed residual losses ensure mass and momentum consistency. Evaluated across diverse floodplain settings, the hybrid model achieves an Intersection over Union of 0.82 and an F1 score of 0.90 for flood extent prediction, outperforming UNet-only and FNO-only baselines. Using hydrodynamic simulations as reference data, the model achieves an RMSE of 0.21 m for water depth and 0.15 m/s for flow velocity. Physics consistency is maintained, with low residuals and mass imbalance below 2.1%. Ablation studies confirm that removing physicsbased regularization significantly degrades performance, underscoring the value of physical constraints for stability and generalization. These results demonstrate that embedding hydrodynamic principles into deep learning yields more accurate, reliable, and physically coherent flood predictions, offering strong potential for operational monitoring and large-scale deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.06509 2026-06-08 eess.IV cs.AI cs.LG q-bio.TO 新提交

Which Anatomy Matters Under Limited Labels? A Data-Efficient Anatomy-Aware Benchmark for Cardiac Pathology Prediction

在有限标签下哪些解剖结构重要？用于心脏病理预测的数据高效解剖感知基准

Himanshu Singh

发表机构 * Himanshu Singh（希曼斯·辛格）

AI总结针对有限标签和计算资源下的医学影像问题，提出解剖感知基准，通过比较不同解剖结构表示和分类器，发现表示质量比模型复杂度更重要。

Comments ACCEPTED at ICML 2026 Workshop GlobalSouthML (Seoul, South Korea; PMLR 306, 2026)

2606.07399 2026-06-08 stat.ML cs.LG 新提交

Automatic, Debiased, and Invariant Counterfactual Generation under General Interventions

通用干预下的自动、去偏和不变反事实生成

Raphael C Kim, Jingsen Zhu, Ramin Zabih, Michele Santacatterina

发表机构 * Cornell Tech（康奈尔科技）； Cornell University（康奈尔大学）； Department of Biostatistics, Department of Population Health（生物统计学系、人口健康系）； New York University Grossman School of Medicine（纽约大学格罗斯曼医学院）

AI总结提出ADIGen框架，结合Riesz回归、因果不变性和正交统计学习，实现通用干预下反事实生成的自动、去偏和不变性，并提供过剩风险界。

2606.07016 2026-06-08 stat.AP cs.CV 新提交

An Integrated Roadside Sensing and Communication Framework for Vulnerable Road User Safety at Signalized Intersections

信号交叉口弱势道路使用者安全的集成路边感知与通信框架

Parvez Anowar

发表机构 * Department of Civil, Environmental and Construction Engineering, University of Central Florida（中央佛罗里达大学土木、环境与建设工程系）

AI总结提出集成多模态感知、边缘计算、V2X/P2X通信和自适应信号控制的框架，基于公开数据集R-LiViT分析53,319个标注，发现VRU占49%、昼夜密度差异大、近距离事件变化10倍、83%行人边界框小，支持多模态感知和自适应部署。

Comments 17 pages, 5 figures, 2 tables. Preprint

详情

AI中文摘要

弱势道路使用者（VRU）约占全球城市交通死亡人数的一半，而交叉口集中了不成比例的伤亡。最近关于VRU保护的感知技术综述列举了数十种单传感器和双传感器部署，但所调查的系统均未将多模态感知与边缘侧近碰撞分析以及双向车联万物（V2X）和行人联万物（P2X）消息传递集成在单个交叉口机柜中。本文提出一个信号交叉口VRU保护的综合框架，在感知层结合LiDAR、雷达、RGB相机和热成像相机，在计算层进行基于边缘的预测和替代安全分析，在通信层进行V2X和P2X消息传递，在驱动层进行自适应信号控制。该框架基于使用R-LiViT（首个公开的路边LiDAR-视觉-热成像数据集）的实证案例研究，该数据集提供了200个多模态序列和2,400个标注的RGB-T帧，来自三个德国交叉口。对53,319个检测标注的分析显示，VRU约占所有道路使用者观测的49%；从白天到夜晚，行人密度下降38%，车辆下降45%，而夜间分布显示更高的近距离比例；在三个交叉口的八个独特位置，每帧近距离事件计数变化约10倍；83%的行人边界框在图像空间中较小，表明VRU通常远离任何单个传感器。这些发现支持多模态感知、边缘侧分析和自适应上下文感知部署，而非统一的单传感器解决方案。

英文摘要

Vulnerable road users (VRUs) account for approximately half of urban traffic deaths globally, with intersections concentrating a disproportionate share of these casualties. Recent reviews of sensing technology for VRU protection have cataloged dozens of single-sensor and dual-sensor deployments, yet none of the surveyed systems couples multi-modal sensing with edge-side near-miss analytics and bidirectional vehicle-to-everything (V2X) and pedestrian-to-everything (P2X) messaging in a single intersection cabinet. This paper presents an integrated framework for VRU protection at signalized intersections, combining LiDAR, radar, RGB camera, and thermal camera at the perception layer, edge-based prediction and surrogate-safety analytics at the computation layer, V2X and P2X messaging at the communication layer, and adaptive signal control at the actuation layer. The framework is grounded in an empirical case study using R-LiViT, the first publicly released roadside LiDAR-Visual-Thermal dataset, which provides 200 multi-modal sequences and 2,400 annotated RGB-T frames at three German intersections. Analysis of 53,319 detection annotations reveals that VRUs comprise approximately 49% of all road-user observations, that day-to-night density drops by 38% for pedestrians and 45% for vehicles while the night distribution shows a higher close-proximity share, that per-frame close-proximity event counts vary approximately 10-fold across the eight unique locations at three intersections, and that 83% of pedestrian bounding boxes are small in image space, indicating that VRUs are typically far from any single sensor. These findings support multi-modal sensing, edge-side analytics, and adaptive context-sensitive deployment rather than uniform single-sensor solutions.

URL PDF HTML ☆

赞 0 踩 0

2606.06957 2026-06-08 stat.ML cs.LG 新提交

Deep Single-Index Fréchet Regression

深度单指标弗雷歇回归

Muqing Cui, Yidong Zhou, Su I Iao, Hans-Georg Müller

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出DeSI框架，通过深度神经网络估计单指标方向，在度量空间中进行弗雷歇回归，缓解维数灾难并保持可解释性，理论保证收敛率，在分布、网络等数据上表现优异。

详情

AI中文摘要

预测位于非欧几里得空间中的输出，如概率分布、网络和对称正定矩阵，在现代数据分析中变得越来越重要，特别是当输入是高维时。我们提出了DeSI（深度单指标弗雷歇回归），一种用于度量空间值输出和多变量输入的半参数回归框架，该框架假设条件弗雷歇均值具有单指标结构。DeSI使用深度神经网络估计可解释的指标方向，该方向量化了输入的相对重要性，并在目标度量空间中沿着得到的一维指标进行弗雷歇回归。这种结构缓解了维数灾难，同时保持了可解释性，这与标准深度神经网络形成对比。我们为DeSI建立了理论保证，包括一致逼近和收敛速度，并通过在分布、网络和对称正定矩阵上的模拟，以及在新泽西州的成分情绪数据上的应用，展示了其强大的预测性能。

英文摘要

Predicting outputs that are located in non-Euclidean spaces, such as probability distributions, networks, and symmetric positive-definite matrices, is becoming increasingly important in modern data analysis, particularly when inputs are high-dimensional. We propose DeSI (Deep Single-Index Fréchet Regression), a semiparametric framework for regression with metric space-valued outputs and multivariate inputs that assumes a single-index structure for the conditional Fréchet mean. DeSI estimates an interpretable index direction, which quantifies the relative importance of inputs, using a deep neural network, and performs Fréchet regression along the resulting one-dimensional index in the target metric space. This structure mitigates the curse of dimensionality while retaining interpretability, which stands in contrast to standard deep neural networks. We establish theoretical guarantees for DeSI, including uniform approximation and convergence rates, and demonstrate its strong predictive performance through simulations on distributions, networks, and symmetric positive-definite matrices, as well as an application to compositional mood data from New Jersey.

URL PDF HTML ☆

赞 0 踩 0

2606.06855 2026-06-08 stat.ML cs.LG math.ST stat.TH 新提交

Stability beyond Bounded Differences: Sharp Generalization Bounds under Finite $L_p$ Moments

超越有界差分的稳定性：有限 $L_p$ 矩下的尖锐泛化界

Qianqian Lei, Soham Bonnerjee, Yuefeng Han, Wei Biao Wu

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结针对重尾或无界损失，提出仅需有限 $L_p$ 矩条件的稳定性框架，导出尖锐高概率泛化界，覆盖经验风险最小化、转导回归和元学习。

详情

AI中文摘要

虽然算法稳定性是理解学习算法泛化能力的核心工具，但现有的高概率保证通常依赖于一致有界或次高斯/次韦布尔尾部假设，这对于现代设置中重尾或无界损失可能过于严格。我们开发了一个仅需有限 $L_p$ 矩条件的稳定性框架。我们的第一个贡献是在 $L_p$ 约束下独立随机变量函数的尖锐集中不等式，将 McDiarmid 的有界差分技术扩展到经典范围之外。利用这些结果，我们在一系列学习范式中推导出尖锐的高概率泛化界，包括经验风险最小化、转导回归和元学习。这些保证表明，即使有界性不成立，$L_p$ 稳定性也足以实现鲁棒泛化，显著削弱了稳定性文献中的标准假设。

英文摘要

While algorithmic stability is a central tool for understanding generalization of learning algorithms, existing high-probability guarantees typically rely on uniform boundedness or sub-Gaussian/sub-Weibull tail assumptions, which can be overly restrictive for modern settings with heavy-tailed or unbounded losses. We develop a stability-based framework that requires only a finite $L_p$ moment condition. Our first contribution is sharp concentration inequalities for functions of independent random variables under $L_p$ constraints, extending McDiarmid's bounded-differences techniques beyond the classical regime. Leveraging these results, we derive sharp high-probability generalization bounds across a range of learning paradigms, including empirical risk minimization, transductive regression, and meta-learning. These guarantees show that $L_p$ stability suffices for robust generalization even when boundedness fails, substantially weakening the standard assumptions in the stability literature.

URL PDF HTML ☆

赞 0 踩 0

2606.06814 2026-06-08 stat.ML cs.LG math.ST stat.AP stat.TH 新提交

The Effect of Training Task Diversity on In-Context Learning through the Lens of Low-Dimensional Subspaces

训练任务多样性对上下文学习的影响：基于低维子空间的视角

Soo Min Kwon, Alec S. Xu, Can Yaras, Dogyoon Song, Laura Balzano, Qing Qu

发表机构 * University of California, Berkeley（加州大学伯克利分校）； University of Washington（华盛顿大学）； University of California, Los Angeles（加州大学洛杉矶分校）； Stanford University（斯坦福大学）； University of Toronto（多伦多大学）

AI总结本文通过低秩高斯混合模型分析训练任务多样性（由子空间非重叠列数定义）如何提升线性注意力上下文学习的泛化与优化，解释训练多样性缩短学习平台期及实现分布外泛化的现象，并扩展至非线性场景。

详情

AI中文摘要

Transformer执行上下文学习（ICL）的涌现能力引发了大量旨在理解其底层机制的研究。现有工作通常研究训练任务多样性（定义为ICL训练任务向量的数量或任务向量所来自的函数类数量）如何塑造ICL的学习动态和泛化能力。尽管这两种定义都揭示了许多有趣的现象，但后一定义下的许多观察结果在理论上仍未得到解释。本文提出了一个最小分析模型，在这些现象下，这些现象可以从训练数据的属性中可靠地涌现。通过将训练任务向量建模为低秩高斯的混合，我们展示了训练任务多样性（由参数化协方差矩阵的子空间之间的非重叠列数定义）如何改善线性注意力ICL的泛化和优化轨迹。特别地，我们表明我们的模型可以解释（i）为什么任务多样性训练缩短了ICL的平台期，以及（ii）为什么ICL似乎实现了分布外泛化。最后，我们通过实验证明了我们的结果如何扩展到非线性Transformer和非线性函数类。总体而言，我们的工作提出了一个可处理的框架来统一现有的观察结果。

英文摘要

The transformer's emergent ability to perform in-context learning (ICL) has sparked a wide range of studies designed to understand its underlying mechanisms. Existing works often study how training task diversity, defined either as the number of ICL training task vectors or as the number of function classes from which the task vectors are drawn, shapes both the learning dynamics and generalization capabilities of ICL. While both definitions have uncovered many interesting phenomena, many observations under the latter definition remain theoretically unexplained. This paper presents a minimal analytical model under which these phenomena provably emerge from the properties of the training data. By modeling the training task vectors as a mixture of low-rank Gaussians, we show how training task diversity, defined by the number of non-overlapping columns between subspaces that parameterize the covariance matrices, improves both the generalization and optimization trajectory of ICL with linear attention. In particular, we show that our model can explain (i) why training with task diversity shortens the ICL plateau and (ii) why ICL appears to achieve out-of-distribution generalization. We conclude by empirically demonstrating how our results extend to nonlinear transformers and nonlinear function classes. Overall, our work presents a tractable framework to unify existing observations.

URL PDF HTML ☆

赞 0 踩 0

2606.06785 2026-06-08 stat.ML cs.LG math.DS 新提交

Empirical Transfer Operators and Finite-Sample Change Detection for Noisy Expanding Interval Maps

经验转移算子与含噪扩张区间映射的有限样本变化检测

Aparna Rajput

发表机构 * Department of Mathematics and Statistics, Concordia University（数学与统计学系，康科迪亚大学）

AI总结针对一维含噪动力系统，提出基于分区经验转移矩阵的有限样本变化检测方法，通过比较滑动窗口与基线段的平稳分布L1距离来检测不变密度变化，并给出有限样本界和误报保证。

Comments 27 pages, 2 tables, 1 figure

详情

AI中文摘要

我们研究了一维含噪动力系统的有限样本变化检测，使用基于分区的经验近似来刻画平稳行为。给定区间值过程的观测，我们对状态空间进行划分，从观测到的分区元素之间的转移中估计一个有限转移矩阵，并应用一个小的Doeblin型正则化以确保唯一的平稳分布。从初始参考段，我们计算基线经验平稳分布$\widehat{\pi}_{0,\rho}$。对于每个后续滑动窗口，我们计算$\widehat{\pi}_{t,\rho}$并定义得分\[ S_t=\|\widehat{\pi}_{t,\rho}-\widehat{\pi}_{0,\rho}\|_1. \] $S_t$的大值表示相对于基线的平稳行为发生变化。该统计量检测不变密度或平稳定律的变化，但不检测转移动态的所有可能变化。在关于经验转移集中性、有限状态平稳分布稳定性、分区近似、正则化偏差和噪声稳定性的明确假设下，我们推导了经验平稳密度的有限样本界。该界将采样误差、正则化偏差、分区近似误差和噪声偏差分开。然后，我们得到了单窗口误报保证，以及当不变密度变化超过估计误差时的充分检测条件。我们在合成含噪beta映射变点实验中展示了该方法。

英文摘要

We study finite-sample change detection for one-dimensional noisy dynamical systems using partition-based empirical approximations of stationary behaviour. Given observations from an interval-valued process, we partition the state space, estimate a finite transition matrix from observed transitions between partition elements, and apply a small Doeblin-type regularisation to ensure a unique stationary distribution. From an initial reference segment, we compute a baseline empirical stationary distribution $\widehatπ_{0,ρ}$. For each later sliding window, we compute $\widehatπ_{t,ρ}$ and define the score \[ S_t=\|\widehatπ_{t,ρ}-\widehatπ_{0,ρ}\|_1. \] Large values of $S_t$ indicate a change in stationary behaviour relative to the baseline. The statistic detects changes in invariant density or stationary law, but not all possible changes in transition dynamics. Under explicit assumptions on empirical transition concentration, finite-state stationary distribution stability, partition approximation, regularisation bias, and noise stability, we derive a finite-sample bound for the empirical stationary density. The bound separates sampling error, regularisation bias, partition approximation error, and noise bias. We then obtain a single-window false-alarm guarantee and a sufficient detection condition when the invariant density changes by more than the estimation error. We illustrate the method on synthetic noisy beta-map change-point experiments.

URL PDF HTML ☆

赞 0 踩 0

2606.06772 2026-06-08 stat.ML cs.AI cs.LG 新提交

Generalization in Deep Neural Networks: Minimax Rates for Gradient Methods

深度神经网络的泛化：梯度方法的极小化最优速率

Junyu Zhou, Puyu Wang, Yunwen Lei, Marius Kloft, Yiming Ying

发表机构 * Mathematical Institute for Machine Learning and Data Science, Catholic University of Eichstätt-Ingolstadt（机器学习与数据科学数学研究所，埃施特哈特-因戈尔施塔特天主教大学）； Department of Computer Science, RPTU Kaiserslautern-Landau（计算机科学系，凯斯莱特恩-兰道大学）； Department of Mathematics, The University of Hong Kong（数学系，香港大学）； School of Mathematics and Statistics, The University of Sydney（数学与统计学学院，悉尼大学）

AI总结本文建立了过参数化深度神经网络与核方法学习动力学的联系，证明了梯度下降和随机梯度下降在足够宽度下能达到极小化最优泛化误差。

Comments 37 pages

详情

AI中文摘要

理解过参数化神经网络的泛化性能已成为深度学习理论的核心课题。尽管近期进展，特别是神经正切核（NTK）机制下的工作，揭示了浅层架构的行为，但深度神经网络（DNN）的统计泛化性质，尤其是在回归任务中，仍远未得到充分理解。本文通过提供使用梯度方法训练的DNN的全面泛化分析，在弥合这一差距方面取得了重大进展。首先，我们首次建立了使用梯度方法训练的、具有光滑激活函数的DNN的学习动态与核方法的学习动态之间的关键联系，表明过参数化DNN上的梯度方法可以完全继承其核对应物的有利学习动态。基于这一联系以及核方法已确立的最优性，我们推导出了梯度下降（GD）和随机梯度下降（SGD）的过量总体风险的第一个已知极小化最优速率，假设网络宽度与样本大小成多项式比例。我们的结果表明，在足够宽度下，由GD或SGD训练的DNN可以实现与基于核的方法相当的泛化性能。

英文摘要

Understanding the generalization performance of over-parameterized neural networks has become a central topic in deep learning theory. While recent advances, particularly works under the Neural Tangent Kernel (NTK) regime, have shed light on the behavior of shallow architectures, the statistical generalization properties of deep neural networks (DNNs), especially in regression tasks, remain far less understood. In this paper, we make significant progress toward closing this gap by providing a comprehensive generalization analysis of DNNs trained using gradient-based methods. First, we establish, for the first time, a crucial connection between the learning dynamics of a DNN with smooth activation functions trained via gradient-based methods and those of kernel methods, showing that gradient-based methods on over-parameterized DNNs can fully inherit the favorable learning dynamics of their kernel counterparts. Building on this connection and the well-established optimality of kernel methods, we derive the first known minimax-optimal rates for the excess population risk of both gradient descent (GD) and stochastic gradient descent (SGD), under the assumption that network width scales polynomially with the sample size. Our results demonstrate that, with sufficient width, DNNs trained by GD or SGD can achieve generalization performance comparable to kernel-based methods.

URL PDF HTML ☆

赞 0 踩 0

2606.06764 2026-06-08 stat.ML cs.AI cs.LG 新提交

Optimal Rates for Generalization of Gradient Descent Methods with Deep Neural Networks

深度神经网络梯度下降方法的泛化最优速率

Junyu Zhou, Puyu Wang, Yunwen Lei, Yiming Ying, Ding-Xuan Zhou

发表机构 * Mathematical Institute for Machine Learning and Data Science, KU Eichstätt-Ingolstadt（机器学习与数据科学数学研究所，埃施特哈特-英戈尔施塔特大学）； Department of Computer Science, RPTU Kaiserslautern-Landau（计算机科学系，凯撒斯劳滕-兰道大学）； Department of Mathematics, University of Hong Kong（数学系，香港大学）； School of Mathematics and Statistics, University of Sydney（数学与统计学学院，悉尼大学）

AI总结本文针对深度ReLU网络，在神经正切核（NTK）机制下，首次建立了梯度下降（GD）和随机梯度下降（SGD）的极小化最优泛化误差速率，证明宽度足够时可达核方法的最优速率。

Comments 39 pages, 1 table

AI 大模型

视觉与机器人

科学与医疗

Graph-Dictionary Signal Model for Sparse Representations of Multivariate Data

SINDy-RL: Interpretable and Efficient Model-Based Reinforcement Learning

Mcity Data Engine: Iterative Model Improvement Through Open-Vocabulary Data Selection

AIn't Nothing But a Survey? Using Large Language Models for Coding German Open-Ended Survey Responses on Survey Motivation

A Unified Framework to Enforce, Discover, and Promote Symmetry in Machine Learning

3D-Grounded Vision-Language Framework for Robotic Task Planning: Automated Prompt Synthesis and Supervised Reasoning

Training-free Ultra Small Model for Universal Sparse Reconstruction in Compressed Sensing

Amortized Neural Optimization for Pre-Layout Signal Integrity Design Space Exploration using Differentiable Surrogates

Impact of Synthetic Lesional MR Images in Automated Focal Cortical Dysplasia Detection in Low-Data Scenarios

Beyond Backscatter: InSAR coherence from detected SAR images

Assessing True Generalisability of Audio-Visual Speech Recognisers

Beyond Universality: The GCC-FER Dataset and Culture-Aware Adaptation for Dynamic Facial Expression Recognition

DaX: Learning General Pathology Representations Across Scales

SpectCount: Spectrotemporal Counting via Synthetic Signals Improves Large Audio Language Models

Physics-Driven Semantic Scattering Structure Understanding of Aircraft Target in SAR Images

SEAM: Shortcut-Aware Real-Time Detection of Scripted vs. Spontaneous Speech for Interview Guardrails

BiEAR: A Human Auditory-Inspired Adaptive Binaural Front-end for Multi-Speaker Localisation and Distance Estimation

Compute-Optimal Network Design for Echocardiography Myocardial Segmentation and Perfusion Quantification using Neural Scaling Laws

ErA: Error-Aware Deep Unrolling Network for Single Image Defocus Deblurring

Attention Consistent Longitudinal Medical Visual Question Answering Guided by Vision Foundation Models

Advanced Flood Prediction with Physics-Guided Deep Learning: Combining UNet, FNO, and SAR/Optical Imagery

Which Anatomy Matters Under Limited Labels? A Data-Efficient Anatomy-Aware Benchmark for Cardiac Pathology Prediction

Automatic, Debiased, and Invariant Counterfactual Generation under General Interventions

An Integrated Roadside Sensing and Communication Framework for Vulnerable Road User Safety at Signalized Intersections

Deep Single-Index Fréchet Regression

Stability beyond Bounded Differences: Sharp Generalization Bounds under Finite $L_p$ Moments

The Effect of Training Task Diversity on In-Context Learning through the Lens of Low-Dimensional Subspaces

Empirical Transfer Operators and Finite-Sample Change Detection for Noisy Expanding Interval Maps

Generalization in Deep Neural Networks: Minimax Rates for Gradient Methods

Optimal Rates for Generalization of Gradient Descent Methods with Deep Neural Networks