arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 8081
专题追踪
2508.05852 2026-06-03 cs.CV

Interpretable Modeling of Driver Attention Shifts with a Vision-Language Model

基于视觉-语言模型的驾驶员注意力转移可解释建模

Kaiser Hamid, Khandakar Ashrafi Akbar, Peihang Li, Nade Liang

发表机构 * Texas Tech University(德克萨斯理工大学) Towson University(托森大学)

AI总结 本研究通过少量人工监督微调视觉-语言模型,生成可解释的驾驶员注意力转移描述,以补充传统注视热图,提升人因分析、监控和态势感知支持。

详情
AI中文摘要

驾驶员注视通常被建模为空间热图,但热图本身难以解释,因为它们不说明正在监控哪个道路对象或区域,也不说明注意力转移为何重要。本研究探讨了最小的人工监督是否能够引导视觉-语言模型生成驾驶员注意力转移的可解释描述。利用Berkeley DeepDrive-Attention数据集中选定的高变化注视时刻,我们比较了零样本、单样本和LoRA微调VLM条件与人工精炼参考描述和专家评分。结果表明,使用80个专家精炼的注意力示例进行微调,相对于未引导的VLM输出,提高了ROUGE-L、METEOR、实体对齐F1和人类对齐分数。研究结果表明,基于语言的描述可以通过使驾驶员注意力更易于人因分析、驾驶员监控审查和态势感知支持来补充注视热图。

英文摘要

Driver gaze is commonly modeled as a spatial heatmap, but heatmaps alone are difficult for humans to interpret because they do not explain which road object or region is being monitored or why an attention shift may matter. This study examines whether minimal human-grounded supervision can steer a vision--language model toward interpretable descriptions of driver attention shifts. Using selected high-change gaze moments from the Berkeley DeepDrive-Attention dataset, we compare zero-shot, one-shot, and LoRA fine-tuned VLM conditions against human-refined reference descriptions and expert ratings. Results show that fine-tuning with 80 expert-refined attention examples improves ROUGE-L, METEOR, Entity Alignment F1, and Human Alignment Score relative to unsteered VLM outputs. The findings suggest that language-based descriptions can complement gaze heatmaps by making driver attention more accessible for human-factors analysis, driver-monitoring review, and situation-awareness support.

2508.03098 2026-06-03 cs.CL

Privacy-Aware Decoding: Mitigating Privacy Leakage of Large Language Models in Retrieval-Augmented Generation

隐私感知解码:缓解检索增强生成中大语言模型的隐私泄露

Haoran Wang, Xiongxiao Xu, Baixiang Huang, Kai Shu

发表机构 * Emory University(埃默里大学) Illinois Institute of Technology(伊利诺伊理工学院)

AI总结 提出一种轻量级推理时防御方法PAD,通过在解码过程中注入校准高斯噪声,结合置信度筛选、敏感度估计和上下文感知噪声校准,在RAG系统中平衡隐私保护与生成质量,并利用Rényi差分隐私跟踪累积隐私损失。

详情
AI中文摘要

检索增强生成(RAG)通过将输出条件化于外部知识源来增强大语言模型(LLM)的事实准确性。然而,当检索涉及私人或敏感数据时,RAG系统容易受到提取攻击,从而通过生成的响应泄露机密信息。我们提出隐私感知解码(PAD),一种轻量级的推理时防御方法,在生成过程中自适应地将校准的高斯噪声注入到token logits中。PAD集成了基于置信度的筛选以选择性保护高风险token、高效的敏感度估计以最小化不必要的噪声,以及上下文感知的噪声校准以平衡隐私与生成质量。Rényi差分隐私(RDP)会计机制严格跟踪累积隐私损失,从而为敏感输出提供明确的每响应$(\varepsilon, δ)$-DP保证。与需要重新训练或语料库级过滤的先前方法不同,PAD是模型无关的,并且完全在解码时运行,计算开销极小。在三个真实世界数据集上的实验表明,PAD在保持响应实用性的同时显著减少了私有信息泄露,优于现有的基于检索和后处理的防御方法。我们的工作通过解码策略在缓解RAG中的隐私风险方面迈出了重要一步,为敏感领域中的通用和可扩展隐私解决方案铺平了道路。我们的代码可在https://github.com/wang2226/PAD获取。

英文摘要

Retrieval-Augmented Generation (RAG) enhances the factual accuracy of large language models (LLMs) by conditioning outputs on external knowledge sources. However, when retrieval involves private or sensitive data, RAG systems are susceptible to extraction attacks that can leak confidential information through generated responses. We propose Privacy-Aware Decoding (PAD), a lightweight, inference-time defense that adaptively injects calibrated Gaussian noise into token logits during generation. PAD integrates confidence-based screening to selectively protect high-risk tokens, efficient sensitivity estimation to minimize unnecessary noise, and context-aware noise calibration to balance privacy with generation quality. A \renyi Differential Privacy (RDP) accountant rigorously tracks cumulative privacy loss, enabling explicit per-response $(\varepsilon, δ)$-DP guarantees for sensitive outputs. Unlike prior approaches requiring retraining or corpus-level filtering, PAD is model-agnostic and operates entirely at decoding time with minimal computational overhead. Experiments on three real-world datasets demonstrate that PAD substantially reduces private information leakage while preserving response utility, outperforming existing retrieval- and post-processing-based defenses. Our work takes an important step toward mitigating privacy risks in RAG via decoding strategies, paving the way for universal and scalable privacy solutions in sensitive domains. Our code is available: https://github.com/wang2226/PAD.

2507.19684 2026-06-03 cs.LG cs.AI cs.CL cs.CV

CoMPAS3D: A Dataset and Benchmark for Interactive Motion

CoMPAS3D: 一个用于交互动作的数据集和基准

Bermet Burkanova, Yasaman Etesam, Payam Jome Yazdian, Trinity Evans, Chuxuan Zhang, Zoe Stanley, Paige Tuttösí, Angelica Lim

发表机构 * School of Computing Science Simon Fraser University(计算科学学院西蒙弗雷泽大学)

AI总结 提出CoMPAS3D数据集和评估框架,通过动作可读性和熟练度适当性等客观指标,解决交互式动作生成中缺乏社交上下文评估的问题。

Comments https://rosielab.github.io/compas3d

详情
AI中文摘要

社交互动型人形机器人必须通过身体与人类互动,实时适应伙伴的动作、意图和能力。这需要模型不仅理解身体如何移动,还要理解在共享社交背景下动作的含义。然而,交互式动作生成的评估框架并未衡量生成的动作是否在共享动作词汇中可读,也不评估其是否适合伙伴的熟练水平。这一差距有两个原因:现有框架依赖运动学指标(如FID和节拍对齐),无法衡量上述特性;现有数据集缺乏动作标注和熟练度变化。萨尔萨舞作为评估领域很合适:即兴、双人、由动作词汇和评判标准(涵盖时机、音乐性、技巧、难度、配合和原创性)指导。我们提出CoMPAS3D,一个即兴双人萨尔萨舞的动作捕捉数据集,附带评估框架,涵盖运动学质量、两个客观指标(动作可读性和熟练度适当性)以及六个基于竞赛的主观维度。数据集包含18名舞者(涵盖初级、中级和高级水平)的3小时即兴表演,超过2800个专家标注片段,涵盖动作类型、错误和风格元素。我们定义了三个基准:动作分类(类似于转录)、熟练度估计(流利度评估)和跟随者生成(对话响应)。微调的视觉语言模型在应用于真实动作序列的客观指标上表现强劲。应用于Duolando和InterGen时,这些指标揭示了运动学指标遗漏的失败。人工评估确认了生成动作与真实动作之间的差距。CoMPAS3D、标注、基准代码和基线结果公开可用。

英文摘要

Socially interactive humanoid robots must engage with humans through their bodies, adapting in real time to a partner's movement, intent, and abilities. This requires models that understand not just how bodies move, but what movement means in a shared social context. Yet evaluation frameworks for interactive motion generation do not measure whether generated follower motion is legible within a shared movement vocabulary, nor whether it is appropriate to the partner's proficiency level. This gap has two causes: existing frameworks rely on kinematic metrics such as FID and beat alignment that cannot measure either property, and existing datasets lack the move annotations and proficiency variation needed. Salsa is well-suited as an evaluation domain: improvised, dyadic, and governed by a move vocabulary and judging criteria covering timing, musicality, technique, difficulty, partnering, and originality. We present CoMPAS3D, a motion capture dataset of improvised partner salsa paired with an evaluation framework covering kinematic quality, two objective metrics (move legibility and proficiency appropriateness), and six competition-based subjective dimensions. The dataset includes 3 hours of improvisation by 18 dancers spanning beginner, intermediate, and professional levels, with over 2,800 expert-annotated segments covering move types, errors, and stylistic elements. We define three benchmarks: move classification (analogous to transcription), proficiency estimation (fluency assessment), and follower generation (dialogue response). Fine-tuned vision-language models perform strongly on objective metrics applied to ground-truth motion sequences. Applied to Duolando and InterGen, the metrics reveal failures that kinematic metrics miss. Human evaluations confirm the gap between generated and ground-truth motion. CoMPAS3D, annotations, benchmark code, and baseline results are publicly available.

2504.01531 2026-06-03 cs.LG

DRAN: A Distribution and Relation Adaptive Network for Spatio-temporal Forecasting

DRAN:一种面向时空预测的分布与关系自适应网络

Xiaobei Zou, Luolin Xiong, Kexuan Zhang, Cesare Alippi, Yang Tang

发表机构 * Key Laboratory of Smart Manufacturing in Energy Chemical Process, Ministry of Education, East China University of Science and Technology, Shanghai(能源化学过程智能制造关键实验室,教育部,东华大学,上海) Faculty of Informatics, Università della Svizzera italiana(瑞士意大利大学信息学院) Department of Electronics, Information and Bioengineering, Politecnico di Milano(米兰理工学院电子、信息与生物工程系)

AI总结 针对非平稳时空系统的预测挑战,提出分布与关系自适应网络(DRAN),通过空间因子学习器(SFL)和动态-静态融合学习器(DSFL)分别适应分布偏移和关系变化,在天气和交通预测任务上超越现有方法。

Comments 15 pages, 10 figures

详情
AI中文摘要

准确的时空系统预测对于系统管理、控制和危机预防等任务至关重要。然而,许多时空系统固有的时变性给在非平稳条件下实现准确预测带来了挑战。为了解决非平稳性问题,我们提出了一种分布与关系自适应网络(DRAN),能够动态适应随时间变化的关系和分布。虽然时间归一化和反归一化是适应分布偏移的常用技术,但这种操作不适用于时空上下文,因为时间归一化会缩放节点的时间序列,可能破坏节点间的空间关系。为了解决这个问题,我们开发了一个空间因子学习器(SFL)模块,使得归一化和反归一化过程得以实现。为了适应传感器间空间关系的动态变化,我们提出了一种动态-静态融合学习器(DSFL)模块,通过自适应融合比例机制有效整合从动态和静态关系中学习到的特征。此外,我们引入了一个随机学习器来捕获时空表示中的噪声成分。我们的方法在天气预测和交通流预测任务上优于现有最先进方法。实验结果表明,我们的SFL在各种时间归一化操作下有效保持了空间关系。对学习到的动态和静态关系的可视化表明,DSFL能够捕获节点间的局部和远程关系。

英文摘要

Accurate predictions of spatio-temporal systems are crucial for tasks such as system management, control, and crisis prevention. However, the inherent time variance of many spatio-temporal systems poses challenges to achieving accurate predictions whenever stationarity is not granted. In order to address non-stationarity, we propose a Distribution and Relation Adaptive Network (DRAN) capable of dynamically adapting to relation and distribution changes over time. While temporal normalization and de-normalization are frequently used techniques to adapt to distribution shifts, this operation is not suitable for the spatio-temporal context as temporal normalization scales the time series of nodes and possibly disrupts the spatial relations among nodes. In order to address this problem, a Spatial Factor Learner (SFL) module is developed that enables the normalization and de-normalization process. To adapt to dynamic changes in spatial relationships among sensors, we propose a Dynamic-Static Fusion Learner (DSFL) module that effectively integrates features learned from both dynamic and static relations through an adaptive fusion ratio mechanism. Furthermore, we introduce a Stochastic Learner to capture the noisy components of spatio-temporal representations. Our approach outperforms state-of-the-art methods on weather prediction and traffic flow forecasting tasks.Experimental results show that our SFL efficiently preserves spatial relationships across various temporal normalization operations. Visualizations of the learned dynamic and static relations demonstrate that DSFL can capture both local and distant relationships between nodes.

2506.21129 2026-06-03 cs.LG cs.AI

Curriculum-Adapted Robust Reinforcement Learning for UAV Deconfliction in Adversarial Environments

对抗环境中无人机冲突消解的课程自适应鲁棒强化学习

Deepak Kumar Panda, Adolfo Perrusquia, Weisi Guo

发表机构 * Faculty of Engineering and Applied Sciences, Cranfield University(工程与应用科学学院,克兰菲尔德大学)

AI总结 提出一种课程引导的适应框架,通过渐进暴露于梯度对抗观测扰动并对齐时序差分误差分布,提升无人机在GNSS欺骗攻击下的鲁棒性和泛化能力。

详情
AI中文摘要

自主无人机(UAV)越来越依赖强化学习(RL)进行导航。然而,全球导航卫星系统(GNSS)欺骗攻击可能导致分布外观测偏移,破坏价值估计并降低任务性能。现有的鲁棒RL方法通常能提高对特定攻击模型的抵抗力,但往往无法泛化到训练中未遇到的攻击。为解决这一局限,我们提出一种课程引导的适应框架,该框架逐步将鲁棒策略暴露于强度递增的基于梯度的对抗观测扰动,同时对齐课程阶段间的时序差分(TD)误差分布。所提出的方法不是适应特定的攻击模型,而是保持TD误差一致性以促进跨攻击条件的可迁移性。我们进一步推导了一个TD空间泛化保证,表明如果测试时攻击引起的TD误差分布与最终课程阶段的分布足够接近,则由此产生的性能退化是有界的。该框架在具有动态3D障碍物的无人机冲突消解环境中进行评估,面对之前未见过的固定和动态GNSS欺骗攻击。在固定欺骗条件下,课程适应策略实现了近乎完美的任务成功率,而标准和鲁棒RL基线为20-56%。在动态障碍物引诱欺骗攻击下,它获得了最高的情节奖励,同时随着空中交通密度的增加,任务完成步骤最多减少了45%。

英文摘要

Autonomous unmanned aerial vehicles (UAVs) increasingly rely on reinforcement learning (RL) for navigation. However, global navigation satellite system (GNSS) spoofing attacks can induce out-of-distribution observation shifts that corrupt value estimation and degrade mission performance. Existing robust RL approaches typically improve resilience against specific attack models but often fail to generalize to attacks not encountered during training. To address this limitation, we propose a curriculum-guided adaptation framework that progressively exposes a robust policy to gradient-based adversarial observation perturbations of increasing intensity while aligning temporal-difference (TD) error distributions across curriculum stages. Rather than adapting to a particular attack model, the proposed approach preserves TD-error consistency to promote transferability across attack conditions. We further derive a TD-space generalization certificate showing that if the TD-error distribution induced by a test-time attack remains sufficiently close to that of the final curriculum stage, the resulting performance degradation is bounded. The framework is evaluated in a UAV deconfliction environment with dynamic 3D obstacles under previously unseen fixed and dynamic GNSS spoofing attacks. Under fixed spoofing conditions, the curriculum-adapted policy achieved near-perfect mission success rates, compared with 20-56% for standard and robust RL baselines. Under dynamic obstacle-luring spoofing attacks, it achieved the highest episodic rewards while reducing mission completion steps by up to 45% across increasing aerial traffic densities.

2506.04367 2026-06-03 cs.CV

Fine-Tuning Video Transformers for Word-Level Bangla Sign Language: A Comparative Analysis for Classification Tasks

微调视频变换器用于词级孟加拉手语:分类任务的比较分析

Jubayer Ahmed Bhuiyan Shawon, Hasan Mahmud, Kamrul Hasan

发表机构 * Systems and Software Lab (SSL), Department of CSE, Islamic University of Technology (IUT)(计算机科学与软件系,伊斯兰科技大学(IUT)系统与软件实验室)

AI总结 本研究通过微调VideoMAE、ViViT和TimeSformer三种视频变换器模型,在BdSLW60和BdSLW401数据集上实现了高精度孟加拉手语识别,其中VideoMAE在帧率校正后的BdSLW60上达到95.5%准确率。

Comments 16 pages, 8 figures, 6 tables

Journal ref PLOS ONE, Vol. 21, No. 5, e0341909, 2026

详情
AI中文摘要

手语识别(SLR)涉及从图像或视频中自动识别和分类手势,将其转换为文本或语音,以改善听障社区的可访问性。在孟加拉国,孟加拉手语(BdSL)是许多听障人士的主要交流方式。本研究在BdSLW60(arXiv:2402.08635)上微调了最先进的视频变换器架构——VideoMAE、ViViT和TimeSformer,BdSLW60是一个包含60个频繁手势的小规模BdSL数据集。我们将视频标准化为30 FPS,得到9,307个用户试用片段。为了评估可扩展性和鲁棒性,模型还在BdSLW401(arXiv:2503.02360)上进行了微调,这是一个包含401个手势类别的大规模数据集。此外,我们还在公开数据集(包括LSA64和WLASL)上进行了基准测试。应用了随机裁剪、水平翻转和短边缩放等数据增强技术以提高模型鲁棒性。为了在模型选择期间确保跨折的平衡评估,我们在训练集上采用了10折分层交叉验证,同时使用来自未见用户U4和U8的留出测试数据进行了独立于手语者的评估。结果表明,视频变换器模型显著优于传统的机器学习和深度学习方法。性能受数据集大小、视频质量、帧分布、帧率和模型架构等因素影响。在这些模型中,VideoMAE变体(MCG-NJU/videomae-base-finetuned-kinetics)在帧率校正后的BdSLW60数据集上达到了95.5%的最高准确率,在BdSLW401的正面手势上达到了81.04%——展示了可扩展且准确的BdSL识别的强大潜力。

英文摘要

Sign Language Recognition (SLR) involves the automatic identification and classification of sign gestures from images or video, converting them into text or speech to improve accessibility for the hearing-impaired community. In Bangladesh, Bangla Sign Language (BdSL) serves as the primary mode of communication for many individuals with hearing impairments. This study fine-tunes state-of-the-art video transformer architectures -- VideoMAE, ViViT, and TimeSformer -- on BdSLW60 (arXiv:2402.08635), a small-scale BdSL dataset with 60 frequent signs. We standardized the videos to 30 FPS, resulting in 9,307 user trial clips. To evaluate scalability and robustness, the models were also fine-tuned on BdSLW401 (arXiv:2503.02360), a large-scale dataset with 401 sign classes. Additionally, we benchmark performance against public datasets, including LSA64 and WLASL. Data augmentation techniques such as random cropping, horizontal flipping, and short-side scaling were applied to improve model robustness. To ensure balanced evaluation across folds during model selection, we employed 10-fold stratified cross-validation on the training set, while signer-independent evaluation was carried out using held-out test data from unseen users U4 and U8. Results show that video transformer models significantly outperform traditional machine learning and deep learning approaches. Performance is influenced by factors such as dataset size, video quality, frame distribution, frame rate, and model architecture. Among the models, the VideoMAE variant (MCG-NJU/videomae-base-finetuned-kinetics) achieved the highest accuracies of 95.5% on the frame rate corrected BdSLW60 dataset and 81.04% on the front-facing signs of BdSLW401 -- demonstrating strong potential for scalable and accurate BdSL recognition.

2506.03087 2026-06-03 cs.LG cs.AI

Do Explanations Increase the Risk of Decision Logic Leakage? Explanation-Guided Stealing of Graph Models

解释是否会增加决策逻辑泄露的风险?解释引导的图模型窃取

Bin Ma, Yuyuan Feng, Minhua Lin, Enyan Dai

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) Xiamen University(厦门大学) The Pennsylvania State University(宾夕法尼亚州立大学)

AI总结 研究解释机制可能泄露图神经网络决策逻辑的风险,提出一种结合解释对齐与数据增强的模型窃取框架,实验证明其优于传统方法。

详情
AI中文摘要

图神经网络(GNNs)已成为药物发现和金融分析等领域中分析图结构数据的重要工具,导致对模型透明度的需求日益增长。可解释GNNs的最新进展通过揭示影响预测的重要子图满足了这一需求,但这些解释机制可能无意中使这些模型面临安全风险。本文研究了此类解释如何潜在泄露可被利用进行模型窃取的关键决策逻辑。我们提出了{\method},一种新颖的窃取框架,它将用于捕获决策逻辑的解释对齐与用于在有限查询下高效训练的引导数据增强相结合,从而能够有效复制目标模型的预测行为和底层推理模式。在分子图数据集上的实验表明,我们的方法在模型窃取方面优于传统方法。这项工作突出了在敏感领域部署可解释GNNs时的重要安全考虑,并表明需要针对基于解释的攻击采取保护措施。我们的代码可在https://github.com/beanmah/EGSteal获取。

英文摘要

Graph Neural Networks (GNNs) have become essential tools for analyzing graph-structured data in domains such as drug discovery and financial analysis, leading to a growing demand for model transparency. Recent advances in explainable GNNs have addressed this need by revealing important subgraphs that influence predictions, but these explanation mechanisms may inadvertently expose these models to security risks. This paper investigates how such explanations potentially leak critical decision logic that can be exploited for model stealing. We propose {\method}, a novel stealing framework that integrates explanation alignment for capturing decision logic with guided data augmentation for efficient training under limited queries, enabling effective replication of both the predictive behavior and underlying reasoning patterns of target models. Experiments on molecular graph datasets demonstrate that our approach shows advantages over conventional methods in model stealing. This work highlights important security considerations for the deployment of explainable GNNs in sensitive domains and suggests the need for protective measures against explanation-based attacks. Our code is available at https://github.com/beanmah/EGSteal.

2506.00431 2026-06-03 cs.LG

TIDFormer: Exploiting Temporal and Interactive Dynamics Makes A Great Dynamic Graph Transformer

TIDFormer: 利用时间和交互动态打造卓越的动态图Transformer

Jie Peng, Zhewei Wei, Yuhang Ye

发表机构 * Renmin University of China(中国人民大学) Huawei Shenzhen, Guangdong China(华为深圳,广东中国)

AI总结 提出TIDFormer,通过高效利用时间和交互动态,并设计可解释的自注意力机制,在多个动态图数据集上超越现有模型。

Comments KDD2025

详情
AI中文摘要

由于自注意力机制(SAMs)在序列建模中捕捉依赖关系的能力,一些现有的动态图神经网络(DGNNs)利用具有各种编码设计的Transformer架构来捕捉动态图的序列演化。然而,这些基于Transformer的DGNNs的有效性和效率差异很大,凸显了在动态图上正确定义SAM以及在不增加额外复杂模块的情况下全面编码时间和交互动态的重要性。在这项工作中,我们提出了TIDFormer,一种以高效方式充分利用时间和交互动态的动态图Transformer。我们阐明并验证了我们提出的SAM的可解释性,解决了先前工作中在动态图上其定义不可解释的开放问题。为了分别建模时间和交互动态,我们利用基于日历的时间划分信息,并仅使用采样的一阶邻居为二分图和非二分图提取信息丰富的交互嵌入。此外,我们通过简单的分解捕捉历史交互模式的潜在变化,联合建模时间和交互特征。我们在多个动态图数据集上进行了大量实验,以验证TIDFormer的有效性和效率。实验结果表明,TIDFormer表现出色,在大多数数据集和实验设置中超越了最先进的模型。此外,与之前基于Transformer的方法相比,TIDFormer展现出显著的效率优势。

英文摘要

Due to the proficiency of self-attention mechanisms (SAMs) in capturing dependencies in sequence modeling, several existing dynamic graph neural networks (DGNNs) utilize Transformer architectures with various encoding designs to capture sequential evolutions of dynamic graphs. However, the effectiveness and efficiency of these Transformer-based DGNNs vary significantly, highlighting the importance of properly defining the SAM on dynamic graphs and comprehensively encoding temporal and interactive dynamics without extra complex modules. In this work, we propose TIDFormer, a dynamic graph TransFormer that fully exploits Temporal and Interactive Dynamics in an efficient manner. We clarify and verify the interpretability of our proposed SAM, addressing the open problem of its uninterpretable definitions on dynamic graphs in previous works. To model the temporal and interactive dynamics, respectively, we utilize the calendar-based time partitioning information and extract informative interaction embeddings for both bipartite and non-bipartite graphs using merely the sampled first-order neighbors. In addition, we jointly model temporal and interactive features by capturing potential changes in historical interaction patterns through a simple decomposition. We conduct extensive experiments on several dynamic graph datasets to verify the effectiveness and efficiency of TIDFormer. The experimental results demonstrate that TIDFormer excels, outperforming state-of-the-art models across most datasets and experimental settings. Furthermore, TIDFormer exhibits significant efficiency advantages compared to previous Transformer-based methods.

2505.20853 2026-06-03 cs.LG cs.AI

Cooperation of Experts: Fusing Heterogeneous Information with Large Margin

专家合作:大间隔融合异构信息

Shuo Wang, Shunyang Huang, Jinghui Yuan, Zhixiang Shen, Zhao Kang

发表机构 * Shuo Wang, Shunyang Huang, Jinghui Yuan, Zhixiang Shen, Zhao Kang(未知)

AI总结 提出专家合作框架,通过大间隔机制融合异构信息,在统一异构多路网络中编码多类型数据,实现鲁棒且互补的知识提取。

Comments Accepted at the 42nd International Conference on Machine Learning (ICML 2025)

Journal ref Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:63169-63185, 2025

详情
AI中文摘要

融合异构信息仍然是现代数据分析中的一个持续挑战。尽管已取得显著进展,但现有方法往往未能考虑对象模式在不同语义空间中的固有异质性。为解决这一局限性,我们提出了专家合作(CoE)框架,该框架将多类型信息编码到统一的异构多路网络中。通过克服模态和连接差异,CoE为捕捉现实世界复杂数据的复杂结构提供了一个强大且灵活的模型。在我们的框架中,专用编码器充当领域特定专家,每个专家专门学习特定语义空间中的不同关系模式。为了增强鲁棒性并提取互补知识,这些专家通过一种新颖的大间隔机制进行协作,该机制由定制的优化策略支持。严格的理论分析保证了框架的可行性和稳定性,而跨多种基准的广泛实验证明了其优越的性能和广泛的适用性。我们的代码可在 https://github.com/strangeAlan/CoE 获取。

英文摘要

Fusing heterogeneous information remains a persistent challenge in modern data analysis. While significant progress has been made, existing approaches often fail to account for the inherent heterogeneity of object patterns across different semantic spaces. To address this limitation, we propose the Cooperation of Experts (CoE) framework, which encodes multi-typed information into unified heterogeneous multiplex networks. By overcoming modality and connection differences, CoE provides a powerful and flexible model for capturing the intricate structures of real-world complex data. In our framework, dedicated encoders act as domain-specific experts, each specializing in learning distinct relational patterns in specific semantic spaces. To enhance robustness and extract complementary knowledge, these experts collaborate through a novel large margin mechanism supported by a tailored optimization strategy. Rigorous theoretical analyses guarantee the framework's feasibility and stability, while extensive experiments across diverse benchmarks demonstrate its superior performance and broad applicability. Our code is available at https://github.com/strangeAlan/CoE.

2505.20142 2026-06-03 cs.LG

Grounding Functional Similarity by Invariance-Aware Model Stitching

通过不变性感知模型拼接实现功能相似性评估

Ioannis Athanasiadis, Anmar Karmush, Michael Felsberg

发表机构 * Ioannis Athanasiadis Anmar Karmush Michael Felsberg

AI总结 针对标准模型拼接忽略不变性导致功能相似性误判的问题,提出前向-后向兼容性要求下的不变性感知模型拼接方法,揭示隐藏的功能差异。

详情
AI中文摘要

在深度学习中,功能相似性评估量化了独立训练的模型学习相似输入-输出关系的程度。在模型拼接中,功能相似性被表述为表示前向兼容性,即两个模型的表示能否对齐以解决给定任务。然而,最近的研究强调了一个关键限制:依赖不同信息线索的模型仍可能产生兼容的表示,使其看起来具有误导性的相似性(Smith et al., 2025)。我们将此失败归因于标准模型拼接本质上对拼接模型的不变性特性视而不见。为解决这一限制,我们引入了前向-后向兼容性要求,并据此制定了不变性感知模型拼接。通过分析关键拼接配置,我们研究了前向和后向兼容性之间的相互作用,表明不变性感知模型拼接为功能相似性评估提供了更原则性的方法,同时揭示了先前被掩盖的功能差异。

英文摘要

In deep learning, functional similarity evaluation quantifies the extent to which independently trained models learn similar input--output relationships. In model stitching, functional similarity is framed as representation forward compatibility, i.e., whether the representations of two models can be aligned to solve a given task. Recent studies, however, highlight a critical limitation: models relying on different information cues can still produce compatible representations, making them appear misleadingly similar (Smith et al., 2025). We attribute this failure to standard model stitching being inherently blind to the invariance properties of the stitched models. To address this limitation, we introduce the forward--backward compatibility requirement under which we formulate the invariance-aware model stitching. Through analyzing key stitching configurations, we study the interplay between forward and backward compatibility, showing that invariance-aware model stitching provides a more principled approach to functional similarity evaluation while revealing functional discrepancies previously obscured.

2406.18544 2026-06-03 cs.CV cs.GR

GS-ROR$^2$: Bidirectional-guided 3DGS and SDF for Reflective Object Relighting and Reconstruction

GS-ROR$^2$: 双向引导的3DGS和SDF用于反射物体重光照与重建

Zuo-Liang Zhu, Beibei Wang, Jian Yang

发表机构 * VCIP, College of Computer Science, Nankai University(VCIP,计算机科学学院,南开大学) School of Intelligence Science and Technology, Nanjing University(智能科学与技术学校,南京大学)

AI总结 提出一种双向引导框架,通过SDF辅助的高斯溅射优化重光照模型,并利用GS引导的SDF增强实现高质量几何重建,解决反射物体重光照与重建中的几何约束和细节捕捉问题。

Comments Accepted by ACM TOG

详情
AI中文摘要

3D高斯溅射(3DGS)因其细致的表达能力和高效的渲染速度,在新视角合成方面展现出强大能力。然而,使用3DGS创建可重光照的3D资产并重建忠实几何仍然存在问题,特别是对于反射物体,其不连续表示给几何约束带来困难。体积符号距离场(SDF)方法提供了鲁棒的几何重建,但昂贵的射线步进阻碍了其实时应用并减慢了训练速度。此外,这些方法难以捕捉尖锐的几何细节。为此,我们提出以互补方式双向引导3DGS和SDF,包括SDF辅助的高斯溅射用于重光照模型的高效优化,以及GS引导的SDF增强用于高质量几何重建。SDF辅助高斯溅射的核心是混合高斯与SDF之间的深度和法线相互监督,避免了SDF昂贵的体积渲染。得益于这种相互监督,学习到的混合高斯以最小的时间成本得到良好约束。由于高斯以延迟着色模式渲染,alpha混合的高斯是平滑的,但单个高斯可能仍然是异常值,产生漂浮伪影。因此,我们引入SDF感知的剪枝策略,移除位于SDF定义表面远处的高斯异常值,避免漂浮问题。这样,我们的GS框架提供了合理的法线并实现了逼真的重光照,但来自深度的网格仍然存在问题。因此,我们设计了GS引导的SDF细化,利用来自高斯的混合法线微调SDF。通过这种增强,我们的方法可以以额外17%的训练时间为代价,为反射物体提供高质量的网格。

英文摘要

3D Gaussian Splatting (3DGS) has shown a powerful capability for novel view synthesis due to its detailed expressive ability and highly efficient rendering speed. Unfortunately, creating relightable 3D assets and reconstructing faithful geometry with 3DGS is still problematic, particularly for reflective objects, as its discontinuous representation raises difficulties in constraining geometries. Volumetric signed distance field (SDF) methods provide robust geometry reconstruction, while the expensive ray marching hinders its real-time application and slows the training. Besides, these methods struggle to capture sharp geometric details. To this end, we propose to guide 3DGS and SDF bidirectionally in a complementary manner, including an SDF-aided Gaussian splatting for efficient optimization of the relighting model and a GS-guided SDF enhancement for high-quality geometry reconstruction. At the core of our SDF-aided Gaussian splatting is the mutual supervision of the depth and normal between blended Gaussians and SDF, which avoids the expensive volume rendering of SDF. Thanks to this mutual supervision, the learned blended Gaussians are well-constrained with a minimal time cost. As the Gaussians are rendered in a deferred shading mode, the alpha-blended Gaussians are smooth, while individual Gaussians may still be outliers, yielding floater artifacts. Therefore, we introduce an SDF-aware pruning strategy to remove Gaussian outliers located distant from the surface defined by SDF, avoiding floater issue. This way, our GS framework provides reasonable normal and achieves realistic relighting, while the mesh from depth is still problematic. Therefore, we design a GS-guided SDF refinement, which utilizes the blended normal from Gaussians to finetune SDF. With this enhancement, our method can further provide high-quality meshes for reflective objects at the cost of 17% extra training time.

2504.01250 2026-06-03 cs.LG cs.SY eess.SY

R2DN: Scalable Parameterization of Contracting and Lipschitz Recurrent Deep Networks

R2DN:收缩和Lipschitz循环深度网络的可扩展参数化

Nicholas H. Barbara, Ruigang Wang, Ian R. Manchester

发表机构 * Australian Centre for Robotics(澳大利亚机器人中心) School of Aerospace, Mechanical and Mechatronic Engineering(航空航天、机械与机电工程学院) The University of Sydney(悉尼大学)

AI总结 本文提出鲁棒循环深度网络(R2DN),通过将线性时不变系统与1-Lipschitz深度前馈网络反馈互联,直接参数化权重以保证模型稳定(收缩)且对小输入扰动鲁棒(Lipschitz),相比循环均衡网络(REN)无需迭代求解均衡层,显著提升GPU上的推理和反向传播速度,并在非线性系统辨识、观测器设计和基于学习的反馈控制中实现相近性能下训练和推理速度提升一个数量级。

详情
AI中文摘要

本文提出鲁棒循环深度网络(R2DN),这是一种用于机器学习和数据驱动控制的鲁棒循环神经网络的可扩展参数化。我们将R2DN构造为线性时不变系统与1-Lipschitz深度前馈网络的反馈互联,并直接参数化权重,使得我们的模型天生稳定(收缩)且对小输入扰动鲁棒(Lipschitz)。我们的参数化使用了类似于先前提出的循环均衡网络(REN)的结构,但无需在每个时间步迭代求解均衡层。这加速了GPU上的模型推理和反向传播,并且与REN相比,使得网络规模、批大小和输入序列长度的扩展在计算上可行。我们在非线性系统辨识、观测器设计和基于学习的反馈控制三个代表性问题上将R2DN与REN进行比较。我们发现,在相似的测试集性能下,训练和推理速度均提升一个数量级,并且它们在模型表达能力方面具有更好的可扩展性。

英文摘要

This paper presents the Robust Recurrent Deep Network (R2DN), a scalable parameterization of robust recurrent neural networks for machine learning and data-driven control. We construct R2DNs as the feedback interconnection of a linear time-invariant system and a 1-Lipschitz deep feedforward network, and directly parameterize the weights so that our models are stable (contracting) and robust to small input perturbations (Lipschitz) by design. Our parameterization uses a structure similar to the previously-proposed recurrent equilibrium network (REN), but without the requirement to iteratively solve an equilibrium layer at each time-step. This speeds up both model inference and backpropagation on GPUs, and makes it computationally feasible to scale up the network size, batch size, and input sequence length in comparison to RENs. We compare R2DNs to RENs on three representative problems in nonlinear system identification, observer design, and learning-based feedback control. We find that training and inference are both up to an order of magnitude faster with similar test set performance, and that they scale more favorably with respect to model expressivity.

2502.02260 2026-06-03 cs.LG cs.CR

Position: Adversarial ML for LLMs Is Not Making Any Progress

立场:针对LLM的对抗性机器学习并未取得任何进展

Javier Rando, Jie Zhang, Nicholas Carlini, Florian Tramèr

发表机构 * GitHub University of California, Berkeley(加州大学伯克利分校)

AI总结 本文认为,在大语言模型时代,对抗性机器学习研究的问题定义更模糊、更难解决且更难以评估,可能导致未来十年仍无法取得有意义进展。

Comments Accepted at ICML 2026 Position Paper Track

详情
AI中文摘要

在过去十年中,大量研究工作致力于保护在对抗性环境中运行的机器学习模型。然而,即使是简单的“玩具”问题(例如,对微小对抗扰动的鲁棒性),进展也很缓慢,并且常常受到非严格评估的阻碍。如今,对抗性机器学习研究已转向研究更大规模、通用目的的语言模型。在这篇立场论文中,我们认为情况现在更糟:在大语言模型时代,对抗性机器学习研究的问题(1)定义更不明确,(2)更难解决,以及(3)更难以评估。因此,我们警告说,又一个十年的对抗性机器学习工作可能无法产生有意义的进展。

英文摘要

In the past decade, considerable research effort has been devoted to securing machine learning (ML) models that operate in adversarial settings. Yet, progress has been slow even for simple "toy" problems (e.g., robustness to small adversarial perturbations) and is often hindered by non-rigorous evaluations. Today, adversarial ML research has shifted towards studying larger, general-purpose language models. In this position paper, we argue that the situation is now even worse: in the era of LLMs, the field of adversarial ML studies problems that are (1) less clearly defined, (2) harder to solve, and (3) even more challenging to evaluate. As a result, we caution that yet another decade of work on adversarial ML may be failing to produce meaningful progress.

2412.05123 2026-06-03 cs.SD eess.AS

Differentiable Optimization of Linear Differential Microphone Arrays: A Joint Geometry and Filter Design Framework

线性差分麦克风阵列的可微优化:联合几何与滤波器设计框架

Siminfar Samakoush Galougah, Ramani Duraiswami

发表机构 * University of Maryland, College Park(马里兰大学学院公园分校)

AI总结 提出一种可微优化框架,通过联合优化麦克风位置和滤波器权重,实现线性差分麦克风阵列的最优波束模式,在保证无失真约束的同时兼顾指向性、鲁棒性和硬件效率。

Comments 5 pages, 4 figures, 2 tables

详情
AI中文摘要

本文提出了一种用于约束线性差分麦克风阵列(LDMA)设计的可微优化框架。该方法采用非均匀延迟求和波束形成器作为轻量级基础系统模型,通过联合优化麦克风位置和滤波器权重,证明了其能够实现LDMA的最优波束模式。该公式能够在期望声方向实现无失真约束的滤波器优化设计,同时对麦克风定位施加约束以确保一致性能。通过多个指标的评估,包括均方误差(MSE)、指向性指数(DI)、白噪声增益(WNG)和计算时间,并与最先进方法进行比较,该方法展示了一种灵活、指向性强、鲁棒且硬件高效的设计。

英文摘要

This paper presents a differentiable optimization framework for the design of constrained Linear Differential Microphone Arrays (LDMAs). The proposed method leverages a non-uniform delay-and-sum beamformer as a light-weight base system model, proving its ability to achieve the optimal beampattern of LDMAs by jointly optimizing microphone positions and filter weights. The formulation enables the optimized design of a filter with a distortion-free constraint in the desired sound direction, while also imposing constraints on microphone positioning to ensure consistent performance. Through evaluation on multiple metrics, including Mean Squared Error (MSE), Directivity Index (DI), White Noise Gain (WNG), and computation time, and comparison with state-of-the-art methods, this approach demonstrates a flexible, directive, robust, and hardware-efficient design.

2412.01282 2026-06-03 cs.CV cs.AI

Align-KD: Distilling Cross-Modal Alignment Knowledge for Mobile Vision-Language Model Enhancement

Align-KD:为移动视觉语言模型增强提取跨模态对齐知识

Qianhan Feng, Wenshuo Li, Tong Lin, Xinghao Chen

发表机构 * State Key Laboratory of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University, China(通用人工智能国家重点实验室,智能科学与技术学院,北京大学,中国) Huawei Noah’s Ark Lab, China(华为诺亚方舟实验室,中国)

AI总结 提出Align-KD方法,通过蒸馏教师模型浅层跨模态对齐知识,指导1.7B学生模型学习视觉-文本匹配,在6个基准上平均提升2.0分。

Comments CVPR 2025 Paper

详情
AI中文摘要

视觉语言模型(VLM)为多模态任务带来了强大的理解和推理能力。同时,移动设备对强大人工智能的需求也日益增长,例如AI助手软件。一些工作试图将VLM迁移到边缘设备以扩展其应用范围。简化模型结构是一种常见方法,但随着模型缩小,性能与大小之间的权衡变得越来越困难。知识蒸馏(KD)可以帮助模型在不增加大小或数据量的情况下提升综合能力。然而,现有的大模型蒸馏技术大多只考虑单模态LLM的应用,或者仅使用教师为学生创建新的数据环境。这些方法都没有考虑VLM中最重要的跨模态对齐知识的蒸馏。我们提出了一种名为Align-KD的方法,引导学生模型学习发生在浅层的跨模态匹配。教师还帮助学生基于文本的关注点学习将视觉标记投影到文本嵌入空间。在Align-KD的指导下,1.7B的MobileVLM V2模型能够从7B教师模型中学习丰富的知识,且训练损失设计轻量,在两个训练子集上分别在6个基准上平均得分提升2.0。代码地址:https://github.com/fqhank/Align-KD。

英文摘要

Vision-Language Models (VLMs) bring powerful understanding and reasoning capabilities to multimodal tasks. Meanwhile, the great need for capable aritificial intelligence on mobile devices also arises, such as the AI assistant software. Some efforts try to migrate VLMs to edge devices to expand their application scope. Simplifying the model structure is a common method, but as the model shrinks, the trade-off between performance and size becomes more and more difficult. Knowledge distillation (KD) can help models improve comprehensive capabilities without increasing size or data volume. However, most of the existing large model distillation techniques only consider applications on single-modal LLMs, or only use teachers to create new data environments for students. None of these methods take into account the distillation of the most important cross-modal alignment knowledge in VLMs. We propose a method called Align-KD to guide the student model to learn the cross-modal matching that occurs at the shallow layer. The teacher also helps student learn the projection of vision token into text embedding space based on the focus of text. Under the guidance of Align-KD, the 1.7B MobileVLM V2 model can learn rich knowledge from the 7B teacher model with light design of training loss, and achieve an average score improvement of 2.0 across 6 benchmarks under two training subsets respectively. Code is available at: https://github.com/fqhank/Align-KD.

2409.08958 2026-06-03 cs.LG cs.AI physics.comp-ph physics.flu-dyn

PINNfluence: Interpreting PINNs through Influence Functions

PINNfluence: 通过影响函数解释 PINN

Aleksander Krasowski, Jonas R. Naujoks, Moritz Weckbecker, Galip Ü. Yolcu, Thomas Wiegand, Sebastian Lapuschkin, Wojciech Samek, René P. Klausen

发表机构 * Technical University of Munich(慕尼黑技术大学) Max Planck Institute for Intelligent Systems(智能系统马克斯·普朗克研究所) University of Tübingen(图宾根大学) ETH Zurich(苏黎世联邦理工学院)

AI总结 提出 PINNfluence 框架,基于影响函数对物理信息神经网络进行训练数据归因,实现预测、损失分量和训练数据点之间的细粒度归因,并通过基准实验区分训练好与差的 PINN 的结构特征。

Comments Accepted at ICML 2026

详情
AI中文摘要

物理信息神经网络(PINN)已成为物理科学中求解偏微分方程(PDE)的强大深度学习方法,但其行为在很大程度上仍然不透明,通常通过故障模式分析而非显式可解释性来理解。为了解决这个问题,我们引入了 PINNfluence,这是一个基于影响函数解释 PINN 的训练数据归因框架。通过将影响函数扩展到复合物理信息训练目标,我们实现了预测、损失分量和训练数据点之间的细粒度归因。通过跨各种 PDE 的基准实验,我们证明了影响模式提供了区分训练良好和训练不良的 PINN 结构特征的细粒度诊断。因此,PINNfluence 通过数据视角为理解和提高 PINN 的可靠性开辟了新途径。

英文摘要

Physics-informed neural networks (PINNs) have emerged as a powerful deep learning approach for solving partial differential equations (PDEs) in the physical sciences, yet their behavior remains largely opaque and is typically understood through failure mode analyses rather than explicit interpretability. To address this issue, we introduce PINNfluence, a training data attribution framework for interpreting PINNs based on influence functions. By extending influence functions to composite physics-informed training objectives, we enable fine-grained attribution between predictions, loss components, and training data points. Through benchmark experiments across various PDEs, we demonstrate that influence patterns provide granular diagnostics that distinguish structural characteristics across well-trained and poorly-trained PINNs. PINNfluence thus opens a new avenue for understanding and improving the reliability of PINNs through the lens of their data.

2411.15851 2026-06-03 cs.CV

ResCLIP: Residual Attention for Training-free Dense Vision-language Inference

ResCLIP: 用于无训练密集视觉-语言推理的残差注意力

Yuhang Yang, Jinhong Deng, Wen Li, Lixin Duan

发表机构 * University of Electronic Science and Technology of China(电子科学与技术大学)

AI总结 提出残差交叉相关自注意力模块和语义反馈精炼模块,利用中间层交叉相关注意力重组空间信息,提升CLIP在密集预测任务中的性能。

Journal ref Proceedings of the Computer Vision and Pattern Recognition Conference. 2025: 29968-29978

详情
AI中文摘要

尽管像CLIP这样的视觉-语言模型在开放词汇任务中取得了显著成功,但其应用目前局限于图像级任务,在密集预测方面仍存在困难。最近的研究通常将这种密集预测的不足归因于最终块中的自注意力层,并通过将原始的查询-键注意力修改为自相关注意力(例如查询-查询和键-键注意力)取得了可观的成果。然而,这些方法忽略了捕捉丰富空间对应关系的交叉相关注意力(查询-键)特性。在本文中,我们揭示了CLIP非最终层中自注意力的交叉相关性也表现出定位特性。因此,我们提出了残差交叉相关自注意力(RCS)模块,该模块利用中间层的交叉相关自注意力来重塑最终块中的注意力。RCS模块有效重组了空间信息,释放了CLIP在密集视觉-语言推理中的定位潜力。此外,为了增强对相同类别区域的关注和局部一致性,我们提出了语义反馈精炼(SFR)模块,该模块利用语义分割图进一步调整注意力分数。通过整合这两种策略,我们的方法(称为ResCLIP)可以轻松作为即插即用模块集成到现有方法中,显著提升其在密集视觉-语言推理中的性能。在多个标准基准上的大量实验表明,我们的方法超越了最先进的无训练方法,验证了所提方法的有效性。代码可在 https://github.com/yvhangyang/ResCLIP 获取。

英文摘要

While vision-language models like CLIP have shown remarkable success in open-vocabulary tasks, their application is currently confined to image-level tasks, and they still struggle with dense predictions. Recent works often attribute such deficiency in dense predictions to the self-attention layers in the final block, and have achieved commendable results by modifying the original query-key attention to self-correlation attention, (e.g., query-query and key-key attention). However, these methods overlook the cross-correlation attention (query-key) properties, which capture the rich spatial correspondence. In this paper, we reveal that the cross-correlation of the self-attention in CLIP's non-final layers also exhibits localization properties. Therefore, we propose the Residual Cross-correlation Self-attention (RCS) module, which leverages the cross-correlation self-attention from intermediate layers to remold the attention in the final block. The RCS module effectively reorganizes spatial information, unleashing the localization potential within CLIP for dense vision-language inference. Furthermore, to enhance the focus on regions of the same categories and local consistency, we propose the Semantic Feedback Refinement (SFR) module, which utilizes semantic segmentation maps to further adjust the attention scores. By integrating these two strategies, our method, termed ResCLIP, can be easily incorporated into existing approaches as a plug-and-play module, significantly boosting their performance in dense vision-language inference. Extensive experiments across multiple standard benchmarks demonstrate that our method surpasses state-of-the-art training-free methods, validating the effectiveness of the proposed approach. Code is available at https://github.com/yvhangyang/ResCLIP.

2410.14573 2026-06-03 cs.LG cs.AI

Building Trust in Black-box Optimization: A Comprehensive Framework for Explainability

在黑盒优化中建立信任:可解释性的综合框架

Nazanin Nezami, Hadis Anahideh

发表机构 * University of Illinois Chicago(伊利诺伊大学芝加哥分校)

AI总结 提出一套模型无关的指标IEMSO,通过采样核心、批次属性、优化过程和特征重要性四类指标,增强代理优化方法的透明性和可解释性。

详情
AI中文摘要

在受限评估预算内优化昂贵的黑盒函数在许多实际应用中面临重大挑战。代理优化(SO)是一种常见的解决方案,但其由代理模型和采样核心(例如采集函数)的复杂性引入的专有性质往往导致缺乏可解释性和透明度。尽管现有文献主要集中在增强对全局最优的收敛性,但新提出策略的实际解释仍未被充分探索,特别是在批量评估设置中。在本文中,我们提出了代理优化的包容性可解释性指标(IEMSO),这是一组全面的模型无关指标,旨在增强SO方法的透明度、可信度和可解释性。通过这些指标,我们在执行昂贵评估之前和之后为从业者提供中间和事后解释,以建立信任。我们考虑了四类主要指标,每类针对SO过程的特定方面:采样核心指标、批次属性指标、优化过程指标和特征重要性。我们的实验评估证明了所提指标在不同基准上的显著潜力。

英文摘要

Optimizing costly black-box functions within a constrained evaluation budget presents significant challenges in many real-world applications. Surrogate Optimization (SO) is a common resolution, yet its proprietary nature introduced by the complexity of surrogate models and the sampling core (e.g., acquisition functions) often leads to a lack of explainability and transparency. While existing literature has primarily concentrated on enhancing convergence to global optima, the practical interpretation of newly proposed strategies remains underexplored, especially in batch evaluation settings. In this paper, we propose \emph{Inclusive} Explainability Metrics for Surrogate Optimization (IEMSO), a comprehensive set of model-agnostic metrics designed to enhance the transparency, trustworthiness, and explainability of the SO approaches. Through these metrics, we provide both intermediate and post-hoc explanations to practitioners before and after performing expensive evaluations to gain trust. We consider four primary categories of metrics, each targeting a specific aspect of the SO process: Sampling Core Metrics, Batch Properties Metrics, Optimization Process Metrics, and Feature Importance. Our experimental evaluations demonstrate the significant potential of the proposed metrics across different benchmarks.

2310.10322 2026-06-03 cs.CL

Evaluating the Reversal Curse in Model Editing

评估模型编辑中的逆转诅咒

Hao-Xiang Xu, Jun-Yu Ma, Zhen-Hua Ling, Quan Liu, Cong Liu, Jia-Chen Gu

发表机构 * National Engineering Research Center of Speech and Language Information Processing(语音与语言信息处理国家级工程研究中心) University of Science and Technology of China(中国科学技术大学) iFLYTEK Research(iFLYTEK研究院) University of California, Los Angeles(美国加州大学洛杉矶分校)

AI总结 本文研究双向语言模型编辑,提出反向泛化指标并构建BAKE基准,发现多数编辑方法在反向评估中存在系统性缺陷,并分析逆转诅咒的成因及缓解策略。

Comments Accepted by TMLR

详情
AI中文摘要

大型语言模型(LLMs)由于错误或过时的知识容易产生不期望的文本幻觉。由于重新训练LLMs资源密集,模型编辑日益受到关注。尽管出现了基准和方法,现有的单向编辑和评估范式未能探索逆转诅咒。在本文中,我们研究双向语言模型编辑,旨在提供严格的评估,以判断编辑后的LLMs能否双向回忆编辑知识。引入了反向泛化指标,并构建了名为BAKE(双向知识编辑评估)的基准,用于评估编辑后的模型能否在编辑的反向方向上回忆编辑知识。我们使用多种编辑方法和LLMs进行了大量实验。结果表明,尽管大多数编辑方法能够沿着修改方向准确回忆编辑事实,但在反向方向评估时,它们表现出显著的系统性缺陷。为了进一步研究逆转诅咒的根本原因并探索潜在的缓解策略,我们从三个角度进行了详细分析。我们的发现表明,尽管上下文学习(ICL)可以在一定程度上缓解逆转诅咒,但它缺乏连续性,受输入长度限制,并可能引入幻觉。因此,结合ICL和其他编辑方法的优势是开发新编辑范式的一个有前景的方向。

英文摘要

Large language models (LLMs) are prone to hallucinate unintended text due to false or outdated knowledge. Since retraining LLMs is resource intensive, there has been a growing interest in model editing. Despite the emergence of benchmarks and approaches, existing unidirectional editing and evaluation paradigms have failed to explore the reversal curse. In this paper, we study bidirectional language model editing, aiming to provide a rigorous evaluation to assess if edited LLMs can recall the editing knowledge bidirectionally. A metric of reverse generalization is introduced and a benchmark dubbed Bidirectional Assessment for Knowledge Editing (BAKE) is constructed to evaluate if post-edited models can recall the edited knowledge in the reverse direction of editing. We conduct extensive experiments using a variety of editing methods and LLMs. The results show that while most editing methods are able to accurately recall editing facts along the modification direction, they exhibit substantial systematic deficiencies when evaluating in the reverse direction. To further investigate the underlying causes of reversal curse and to explore potential strategies for mitigation, a detailed analysis is conducted from three perspectives. Our findings reveal that although In-Context Learning (ICL) can mitigate the reversal curse to a certain extent, it lacks continuity, is limited by the input length, and may introduce hallucinations. Therefore, combining the advantages of ICL and other editing methods is a promising direction for developing new editing paradigms.

2407.18428 2026-06-03 cs.LG cs.AI cs.CV

Weighted Risk Invariance: Domain Generalization under Invariant Feature Shift

加权风险不变性:不变特征偏移下的领域泛化

Gina Wong, Joshua Gleason, Rama Chellappa, Yoav Wald, Anqi Liu

发表机构 * Johns Hopkins University(约翰霍普金斯大学) University of Maryland, College Park(马里兰大学学院公园分校) New York University(纽约大学) Center for Data Science(数据科学中心)

AI总结 针对不变协变量偏移下现有不变学习方法性能不佳的问题,提出加权风险不变性(WRI)框架,通过环境间损失的不变性并加权训练样本,在理论上保证学习到不变模型,并在实验中优于先前方法。

Journal ref TMLR 2024

详情
AI中文摘要

学习预测在多个环境下不变的模型是一种有前景的分布外泛化方法。这类模型被训练来提取特征 $X_{ ext{inv}}$,其中给定提取特征的条件分布 $Y \mid X_{ ext{inv}}$ 在不同环境下不发生变化。不变模型还应能泛化到提取特征 $X_{ ext{inv}}$ 的边缘分布 $p(X_{ ext{inv}})$ 的偏移,这种偏移称为 $ extit{不变协变量偏移}$。然而,我们表明,现有学习不变模型的方法在不变协变量偏移下表现不佳,要么无法学习到不变模型——即使对于从简单且经过充分研究的线性-高斯模型生成的数据也是如此——要么有限样本性能较差。为了解决这些问题,我们提出 $ extit{加权风险不变性}$(WRI)。我们的框架基于对训练样本进行适当加权,强制要求损失在不同环境下保持不变。我们证明,在线性-高斯设置下,WRI 可证明地学习到不变模型,即丢弃虚假相关性。我们提出了一种实用算法,通过同时学习密度 $p(X_{ ext{inv}})$ 和模型参数来实现 WRI,并且实验表明,在不变协变量偏移下,WRI 优于先前的不变学习方法。

英文摘要

Learning models whose predictions are invariant under multiple environments is a promising approach for out-of-distribution generalization. Such models are trained to extract features $X_{\text{inv}}$ where the conditional distribution $Y \mid X_{\text{inv}}$ of the label given the extracted features does not change across environments. Invariant models are also supposed to generalize to shifts in the marginal distribution $p(X_{\text{inv}})$ of the extracted features $X_{\text{inv}}$, a type of shift we call an $\textit{invariant covariate shift}$. However, we show that proposed methods for learning invariant models underperform under invariant covariate shift, either failing to learn invariant models$\unicode{x2014}$even for data generated from simple and well-studied linear-Gaussian models$\unicode{x2014}$or having poor finite-sample performance. To alleviate these problems, we propose $\textit{weighted risk invariance}$ (WRI). Our framework is based on imposing invariance of the loss across environments subject to appropriate reweightings of the training examples. We show that WRI provably learns invariant models, i.e. discards spurious correlations, in linear-Gaussian settings. We propose a practical algorithm to implement WRI by learning the density $p(X_{\text{inv}})$ and the model parameters simultaneously, and we demonstrate empirically that WRI outperforms previous invariant learning methods under invariant covariate shift.

2407.11821 2026-06-03 cs.AI

Approximating Probabilistic Inference in Statistical EL with Knowledge Graph Embeddings

使用知识图谱嵌入近似统计EL中的概率推理

Yuqicheng Zhu, Nico Potyka, Bo Xiong, Trung-Kien Tran, Mojtaba Nayyeri, Evgeny Kharlamov, Steffen Staab

发表机构 * Bosch Center for AI(博世人工智能中心) University of Stuttgart(斯图加特大学) Cardiff University(卡迪夫大学) Stanford University(斯坦福大学) University of Oslo(奥斯陆大学) University of Southampton(南安普顿大学)

AI总结 本文提出利用知识图谱嵌入高效近似统计EL中的概率推理,并提供了运行时和正确性保证的理论证明及实验评估。

Comments Accepted at UAI 2026

详情
AI中文摘要

统计信息无处不在,但从中得出有效结论却极其困难。我们以统计EL(SEL)为例,解释了如何使用知识图谱嵌入来高效近似概率推理,SEL是轻量级描述逻辑EL的统计扩展。我们提供了运行时和正确性保证的证明,并通过实验评估了我们方法的运行时和近似质量。

英文摘要

Statistical information is ubiquitous but drawing valid conclusions from it is prohibitively hard. We explain how knowledge graph embeddings can be used to approximate probabilistic inference efficiently using the example of Statistical EL (SEL), a statistical extension of the lightweight Description Logic EL. We provide proofs for runtime and soundness guarantees, and empirically evaluate the runtime and approximation quality of our approach.

2407.05312 2026-06-03 cs.CV

An Improved Method for Personalizing Diffusion Models

一种改进的扩散模型个性化方法

Yan Zeng, Masanori Suganuma, Takayuki Okatani

发表机构 * Graduate School of Information Sciences, Tohoku University(东北大学信息科学研究生院) RIKEN Center for AIP(理化学研究所AIP研究中心)

AI总结 提出一种在整合新信息时保留模型原有知识的扩散模型个性化方法,相比Dreambooth和文本反转训练时间更短且效果更优。

详情
AI中文摘要

扩散模型已经展示了令人印象深刻的图像生成能力。个性化方法,如文本反转和Dreambooth,通过使用特定图像增强模型的个性化。这些方法能够基于多样的文本上下文生成特定对象的图像。我们提出的方法旨在在整合新信息时保留模型的原有知识,从而在比Dreambooth和文本反转更少的训练时间内获得更优的结果。

英文摘要

Diffusion models have demonstrated impressive image generation capabilities. Personalized approaches, such as textual inversion and Dreambooth, enhance model individualization using specific images. These methods enable generating images of specific objects based on diverse textual contexts. Our proposed approach aims to retain the model's original knowledge during new information integration, resulting in superior outcomes while necessitating less training time compared to Dreambooth and textual inversion.

2405.03386 2026-06-03 cs.LG

Annot-Mix: Learning with Noisy Class Labels from Multiple Annotators via a Mixup Extension

Annot-Mix: 通过混合扩展从多个标注者学习带噪声类别标签

Marek Herde, Lukas Lührs, Denis Huseljic, Bernhard Sick

发表机构 * University of Kassel(卡塞尔大学) European Conference on Artificial Intelligence(欧洲人工智能会议) Conference on Prestigious Applications of Intelligent Systems(智能系统 prestigious 应用会议)

AI总结 提出Annot-Mix框架,通过扩展mixup处理多标注者提供的类别标签,在11个数据集上优于11种现有方法。

Comments 9 pages, 8 figures, 4 tables; post-publication arXiv version with minor editorial corrections; methodology, results, and conclusions unchanged

Journal ref ECAI 2024: 27th European Conference on Artifical Intelligence, IOS Press, pp. 2910-2918, 2024

详情
AI中文摘要

使用带噪声的类别标签进行训练会损害神经网络的泛化性能。在此背景下,mixup是一种流行的正则化技术,通过使记忆错误类别标签更加困难来提高训练鲁棒性。然而,mixup忽略了多个标注者(例如众包工作者)通常提供类别标签的事实。因此,我们提出了mixup的一种扩展,该扩展处理每个实例的多个类别标签,同时考虑哪个类别标签来自哪个标注者。集成到我们的多标注者分类框架annot-mix中,在包含来自人类或模拟标注者的噪声类别标签的11个数据集的评估研究中,它的性能优于11种(大多数是最先进的)方法。我们的代码通过我们的GitHub仓库公开提供:https://github.com/ies-research/multi-annotator-machine-learning/tree/annot-mix

英文摘要

Training with noisy class labels impairs neural networks' generalization performance. In this context, mixup is a popular regularization technique to improve training robustness by making memorizing false class labels more difficult. However, mixup neglects that multiple annotators, e.g., crowdworkers, typically provide class labels. Therefore, we propose an extension of mixup, which handles multiple class labels per instance while considering which class label originates from which annotator. Integrated into our multi-annotator classification framework annot-mix, it performs superiorly to eleven (mostly state-of-the-art) approaches in an evaluation study with eleven datasets comprising noisy class labels from either human or simulated annotators. Our code is publicly available through our GitHub repository at https://github.com/ies-research/multi-annotator-machine-learning/tree/annot-mix

2403.19883 2026-06-03 cs.AI

Planning with Uncertainty: Symmetries, Policy Inference, and Solution Compression

不确定性规划:对称性、策略推理与解压缩

Frederico Messa, André Grahl Pereira

发表机构 * INF/UFRGS(乌尔巴诺-弗兰西斯科·里格尔大学信息学院)

AI总结 本文提出基于显式最佳优先策略空间搜索的FOND规划方法,通过定义策略等价关系、利用群论计算状态对称性、多项式时间策略推断以及整数规划实现部分状态策略压缩,显著提升求解效率。

详情
AI中文摘要

完全可观测非确定性(FOND)规划是人工智能不确定性规划的核心。它通过具有非确定性效果的动作来建模不确定性。在这项工作中,我们提出了一系列技术,将显式最佳优先策略空间搜索建立为一种与当前最先进方法相竞争的方法,用于解决FOND规划任务。我们研究了如何定义策略之间的等价关系,从而允许剪枝部分搜索空间。我们展示了可以使用群论技术有效计算状态之间的规范对称性。我们还提出了两项超越策略空间搜索的贡献:一个过程,在给定策略域集规范的情况下,能在多项式时间内推断出解策略函数;以及一个整数规划公式化过程,给定一个定义在完整状态上的解策略,能产生一组资源高效的模型,这些模型能够找到以最少部分状态无歧义地表示该策略的部分状态策略。

英文摘要

Fully-observable non-deterministic (FOND) planning is at the core of artificial intelligence planning with uncertainty. It models uncertainty through actions with non-deterministic effects. In this work, we present a collection of techniques that establish explicit best-first policy-space search as a method competitive with the state of the art for solving FOND planning tasks. We study how to define equivalence relations between policies, allowing part of the search space to be pruned. We show it is possible to use group theory techniques to effectively compute canonical symmetries between states. We also present two contributions that go beyond just policy-space search: we present a procedure that infers in polynomial time a solution policy function given just the specification of its domain set, and an integer-programming formulation procedure that, given a solution policy defined over complete states, yields a set of resource-efficient models that are capable of finding a partial-state policy that represents it unambiguously with the fewest partial states possible.

2303.15619 2026-06-03 cs.CL cs.AI

Typhoon: Towards an Effective Task-Specific Masking Strategy for Pre-trained Language Models

Typhoon: 面向预训练语言模型的有效任务特定掩码策略

Muhammed Shahir Abdurrahman, Hashem Elezabi, Bruce Changlong Xu

发表机构 * Department of Computer Science, Stanford University(斯坦福大学计算机科学系)

AI总结 本文提出Typhoon,一种基于任务损失梯度的自适应掩码策略,在GLUE任务上对比随机掩码和整词掩码,经严格评估发现无显著优势。

详情
AI中文摘要

在掩码语言建模(MLM)中,选择哪些token进行掩码是一个核心但未被充分研究的设计决策。标准预训练随机均匀掩码token,但多项研究表明,更具信息性的掩码目标可以提升下游性能。我们将掩码视为微调流程中任务自适应的组件,并引入Typhoon,一种掩码策略,它利用任务损失相对于one-hot token输入的梯度来在线估计每种token类型对目标的贡献程度。Typhoon维护每个token类型显著性的指数移动平均,并将这些分数校准为掩码分布,在token独立性近似下,其期望掩码率与目标预算匹配。我们形式化了该方法,并在两个GLUE任务(MRPC和CoLA)上,针对三个BERT系列骨干网络(TinyBERT、DistilBERT和BERT-base)以及每个配置五个随机种子(总共90次训练运行),将其与随机掩码和整词掩码进行了评估。我们的主要发现是,一旦考虑了种子方差,没有哪种掩码策略在这些任务上可靠地优于其他策略:在MRPC上,Typhoon与最佳基线之间的差距保持在0.004 F1以内,所有十二次Typhoon比较中无配对检验达到显著性,且每个95%置信区间包含零。Typhoon在单次运行实验中的明显优势并未经受住这种更仔细的评估。我们将此视为一个警示性的、以可重复性为重点的结果——基于梯度的任务自适应掩码具有竞争力,但在此规模上并不明显优于无资源的随机掩码——我们描述了一个干净的现代重实现以支持后续工作。

英文摘要

The choice of \emph{which} tokens to mask is a central, under-examined design decision in masked language modeling (MLM). Standard pretraining masks tokens uniformly at random, but several studies show that more informative masking targets can improve downstream performance. We study masking as a \emph{task-adaptive} component of the fine-tuning pipeline and introduce \textbf{Typhoon}, a masking strategy that uses the gradient of the task loss with respect to one-hot token inputs to estimate, online, how much each token type contributes to the objective. Typhoon maintains an exponential moving average of per-token-type saliency and calibrates these scores into a masking distribution whose expected masking rate matches a target budget, under a token-independence approximation. We formalize the method and evaluate it against random masking and whole-word masking on two GLUE tasks, MRPC and CoLA, across three BERT-family backbones (TinyBERT, DistilBERT, and BERT-base) and five random seeds per configuration ($90$ training runs in total). Our main finding is that, once seed variance is accounted for, no masking strategy is reliably better than the others on these tasks: on MRPC the gap between Typhoon and the best baseline stays within $0.004$ $F_1$, across all twelve Typhoon comparisons no paired test reaches significance, and every $95\%$ confidence interval contains zero. Typhoon's apparent advantage in single-run experiments does not survive this more careful evaluation. We read this as a cautionary, reproducibility-focused result -- gradient-based task-adaptive masking is competitive but not clearly better than resource-free random masking at this scale -- and we describe a clean modern reimplementation to support follow-up work.

2606.03940 2026-06-03 eess.IV cs.CV cs.LG cs.RO

SEAOTTER: Sensor Embedded Autoencoding with One-Time Transcode for Efficient Reconstruction

SEAOTTER: 基于传感器嵌入自编码器与一次性转码的高效重建

Dan Jacobellis, Neeraja J. Yadwadkar

发表机构 * Department of Electrical and Computer Engineering(电气与计算机工程系) The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 提出SEAOTTER框架,结合传感器嵌入自编码器与可学习JPEG转码,在200:1压缩比下实现比AVIF快7倍编码、3.5倍解码,并提升ImageNet top-1准确率8%,同时保持JPEG兼容性。

详情
AI中文摘要

在机器人系统中,使用低成本、低功耗硬件可以轻松捕获高分辨率的大量视觉数据。然而,当通过JPEG/MPEG等传统编解码器传输时,有限的带宽和机载计算资源阻碍了充分利用。较新的编解码器(如AV1/AVIF)改善了率失真权衡,但需要更多资源进行编码,在没有定制ASIC的情况下不切实际。最近的非对称自编码器在极端功率和带宽约束下提供高质量,但增加了高昂的解码成本,并使用忽略围绕JPEG等标准建立的数十年基础设施的特有格式。为了解决这些限制,我们引入了一种基于传感器嵌入自编码器与一次性转码的高效重建(SEAOTTER)的云机器人压缩框架。由于传感器、云和消费阶段面临非常不同的功率和带宽预算,SEAOTTER结合了学习潜变量的紧凑性和标准JPEG文件的广泛可用性。由于朴素转码会降低性能,我们提出了一种可学习的JPEG颜色和量化变换,能够提高全局、密集和基于视觉语言感知的准确性。使用SEAOTTER,我们为预训练的冻结编码器训练通用和任务感知的转码流水线。在200:1的压缩比下,与AVIF相比,我们观察到编码速度提高7倍,解码速度提高3.5倍,ImageNet top-1准确率提高8%,同时保持与JPEG基础设施的兼容性。我们的代码可从此https URL获取。

英文摘要

In robotics systems, vast amounts of visual data are easily captured at high resolution using low-cost, low-power hardware. Yet, limited bandwidth and on-device compute resources prevent full utilization when transmitted via conventional codecs like JPEG/MPEG. Newer codecs, like AV1/AVIF, improve the rate-distortion trade-off, but demand far more resources for encoding, impractical without custom ASICs. Recent asymmetric autoencoders deliver high quality under extreme power and bandwidth constraints, but add prohibitive decoding cost and use bespoke formats that ignore decades of infrastructure built around standards like JPEG. To address these limitations, we introduce a compression framework for cloud robotics based on a Sensor Embedded Autoencoder paired with a One-Time Transcode for Efficient Reconstruction (SEAOTTER). Because the sensor, cloud, and consumer stages face very different power and bandwidth budgets, SEAOTTER combines the compactness of a learned latent with the broad usability of a standard JPEG file. Since naive transcoding degrades performance, we propose a learnable JPEG color and quantization transform that enables increased accuracy for global, dense, and vision-language-based perception. Using SEAOTTER, we train both general-purpose and task-aware transcoding pipelines for a pre-trained, frozen encoder. At a compression ratio of 200:1 and compared to AVIF, we observe 7 times faster encoding, 3.5 times faster decoding, and +8% ImageNet top-1 accuracy, while retaining compatibility with JPEG infrastructure. Our code is available at https://github.com/UT-SysML/seaotter .

2606.03455 2026-06-03 eess.AS cs.SD

WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling

WavTTS:通过直接原始波形建模实现高质量零样本TTS

Wenxi Chen, Dongya Jia, Yushen Chen, Zhikang Niu, Yuzhe Liang, Xiquan Li, Ruiqi Yan, Ziyang Ma, Guanrou Yang, Sanyuan Chen, Yue Wang, Zhuo Chen, Kai Yu, Xie Chen

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Innovation Institute(上海创新研究院) ByteDance Seed(字节跳动种子)

AI总结 提出WavTTS,首个基于流匹配与扩散Transformer的原始波形生成TTS模型,通过简单分块策略直接建模波形并集成多尺度梅尔频谱监督,在零样本TTS中接近潜在空间生成模型性能。

详情
AI中文摘要

最近,基于VAE潜在变量或梅尔频谱的扩散模型已成为零样本TTS的主流范式。尽管这些压缩表示提高了生成效率,但它们不可避免地遭受信息损失和非端到端训练的问题。理论上,直接建模原始波形可以规避这些问题;然而,由于音频信号序列长度极长,这一方向尚未充分探索且常被认为困难。为了克服这一点,我们提出了WavTTS,这是第一个原始波形生成TTS模型,显著缩小了与潜在空间生成模型的差距。基于流匹配与扩散Transformer(DiT),WavTTS通过简单的分块策略直接建模语音波形,同时集成多尺度梅尔频谱监督以在训练过程中提供感知指导。此外,我们研究了波形扩散中预测目标和噪声调度的影响,并开发了一种有效的调度设计以提高生成质量。在开源基准上的评估表明,WavTTS接近当前最先进的潜在生成零样本TTS模型的性能,同时显著优于之前的端到端语音生成模型。我们的发现证明了直接在波形空间扩展基于扩散的TTS的可行性,为端到端语音生成开辟了新方向。

英文摘要

Recently, diffusion models operating on VAE latents or mel-spectrograms have become the dominant paradigm for zero-shot TTS. Although these compressed representations improve generation efficiency, they inevitably suffer from information loss and non-end-to-end training. Theoretically, directly modeling raw waveforms circumvents these issues; however, this direction remains underexplored and is often deemed difficult due to the extremely long sequence length of audio signals. To overcome this, we propose WavTTS, the first raw waveform generative TTS model that substantially narrows the gap with latent-space generative models. Built upon the flow matching with Diffusion Transformer (DiT), WavTTS directly models speech waveforms via a simple patchification strategy, while integrating multi-scale mel-spectrogram supervision to provide perceptual guidance during training. Furthermore, we investigate the impact of prediction targets and noise scheduling in waveform diffusion, and develop an effective schedule design to improve generation quality. Evaluations on open-source benchmarks demonstrate that WavTTS closely approaches the performance of current state-of-the-art latent generative zero-shot TTS models, while substantially outperforming previous end-to-end speech generation models. Our findings demonstrate the feasibility of scaling diffusion-based TTS directly in the waveform space, opening a new direction for end-to-end speech generation.

2606.03116 2026-06-03 eess.AS cs.AI cs.SD

AnyAudio-Judge: A Dynamic Rubric-Based Benchmark and Evaluator for Audio Instruction Following

AnyAudio-Judge:基于动态评分标准的音频指令跟随基准与评估器

Haitao Li, Tian Tan, Yuguang Yang, Shan Yang, Xie Chen

发表机构 * Zhejiang University(浙江大学) Shanghai Innovation Institute(上海创新研究院) Shanghai Jiao Tong University(上海交通大学) Tencent Hunyuan(腾讯文脉)

AI总结 针对指令引导音频生成中复杂指令解耦困难、评估缺乏可解释性和细粒度属性匹配的问题,提出基于动态评分标准的评估范式,通过自适应分解音频描述为可验证的二元评分项,并构建包含7920个样本的双语基准和105K训练语料,结合SFT与GRPO训练专用评估器,在零样本对齐检测和下游强化学习指令对齐中取得显著提升。

详情
AI中文摘要

指令引导音频生成的快速发展凸显了对稳健对齐评估的迫切需求。当前的自动评估方法严重依赖通用大语言模型的整体评分,难以解耦复杂指令,缺乏可解释性,且无法捕捉细粒度的属性不匹配。为解决这一问题,我们引入了一种新颖的基于动态评分标准的评估范式,该范式自适应地将复杂的音频描述分解为可变数量的独立、可验证的二元评分项。为了严格基准测试这一能力,我们提出了AnyAudio-Judge Bench,一个全面的双语基准,包含7920个精心策划的样本,涵盖四个不同的音频领域(语音、声音、音乐和混合),并包含特意构建的困难负样本。此外,我们构建了一个包含105K样本的大规模语料库,并带有明确的思维链(CoT)理由,以训练我们的专用评估器——AnyAudio-Judge模型。通过采用结合监督微调(SFT)和组相对策略优化(GRPO)的训练流程,我们的模型成功将其推理路径与基于评分标准的评分机制对齐。大量实验表明,AnyAudio-Judge不仅显著增强了与最先进基线相比的零样本对齐检测,而且提供了精确且可解释的奖励信号,显著改善了音频生成下游强化学习中的指令对齐。

英文摘要

The rapid advancement of instruction-guided audio generation has highlighted the critical need for robust alignment evaluation. Current automated evaluation methods heavily rely on holistic scoring from general-purpose large language models, which struggle to decouple complex instructions, lack interpretability, and fail to capture fine-grained attribute mismatches. To address this, we introduce a novel dynamic rubric-based evaluation paradigm that adaptively decomposes complex audio captions into a variable number of independent, verifiable binary rubric items. To rigorously benchmark this capability, we propose the AnyAudio-Judge Bench, a comprehensive, bilingual benchmark comprising 7,920 meticulously curated samples across four diverse audio domains (speech, sound, music, and mixed), featuring deliberately constructed hard negatives. Furthermore, we construct a large-scale corpus of 105K samples with explicit Chain-of-Thought (CoT) rationales to train our dedicated evaluator, the AnyAudio-Judge model. By employing a training pipeline that combines Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO), our model successfully aligns its reasoning paths with the rubric-based scoring mechanism. Extensive experiments demonstrate that AnyAudio-Judge not only significantly enhances zero-shot alignment detection compared to state-of-the-art baselines, but also provides precise and interpretable reward signals that substantially improve instruction alignment in downstream reinforcement learning for audio generation.

2606.02913 2026-06-03 eess.AS cs.SD

A Comparison of Generative and Discriminative Methods for Speech Enhancement: Robustness, Complexity, and Hallucination

生成式与判别式语音增强方法的比较:鲁棒性、复杂性与幻觉

Shrishti Saha Shetu, Emanuël A. P. Habets, Andreas Brendel

发表机构 * Fraunhofer IIS(弗劳恩霍夫研究所) Friedrich-Alexander-Universität Erlangen-Nürnberg(埃尔兰根-纽伦堡亚当-弗里德里希-亚历山大大学)

AI总结 本文比较了生成式和判别式深度学习方法在语音增强中的表现,分析了高/低信噪比、匹配/失配训练场景下的鲁棒性、复杂度与幻觉特性。

详情
AI中文摘要

在本研究中,我们对基于深度学习的生成式和判别式语音增强方法进行了全面的比较分析,特别是在降噪任务中。我们的研究重点在于评估它们在高低信噪比条件下的有效性,同时考虑匹配和不匹配的训练场景。我们进一步研究了训练数据量、模型收敛速度的影响,并根据所考虑的训练范式,从客观结果的角度解释了性能差异。此外,我们比较了这些方法的复杂度-性能权衡和实际可行性。为了进一步加强评估,我们研究了生成式方法在词错误率和音素相似度方面的幻觉特性。本研究得出的见解提供了经验证据,帮助研究人员和从业者理解不同方法的感知增益是否证明了其在实际应用中的计算成本是合理的。

英文摘要

In this study, we conduct a comprehensive comparative analysis of generative and discriminative deep learning-based speech enhancement methods, specifically in noise reduction tasks. Our investigation focuses on evaluating their effectiveness under high and low signal-to-noise ratio conditions, considering both matched and mismatched training scenarios. We further investigate the impact of training data volume, model convergence speed, and interpret the performance differences in terms of objective results for the considered training paradigms. Additionally, we compare the complexity-performance trade-off and the practical viability of these approaches. To further strengthen the evaluation, we study the hallucination characteristics of generative approaches in terms of word error rate and phoneme similarity. The insights derived from this study provide empirical evidence to assist researchers and practitioners in understanding whether the perceptual gains of different approaches justify their computational cost in practical applications.

2606.02906 2026-06-03 eess.IV cs.CV

Depth from Dual Differential Defocus and Stereo Consensus

基于双差分散焦与立体一致性的深度估计

Junjie Luo, Wei Xu, Dylan Chu, Emma Alexander, Qi Guo

发表机构 * Purdue University(普渡大学) Northwestern University(西北大学)

AI总结 提出D^3S Consensus算法,融合散焦深度与立体视觉,在超出景深范围内实现高精度深度估计,通过物理独立线索的一致性选择可靠预测,以更小基线达到可比工作范围。

详情
AI中文摘要

我们提出了D^3S Consensus,一种基于物理的闭式算法,它统一了散焦深度(DfD)和立体视觉,在超出相机景深(DoF)的扩展工作范围内实现高精度深度估计。给定一对双散焦立体图像,该方法通过一种新颖的DfD理论——双差分散焦(D^3)和(S)立体耦合方式,估计一组超定深度。然后,通过在这些物理独立线索之间强制执行一致性,从该组中选择最可信的深度预测,以拒绝不可靠的估计。分析表明,在相同误差容限下,D^3S与先前的基于三角测量的深度估计系统相比,以10倍小的基线实现了可比的工作范围。这使得紧凑的无源双目测距仪具有比传统立体和DfD设计小得多的外形尺寸。我们展示了第一个D^3S原型,其基线仅为4毫米,EFL为12毫米。它通过单次采集生成高达900×1800像素的深度图,在0.3-1.64米范围内平均绝对误差为1厘米。这已经超过了某些具有更大外形尺寸的商用立体相机的报告精度。

英文摘要

We introduce D^3S Consensus, a physics-based, closed-form algorithm that unifies depth-from-defocus (DfD) and stereo to achieve highly accurate depth estimation throughout an extended working range beyond the depth-of-field (DoF) of cameras. Given a pair of dual-defocus stereo images, the method estimates an overdetermined set of depth using a novel DfD theory, Dual Differential Defocus (D^3), and (S)tereo in a coupled fashion. It then picks the most confident depth prediction from the set by enforcing consensus between these physically independent cues to reject unreliable estimates. Analysis shows that D^3S achieves a comparable working range under the same error tolerance with 10x smaller baseline than previous triangulation-based depth estimation systems. This enables compact passive binocular rangefinders with substantially smaller form factors than conventional stereo and DfD designs. We demonstrate the first D^3S prototype with only 4 mm baseline and 12 mm EFL. It generates up to 900 x 1800-pixel depth maps with 1-cm mean absolute error over 0.3-1.64 m from a snapshot acquisition. This has surpassed the reported accuracy of certain commercially available stereo cameras with much larger form factors.