arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2148
2512.11995 2026-06-10 cs.CV cs.AI cs.LG 版本更新

V-REX: Benchmarking Exploratory Visual Reasoning via Chain-of-Questions

V-REX: 通过问题链进行探索性视觉推理的基准测试

Chenrui Fan, Yijun Liang, Shweta Bhardwaj, Kwesi Cobbina, Ming Li, Tianyi Zhou

发表机构 * University of Maryland, College Park(马里兰大学学院市分校) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)

AI总结 提出V-REX基准,通过问题链将多步探索推理分解为规划和遵循能力,评估视觉语言模型在复杂开放任务中的表现。

Comments 28 pages

详情
AI中文摘要

尽管许多视觉语言模型(VLM)被开发用于回答定义明确、目标高度具体的简单问题(如大多数基准测试所示),但在实践中,它们通常难以处理复杂的开放式任务,这些任务通常需要在视觉空间中进行多轮探索和推理。这种视觉思维路径不仅像AI侦探一样提供逐步探索和验证,还能对最终答案产生更好的解释。然而,由于中间步骤的探索空间巨大,这些路径难以评估。为弥补这一差距,我们开发了一个评估套件“多步探索视觉推理(V-REX)”,它由一个具有挑战性的视觉推理任务基准和一个评估协议组成。V-REX涵盖了跨不同领域的丰富应用场景。V-REX将多步探索推理转化为问题链(CoQ),并解耦了VLM的能力:(1)规划:通过选择一系列探索性问题来分解开放式任务;(2)遵循:顺序回答精心策划的CoQ以收集信息,从而推导出最终答案。通过每步策划有限的问题和答案选项,V-REX实现了对中间步骤的可靠定量和细粒度分析。通过评估最先进的专有和开源VLM,我们揭示了持续的扩展趋势、规划与遵循能力之间的显著差异,以及多步探索推理中巨大的改进空间。

英文摘要

While many vision-language models (VLMs) are developed to answer well-defined, straightforward questions with highly specified targets, as in most benchmarks, they often struggle in practice with complex open-ended tasks, which usually require multiple rounds of exploration and reasoning in the visual space. Such visual thinking paths not only provide step-by-step exploration and verification as an AI detective but also produce better interpretations of the final answers. However, these paths are challenging to evaluate due to the large exploration space of intermediate steps. To bridge the gap, we develop an evaluation suite, ``Visual Reasoning with multi-step EXploration (V-REX)'', which is composed of a benchmark of challenging visual reasoning tasks requiring native multi-step exploration and an evaluation protocol. V-REX covers rich application scenarios across diverse domains. V-REX casts the multi-step exploratory reasoning into a Chain-of-Questions (CoQ) and disentangles VLMs' capability to (1) Planning: breaking down an open-ended task by selecting a chain of exploratory questions; and (2) Following: answering curated CoQ sequentially to collect information for deriving the final answer. By curating finite options of questions and answers per step, V-REX achieves a reliable quantitative and fine-grained analysis of the intermediate steps. By assessing SOTA proprietary and open-sourced VLMs, we reveal consistent scaling trends, significant differences between planning and following abilities, and substantial room for improvement in multi-step exploratory reasoning.

2512.08180 2026-06-10 cs.CV 版本更新

GeoLoom: High-quality Geometric Diagram Generation from Textual Input

GeoLoom:从文本输入生成高质量几何图形

Xiaojing Wei, Ting Zhang, Wei He, Jingdong Wang, Hua Huang

发表机构 * GitHub

AI总结 提出GeoLoom框架,通过自动形式化模块和坐标求解器,将自然语言几何描述转化为高质量图形,并引入约束评估指标,显著优于现有方法。

详情
AI中文摘要

高质量几何图形生成既带来挑战也带来机遇:它要求严格的空间准确性,同时提供明确的约束来指导生成。受近期在几何问题求解中使用形式语言和符号求解器以增强正确性和可解释性的进展启发,我们提出了GeoLoom,一个用于几何领域文本到图形生成的新颖框架。GeoLoom包含两个核心组件:一个自动形式化模块,将自然语言翻译成专门设计的面向生成的形式语言GeoLingua;以及一个坐标求解器,利用高效的蒙特卡洛优化将形式约束映射到精确坐标。为支持该框架,我们引入了GeoNF,一个将自然语言几何描述与形式化GeoLingua描述对齐的数据集。我们进一步提出了一种基于约束的评估指标,量化结构偏差,为迭代细化提供数学上有依据的监督。实验结果表明,GeoLoom在结构保真度上显著优于最先进的基线,为可解释和可扩展的图形生成提供了原则性基础。

英文摘要

High-quality geometric diagram generation presents both a challenge and an opportunity: it demands strict spatial accuracy while offering well-defined constraints to guide generation. Inspired by recent advances in geometry problem solving that employ formal languages and symbolic solvers for enhanced correctness and interpretability, we propose GeoLoom, a novel framework for text-to-diagram generation in geometric domains. GeoLoom comprises two core components: an autoformalization module that translates natural language into a specifically designed generation-oriented formal language GeoLingua, and a coordinate solver that maps formal constraints to precise coordinates using the efficient Monte Carlo optimization. To support this framework, we introduce GeoNF, a dataset aligning natural language geometric descriptions with formal GeoLingua descriptions. We further propose a constraint-based evaluation metric that quantifies structural deviation, offering mathematically grounded supervision for iterative refinement. Empirical results demonstrate that GeoLoom significantly outperforms state-of-the-art baselines in structural fidelity, providing a principled foundation for interpretable and scalable diagram generation.

2512.02240 2026-06-10 cs.CL 版本更新

Lightweight Latent Reasoning for Narrative Tasks

面向叙事任务的轻量级潜在推理

Alexander Gurung, Esmeralda S. Whitammer, Mirella Lapata

发表机构 * School of Informatics, University of Edinburgh(爱丁堡大学信息学院) CIFAR Fellow

AI总结 提出LiteReason方法,通过轻量级推理投影器生成连续潜在令牌,在强化学习中动态切换潜在与离散推理,将推理长度减少77-92%,同时保持接近非潜在RL的性能。

详情
AI中文摘要

大型语言模型通过生成长思维链或“推理轨迹”来处理复杂任务,这些轨迹在给定查询时作为输出生成的潜在变量。模型生成此类轨迹的能力可以通过强化学习进行优化,以提高其在预测答案中的效用。这种优化带来了高昂的计算成本,尤其是对于涉及检索和处理大量令牌的叙事相关任务。为此,我们提出了LiteReason,一种潜在推理方法,可以与标准令牌采样交错进行,并易于与RL技术结合。LiteReason采用轻量级推理投影器模块,训练生成连续的潜在令牌,帮助模型“跳过”推理步骤。在RL过程中,策略模型决定何时激活投影器,根据需要切换潜在和离散推理。在情节漏洞检测和书籍章节生成上的实验结果表明,我们的方法优于潜在推理基线,并接近匹配非潜在RL训练,同时将最终推理长度减少77-92%。总体而言,LiteReason引导RL训练到性能-计算权衡曲线中更高效的部分。

英文摘要

Large language models (LLMs) tackle complex tasks by generating long chains of thought or "reasoning traces" that act as latent variables in the generation of an output given a query. A model's ability to generate such traces can be optimized with reinforcement learning (RL) to improve their utility in predicting an answer. This optimization comes at a high computational cost, especially for narrative-related tasks that involve retrieving and processing many tokens. To this end, we propose LiteReason, a latent reasoning method that can be interleaved with standard token sampling and easily combined with RL techniques. LiteReason employs a lightweight Reasoning Projector module, trained to produce continuous latent tokens that help the model 'skip' reasoning steps. During RL, the policy model decides when to activate the projector, switching between latent and discrete reasoning as needed. Experimental results on plot hole detection and book chapter generation show that our method outperforms latent reasoning baselines and comes close to matching non-latent RL training, while reducing final reasoning length by 77-92%. Overall, LiteReason guides RL training to a more efficient part of the performance-computation tradeoff curve.

2404.09101 2026-06-10 cs.LG cs.AI cs.NA math.NA stat.ML 版本更新

Mixtures of Neural Operators Reduce Active Complexity in Operator Learning

神经算子混合体降低算子学习中的主动复杂度

Anastasis Kratsios, Takashi Furuya, Jose Antonio Lara Benitez, Matti Lassas, Maarten de Hoop

发表机构 * McMaster University and Vector Institute(麦斯特大学和向量研究所) Shimane University(岛根大学) Rice University(里士满大学) University of Helsinki(赫尔辛基大学)

AI总结 通过路由混合神经算子(MoNO)与固定单神经算子构造的比较,证明MoNO在主动专家规模上具有更优的深度、宽度和秩缩放,且对Lipschitz目标这些量以O(ε^{-1})为界。

详情
AI中文摘要

算子学习系统并非仅由总参数数量决定;对于一次查询,相关瓶颈可能是必须加载和评估的模型。我们通过路由混合神经算子(MoNO)与固定单神经算子构造之间的建设性比较,在紧致Sobolev子集上研究了经典神经算子的这一区别。该比较涉及相对于基线的专家主动复杂度,其中总存储大小和路由搜索分别考虑。MoNO将每个输入函数通过树路由到一个专家。我们的主要定理表明,在近似集上,每个具有有界输出Sobolev半径的标量一致连续非线性算子都存在一个MoNO近似,其主动专家具有比所分析的单神经算子构造更小的深度、宽度和秩缩放;对于Lipschitz目标,这些专家量以$\mathcal{O}(\varepsilon^{-1})$为界。该定理将局部化转化为主动专家大小、路由深度和专家数量的算子级核算。我们还证明了底层神经算子架构的定量通用近似定理,明确依赖于紧集直径和连续模。

英文摘要

Operator-learning systems are not governed solely by total parameter count; for one query, the relevant bottleneck can be the model that must be loaded and evaluated. We study this distinction for classical neural operators on compact Sobolev subsets through a constructive comparison between routed mixtures of neural operators (MoNOs) and a fixed single-neural-operator construction. The comparison concerns expert-active complexity relative to that baseline, with total stored size and routing search accounted separately. A MoNO routes each input function through a tree to one expert. Our main theorem shows that every scalar uniformly continuous nonlinear operator with bounded output Sobolev radius on the approximation set admits a MoNO approximation whose active expert has smaller depth, width, and rank scaling than the analyzed single-neural-operator construction; for Lipschitz targets these expert quantities are bounded by $\mathcal{O}(\varepsilon^{-1})$. The theorem turns localization into an operator-level accounting of active expert size, routing depth, and number of experts. We also prove a quantitative universal approximation theorem for the underlying neural-operator architecture, with explicit dependence on compact-set diameter and modulus of continuity.

2511.10234 2026-06-10 cs.LG cs.AI 版本更新

Lost in Serialization: Invariance and Generalization of LLM Graph Reasoners

迷失在序列化中:LLM图推理器的不变性与泛化能力

Daniel Herbst, Lea Karbevska, Divyanshu Kumar, Akanksha Ahuja, Fatemeh Gholamzadeh Nasrabadi, Fabrizio Frasca

发表机构 * University of Cambridge(剑桥大学) ETH Zurich(苏黎世联邦理工学院)

AI总结 研究LLM图推理器对图表示对称性的缺乏不变性,通过分解序列化因素并评估微调影响,发现大模型更鲁棒,微调降低节点重标敏感但增加结构和格式敏感,且不保证泛化。

Comments ICML 2026 Workshop on Graph Foundation Models

详情
AI中文摘要

尽管前景广阔,基于大型语言模型(LLM)的图推理器缺乏对图表示中对称性的内置不变性。在顺序图序列化上操作时,LLM在节点重索引、边重排序或格式变化下可能产生不同输出,引发鲁棒性问题。我们系统分析了这些影响,研究了微调如何影响编码敏感性以及在未见任务上的泛化能力。我们提出了一种将图序列化分解为节点标记、边编码和语法的原则性方法,并在一个全面的基准测试套件上评估了LLM对每个因素变化的鲁棒性。我们还贡献了一组新的谱任务,以进一步评估微调推理器的泛化能力。结果表明,较大的(未微调)模型更鲁棒。微调降低了对节点重标的敏感性,但可能增加对结构和格式变化的敏感性,同时并未一致地提高在未见任务上的性能。

英文摘要

While promising, graph reasoners based on Large Language Models (LLMs) lack built-in invariance to symmetries in graph representations. Operating on sequential graph serializations, LLMs can produce different outputs under node reindexing, edge reordering, or formatting changes, raising robustness concerns. We systematically analyze these effects, studying how fine-tuning impacts encoding sensitivity as well generalization on unseen tasks. We propose a principled decomposition of graph serializations into node labeling, edge encoding, and syntax, and evaluate LLM robustness to variations of each of these factors on a comprehensive benchmarking suite. We also contribute a novel set of spectral tasks to further assess generalization abilities of fine-tuned reasoners. Results show that larger (non-fine-tuned) models are more robust. Fine-tuning reduces sensitivity to node relabeling but may increase it to variations in structure and format, while it does not consistently improve performance on unseen tasks.

2511.05349 2026-06-10 cs.SD 版本更新

Passive Acoustic-based Composite Indices for Reef Health Monitoring in Noisy Tropical waters

基于被动声学的复合指数用于嘈杂热带水域的珊瑚礁健康监测

Hari Vishnu, Yuen Min Too, Mandar Chitre, Danwei Huang, Teong Beng Koay, Sudhanshi S. Jain

发表机构 * University of Technology, Sydney(悉尼科技大学) Nanyang Technological University(南洋理工大学) National Institute of Oceanography and Environmental Physics(国家海洋与环境物理研究所) Institute of Marine and Coastal Sciences, University of Connecticut(康乃狄克大学海洋与海岸科学研究所) Indian Institute of Technology, Bombay(印度班加罗尔理工学院)

AI总结 提出使用卷积神经网络去噪器处理低频噪声,结合声压级、声学复杂度指数和虾鸣率等声学指标,实现与潜水评估一致的珊瑚礁健康监测。

详情
AI中文摘要

被动声学监测为珊瑚礁的长期、空间广泛评估提供了潜力。为探索这种方法,我们在新加坡水域的十个珊瑚礁站点部署了水下声学记录仪,持续两年。为减轻持续的人为和流致噪声对低频礁声景的掩蔽,我们训练了一个卷积神经网络去噪器。声学数据分析揭示了明显的晨昏合唱。尽管在噪声记录的低频部分,与环境变量的相关性被掩盖,但去噪后的数据显示声学活动指数(如声压级和声学复杂度指数)与基于潜水员的珊瑚礁健康评估(如活珊瑚丰富度和覆盖率、藻类覆盖率)之间存在相关性。此外,从高频声带计算的虾鸣率在时间和空间上与珊瑚礁参数稳健相关。本研究证明,只要有效去噪和解释数据,被动声学包含有助于珊瑚礁监测的有价值信息。该方法可推广到其他因持续噪声而阻碍声学监测的海洋环境。

英文摘要

Passive acoustic monitoring offers the potential to enable long-term, spatially extensive assessments of coral reefs. To explore this approach, we deployed underwater acoustic recorders at ten coral reef sites around Singapore waters over two years. To mitigate the persistent anthropogenic and current-induced noise masking the low-frequency reef soundscape, we trained a convolutional neural network denoiser. Analysis of the acoustic data reveals distinct morning and evening choruses. Though the correlation with environmental variates was obscured in the low-frequency part of the noisy recordings, the denoised data showed correlations of acoustic activity indices such as sound pressure level and acoustic complexity index with diver-based assessments of reef health such as live coral richness and cover, and algal cover. Furthermore, the shrimp snap rate, computed from the high-frequency acoustic band, is robustly correlated with the reef parameters, both temporally and spatially. This study demonstrates that passive acoustics holds valuable information that can help with reef monitoring, provided the data is effectively denoised and interpreted. This methodology can be extended to other marine environments where acoustic monitoring is hindered by persistent noise.

2511.02603 2026-06-10 cs.CL 版本更新

CGES: Confidence-Guided Early Stopping for Efficient and Accurate Self-Consistency

CGES:面向高效准确自一致性的置信引导早停方法

Ehsan Aghazadeh, Ahmad Ghasemi, Hedyeh Beyhaghi, Hossein Pishro-Nik

发表机构 * University of Massachusetts Amherst(马萨诸塞大学阿姆赫斯特分校)

AI总结 提出贝叶斯框架CGES,通过自适应停止采样减少自一致性推理调用次数,在5个推理基准上平均减少58%调用且精度损失仅0.4个百分点。

Comments Extended version. A preliminary version was accepted at the Efficient Reasoning Workshop @ NeurIPS 2025. Code: https://github.com/EhsanAghazadeh/cges

详情
AI中文摘要

大型语言模型(LLMs)在测试时通常被多次查询,并通过多数投票聚合预测。虽然有效,但这种自一致性(Wang et al., 2023)策略需要固定次数的调用,并且在正确答案出现频率较低时失败。我们引入了置信引导早停(CGES),一个贝叶斯框架,它在候选答案上形成后验分布,并一旦某个答案积累了足够的后验质量就自适应地停止采样。我们在理想校准设置和现实有噪置信设置(在方向漂移条件下)下证明了保证。在五个推理基准上平均,CGES将平均调用次数减少了58%(从16.0降至6.7),同时其精度与自一致性相差在0.4个百分点以内。

英文摘要

Large language models (LLMs) are often queried multiple times at test time, with predictions aggregated by majority vote. While effective, this self-consistency (Wang et al., 2023) strategy requires a fixed number of calls and fails when the correct answer is infrequent. We introduce Confidence-Guided Early Stopping (CGES), a Bayesian framework that forms posteriors over candidate answers and adaptively halts sampling once one answer accumulates enough posterior mass. We prove guarantees in both an ideal calibrated regime and a realistic noisy-confidence regime under a directional drift condition. Averaged over five reasoning benchmarks, CGES reduces the average number of calls by 58% on average (from 16.0 to 6.7) while matching its accuracy within 0.4 percentage points of self-consistency.

2510.09801 2026-06-10 cs.AI 版本更新

How can we assess human-agent interactions? Case studies in software agent design

如何评估人机交互?软件代理设计案例研究

Valerie Chen, Rohit Malhotra, Xingyao Wang, Juan Michelini, Xuhui Zhou, Aditya Bharat Soni, Hoang H. Tran, Calvin Smith, Ameet Talwalkar, Graham Neubig

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出PULSE框架,通过用户反馈和模型预测结合评估人机交互,在15k用户实验中验证其能减少40%置信区间,并揭示基准测试与真实结果的差异。

Comments ICML 2026

详情
AI中文摘要

虽然基准测试衡量了基于LLM的代理的准确性,但它们大多假设完全自动化,未能代表现实用例的协作性质。在本文中,我们朝着严格评估人机交互迈出了两大步。首先,我们提出了PULSE,一个用于更高效地以人为中心评估代理设计的框架,包括收集用户反馈、训练ML模型预测用户满意度,以及通过结合人类满意度评分与模型生成的伪标签来计算结果。其次,我们在软件工程——LLM代理最高影响、最真实的领域之一——中部署了PULSE,通过一个围绕开源代理OpenHands构建的大规模网络平台。在15k用户中,我们评估了三个代理设计决策如何影响开发者满意度率。我们还展示了PULSE如何能导致关于代理设计的更稳健结论,与标准A/B测试相比,将置信区间减少了40%。最后,我们发现了野外结果与基准性能之间的显著差异(例如,claude-sonnet-4和gpt-5之间的反相关性),强调了基准驱动评估的局限性。我们的框架PULSE为未来评估提供了指导,我们的发现识别了改进软件代理设计的机会。

英文摘要

While benchmarks measure the accuracy of LLM-powered agents, they mostly assume full automation, failing to represent the collaborative nature of real-world use cases. In this paper, we make two major steps towards the rigorous assessment of human-agent interactions. First, we propose PULSE, a framework for more efficient human-centric evaluation of agent designs, which comprises collecting user feedback, training an ML model to predict user satisfaction, and computing results by combining human satisfaction ratings with model-generated pseudo-labels. Second, we deploy PULSE in software engineering -- one of the highest-impact, real-world domains for LLM agents -- via a large-scale web platform built around the open-source agent OpenHands. Across 15k users, we evaluate how three agent design decisions impact developer satisfaction rates. We also show how PULSE can lead to more robust conclusions about agent design, reducing confidence intervals by 40\% compared to a standard A/B test. Finally, we find substantial discrepancies between in-the-wild results with benchmark performance (e.g., the anti-correlation between claude-sonnet-4 and gpt-5), underscoring the limitations of benchmark-driven evaluation. Our framework PULSE provides guidance for future evaluations, and our findings identify opportunities for better software agent designs.

2511.01927 2026-06-10 cs.LG cs.AI cs.NA math.NA 版本更新

Learning-Guided Integration Contours Construction for Fast Large-Scale Generalized Eigensolvers

学习引导的积分轮廓构建用于快速大规模广义特征值求解器

Yeqiu Chen, Ziyan Liu, Hong Wang, Lei Liu

发表机构 * University of Science(科学技术大学)

AI总结 提出Deepcontour混合框架,结合深度学习谱预测器与核密度估计自动构建优化积分轮廓,加速大规模广义特征值求解,实现最高5.63倍加速并保持数值精度。

详情
AI中文摘要

解决大规模广义特征值问题(GEPs)是科学与工程中一项基本但计算上极为困难的任务。作为一种有前景的方向,轮廓积分(CI)方法提供了高效且可并行化的框架。然而,其性能关键依赖于积分轮廓的选择——在没有可靠先验知识的情况下,不当选择可能导致显著的计算开销并损害数值精度。为应对这一挑战,我们提出Deepcontour,一种新颖的混合框架,它将基于深度学习的谱预测器与核密度估计(KDE)相结合,用于原则性的轮廓设计。具体而言,Deepcontour利用其专用的特征神经算子(ENO)提供快速的谱分布先验,驱动KDE模块自动构建优化的积分轮廓,从而引导CI求解器高效地找到所需特征值。Deepcontour在多种科学数据集上实现了高达5.63倍的加速,同时保持严格的数值精度。通过融合深度学习的预测能力与经典求解器的数值严谨性,这项工作为解决大规模GEPs建立了一种高效且稳健的范式。

英文摘要

Solving large-scale Generalized Eigenvalue Problems (GEPs) is a fundamental yet computationally prohibitive task in science and engineering. As a promising direction, contour integral (CI) methods offer an efficient and parallelizable framework. However, their performance is critically dependent on the selection of integration contours -- improper selection without reliable prior knowledge of eigenvalue distribution can incur significant computational overhead and compromise numerical accuracy. To address this challenge, we propose Deepcontour, a novel hybrid framework that integrates a deep learning-based spectral predictor with Kernel Density Estimation (KDE) for principled contour design. Specifically, Deepcontour utilizes its specialized Eigen-Neural-Operator (ENO) to provide rapid spectral distribution priors, driving a KDE module to automatically construct the optimized integration contours, which guide the CI solver to efficiently find the desired eigenvalues. Deepcontour achieves up to a 5.63x speedup across diverse scientific datasets while maintaining strict numerical rigor. By merging the predictive power of deep learning with the numerical rigor of classical solvers, this work establishes an efficient and robust paradigm for solving large-scale GEPs.

2503.19158 2026-06-10 cs.LG q-bio.QM 版本更新

Integrating Biological-Informed Recurrent Neural Networks for Glucose-Insulin Dynamics Modeling

整合生物信息递归神经网络用于葡萄糖-胰岛素动态建模

Stefano De Carli, Nicola Licini, Davide Previtali, Fabio Previdi, Antonio Ferramosca

发表机构 * Department of Management, Information and Production Engineering, University of Bergamo(管理、信息与生产工程系,贝加莫大学)

AI总结 本文提出生物信息递归神经网络框架,用于更准确地建模葡萄糖-胰岛素动态,以提高人工胰腺系统的个性化血糖调节能力。

Comments Accepted for publication in the proceedings of the Engineering Diabetes Technologies (EDT 2025). 7 pages, 2 figures and 1 table

详情
Journal ref
IFAC-PapersOnLine, 59(2), 2025, pp. 91-96
AI中文摘要

1型糖尿病管理由于多种变异性因素而复杂。人工胰腺系统通过先进控制算法自动化胰岛素输送,减轻了患者负担。然而,这些系统的有效性依赖于对葡萄糖-胰岛素动态的准确建模,而传统数学模型往往无法捕捉到患者特异性变化。本文引入了生物信息递归神经网络(BIRNN)框架,该框架利用门控递归单元(GRU)架构,并辅以包含生理约束的物理信息损失函数,确保预测准确性和生物原理的一致性。该框架通过商业UVA/Padova模拟器验证,其在葡萄糖预测准确性和未测量状态重构方面优于传统线性模型,即使在胰岛素敏感性昼夜变化下也表现优异。结果表明,BIRNN在人工胰腺系统的个性化葡萄糖调节和未来自适应控制策略中具有潜力。

英文摘要

Type 1 Diabetes (T1D) management is a complex task due to many variability factors. Artificial Pancreas (AP) systems have alleviated patient burden by automating insulin delivery through advanced control algorithms. However, the effectiveness of these systems depends on accurate modeling of glucose-insulin dynamics, which traditional mathematical models often fail to capture due to their inability to adapt to patient-specific variations. This study introduces a Biological-Informed Recurrent Neural Network (BIRNN) framework to address these limitations. The BIRNN leverages a Gated Recurrent Units (GRU) architecture augmented with physics-informed loss functions that embed physiological constraints, ensuring a balance between predictive accuracy and consistency with biological principles. The framework is validated using the commercial UVA/Padova simulator, outperforming traditional linear models in glucose prediction accuracy and reconstruction of unmeasured states, even under circadian variations in insulin sensitivity. The results demonstrate the potential of BIRNN for personalized glucose regulation and future adaptive control strategies in AP systems.

2509.05913 2026-06-10 cs.CV 版本更新

A fine-grained attention and geometric correspondence model for musculoskeletal risk classification in athletes using multimodal visual and skeletal features

基于多模态视觉和骨骼特征的运动员肌肉骨骼风险分类的细粒度注意力与几何对应模型

Md. Abdur Rahman, Mohaimenul Azam Khan Raiaan, Tamanna Shermin, Md Rafiqul Islam, Mukhtar Hussain, Sami Azam

发表机构 * Department of Computer Science and Engineering, United International University(计算机科学与工程系,国际联合大学) Department of Data Science and Artificial Intelligence, Monash University(数据科学与人工智能系,墨尔本大学) Faculty of Science and Technology, Charles Darwin University(科学与技术学院,查尔斯达尔文大学) Applied Artificial Intelligence and Intelligent Systems (AAIINS) Laboratory, Dhaka(应用人工智能与智能系统实验室,达卡)

AI总结 提出ViSK-GAT多模态框架,融合图像和骨骼坐标特征,通过细粒度注意力模块和几何对应模块实现运动员肌肉骨骼风险八级分类,关键指标超93%。

Comments Published in Computers and Electrical Engineering

详情
Journal ref
Computers and Electrical Engineering, Vol. 138, 111281, 2026
AI中文摘要

肌肉骨骼疾病对运动员构成重大风险,早期风险评估对于预防至关重要。然而,现有方法大多针对受控环境设计,由于依赖单一数据类型,无法在复杂环境中可靠地评估风险。本研究引入了ViSK-GAT(视觉-骨骼几何注意力变换器),一种新颖的多模态深度学习框架,利用视觉和基于骨骼坐标的特征对肌肉骨骼风险进行分类。通过结合图像和骨骼坐标创建了自定义多模态数据集(MusDis-Sports),每个样本根据快速全身评估(REBA)系统标记为八个风险类别。ViSK-GAT集成了两个创新模块:细粒度注意力模块(FGAM),在融合前通过自注意力细化模态内特征;以及多模态几何对应模块(MGCM),增强图像特征与坐标之间的跨模态对齐。该模型取得了稳健的性能,所有关键指标均超过93%。概率分布误差指标也显示出较低的均方根误差(RMSE)为0.1205和平均绝对误差(MAE)为0.0156。ViSK-GAT持续优于最先进的深度学习骨干网络,展示了其在推动人工智能驱动的肌肉骨骼风险评估和实现运动领域及时干预方面的潜力。

英文摘要

Musculoskeletal disorders pose significant risks to athletes, and early risk assessment is essential for prevention. However, most existing methods are designed for controlled settings and fail to reliably assess risk in complex environments due to their reliance on a single type of data. This research introduces ViSK-GAT (Visual-Skeletal Geometric Attention Transformer), a novel multimodal deep learning framework that classifies musculoskeletal risk using both visual and skeletal coordinate-based features. A custom multimodal dataset (MusDis-Sports) was created by combining images and skeletal coordinates, with each sample labeled into eight risk categories based on the Rapid Entire Body Assessment (REBA) system. ViSK-GAT integrates two innovative modules: the Fine-Grained Attention Module (FGAM), which refines intra-modal features through self-attention before fusion, and the Multimodal Geometric Correspondence Module (MGCM), which enhances cross-modal alignment between image features and coordinates. The model achieved robust performance, with all key metrics exceeding 93%. Probability distribution error metrics also showed a low Root Mean Squared Error (RMSE) of 0.1205 and a Mean Absolute Error (MAE) of 0.0156. ViSK-GAT consistently outperformed state-of-the-art (SOTA) deep learning backbones and showed its potential to advance artificial intelligence-driven musculoskeletal risk assessment and enable timely interventions in sports.

2502.01272 2026-06-10 cs.LG 版本更新

Boosting Graph Robustness Against Backdoor Attacks: An Over-Similarity Perspective

提升图神经网络对后门攻击的鲁棒性:过度相似性视角

Chang Liu, Hai Huang, Yujie Xing, Xingquan Zuo

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 针对图后门攻击中触发器与干净节点难以区分的问题,提出基于过度相似性检测的防御方法SimGuard,利用对比学习训练检测器分离触发器,在保持干净节点性能的同时有效防御多种后门攻击。

Comments After discussions with one of the co-authors, it was decided that this version should not be made public at this time. To respect the co-author's perspective and ensure alignment among all authors, I am requesting the withdrawal of this article

详情
AI中文摘要

图神经网络(GNN)在社交网络和交通网络等任务中取得了显著成功。然而,最近的研究强调了GNN易受后门攻击的脆弱性,引发了对其在实际应用中可靠性的重大担忧。尽管已有初步努力来防御特定的图后门攻击,但现有防御方法面临两个主要挑战:要么无法在触发器和干净节点之间建立明确区分,导致许多干净节点被移除;要么未能消除触发器的影响,使得难以将目标节点恢复到攻击前的状态。通过对各种现有图后门攻击的实证分析,我们观察到这些方法生成的触发器在特征和结构上都表现出过度相似性。基于这一观察,我们提出了一种新颖的图后门防御方法SimGuard。我们首先利用基于相似性的度量来检测触发器,然后采用对比学习训练一个后门检测器,生成能够将触发器与干净节点分离的嵌入,从而提高检测效率。在真实数据集上进行的大量实验表明,我们提出的方法在保持干净节点性能的同时,有效防御了各种图后门攻击。代码将在接收后发布。

英文摘要

Graph Neural Networks (GNNs) have achieved notable success in tasks such as social and transportation networks. However, recent studies have highlighted the vulnerability of GNNs to backdoor attacks, raising significant concerns about their reliability in real-world applications. Despite initial efforts to defend against specific graph backdoor attacks, existing defense methods face two main challenges: either the inability to establish a clear distinction between triggers and clean nodes, resulting in the removal of many clean nodes, or the failure to eliminate the impact of triggers, making it challenging to restore the target nodes to their pre-attack state. Through empirical analysis of various existing graph backdoor attacks, we observe that the triggers generated by these methods exhibit over-similarity in both features and structure. Based on this observation, we propose a novel graph backdoor defense method SimGuard. We first utilizes a similarity-based metric to detect triggers and then employs contrastive learning to train a backdoor detector that generates embeddings capable of separating triggers from clean nodes, thereby improving detection efficiency. Extensive experiments conducted on real-world datasets demonstrate that our proposed method effectively defends against various graph backdoor attacks while preserving performance on clean nodes. The code will be released upon acceptance.

2510.12071 2026-06-10 cs.LG 版本更新

Influence Dynamics and Stagewise Data Attribution

影响动力学与分阶段数据归因

Jin Hwa Lee, Matthew Smith, Maxwell Adam, Jesse Hoogland

发表机构 * University College London(伦敦大学学院) Independent(独立) University of Melbourne(墨尔本大学) Timaeus

AI总结 针对神经网络训练中样本影响动态变化的问题,基于奇异学习理论提出分阶段数据归因框架,预测影响非单调变化(符号翻转、尖峰),并在玩具模型和语言模型中验证与模型学习阶段的对应。

Comments 28 pages, 15 figures

详情
AI中文摘要

当前的训练数据归因(TDA)方法将样本对另一个样本的影响视为静态的,但神经网络在表现出不同影响模式的独特阶段中学习。在这项工作中,我们引入了一个基于奇异学习理论的分阶段数据归因框架。我们预测影响可以非单调地变化,包括符号翻转和发展转变处的尖锐峰值。我们首先在玩具模型中通过分析和实验验证这些预测,表明影响的动态变化直接映射到模型对语义层次结构的逐步学习。最后,我们在语言模型中大规模展示了这些现象,其中令牌级别的影响变化与已知的发展阶段一致。

英文摘要

Current training data attribution (TDA) methods treat the influence one sample has on another as static, but neural networks learn in distinct stages that exhibit changing patterns of influence. In this work, we introduce a framework for stagewise data attribution grounded in singular learning theory. We predict that influence can change non-monotonically, including sign flips and sharp peaks at developmental transitions. We first validate these predictions analytically and empirically in a toy model, showing that dynamic shifts in influence directly map to the model's progressive learning of a semantic hierarchy. Finally, we demonstrate these phenomena at scale in language models, where token-level influence changes align with known developmental stages.

2510.08622 2026-06-10 cs.CL cs.SE 版本更新

Automated Alignment between Elicitation Interviews and Requirements

启发式访谈与需求之间的自动对齐

Francesco Dente, Fabiano Dalpiaz, Paolo Papotti

发表机构 * University of Bologna(博洛尼亚大学)

AI总结 提出将访谈转录与用户故事需求自动对齐的任务,定义忠实度和覆盖率两个度量,利用大语言模型和嵌入模型实现自动评估,在四个数据集上达到0.86 macro-F1。

Comments 8 pages

详情
AI中文摘要

软件需求来源于多种启发式技术,其中许多具有对话性质,如访谈。然而,评估这些衍生需求是否忠实反映利益相关者的需求仍然是一项具有挑战性的手工任务。在本文中,我们形式化了将访谈转录与以用户故事表示的需求集合对齐的任务。我们提出了两种启发式对齐度量,称为(i)需求忠实度:转录支持的故事比例,以及(ii)访谈覆盖率:至少被一个故事支持的转录比例。然后,我们使用大语言模型和嵌入模型进行实验,评估自动计算这些度量的能力。在四个数据集上的实验表明,基于LLM的解决方案在手动标注的块-故事对上达到了0.86的宏F1分数。我们还展示了如何将嵌入模型用作阻断器,使方法更具可扩展性。这项工作为更多关于连接对话制品与需求的研究铺平了道路。形式化框架和自动匹配技术是基本组件,可用于新兴任务,如将需求追溯到访谈以及从对话生成需求。

英文摘要

Software requirements are derived from a variety of elicitation techniques, many of which have a conversational nature, like interviews. However, evaluating whether those derived requirements faithfully reflect the stakeholders' needs remains a challenging manual task. In this paper, we formalize the task of aligning the transcript of an interview with a collection of requirements represented as user stories. We propose two heuristic metrics for alignment, called (i) requirements faithfulness: the proportion of stories supported by the transcript, and (ii) interview coverage: the proportion of transcript supported by at least one story. Then, we run experiments with large language models and embedding models that assess the ability of evaluating these metrics automatically. Experiments over four datasets show that an LLM-based solution achieves 0.86 macro-F1 on manually labeled chunk-story pairs. We also show how embedding models can be used as blockers to make the approach more scalable. This work paves the way for more research on linking conversational artifacts with requirements. The formal framework and the automated matching techniques are basic components that can be used for emerging tasks such as tracing requirements to interviews and generating requirements from conversations.

2510.07061 2026-06-10 cs.CL 版本更新

Revisiting Metric Reliability for Fine-grained Evaluation of Machine Translation and Summarization in Indian Languages

重新审视印度语言机器翻译和摘要细粒度评估的度量可靠性

Amir Hossein Yari, Kalmit Kulkarni, Ahmad Raza Khan, Fajri Koto

发表机构 * Sharif University of Technology(谢里夫理工学院) Vellore Institute of Technology(韦洛雷理工学院) IIT Kharagpur(印度理工学院达卡分校) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)

AI总结 针对印度语言评估不足的问题,提出ITEM基准,系统评估29种自动度量与人工判断的对齐,发现基于LLM的评估器表现最佳,并揭示了异常值影响、任务差异及扰动鲁棒性等关键发现。

Comments 18 pages, 14 figures

详情
AI中文摘要

虽然自动度量推动了机器翻译(MT)和文本摘要(TS)的发展,但现有度量几乎完全针对英语和其他高资源语言开发和验证。这种狭隘的关注使得超过15亿人使用的印度语言在很大程度上被忽视,对当前评估实践的普遍性提出了质疑。为弥补这一空白,我们引入了ITEM,一个大规模基准,系统评估了29种自动度量与六种主要印度语言人工判断的对齐,并丰富了细粒度注释。我们的广泛评估涵盖了与人工判断的一致性、对异常值的敏感性、语言特定可靠性、度量间相关性以及对受控扰动的鲁棒性,揭示了四个核心发现:(1)基于LLM的评估器在段落和系统级别上与人工判断的对齐最强;(2)异常值对度量-人工一致性有显著影响;(3)在TS中,度量在捕捉内容保真度方面更有效,而在MT中,它们更好地反映流畅性;(4)度量在受到不同扰动时,其鲁棒性和敏感性有所不同。总体而言,这些发现为推进印度语言的度量设计和评估提供了关键指导。

英文摘要

While automatic metrics drive progress in Machine Translation (MT) and Text Summarization (TS), existing metrics have been developed and validated almost exclusively for English and other high-resource languages. This narrow focus leaves Indian languages, spoken by over 1.5 billion people, largely overlooked, casting doubt on the universality of current evaluation practices. To address this gap, we introduce ITEM, a large-scale benchmark that systematically evaluates the alignment of 29 automatic metrics with human judgments across six major Indian languages, enriched with fine-grained annotations. Our extensive evaluation, covering agreement with human judgments, sensitivity to outliers, language-specific reliability, inter-metric correlations, and resilience to controlled perturbations reveals four central findings: (1) LLM-based evaluators show the strongest alignment with human judgments at both segment and system levels; (2) outliers exert a significant impact on metric-human agreement; (3) In TS, metrics are more effective at capturing content fidelity, whereas in MT, they better reflect fluency; and (4) Metrics differ in their robustness and sensitivity when subjected to diverse perturbations. Collectively, these findings offer critical guidance for advancing metric design and evaluation in Indian languages.

2510.04195 2026-06-10 cs.AI 版本更新

Constructing coherent spatial memory in LLM agents through graph rectification

通过图修正构建LLM智能体中的连贯空间记忆

Puzhen Zhang, Xuyang Chen, Yu Feng, Yuhan Jiang, Liqiu Meng

发表机构 * Chair of Cartography and Visual Analytics(制图学与视觉分析教授会)

AI总结 提出LLM-MapRepair框架,通过版本控制和边影响评分检测并修正增量构建的导航图中的结构不一致性,在多个基准上显著提升节点和边召回率。

详情
AI中文摘要

给定通过全局遍历导航指令的地图描述,LLM通常能够推断隐式空间布局并通过提供最短路径来回答用户查询。然而,随着环境变大,这种依赖于上下文的查询变得不可行,这促使需要增量地图构建,即从逐步观察中构建完整的拓扑图。我们提出LLM-MapRepair,一个用于LLM驱动的地图构建和修复的框架,旨在检测、定位和修正增量构建的导航图中的结构不一致性。我们的贡献包括:用于图构建的版本控制机制、用于修复优先级的边影响评分,以及为LLM驱动的地图构建和修复量身定制的MANGO基准的清理变体。我们在四个评估设置上评估该框架:合成逐组件消融(gpt-4.1,每个单元n=20个种子)、跨供应商扫描(覆盖OpenAI、Anthropic和Google的七个LLM,在合成和TextWorld程序生成的文本冒险游戏上)、修复阶段评估(在所有42个清理后的MANGO游戏上,具有非零剩余冲突,共534个冲突;三个供应商×三种模式加上两个非LLM参考),以及在《红楼梦》第16-17章上的端到端自然文本部署。在DRC部署中,LLM-MapRepair使用GPT-4.1实现了94.3%的节点召回率(比直接LLM映射高8.6个百分点)和88.2%的边召回率(高55.8个百分点);召回率的提升伴随着预测节点和边数量约为真实值的4倍(表4),这反映了我们在局限性中讨论的离散化驱动的过度生成权衡。

英文摘要

Given a map description through global traversal navigation instructions, an LLM can often infer the implicit spatial layout and answer user queries by providing shortest paths. However, such context-dependent querying becomes incapable as environments grow larger, motivating the need for incremental map construction that builds a complete topological graph from stepwise observations. We propose LLM-MapRepair, a framework for LLM-driven construction and map repair, designed to detect, localize, and correct structural inconsistencies in incrementally constructed navigation graphs. Our contributions include a Version Control mechanism for graph construction, an Edge Impact Score for repair prioritization, and a cleaned variant of the MANGO benchmark tailored for LLM-driven map construction and repair. We evaluate the framework on four evaluation settings: a synthetic per-component ablation (gpt-4.1, n=20 seeds per cell), a cross-vendor sweep over seven LLMs from OpenAI, Anthropic, and Google on both synthetic and TextWorld procedurally-generated text-adventure games, a repair-stage evaluation on all 42 cleaned-MANGO games with non-zero residual conflicts (534 conflicts; three vendors x three modes plus two non-LLM references), and an end-to-end natural-text deployment on Chapters 16-17 of Dream of the Red Chamber. On the DRC deployment, LLM-MapRepair achieves 94.3% node recall (+8.6 pp over direct LLM mapping) and 88.2% edge recall (+55.8 pp), using GPT-4.1; the recall improvements come with predicted node and edge counts that are roughly 4x the ground-truth counts (Table 4), reflecting the discretization-driven over-generation trade-off we discuss in the Limitations.

2507.14725 2026-06-10 cs.LG cs.AI 版本更新

GRID: Scaling Task-Agnostic Inference in Continual Prompt Tuning

GRID:持续提示调优中任务无关推理的规模化

Anushka Tiwari, Sayantan Pal, Rohini K. Srihari, Kaiyi Ji

发表机构 * State University of New York at Buffalo(纽约州立大学布法罗分校) Department of Computer Science and Engineering(计算机科学与工程系) Institute for Artificial Intelligence and Data Science(人工智能与数据科学研究院)

AI总结 提出GRID框架,通过输出空间感知解码和梯度引导提示选择,解决持续学习中任务无关推理的性能退化与可扩展性问题,在长序列和负迁移基准上提升后向迁移并减少提示内存。

详情
AI中文摘要

基于提示的持续学习提供了一种参数高效的方式,使大型语言模型能够适应任务序列。然而,现有方法通常依赖任务感知推理,并维护不断扩展的任务特定提示集,导致(1)当推理时任务标识符不可用于提示选择时,早期任务性能严重下降;(2)随着任务序列增长,可扩展性受限。我们提出GRID,一个统一的框架来解决这些挑战。GRID包含一个输出空间感知解码机制,通过利用代表性输入和自动标签语义归一化来增强后向迁移,以及一个梯度引导的提示选择策略,将信息量较少的提示压缩为单个聚合表示,以实现可扩展、内存高效的持续学习。在长序列和负迁移基准上的大量实验表明,GRID改善了后向迁移,实现了有竞争力的前向迁移,并显著减少了编码器-解码器和仅解码器架构(包括T5、Qwen和LLaMA)中的提示内存。源代码可从此https URL获取。

英文摘要

Prompt-based continual learning (CL) offers a parameter-efficient way to adapt large language models (LLMs) across task sequences. However, existing methods often rely on task-aware inference and maintain an expanding set of task-specific prompts, leading to (1) severe performance degradation on earlier tasks when task identifiers are unavailable for prompt selection at inference time, and (2) limited scalability as task sequence grows. We propose GRID, a unified framework designed to address these challenges. GRID incorporates an output-space-aware decoding mechanism that enhances backward transfer by leveraging representative inputs and automatic label semantic normalization, alongside a gradient-guided prompt selection strategy that compresses less informative prompts into a single aggregated representation for scalable, memory-efficient continual learning. Extensive experiments on long-sequence and negative-transfer benchmarks show that GRID improves backward transfer, achieves competitive forward transfer, and substantially reduces prompt memory across encoder-decoder and decoder-only architectures, including T5, Qwen, and LLaMA. Source code is available at https://github.com/AnushkaTi/GRID.

2509.25760 2026-06-10 cs.CL cs.AI cs.LG 版本更新

TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning

TruthRL: 通过强化学习激励诚实的LLM

Zhepei Wei, Xiao Yang, Kai Sun, Jiaqi Wang, Rulin Shao, Jingxiang Chen, Mohammad Kachuee, Teja Gollapudi, Yiwei Liao, Nicolas Scheffer, Rakesh Wanga, Anuj Kumar, Yu Meng, Wen-tau Yih, Xin Luna Dong

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出TruthRL框架,使用GRPO和三值奖励直接优化LLM的诚实性,减少幻觉并允许不确定时弃权,在知识密集型基准上显著提升诚实性。

Comments ICML 2026. Code: https://github.com/facebookresearch/TruthRL

详情
AI中文摘要

虽然大型语言模型(LLM)在事实性问题回答上表现出色,但它们仍然容易产生幻觉和不真实的回答,特别是当任务需要其参数知识之外的信息时。事实上,诚实性需要的不仅仅是准确性——模型还必须识别不确定性,并在不确定时弃权以避免幻觉。这对现有方法提出了根本性挑战:优化准确性的方法往往会放大幻觉,而鼓励弃权的方法可能变得过于保守,牺牲正确答案。两种极端最终都损害了诚实性。在这项工作中,我们提出了TruthRL,一个通用的强化学习(RL)框架,直接优化LLM的诚实性。具体来说,我们使用GRPO实现TruthRL,并采用一个简单而有效的三值奖励,区分正确答案、幻觉和弃权。它激励模型不仅通过提供正确回答来减少幻觉,还通过在不确定时启用弃权来提高诚实性。在四个知识密集型基准上的大量实验表明,TruthRL显著减少了幻觉(例如,43.5% → 19.4%)并提高了诚实性(例如,5.3% → 37.2%),在各种骨干模型上均有一致的提升。分析表明,TruthRL的改进源于LLM识别其知识边界的能力增强,从而避免了像基线那样过于保守。

英文摘要

While large language models (LLMs) have demonstrated strong performance on factoid question answering, they are still prone to hallucination and untruthful responses, particularly when tasks demand information outside their parametric knowledge. Indeed, truthfulness requires more than accuracy -- models must also recognize uncertainty and abstain when unsure to avoid hallucinations. This presents a fundamental challenge for existing methods: approaches that optimize for accuracy often amplify hallucinations, while those that encourage abstention can become overly conservative, sacrificing correct answers. Both extremes ultimately compromise truthfulness. In this work, we present TruthRL, a general reinforcement learning (RL) framework that directly optimizes the truthfulness of LLMs. Specifically, we implement TruthRL using GRPO with a simple yet effective ternary reward that distinguishes correct answers, hallucinations, and abstentions. It incentivizes models to reduce hallucinations not only by providing correct responses, but also by enabling abstention when uncertain, thereby improving truthfulness. Extensive experiments across four knowledge-intensive benchmarks show that TruthRL significantly reduces hallucinations (e.g., 43.5% $\rightarrow$ 19.4%) and improves truthfulness (e.g., 5.3% $\rightarrow$ 37.2%), with consistent gains across various backbone models. Analysis shows that the improvement of TruthRL arises from enhanced capability of LLMs to recognize their knowledge boundary, hence avoiding being overly conservative as the baselines are.

2509.25017 2026-06-10 cs.LG cs.CV 版本更新

Uncertainty-Aware Deep Learning for Wildfire Danger Forecasting

不确定性感知的深度学习用于野火危险预测

Spyros Kondylatos, Nikolas Papadopoulos, Gustau Camps-Valls, Ioannis Papoutsis

发表机构 * Aix-Marseille University(艾克斯-马赛大学) University of Cambridge(剑桥大学) University of Malaga(马拉加大学) University of Crete(希腊克里特大学)

AI总结 提出不确定性感知深度学习框架,联合捕获认知不确定性和偶然不确定性,提升短期野火危险预测的准确性和可靠性,F1分数提高2.3%,预期校准误差降低2.1%。

详情
AI中文摘要

野火是最严重的自然灾害之一,对人类和自然生态系统构成重大威胁。日益增长的野火风险增加了对不仅准确而且可靠的预测模型的需求。深度学习在预测野火危险方面显示出潜力;然而,其采用受到对其预测可靠性的担忧的阻碍,部分源于缺乏不确定性量化。为应对这一挑战,我们提出了一个不确定性感知的深度学习框架,该框架联合捕获认知(模型)和偶然(数据)不确定性,以增强短期野火危险预测。在次日预测中,与确定性基线相比,我们表现最佳的模型将F1分数提高了2.3%,并将预期校准误差降低了2.1%,从而提升了预测技能和校准能力。我们的实验证实了不确定性估计的可靠性,并展示了它们在决策支持中的实际效用,包括识别拒绝低置信度预测的不确定性阈值,以及生成伴随不确定性层的良好校准的野火危险图。将预测范围延长至十天,我们观察到偶然不确定性随时间增加,表明环境条件的更大变异性,而认知不确定性保持稳定。最后,我们表明,尽管两种不确定性类型在低不确定性情况下可能是冗余的,但在更具挑战性的条件下它们提供互补的见解,强调了联合建模对稳健野火危险预测的价值。总之,我们的方法显著提高了野火危险预测的准确性和可靠性,推动了可信赖的野火深度学习系统的发展。

英文摘要

Wildfires are among the most severe natural hazards, posing a significant threat to both humans and natural ecosystems. The growing risk of wildfires increases the demand for forecasting models that are not only accurate but also reliable. Deep Learning (DL) has shown promise in predicting wildfire danger; however, its adoption is hindered by concerns over the reliability of its predictions, some of which stem from the lack of uncertainty quantification. To address this challenge, we present an uncertainty-aware DL framework that jointly captures epistemic (model) and aleatoric (data) uncertainty to enhance short-term wildfire danger forecasting. In the next-day forecasting, our best-performing model improves the F1 Score by 2.3% and reduces the Expected Calibration Error by 2.1% compared to a deterministic baseline, enhancing both predictive skill and calibration. Our experiments confirm the reliability of the uncertainty estimates and illustrate their practical utility for decision support, including the identification of uncertainty thresholds for rejecting low-confidence predictions and the generation of well-calibrated wildfire danger maps with accompanying uncertainty layers. Extending the forecast horizon up to ten days, we observe that aleatoric uncertainty increases with time, showing greater variability in environmental conditions, while epistemic uncertainty remains stable. Finally, we show that although the two uncertainty types may be redundant in low-uncertainty cases, they provide complementary insights under more challenging conditions, underscoring the value of their joint modeling for robust wildfire danger prediction. In summary, our approach significantly improves the accuracy and reliability of wildfire danger forecasting, advancing the development of trustworthy wildfire DL systems.

2507.13595 2026-06-10 cs.CV 版本更新

NoiseSDF2NoiseSDF: Learning Clean Neural Fields from Noisy Supervision

NoiseSDF2NoiseSDF: 从含噪监督中学习干净的神经场

Tengkai Wang, Weihao Li, Ruikai Cui, Shi Qiu, Nick Barnes

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出NoiseSDF2NoiseSDF方法,通过最小化含噪SDF表示之间的MSE损失,从含噪点云中学习干净的神经SDF,实现隐式去噪和表面优化。

Comments 16 pages, 7 figures

详情
AI中文摘要

从点云重建准确的隐式表面表示仍然是一项具有挑战性的任务,特别是当数据使用低质量扫描设备捕获时。这些点云通常包含大量噪声,导致表面重建不准确。受2D图像中Noise2Noise范式的启发,我们引入了NoiseSDF2NoiseSDF,一种旨在将此概念扩展到3D神经场的新方法。我们的方法通过最小化含噪SDF表示之间的MSE损失,从含噪点云中通过含噪监督学习干净的神经SDF,使网络能够隐式去噪并细化表面估计。我们在ShapeNet、ABC、Famous和Real数据集等基准上评估了NoiseSDF2NoiseSDF的有效性。实验结果表明,我们的框架显著提高了从含噪输入重建的表面质量。

英文摘要

Reconstructing accurate implicit surface representations from point clouds remains a challenging task, particularly when data is captured using low-quality scanning devices. These point clouds often contain substantial noise, leading to inaccurate surface reconstructions. Inspired by the Noise2Noise paradigm for 2D images, we introduce NoiseSDF2NoiseSDF, a novel method designed to extend this concept to 3D neural fields. Our approach enables learning clean neural SDFs from noisy point clouds through noisy supervision by minimizing the MSE loss between noisy SDF representations, allowing the network to implicitly denoise and refine surface estimations. We evaluate the effectiveness of NoiseSDF2NoiseSDF on benchmarks, including the ShapeNet, ABC, Famous, and Real datasets. Experimental results demonstrate that our framework significantly improves surface reconstruction quality from noisy inputs.

2505.14608 2026-06-10 cs.CL cs.AI cs.LG 版本更新

Attacks on Machine-Text Detectors Retain Stylistic Fingerprints

对机器文本检测器的攻击保留风格指纹

Rafael Rivera Soto, Barry Chen, Nicholas Andrews

发表机构 * GitHub University of California, Berkeley(加州大学伯克利分校)

AI总结 研究机器文本检测器对抗攻击的局限性,提出一种同时优化不可检测性和特定人类风格的 paraphrasing 方法,发现单文档检测不可靠,需多文档分析。

详情
AI中文摘要

尽管机器文本检测器的开发取得了显著进展,但机器文本容易被操纵以逃避检测,这导致有人认为该问题本质上是难以解决的。在这项工作中,我们研究了这种逃避策略的局限性。我们证明,尽管当前的攻击(从提示工程到检测器引导的优化)可以有效降低标准检测器的性能,但它们无法抹去机器文本底层的风格“指纹”。我们表明,利用风格特征空间的少样本检测器对这些逃避尝试具有鲁棒性,即使对于明确调整以逃避检测的模型生成的样本也能可靠地检测。这引发了一个问题:风格是否代表了对机器检测攻击的通用防御?我们通过引入一种新颖的 paraphrasing 方法来证明答案是“不”,该方法同时优化不可检测性和对特定人类风格的遵循。我们表明,与先前方法不同,这种攻击有效逃避了所有考虑的检测器,包括那些利用写作风格的检测器。然而,我们发现这种逃避并非绝对:随着可供分析的文档数量增加,人类和机器分布再次变得可区分。总体而言,我们的发现表明,可靠的机器文本检测需要从单文档分析转向多文档分析。

英文摘要

Despite considerable progress in the development of machine-text detectors, the ease with which machine-text can be manipulated to evade detection has led to suggestions that the problem is inherently intractable. In this work, we investigate the limits of such evasion strategies. We demonstrate that while current attacks, ranging from prompt engineering to detector-guided optimization can effectively degrade performance of standard detectors, they fail to erase the underlying stylistic "fingerprints" of machine text. We show that few-shot detectors that utilize the stylistic feature space are robust to these evasion attempts, reliably detecting samples even from models explicitly tuned to prevent detection. This raises the question: does style represent a universal defense against machine-detection attacks? We demonstrate that the answer is "no'' by introducing a novel paraphrasing approach that simultaneously optimizes for undetectability and adherence to specific human styles. We show that unlike prior methods, this attack effectively evades all considered detectors, including those that utilize writing style. However, we find that this evasion is not absolute: as the number of documents available for analysis grows, the human and machine distributions become distinguishable again. Overall, our findings suggest that reliable machine-text detection requires moving beyond single-document analysis to multi-document analysis.

2509.04027 2026-06-10 cs.AI cs.CL 版本更新

Why Does Reasoning Length Converge? Unveiling the Underfitting-Overfitting Trade-off in Chain-of-Thought

CoT-Space: 一种通过强化学习实现内部慢思考的理论框架

Zeyu Gan, Hao Yi, Yong Liu

发表机构 * Zeyu Gan, Yi Hao, Yong Liu(GAN 赵毅、LIU 刘永)

AI总结 本文提出CoT-Space理论框架,通过强化学习将推理过程从离散的token预测任务转化为连续的推理层面语义空间中的优化过程,揭示了测试时扩展中最优CoT长度的收敛是欠拟合与过拟合基本权衡的自然结果。

Comments Preprint Edition

详情
AI中文摘要

测试时扩展,主要通过强化学习(RL)中的多步链式推理(CoT)体现,已成为增强大型语言模型(LLMs)推理能力的关键范式。然而,仍存在显著的理论空白:传统token级分析无法捕捉推理层面扩展的宏观动态。为此,我们引入CoT-Space,一种新的理论框架,将推理过程从离散的token预测任务转换为连续的推理层面语义空间中的优化过程。通过从噪声和风险视角建模推理轨迹,并复兴经典学习理论中的基础原理,我们证明观察到的收敛到最优CoT长度是欠拟合与过拟合基本权衡的自然结果。我们进一步利用RL作为工具,在实验中激发并验证这些结果。我们的发现为通过RL实现内部测试时扩展提供了机制解释,为现代LLMs中优化推理轨迹提供了系统性的理论基础。

英文摘要

Test-time scaling, primarily manifested through multi-step Chain-of-Thought (CoT) reasoning via Reinforcement Learning (RL), has emerged as a pivotal paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs). However, a significant theoretical gap persists: traditional token-level analysis fails to capture the macroscopic dynamics of reasoning-level scaling. To address this, we introduce CoT-Space, a novel theoretical framework that recasts the reasoning process from a discrete token-prediction task to an optimization process within a continuous, reasoning-level semantic space. By modeling the reasoning trajectory from both noise and risk perspectives and revitalizing foundational principles from classical learning theory, we demonstrate that the observed convergence to an optimal CoT length is a natural consequence of the fundamental trade-off between underfitting and overfitting. We further utilize RL as a tool to elicit and verify these results in our experiments. Our findings provide a mechanistic explanation for the internal test-time scaling via RL, offering a principled theoretical foundation to optimize reasoning trajectories in modern LLMs.

2509.19936 2026-06-10 cs.CV 版本更新

CapStARE: Capsule-based Sequential Architecture for Robust and Efficient Gaze Estimation

CapStARE: 基于胶囊的序列架构实现鲁棒高效的目光估计

Miren Samaniego, Igor Rodriguez, Elena Lazkano

发表机构 * University of the Basque Country(巴斯克大学)

AI总结 提出CapStARE,结合冻结ConvNeXt骨干、注意力路由胶囊和双GRU解码器,在ETH-XGaze等数据集上实现实时高精度目光估计,兼顾空间鲁棒性与计算效率。

Comments Preprint for Patter Recognition Journal

详情
AI中文摘要

人类目光估计对于人机交互、社交机器人和辅助系统等应用至关重要。然而,在非约束环境中实现准确、可解释且实时的性能仍然具有挑战性。现有的基于外观的方法通常在空间鲁棒性、计算效率和上下文信息的有效利用之间面临权衡。为了解决这一问题,我们引入了CapStARE,一种基于胶囊的架构,它结合了用于高效特征提取的冻结ConvNeXt骨干网络、用于结构化面部推理的基于注意力路由的胶囊形成,以及用于短时域观测窗口上轻量级序列建模的双GRU解码器。这种设计保留了可解释的部分-整体面部关系,同时通过局部上下文一致性提高了预测稳定性。实验结果表明,该方法在ETH-XGaze(3.36)和MPIIFaceGaze(2.65)上表现强劲,同时在Gaze360(9.06)上也具有竞争力的泛化能力,且所有测试均实现实时推理(<10毫秒)。这些发现表明,所提出的方法为现实交互环境中基于外观的目光估计提供了一个实用且鲁棒的框架。相关代码和实验结果公开于:this https URL

英文摘要

Human gaze estimation is essential for applications such as human-computer interaction, social robotics, and assistive systems. However, achieving accurate, interpretable, and real-time performance in unconstrained environments remains challenging. Existing appearance-based methods often face trade-offs between spatial robustness, computational efficiency, and effective use of contextual information. To address this, we introduce CapStARE, a capsule-based architecture that combines a frozen ConvNeXt backbone for efficient feature extraction, capsule formation with attention-based routing for structured facial reasoning, and dual GRU decoders for lightweight sequential modeling over short-horizon observation windows. This design preserves interpretable part-whole facial relationships while improving prediction stability through local contextual consistency. Experimental results demonstrate strong performance on ETH-XGaze (3.36) and MPIIFaceGaze (2.65), while also generalizing competitively on Gaze360 (9.06), all with real-time inference (<10 ms). These findings suggest that the proposed method provides a practical and robust framework for appearance-based gaze estimation in real-world interactive environments. The related code and experimental results are publicly available at: https://github.com/toukapy/capsStare

2509.16518 2026-06-10 cs.CV cs.AR 版本更新

FG-Attn: Leveraging Fine-Grained Sparse Attention in Video Diffusion Models

FG-Attn:在视频扩散模型中利用细粒度稀疏注意力

Sankeerth Durvasula, Kavya Sreedhar, Zain Moustafa, Suraj Kothawade, Tianlei Pang, Ashish Gondimalla, Suvinay Subramanian, Narges Shahidi, Nandita Vijaykumar

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 针对视频扩散模型中注意力层计算开销大的问题,提出FG-Attn,一种低开销的细粒度稀疏注意力机制,在MxN块粒度上跳过分数计算,实现最高2.45倍加速。

详情
AI中文摘要

使用扩散变压器进行媒体生成可能需要评估极长序列上的注意力,其中注意力层占生成延迟的大部分。利用注意力图中的稀疏性为降低这一成本提供了有前景的机会。在这项工作中,我们展示了扩散变压器中的注意力图在视频生成模型中表现出显著的细粒度稀疏性。然而,现有的稀疏注意力方法过于粗粒度,留下了大量未处理的冗余计算,或者在更细粒度上产生高开销。我们提出FG-Attn,一种新颖的低开销细粒度稀疏注意力机制,它在MxN块的粒度上跳过分数计算,其中N>=1且M>=16,每个块是M个查询和N个键之间查询-键点积的结果。FG-Attn解决了GPU上稀疏注意力内核中硬件利用率不足的关键挑战,同时避免了不规则内存访问和冗余操作的开销。FG-Attn可以完全取代现有的稀疏注意力方法,并将块稀疏注意力方法扩展到现代GPU上的更细粒度。在70%稀疏度下,FG-Attn比最先进的FlashInfer快2.45倍,平均减少注意力内核时间14.7%。FG-Attn将端到端视频生成时间比Flash Attention 3加速高达1.40倍(平均1.18倍)。

英文摘要

Using diffusion transformers for media generation may require evaluating attention over extremely long sequences, with attention layers accounting for the majority of generation latency. Exploiting sparsity in attention maps offers a promising opportunity to reduce this cost. In this work, we show that attention maps in diffusion transformers exhibit significant fine-grained sparsity in video generation models. Existing sparse attention methods, however, are too coarse-grained, leaving a large fraction of redundant computation unaddressed, or incur high overheads at finer granularity. We propose FG-Attn, a novel, low-overhead fine-grained sparse attention mechanism that skips score computations at the granularity of a MxN tile, where N>=1 and M>=16, and where each block is the result of query-key dot products between M queries and N keys. FG-Attn addresses the key challenge of hardware underutilization in sparse attention kernels on GPUs, without incurring the overheads of irregular memory access and redundant operations. FG-Attn can fully supersede existing sparse attention methods and extend block sparse attention methods to finer granularities on modern GPUs. At 70% sparsity, FG-Attn is up to 2.45X faster than the state-of-art FlashInfer, and reduces attention kernel time by 14.7% on average. FG-Attn speeds up end-to-end video generation times by up to 1.40X (1.18X on average) over Flash Attention 3.

2508.13446 2026-06-10 cs.RO 版本更新

CAST: Counterfactual Labels Improve Instruction Following in Vision-Language-Action Models

CAST: 反事实标签提升视觉-语言-动作模型中的指令跟随能力

Catherine Glossop, William Chen, Arjun Bhorkar, Dhruv Shah, Sergey Levine

发表机构 * University of California Berkeley(加州大学伯克利分校) Princeton University(普林斯顿大学)

AI总结 针对VLA模型难以遵循细粒度指令的问题,提出利用视觉语言模型生成反事实标签增强数据集,提升语言基础多样性,实验表明该方法在导航和操作任务中显著提升指令跟随成功率。

详情
AI中文摘要

通用机器人应能理解并遵循用户指令。尽管当前视觉-语言-动作(VLA)模型为将开放词汇语言指令映射到机器人动作提供了强大架构,但它们难以遵循细粒度命令。原因之一是现有机器人数据集缺乏语义多样性和语言基础,特别是对于相似观测缺乏细粒度任务多样性。为解决此问题,我们提出一种新方法,利用视觉语言模型创建反事实标签来增强现有机器人数据集。通过用这些标签增强现有数据集,我们增加了机器人数据集语言基础的多样性和粒度,最终提升了VLA的语言跟随能力。我们通过在3个不同室内外环境中进行视觉语言导航实验,评估了所得模型遵循语言指令的能力,范围从简单的以物体为中心的指令到复杂的指代任务。实验表明,反事实重标记(无需额外数据收集)显著提升了VLA策略的指令跟随能力,超越了最先进方法,并且与在未增强数据上训练的VLA相比,成功率翻倍。我们还评估了该方法在操作VLA上的表现,发现在有干扰物的任务中性能有类似提升。

英文摘要

Generalist robots should be able to understand and follow user instructions. Despite providing a powerful architecture for mapping open-vocabulary language instructions to robot actions, current vision-language-action (VLA) models struggle to follow fine-grained commands. One cause for this is a lack of semantic diversity and language grounding in existing robot datasets and, specifically, a lack of fine-grained task diversity for similar observations. To address this, we present a novel method to augment existing robot datasets by leveraging vision-language models to create counterfactual labels. By augmenting existing datasets with these labels, we increase the diversity and granularity of language grounding for robot datasets, ultimately improving the language-following capabilities of VLAs. We evaluate the resulting model's ability to follow language instructions, ranging from simple object-centric commands to complex referential tasks, by conducting vision-language navigation experiments in 3 different indoor and outdoor environments. Our experiments show that counterfactual relabeling (without additional data collection) significantly improves instruction-following in VLA policies, outperforming state-of-the-art methods and doubling the success rate compared to VLAs trained on unaugmented data. We also evaluate our method for manipulation VLAs and find a similar gain in performance on tasks with distractors.

2508.13362 2026-06-10 cs.LG 版本更新

Optimization-based Online Conformal Prediction for Multi-step Forecasting

基于优化的在线共形预测用于多步预测

Ruipu Li, Daniel Menacho, Alexander Rodríguez

发表机构 * University of Michigan(密歇根大学)

AI总结 提出O2CP框架,通过双层优化结构建模多步误差依赖,在保证边际覆盖有效性的同时生成更窄的预测区间,实验表明在自动驾驶、气候预测等领域优于现有方法。

详情
AI中文摘要

共形预测(CP)因其无分布覆盖保证而非常适合时间序列预测中的不确定性量化。然而,现有的多步方法往往难以平衡覆盖有效性与效率:它们要么独立校准每个预测步长,忽略时间相关性,要么强制执行严格的同步覆盖,导致区间过于保守。在这项工作中,我们提出了O2CP:基于优化的在线共形预测,这是一个统一的在线共形预测框架,显式建模多步误差依赖关系,同时不牺牲长期边际覆盖保证。我们首先证明,只要校准参数保持在定义的“安全”区域内,标准的在线共形更新就能保持有效性。利用这一理论见解,我们引入了一个双层架构:外层定义可接受的参数集以确保有效性,内层执行约束优化以建模联合误差分布并最小化整个预测步长的目标函数。为了使其计算可行,我们开发了一种轻量级采样策略,无需大型校准集即可估计联合分布。在包括自动驾驶、气候预测和公共卫生在内的真实世界数据集上的大量实验表明,O2CP始终优于最先进的基线方法,在实现目标覆盖率的同时,预测区间显著更窄,且长期遗憾更小。

英文摘要

Conformal prediction (CP) is well-suited for uncertainty quantification in time series forecasting due to its distribution-free coverage guarantees. However, existing multi-step methods often struggle to balance coverage validity with efficiency: they either calibrate horizons independently, ignoring temporal correlations, or enforce strict simultaneous coverage, resulting in overly conservative intervals. In this work, we propose O2CP: Optimization-based Online Conformal Prediction, a unified framework for online conformal prediction that explicitly models multi-step error dependencies without sacrificing long-term marginal coverage guarantees. We first prove that standard online conformal updates maintain validity as long as calibration parameters remain within a defined "safe" region. Leveraging this theoretical insight, we introduce a two-layer architecture: an outer layer that defines admissible parameter sets to ensure validity, and an inner layer that performs constrained optimization to model joint error distributions and minimize horizon-wide objectives. To make this computationally feasible, we develop a lightweight sampling strategy that estimates joint distributions without requiring large calibration sets. Extensive experiments on real-world datasets, including autonomous driving, climate forecasting, and public health, demonstrate that O2CP consistently outperforms state-of-the-art baselines, achieving target coverage with significantly sharper prediction intervals and reduced regret over long horizons.

2504.02323 2026-06-10 cs.CL 版本更新

CoTAL: Human-in-the-Loop Prompt Engineering for Generalizable Formative Assessment Scoring and Feedback

CoTAL:面向可泛化形成性评估评分与反馈的人机协同提示工程

Clayton Cohn, Ashwin T S, Naveeduddin Mohammed, Gautam Biswas

发表机构 * Vanderbilt University(范德比大学)

AI总结 提出CoTAL方法,结合证据中心设计、人机协同提示工程和思维链提示,迭代优化LLM评分,在多个领域提升GPT-4评分性能达38.9%,并获师生认可。

Comments Submitted to Computers and Education: Artificial Intelligence. Currently under review

详情
AI中文摘要

大型语言模型(LLM)为辅助教师和支持学生学习创造了新机遇。尽管研究者已在教育背景下探索了各种提示工程方法,但这些方法在科学、计算和工程等领域的泛化程度仍待深入研究。本文提出思维链提示+主动学习(CoTAL),一种基于LLM的形成性评估评分方法,该方法(1)利用证据中心设计(ECD)将评估和评分标准与课程目标对齐,(2)应用人机协同提示工程自动化响应评分,(3)结合思维链(CoT)提示以及教师和学生反馈,迭代优化问题、评分标准和LLM提示。我们的研究结果表明,CoTAL提升了GPT-4在多个领域的评分性能,相比无提示工程基线(即无标注示例、思维链提示或迭代优化),增益高达38.9%。教师和学生认为CoTAL在评分和解释响应方面有效,他们的反馈产生了有价值的见解,提高了评分准确性和解释质量。

英文摘要

Large language models (LLMs) have created new opportunities to assist teachers and support student learning. While researchers have explored various prompt engineering approaches in educational contexts, the degree to which these approaches generalize across domains--such as science, computing, and engineering--remains underexplored. In this paper, we introduce Chain-of-Thought Prompting + Active Learning (CoTAL), an LLM-based approach to formative assessment scoring that (1) leverages Evidence-Centered Design (ECD) to align assessments and rubrics with curriculum goals, (2) applies human-in-the-loop prompt engineering to automate response scoring, and (3) incorporates chain-of-thought (CoT) prompting and teacher and student feedback to iteratively refine questions, rubrics, and LLM prompts. Our findings demonstrate that CoTAL improves GPT-4's scoring performance across domains, achieving gains of up to 38.9% over a non-prompt-engineered baseline (i.e., without labeled examples, chain-of-thought prompting, or iterative refinement). Teachers and students judge CoTAL to be effective at scoring and explaining responses, and their feedback produces valuable insights that enhance grading accuracy and explanation quality.

2508.07048 2026-06-10 cs.SD cs.AI cs.LG eess.AS 版本更新

Whisfusion: Parallel ASR Decoding with Masked Diffusion

Whisfusion: 基于掩码扩散的并行ASR解码

Taeyoun Kwon, Junhyuk Ahn, Taegeun Yun, Heeju Jwa, Yoonchae Choi, Siwon Park, Jongchan Kim, Hyungon Ryu, Hyuk-Jae Lee, Nam-Joon Kim

发表机构 * Seoul National University(首尔国立大学) Soongsil University(顺天大学) NVIDIA Corporation(英伟达公司)

AI总结 提出Whisfusion,在冻结的Whisper音频嵌入上训练专用掩码扩散解码器,通过并行扩散解码实现非自回归ASR,在多种语言基准上超越Whisper-large-v3,速度提升4-5倍。

Comments 16 pages, 3 figures

详情
AI中文摘要

自回归(AR)编码器-解码器模型主导着高质量的多语言ASR,但其从左到右的解码器使得推理延迟随转录长度增加。一种自然的替代方案,CTC风格的非自回归(NAR)系统避免了这一瓶颈,但其条件独立性假设牺牲了转录级别的生成建模。掩码扩散语言模型(例如LLaDA、MDLM)提供了一种有竞争力的NAR文本生成方法。我们探究这类模型是否能在消除从左到右瓶颈的同时,将NAR ASR带入强AR ASR系统的准确率范围。我们提出Whisfusion,它在冻结的Whisper-large-v3音频嵌入之上从头训练一个专用的掩码扩散解码器,仅需几步即可去噪掩码转录。我们在约68k小时的11种语言语音上训练,采用高掩码专门化以将训练与推理的完全掩码起始点对齐,并通过并行扩散解码进行解码。Whisfusion在英语、欧洲和CJK基准测试的组平均准确率上超越Whisper-large-v3,同时运行速度快4-5倍,在准确率和吞吐量上均超越Whisper-turbo。它达到与Canary和Qwen3-ASR竞争的准确率,同时运行速度快3-7倍。这些结果确立了掩码扩散作为高吞吐量多语言转录的帕累托竞争性非自回归范式。代码和模型权重可在https://this URL获取。

英文摘要

Autoregressive (AR) encoder-decoder models dominate high-quality multilingual ASR, but their left-to-right decoders make inference latency scale with transcript length. A natural alternative, CTC-style non-autoregressive (NAR) systems avoid this bottleneck but their conditional independence assumption sacrifices transcript-level generative modeling. Masked diffusion language models (e.g., LLaDA, MDLM) offer a competitive NAR text-generation approach. We ask whether such models can bring NAR ASR into the accuracy regime of strong AR ASR systems while removing the left-to-right bottleneck. We propose Whisfusion, which trains a dedicated masked diffusion decoder from scratch on top of frozen Whisper-large-v3 audio embeddings, denoising masked transcripts in just a few steps. We train on ~68k hours of 11-language speech with high-mask specialization to align training with the fully masked starting point of inference, and decode via Parallel Diffusion Decoding. Whisfusion surpasses Whisper-large-v3 on group-average accuracy across English, European, and CJK benchmarks, while running 4-5x faster, additionally surpassing Whisper-turbo in both accuracy and throughput. It reaches accuracy competitive with Canary and Qwen3-ASR while running 3-7x faster. These results establish masked diffusion as a Pareto-competitive non-autoregressive paradigm for high-throughput multilingual transcription. Code and model weights are available at https://github.com/taeyoun811/Whisfusion.

2503.24007 2026-06-10 cs.LG cs.AI 版本更新

CITRAS: Covariate-Informed Transformer for Time Series Forecasting

CITRAS: 协变量感知的Transformer时间序列预测

Yosuke Yamaguchi, Issei Suemitsu, Wenpeng Wei

发表机构 * Research & Development Group, Hitachi, Ltd.(日立有限公司研发部)

AI总结 提出CITRAS,一种仅解码器Transformer,通过KV移位和注意力分数平滑机制灵活整合已知协变量的未来部分,并捕获局部和全局跨变量依赖,提升预测精度。

详情
Journal ref
IEEE Access, vol. 14, pp. 77983-77998, 2026
AI中文摘要

在时间序列预测中,协变量代表影响目标变量的外部因素。一些协变量仅在过去可观测(观测协变量,如记录的天气数据),而另一些则预先已知(已知协变量,如日历事件或折扣计划)。尽管协变量有潜力提升预测性能,但大多数基于深度学习的预测模型难以处理由已知协变量的未来部分引起的变量长度差异,且无法灵活利用它们。此外,捕获目标变量与协变量之间的依赖关系并非易事,因为模型必须准确反映协变量的局部影响,同时建模全局跨变量依赖。为应对这些挑战,我们提出CITRAS,一种仅解码器Transformer,灵活整合多个目标变量、观测协变量和已知协变量。在保持强大自回归建模能力的同时,CITRAS在分块跨变量注意力中引入两种新机制:键值移位和注意力分数平滑。键值移位通过基于并发依赖将已知协变量的未来部分与目标变量对齐,无缝融入预测过程。注意力分数平滑通过平滑历史注意力分数,将局部精确的分块跨变量依赖细化为全局变量级依赖。实验上,CITRAS在协变量感知和多变量设置下的广泛真实世界数据集上展现出强大性能,展示了其利用跨变量和跨时间依赖提升预测准确性的通用能力。

英文摘要

In time series forecasting, covariates represent external factors that influence target variables. Some covariates are observable only in the past (observed covariates, such as recorded weather data), while others are known in advance (known covariates, such as calendar events or discount schedules). Although covariates have the potential to enhance forecasting performance, most deep learning-based forecasting models struggle to address the length discrepancy between variables caused by the future portion of known covariates and fail to leverage them flexibly. Moreover, capturing dependencies between target variables and covariates is non-trivial, as models must accurately reflect the local impact of covariates while simultaneously modeling global cross-variate dependencies. To address these challenges, we propose CITRAS, a decoder-only Transformer that flexibly integrates multiple target variables, observed covariates, and known covariates. While preserving strong autoregressive modeling capabilities, CITRAS introduces two novel mechanisms within patch-wise cross-variate attention: Key-Value (KV) Shift and Attention Score Smoothing. KV Shift seamlessly incorporates the future portion of known covariates into the forecasting process by aligning them with target variables based on their concurrent dependencies. Attention Score Smoothing refines locally accurate patch-wise cross-variate dependencies into global variate-level dependencies by smoothing the historical attention scores. Experimentally, CITRAS demonstrates strong performance across a wide range of real-world datasets in both covariate-informed and multivariate settings, showcasing its versatile ability to leverage cross-variate and cross-time dependencies for improved forecasting accuracy.

2507.15294 2026-06-10 cs.SD cs.MM 版本更新

MeMo: Attentional Momentum for Real-time Audio-visual Speaker Extraction under Impaired Visual Conditions

MeMo: 视觉受损条件下的实时视听目标说话人提取的注意力动量

Junjie Li, Wenxuan Wu, Shuai Wang, Zexu Pan, Kong Aik Lee, Helen Meng, Haizhou Li

发表机构 * Department of Electrical and Electronic Engineering, Faculty of Engineering, The Hong Kong Polytechnic University(电子工程系,工程学院,香港理工大学) Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong(系统工程与工程管理系,香港中文大学) School of Artificial Intelligence (SAI), The Chinese University of Hong Kong, Shenzhen(人工智能学院(SAI),香港中文大学深圳校区) School of Intelligence Science and Technology, Nanjing University(智能科学与技术学院,南京大学) Tongyi Lab, Alibaba Group, Singapore(通义实验室,阿里巴巴集团,新加坡)

AI总结 提出MeMo框架,通过两个自适应记忆库存储注意力信息,在视觉线索缺失时维持注意力动量,实现实时目标说话人提取,SI-SNR提升至少2dB。

详情
AI中文摘要

视听目标说话人提取(AV-TSE)旨在通过利用视觉线索作为指导,从多说话人环境中分离出目标说话人的声音。然而,AV-TSE系统的性能严重依赖于这些视觉线索的质量。在视觉线索缺失或严重退化的极端场景中,系统可能无法准确提取目标说话人。相比之下,人类即使在缺乏明确辅助信息的情况下也能保持对目标说话人的注意力。受这种人类认知能力的启发,我们提出了一种名为MeMo的新框架,该框架包含两个自适应记忆库来存储注意力相关信息。MeMo专为实时场景设计:一旦建立初始注意力,系统就会随时间维持注意力动量,即使视觉线索变得不可用。我们进行了全面的实验来验证MeMo的有效性。实验结果表明,我们提出的框架相比相应基线实现了至少2 dB的SI-SNR提升。

英文摘要

Audio-visual Target Speaker Extraction (AV-TSE) aims to isolate a target speaker's voice from multi-speaker environments by leveraging visual cues as guidance. However, the performance of AV-TSE systems heavily relies on the quality of these visual cues. In extreme scenarios where visual cues are missing or severely degraded, the system may fail to accurately extract the target speaker. In contrast, humans can maintain attention on a target speaker even in the absence of explicit auxiliary information. Motivated by such human cognitive ability, we propose a novel framework called MeMo, which incorporates two adaptive memory banks to store attention-related information. MeMo is specifically designed for real-time scenarios: once initial attention is established, the system maintains attentional momentum over time, even when visual cues become unavailable. We conduct comprehensive experiments to verify the effectiveness of MeMo. Experimental results demonstrate that our proposed framework achieves SI-SNR improvements of at least 2 dB over the corresponding baseline.