arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1926
专题追踪
2605.22257 2026-05-22 cs.LG cs.AI cs.LO

What are the Right Symmetries for Formal Theorem Proving?

正式定理推理中应有的对称性是什么?

Krzysztof Olejniczak, Radoslav Dimitrov, Xingyue Huang, Bernardo Cuenca Grau, Jinwoo Kim, İsmail İlkan Ceylan

发表机构 * University of Oxford(牛津大学) KAIST(韩国科学技术院) TU Wien(维也纳技术大学) AITHYRA

AI总结 本文探讨了正式定理推理中应尊重的对称性,提出了基于范畴论的重写范畴框架,用于形式化证明等价性和成功不变性,并通过测试时方法改进了LLM基定理证明器的鲁棒性和性能。

详情
AI中文摘要

基于大规模语言模型(LLMs)的正式定理推器对问题表示的表面变化高度敏感:语义等价的陈述可以表现出剧烈不同的证明成功率,揭示了对正式数学中固有对称性的失败。这提出了一个核心问题:正式定理推理中应有什么样的对称性?我们引入了重写范畴,一个范畴论框架,捕捉由证明战术诱导的组合性、一般非可逆的转换,并用它来形式化两个对称性概念:证明等价性,支配证明分布在重写下的变换,以及成功不变性(即成功概率的不变性),要求等价陈述以相同概率被解决。我们观察到基于状态的next-tactic推器通过操作证明状态自然满足证明等价性。相比之下,最先进的基于LLM的推器既不满足这些属性,表现出在等价表述下的大性能变化。为缓解这一问题,我们提出测试时方法,通过等价重写的聚合,理论上证明它们在采样极限下恢复成功不变性,并实验证明它们在固定推理预算下提高鲁棒性和性能。我们的结果突显了对称性作为LLM基定理推理中关键缺失的归纳偏置,并建议测试时计算作为近似该偏置的实用途径。

英文摘要

Formal theorem provers based on large language models (LLMs) are highly sensitive to superficial variations in problem representation: semantically equivalent statements can exhibit drastically different proof success rates, revealing a failure to respect structural symmetries inherent in formal mathematics. This raises a central question: what are the right symmetries for formal theorem proving? We introduce rewriting categories, a category-theoretic framework capturing the compositional, generally non-invertible transformations induced by proof tactics, and use it to formalize two symmetry notions: proof equivariance, governing how proof distributions transform under rewrites, and success invariance (i.e., invariance of success probability), requiring equivalent statements to be solved with the same probability. We observe that state-based next-tactic provers naturally satisfy proof equivariance by operating on proof states. In contrast, state-of-the-art LLM-based provers satisfy neither property, exhibiting large performance variation across equivalent formulations. To mitigate this, we propose test-time methods that aggregate over equivalent rewritings of the input, showing theoretically that they recover success invariance in the sampling limit, and empirically, that they improve robustness and performance under fixed inference budgets. Our results highlight symmetry as a key missing inductive bias in LLM-based theorem proving and suggest test-time computation as a practical route to approximate it.

2605.22249 2026-05-22 cs.CV

D3Seg: Dependency-Aware Diffusion for Brain Tumor Segmentation with Missing Modalities

D3Seg: 依赖感知的扩散模型用于缺失模态的脑肿瘤分割

Danish Ali, Ajmal Mian, Naveed Akhtar, Ghulam Mubashar Hassan

发表机构 * The University of Western Australia(西澳大学) The University of Melbourne(墨尔本大学)

AI总结 本文提出D3Seg模型,通过多跳模态图融合、轻量扩散插补机制和概率空间决策细化,解决缺失MRI模态下的脑肿瘤分割问题,提升分割性能并保持计算效率。

详情
AI中文摘要

使用多参数MRI进行准确的脑肿瘤分割对于有效的治疗计划至关重要。然而,在临床环境中,完整获取所有MRI序列并不总是可能。某些MRI模态的缺失会导致现有分割方法性能显著下降,这些方法通常依赖于朴素的特征拼接或直接融合策略。为了解决这一限制,我们提出了一种新的分割模型D3Seg,其设计旨在在缺失模态设置下保持稳定的性能。D3Seg引入了多跳模态图融合(MMGF)来建模更高阶的跨模态依赖关系,一种轻量级的扩散基插补机制来补偿潜在空间中缺失的T1ce表示,并在概率空间中进行决策细化以缓解主导类的过度自信并改进低表示肿瘤亚区域的界定。在BraTS 2023数据集上的广泛评估表明,我们的D3Seg模型在缺失模态配置下 consistently 改善了分割性能。所提出的模型在多个缺失模态配置中相比当前最先进的模型,在增强肿瘤(ET)方面实现了约1.5-2.0%的Dice改进,在肿瘤核心(TC)方面实现了约1.0%的改进,同时保持了计算效率。

英文摘要

Accurate brain tumor segmentation using multiparametric MRI is critical for effective treatment planning. However, in clinical settings, complete acquisition of all MRI sequences is not always possible. The absence of certain MRI modalities results in substantial performance degradation in existing segmentation methods, which typically rely on naive feature concatenation or direct fusion strategies. To address this limitation, we propose a novel segmentation model D3Seg which is designed to maintain stable performance under missing-modality settings. D3Seg introduces Multi-hop Modality Graph Fusion (MMGF) to model higher order inter-modality dependencies, a lightweight diffusion-based imputation mechanism to compensate for missing T1ce representations in latent space, and probability-space decision refinement to mitigate dominant class overconfidence and improve delineation of underrepresented tumor subregions. Extensive evaluation on BraTS 2023 dataset demonstrates that our D3Seg model consistently improves segmentation performance under missing modality configurations. The proposed model achieves approximately 1.5-2.0% Dice improvement on enhancing tumor (ET) and around 1.0% on tumor core (TC) across multiple missing modality configurations compared to the current state-of-the-art model, while maintaining computational efficiency.

2605.22248 2026-05-22 cs.LG

No Epoch Like the Present: Robust Climate Emulation Requires Out-of-Distribution Generalisation

没有比现在更严峻的挑战:鲁棒的气候模拟需要分布外泛化

Bradley Stanley-Clamp, Anson Lei, Hannah M. Christensen, Ingmar Posner

发表机构 * Applied AI Lab University of Oxford, UK(应用人工智能实验室,牛津大学,英国) Atmospheric, Oceanic and Planetary Physics University of Oxford, UK(大气、海洋和行星物理,牛津大学,英国)

AI总结 本文研究了气候模拟中分布外泛化的重要性,提出了一种新的评估框架,通过季节变化来测试模拟器的鲁棒性,并展示了物理驱动的分解方法如何在不显著牺牲分布内性能的情况下提升分布外性能。

Comments 36 pages, 12 figures

详情
AI中文摘要

气候模拟是一种分布外(OOD)投影任务。正是在这个挑战中,现代机器学习(ML)方法最容易失效。因此,尽管当前训练于现代表现的ML模拟器在分布内表现优异,但其在气候不可避免分布变化下的未来可靠性仍是一个关键但不为人知的盲点。解决这一挑战需要我们对气候模拟器的理解、评估和设计方法进行根本性转变。在本工作中,我们首先确认气候变化导致大气状态分布产生统计显著且逐渐增长的转变,使标准评估协议不足。我们实证地确立季节变化作为这些长期气候转变的有效代理,提供访问真实世界分布转变而无需依赖合成扰动等启发式方法。受此联系启发,我们引入了一种新的评估框架,利用季节转变作为严格且零开销的模拟器鲁棒性测试平台。我们的系统性特征化确认了当前最先进的混合ML模拟器在这些现实转变下显著退化。最后,我们通过识别组合泛化,即从观察到的基本组件中形成新组合的能力,作为稳健气候模拟的原理路径。我们证明了受物理启发的分解方法在不显著牺牲分布内性能的情况下显著提升OOD性能,为ML驱动的气候模拟器提供了一条对未知未来鲁棒的途径。

英文摘要

Climate emulation is an out-of-distribution (OOD) projection task. This is precisely the challenge where modern Machine Learning (ML) methods are most prone to failure. Consequently, while current ML emulators trained on present climate achieve high in-distribution performance, their future reliability under the inevitable distribution shifts of a changing climate remains a critical, poorly understood blind spot. Addressing this challenge requires a fundamental shift in how we understand, evaluate, and design climate emulators. In this work, we first confirm that climate change drives a statistically significant and progressively growing shift in atmospheric state distributions, rendering standard evaluation protocols insufficient. We empirically establish that seasonal variation serves as an effective proxy for these long-term climate shifts, providing access to $\textit{real-world}$ distribution shifts without recourse to heuristics like synthetic perturbations. Motivated by this link, we introduce a novel evaluation framework that leverages seasonal shifts as a rigorous, zero-overhead testbed for emulator robustness. Our systematic characterisation confirms that current state-of-the-art hybrid-ML emulators degrade significantly under these realistic shifts. Finally, we chart a path forward by identifying compositional generalisation, the ability to form novel combinations from observed elementary components, as a principled route towards robust climate emulation. We demonstrate that physically motivated decompositions substantially improve OOD performance with only modest trade-offs against in-distribution performance, providing an avenue towards ML-driven climate emulators robust to an unknown future.

2605.22247 2026-05-22 cs.CL

IdioLink: Retrieving Meaning Beyond Words Across Idiomatic and Literal Expressions

IdioLink: 超越词语的语义检索:在隐喻和直述表达之间

Kai Golan Hashiloni, Daniel Fadlon, Lior Livyatan, Ofri Hefetz, Jiahuan Pei, Kfir Bar

发表机构 * Data Science Institute, Reichman University(雷赫曼大学数据科学学院) Efi Arazi School of Computer Science, Reichman University(雷赫曼大学埃菲·阿拉兹计算机科学学院) Vrije Universiteit Amsterdam(阿姆斯特丹自由大学)

AI总结 本文提出IdioLink检索基准,旨在测试模型能否将隐喻表达与直述或改写形式的概念等价意义联系起来,揭示当前模型在隐喻语义检索中的不足。

详情
AI中文摘要

隐喻表达对语言模型构成了基本挑战,因为其含义无法仅通过表层形式推断。因此,理解此类表达需要超越词汇重叠的语义抽象。我们介绍了IdioLink,一个检索基准,用于测试模型是否能将隐喻表达与用直述或改写形式表达的概念等价意义联系起来。IdioLink包含10,700个文档和2,140个查询,涵盖107个具有直述和隐喻用法的习语。每个文档和查询都标注了传达核心意义的片段。评估强大的嵌入基线(如BGE、E5、Contriever和Qwen),我们发现当前模型在跨不同表层实现检索等价意义时表现不佳,依赖于主题和浅层语义线索。IdioLink揭示了隐喻意识语义检索中的关键缺口,并为未来模型提供了具有挑战性的测试平台。

英文摘要

Idioms pose a fundamental challenge for language models, as their meaning cannot be inferred from surface form alone. Understanding such expressions, therefore, requires semantic abstraction beyond lexical overlap. We introduce IdioLink, a retrieval benchmark designed to test whether models can link idiomatic expressions to conceptually equivalent meanings expressed in literal or paraphrased forms. IdioLink comprises 10,700 documents and 2,140 queries, spanning 107 idioms with both literal and figurative uses. Each document and query is annotated with spans that convey the core meaning. Evaluating strong embedding baselines (e.g., BGE, E5, Contriever, and Qwen), we show that current models struggle to retrieve equivalent meanings across divergent surface realizations, relying instead on topical and shallow semantic cues. IdioLink exposes key gaps in idiom-aware semantic retrieval and provides a challenging testbed for future models.

2605.22243 2026-05-22 cs.LG cs.AI stat.AP

Explainable AI for Data-Driven Design of High-Dimensional Predictive Studies

为高维预测研究的数据驱动设计开发可解释的AI

Junyu Yan, Damian Machlanski, Kurt Butler, Panagiotis Dimitrakopoulos, Ewen M Harrison, Bruce Guthrie, Sotirios A Tsaftaris

发表机构 * School of Engineering, University of Edinburgh(爱丁堡大学工程学院) Causality in Healthcare AI Hub (CHAI)(医疗因果AI枢纽) Advanced Care Research Centre, Usher School of Population Health Sciences, University of Edinburgh(先进护理研究中心,乌瑟人口健康科学学院,爱丁堡大学) Centre for Medical Informatics, Usher School of Population Health Sciences, University of Edinburgh(医学信息学中心,乌瑟人口健康科学学院,爱丁堡大学)

AI总结 本文提出了一种可解释的AI推荐系统,通过数据驱动的方法改进现有可解释统计模型的预测性能,主要贡献是通过可解释AI技术提供三种推荐类型以提高模型的预测能力和透明度。

Comments 41 pages, 7 figures

详情
AI中文摘要

预测建模在健康数据分析和数据驱动的临床决策中非常重要。然而,当需要选择、转换或交互建模数十甚至数百个特征时,手动优化预测研究具有挑战性。尽管复杂的机器学习模型具有高性能,但其“黑盒”性质限制了临床信任、透明度和决策所需的可解释性。我们开发并评估了一种探索性AI推荐器,以提供数据驱动的推荐,从而提高现有可解释统计模型的预测性能。所开发的框架使用灵活的AI建模来捕捉复杂的数据模式,并利用可解释AI技术将这些模式转化为三种推荐类型:特征排除、非线性项和特征交互。我们通过比较基线(即无交互或非线性项)Cox比例风险(CPH)模型与增强的CPH模型(包含由我们方法建议的推荐)的预测性能来评估该框架。主要分析预测245,614名患者首次发生跌倒或相关伤害的时间。我们的方法推荐排除23个特征,包括两个特征的非线性项,以及包含221个建议的特征交互。C指数从0.805(95% CI 0.798-0.812)提高到0.815(95% CI 0.809-0.822),校准也有所改善(截距:-0.006到0.003;斜率:1.063到0.950)。所有推荐均得到现有文献的支持。该方法还证明在两个额外的公共数据集上有效,显示了更广泛的应用性。所提出的探索性AI推荐器展示了可解释AI和数据驱动研究设计在提高高维透明预测模型开发过程和性能方面的潜力。

英文摘要

Predictive modelling is important for health data analysis and data-driven clinical decision-making. However, predictive studies are challenging to design optimally by hand when tens or even hundreds of features require selection, transformation, or interaction modelling. While complex machine learning models offer high performance, their "black-box" nature limits the clinical trust, transparency, and interpretability required for decision-making. We developed and evaluated an Exploratory AI Recommender that provides data-driven recommendations to improve predictive performance of existing interpretable statistical models. The developed framework uses flexible AI modelling to capture complex data patterns and explainable AI techniques to translate the patterns into three recommendation types: feature exclusion, non-linear terms, and feature interactions. We evaluated the framework by comparing predictive performance of a baseline (i.e., no interactions or non-linear terms) Cox Proportional Hazards (CPH) model against an augmented CPH incorporating recommendations suggested by our method. The primary analysis predicts the time to the first occurrence of a fall or related injury in 245,614 patients. Our method recommended excluding 23 features, including non-linear terms for two features, and including 221 suggested feature interactions. The C-index improved from 0.805 (95% CI 0.798-0.812) to 0.815 (95% CI 0.809-0.822), and so did calibration (intercept: -0.006 to 0.003; slope: 1.063 to 0.950). All recommendations were supported by existing literature. The method also proved effective on two additional public datasets, demonstrating wider applicability. The proposed Exploratory AI Recommender demonstrates the potential of explainable AI and data-driven study design to improve the process of developing, and the performance of high-dimensional transparent predictive models.

2605.22238 2026-05-22 cs.AI

Evaluating Large Language Models as Live Strategic Agents: Provider Performance, Hybrid Decomposition, and Operational Gaps in Timed Risk Play

评估大型语言模型作为实时战略代理:提供商性能、混合分解及时间风险游戏中的操作差距

H. C. Ekne

发表机构 * Gemini OpenAI Kimi

AI总结 本文研究了大型语言模型在实时策略环境中的表现,发现其性能受目标跟踪、执行转换、成本和运行时可靠性等因素影响,支持将LLM作为受限制工作流中的组件进行评估,而非孤立的基准测试对象。

Comments 13 pages, 7 figures. Code and tracked notes: https://github.com/hcekne/risk-game . Public runtime artifact index: https://github.com/hcekne/risk-game/blob/main/docs/article-plans/public_experiment_artifacts.md

详情
AI中文摘要

静态基准测试只能捕捉大型语言模型在实践中行为的一部分。实际系统将模型置于具有时间限制、格式约束和故障模式的重复循环中。我们研究了这种环境下的时间多阶段Risk游戏,其中包含明确的胜利目标和重复的规划与执行循环。在一项冻结规则的32局跨提供商锦标赛中,Gemini-3.1-Pro-Preview在32局中胜出20局,战胜了GPT-5.1、Claude-Opus-4-7和Kimi-K2.6。聚合的胜利分布与等强的空模型显著不同(p约1.5×10^-5)。随后,我们通过标准化执行在更便宜的Gemini Flash框架上进行分离。在该设计下,32局规划烘焙测试与近等值性一致(p约0.821),表明早期提供商差异主要来自端到端系统行为而非规划本身。为研究机制,我们分析了提供商锦标赛中保存的规划和执行轨迹。Gemini比其他模型更频繁地参考终端目标,且在胜利接近时增加这种关注。Gemini还更有效地将回合转化为深度征服链,尽管其运行时并不最干净。这些结果表明,实时代理性能取决于目标跟踪、执行转换、成本和运行时可靠性,并支持将LLM作为受限制工作流中的组件进行评估,而非孤立的基准测试响应者。

英文摘要

Static benchmarks capture only part of how large language models behave in practice. Real systems place models inside repeated loops with time limits, formatting constraints, and failure modes. We study this setting in a timed multi-phase Risk environment with explicit victory targets and repeated planning and execution cycles. In a replicated 32-game cross-provider championship under frozen rules, gemini-3.1-pro-preview won 20 of 32 games against gpt-5.1, claude-opus-4-7, and kimi-k2.6, and the pooled winner distribution differs strongly from an equal-strength null (p approx 1.5 x 10^-5). We then separate planning from execution by standardizing execution on a cheaper Gemini Flash scaffold. Under this design, a pooled 32-game planner bakeoff is consistent with near-equality (p approx 0.821), which indicates that much of the earlier provider spread came from end-to-end system behavior rather than planning alone. To study mechanism, we analyze saved planning and execution traces from the provider championship. Gemini refers to the terminal objective far more often than the other models and increases that focus as victory approaches. Gemini also converts more turns into deep conquest chains, even though it is not the cleanest runtime. These results show that live-agent performance depends on objective tracking, execution conversion, cost, and runtime reliability, and they support evaluating LLMs as components in bounded workflows rather than as isolated benchmark respondents.

2605.22231 2026-05-22 cs.CV

REACH: Hand Pose Estimation from Room Corners

REACH:从房间角落估计手部姿态

Shu Nakamura, Ryo Kawahara, Genki Kinoshita, Ryosuke Hirai, Yasutomo Kawanishi, Shohei Nobuhara, Ko Nishino

发表机构 * Graduate School of Informatics, Kyoto University(京都大学信息学研究科) RIKEN(理化学研究所) Kyoto Institute of Technology(京都工业大学)

AI总结 本文提出了一种新的3D手部姿态估计器,能够从远处(通常是从房间角落的固定摄像头)在极低分辨率且频繁遮挡的视图中准确恢复人的手部形状和姿态。核心方法是充分利用手部与身体的协调性、时间序列变化以及多视角观测,通过一种新的基于Transformer的模型实现,利用视图令牌之间的相关性建模手部和身体的配置,并以自回归方式利用时间协调性。同时引入了一个名为REACH的新型数据集,用于训练和测试方法。REACH是首个大规模的手部姿态数据集,记录了50名参与者在多种日常活动中的准确手部运动。通过大量实验,包括与现有方法的比较研究,证明了我们的模型REACH-Net在远距离3D手部姿态估计上取得了高度准确的结果。这些结果拓展了3D手部姿态估计的视野,尤其在“野外”连续人类行为分析方面。

详情
AI中文摘要

我们介绍了一种新颖的3D手部姿态估计器,能够从远处(通常是从房间角落的固定摄像头)在极低分辨率且频繁遮挡的视图中准确恢复人的手部形状和姿态。我们的核心思想是充分利用手部与身体的协调性、其时间序列变化以及多视角观测。我们通过一种新的基于Transformer的模型实现这一目标,其中手部和身体的配置通过其视觉特征之间的相关性建模为每视角令牌,其时间协调性则以自回归方式利用。我们引入了一个新的数据集,称为REACH,即带有胸部摄像头注释的房间环境数据集,用于训练和测试我们的方法。REACH是首个大规模的手部姿态数据集,记录了50名参与者在广泛日常活动中的准确手部运动。为了在标注手部准确形状和姿态时避免干扰自然运动,我们利用隐藏的胸部摄像头。通过广泛的实验,包括与现有方法的比较研究,我们证明了我们的模型REACH-Net在远距离3D手部姿态估计上取得了高度准确的结果。这些结果拓展了3D手部姿态估计的视野,尤其在“野外”连续人类行为分析方面。

英文摘要

We introduce a novel 3D hand pose estimator that can accurately recover the shape and pose of people's hands in a room from afar, typically from fixed cameras at room corners, in extremely low-resolution and frequently occluded views. Our key idea is to fully leverage hand-body coordination, its temporal progression, and multiview observations. We achieve this with a novel Transformer-based model, in which hand and body configurations are modeled through correlations between their visual features expressed as per-view tokens, and their temporal coordination is exploited in an autoregressive manner. We introduce a novel dataset, which we refer to as REACH, Room-Environment dataset Annotated with Chest cameras for Hand pose estimation, to train and test our method. REACH is a first-of-its-kind large-scale hand pose dataset that captures accurate hand movements of 50 participants across a wide variety of daily activities. In order to avoid interfering with natural movements while annotating the hands with accurate shape and pose, we leverage concealed chest cameras. Through extensive experiments, including comparative studies with existing methods, we show that our model, REACH-Net, achieves highly accurate 3D hand pose estimation from afar. These results broaden the horizon of 3D hand pose estimation, especially towards "in-the-wild" continuous human behavior analysis.

2605.22228 2026-05-22 cs.CL

GHI: Graphormer over Conditioned Hypergraph Incidence for Aspect-Based Sentiment Analysis

GHI: 图ormer over Conditioned Hypergraph Incidence 用于基于方面的情感分析

Yu Du, Wenlong Zhu, Xingze Li, Chenglong Cao, Jing Wang, Yukun Ma

发表机构 * Qiqihar University(齐齐哈尔大学)

AI总结 本文提出GHI框架,通过构建基于双分拓扑的 incidence 结构推理层,实现对基于方面的情感分析任务中不同结构信号的统一处理,实验表明GHI在多个标准基准上优于现有方法,且在参数较少的情况下表现优异。

Comments 15 pages, 8 figures, 7 tables

详情
AI中文摘要

基于方面的情感分析(ABSA)要求模型将情感证据绑定到正确的方面,使其成为细粒度结构推理的自然测试场。我们介绍了GHI,一种基于条件超图 incidence 的图ormer框架,其设计为一个基于双分拓扑的 incidence 基础结构推理层。GHI将多样化的语言和语义证据表示为token-超边 incidence 关系,允许通过统一接口整合不同的结构信号。在六个标准ABSA基准上的广泛实验表明,GHI在SemEval领域优于所有基线,多种子评估显示其在强DeBERTa模型上表现稳定。进一步实验显示,仅使用247M参数,GHI在ISE基准上接近11B Flan-T5基方法的性能。此外,GHI在具有挑战性的ARTS数据集上表现出强鲁棒性,保持了高度竞争性性能,而传统模型在此处退化。这些结果表明,紧凑的结构推理仍然是细粒度任务中比规模驱动方法有价值的替代方案。

英文摘要

Aspect-based sentiment analysis (ABSA) requires models to bind sentiment evidence to the correct aspect, making it a natural testbed for fine-grained structural reasoning. We introduce GHI, a Graphormer-over-Conditioned-Hypergraph-Incidence framework that is designed as an incidence-based structural reasoning layer built on a bipartite topology. GHI represents diverse linguistic and semantic evidence as token--hyperedge incidence relations, allowing different structural signals to be incorporated through a unified interface. Extensive experiments on six standard ABSA benchmarks show that GHI outperforms all baselines on the SemEval domains, and multi-seed evaluations show stable improvements over strong DeBERTa. Further experiments show that with only 247M parameters, GHI approaches the performance of 11B Flan-T5 based methods on the ISE benchmark. Moreover, it demonstrates strong robustness on the challenging ARTS datasets, maintaining highly competitive performance where traditional models degrade. These results demonstrate that compact structural reasoning remains a valuable alternative to scale-driven approaches for fine-grained tasks.

2605.22223 2026-05-22 cs.LG

How Many Different Outputs Can a Transformer Generate?

变换器能生成多少种不同的输出?

Maxime Meyer, Mario Michelessa, Caroline Chaux, Vincent Y. F. Tan

发表机构 * Department of Mathematics, National University of Singapore, Singapore, 117543(新加坡国立大学数学系) School of Computing, National University of Singapore, Singapore, 117543(新加坡国立大学计算学院) Aix Marseille Univ, CNRS, I2M, Marseille, France(法国马赛大学、国家科学研究中心、I2M研究所) Department of Electrical and Computer Engineering, National University of Singapore(新加坡国立大学电子与计算机工程系)

AI总结 研究如何利用变换器架构中的少量特性来准确预测其能生成的不同序列数量,包括定性和定量分析,并提供基于提示长度的上限,实验证明在不同架构和模型大小下该上限紧致于10倍以内。分析还解释了之前在简单序列任务(如复制和填塞)中观察到的变换器经验性失败现象。

Comments ICML 2026 Spotlight

详情
AI中文摘要

我们研究如何仅利用变换器架构中的少量特性来紧密预测其能生成的不同序列数量,包括定性和定量分析。我们提供一个依赖于提示长度的上限,实验证明在不同架构和模型大小下,该上限紧致于10倍以内。我们的分析还为之前在简单序列任务(如复制和填塞)中观察到的变换器经验性失败提供了理论解释。形式上,我们证明了(i)可访问序列的最大长度(即变换器能为某些提示生成的序列)与提示长度成线性增长,(ii)超过临界阈值后,可访问序列的比例随序列长度呈指数衰减,(iii)提示长度与可访问序列长度之间的线性系数具有理论上限。值得注意的是,这些结果即使在无界上下文和计算时间下也成立。

英文摘要

We study how we can leverage only a handful of characteristics of a transformer's architecture to closely predict the number of different sequences it can output, both qualitatively and quantitatively. We provide an upper bound depending on the length of the prompt, which we show empirically to be tight up to a factor less than 10, across architectures and model sizes. Our analysis also provides a theoretical explanation for previously observed empirical failures of transformers on simple sequence tasks, such as copying and cramming. Formally, we prove that (i) the maximal length of accessible sequences (those that the transformer can output for some prompt) grows linearly with the prompt length, (ii) beyond a critical threshold, the proportion of accessible sequences decays exponentially with sequence length, and (iii) the linear coefficient relating prompt length to accessible sequence length admits a theoretical upper bound. Notably, these results hold even with unbounded context and computation time.

2605.22221 2026-05-22 cs.LG cs.AI cs.LO

Can Transformers Learn to Verify During Backtracking Search?

Transformer能否在回溯搜索中学习验证?

Yin Jun Phua, Tony Ribeiro, Tuan Nguyen, Katsumi Inoue

发表机构 * Yin Jun Phua (corresponding author) Institute of Science Tokyo, 2-12-1 Ookayama, Meguro-ku, Tokyo 152-8550, Japan Tony Ribeiro Centrale Nantes, CNRS, Laboratoire des Sciences du Num\'erique de Nantes, LS2N, UMR 6004, F-44000 Nantes, France National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo 101-8430, Japan Steelous Protocol, 8-20-32, Ginza, Chuo-ku, Tokyo 104-0061, Japan Tuan Nguyen Hanoi University of Science Technology, No. 1 Dai Co Viet, Hai Ba Trung, Ha Noi, Vietnam Katsumi Inoue National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo 101-8430, Japan

AI总结 本文研究了Transformer在回溯搜索中的验证能力,指出传统方法在处理轨迹数据时存在散列检索和历史纠缠问题,并提出局部化和选择性状态注意力(SSA)来解决这些问题,通过实验验证了SSA在3-SAT、图着色、Blocks World和回溯解析等任务中的有效性。

详情
AI中文摘要

回溯搜索是经典约束求解器、规划器和定理证明器的基础。最近的基于Transformer的推理系统探索其自身中间步骤的搜索树。一种常见的训练方法是在离线求解器轨迹上拟合自回归的下一个令牌损失。模型的输入在每一步都是所有先前决策的累积轨迹。最优的继续或回溯预测器仅依赖于当前搜索状态,因为到达相同状态的两条轨迹允许相同的延续。我们证明,仅使用累积轨迹训练的解码器Transformer在两种方式上未能满足这一要求:轨迹可以将状态特征散列到许多位置(散列检索),并且预测器可以基于轨迹而非状态(历史纠缠)。我们通过局部化解决散列检索问题,这是一种轨迹级的修复方法,将每个决策块重写以局部化状态特征。我们通过选择性状态注意力(SSA)解决历史纠缠问题,这是一种固定注意力掩码,可以在不修改训练数据、目标或参数的情况下强制结构化基于状态的决策。我们专注于矛盾传播后发生的反应验证。我们在3-SAT、图着色、Blocks World和回溯解析中测试SSA。在仅在先前历史上不同的相同状态对中,SSA发出相同的决定,而自回归训练的因果基线则不会。我们的贡献是针对序列轨迹数据的Transformer行为诊断,配以结构化修复。预训练语言模型在搜索其自身推理步骤时可能面临相同的失败。我们的分析为推理时的上下文清除作为不重新训练的情况下应用相同隔离的方法提供了候选方案。

英文摘要

Backtracking search underlies classical constraint solvers, planners, and theorem provers. Recent transformer-based reasoning systems explore search trees over their own intermediate steps. A common training recipe fits an autoregressive next-token loss on offline solver traces. The model's input at each step is a cumulative trace of all prior decisions. The optimal continue-or-backtrack predictor depends only on the current search state, since two trajectories reaching the same state admit the same viable continuations. We show that decoder-only transformers trained on cumulative traces fail this requirement in two ways: the trace can scatter state features across many positions (scattered retrieval), and the predictor can condition on the trajectory rather than the state (history entanglement). We address scattered retrieval with localization, a trace-level fix that rewrites each decision block to expose state features locally. We address history entanglement with Selective State Attention (SSA), a fixed attention mask that enforces state-based decisions structurally without modifying training data, objective, or parameters. We focus on reactive verification, after propagation has exposed a contradiction. We test SSA on 3-SAT, graph coloring, Blocks World, and backtracking parsing. On same-state pairs that differ only in prior history, SSA emits identical decisions while a cumulative-trained causal baseline does not. Our contribution is a diagnostic of transformer behavior on serialized trajectory data, paired with a structural fix. Pretrained language models that search over their own reasoning steps may face the same failure. Our analysis opens up inference-time context clearing as a candidate way to apply the same isolation without retraining.

2605.22219 2026-05-22 cs.AI

SGR-Bench: Benchmarking Search Agents on State-Gated Retrieval

SGR-Bench: 对状态门控检索的搜索代理基准测试

Ningyuan Li, Haiyang Shen, Mugeng Liu, Yudong Han, Zhuofan Shi, Sixiong Xie, Yun Ma

发表机构 * Peking University(北京大学) Beijing University of Technology(北京理工大学)

AI总结 本文提出SGR-Bench,一个用于评估状态门控检索能力的基准数据集,包含100个专家 curated 任务,通过对比显式和隐式指导方法,揭示了搜索代理在处理状态门控检索任务时的主要挑战。

Comments Work in Progress. 23 pages, 7 figures, preprint

详情
AI中文摘要

近年来,大语言模型和工具使用代理的进步扩大了可基准测试的网络任务范围。然而,一个重要类别的专门检索任务仍缺乏充分描述。在许多专门的数据检索网站上,包含答案的证据只有在通过过滤器、视图、层次结构或范围等设置正确的网站特定检索状态后才能被访问。我们称这种能力为状态门控检索(SGR)。我们引入了SGR-Bench,一个针对此设置的基准数据集,包含100个专家curated的任务,涵盖六个来源家族和12个公开数据生态系统。每个任务都需要发现正确的网站并配置其网站特定的检索状态以生成结构化答案。SGR-Bench将约束引导和目标导向的同一底层问题的两种形式配对,使显式和隐式指导在状态门控检索中的比较得以控制。我们评估了八个基于CLI的代理LLM系统和三个商业搜索代理产品。在SGR-Bench上,最强的系统仅达到66.18%的项目级F1,而行级F1仍显著较低。对156条可分析的失败CLI轨迹的手动审核显示了原因:代理通常到达相关网页源,但建立了错误的网站特定检索状态。检索范围漂移(37.2%)和标准不匹配(27.6%)占主导地位,而最终答案组成仅占10.3%。数据集和单案例评估说明文件可在https://huggingface.co/datasets/PKUAIWeb/SGR-BENCH获取。

英文摘要

Recent advances in large language models and tool-using agents have expanded the range of benchmarked web tasks. Yet an important class of specialized retrieval tasks remains undercharacterized. On many specialized data-retrieval websites, answer-bearing evidence becomes accessible only after establishing the correct site-specific retrieval state through filters, views, hierarchies, or scopes. We term this capability state-gated retrieval (SGR). We introduce SGR-Bench, a benchmark for this setting containing 100 expert-curated tasks spanning six source families and 12 public data ecosystems. Each task requires discovering the appropriate website and configuring its site-specific retrieval state to produce a structured answer. SGR-Bench pairs constraint-guided and goal-oriented formulations of the same underlying problems, enabling controlled comparisons between explicit and implicit guidance for state-gated retrieval. We evaluate eight CLI-based agentic LLM systems and three commercial search-agent products. On SGR-Bench, the strongest system reaches only 66.18% item-level F1, while row-level F1 remains much lower. A manual audit of 156 analyzable failed CLI trajectories shows why: agents often reach a relevant web source, but establish the wrong site-specific retrieval state. Retrieval-scope drift (37.2%) and criterion mismatch (27.6%) dominate, whereas final answer composition accounts for only 10.3%. The dataset and single-case evaluation instructions are available at https://huggingface.co/datasets/PKUAIWeb/SGR-BENCH.

2605.22217 2026-05-22 cs.LG cs.CL

Survive or Collapse: The Asymmetric Roles of Data Gating and Reward Grounding in Self-Play RL

生存或崩溃:自我博弈强化学习中数据门控与奖励基础的不对称作用

Sophia Xiao Pu, Zhaotian Weng, Chengzhi Liu, Jayanth Srinivasa, Gaowen Liu, William Yang Wang, Xin Eric Wang

发表机构 * University of California, Santa Barbara(加州大学圣巴巴拉分校) Cisco Research(思科研究)

AI总结 本文研究了自我博弈强化学习中数据门控和奖励基础的不对称作用,发现数据门控是维持稳定的关键因素,而奖励信号在门控移除后无法单独保证稳定性,揭示了'基础提出者悖论'。

详情
AI中文摘要

自我博弈强化学习通过语言模型自行生成任务进行训练,实现提出者与求解者的共同进化,无需人工标注。最近的系统报告了显著的推理提升,但崩溃和不稳定性普遍存在且理解不足。主流观点将其视为奖励设计问题,但我们认为自我博弈的稳定性由两个不同的调节机制决定:数据层面的门控,决定哪些由提出者生成的任务进入训练池,以及奖励信号,更新已准入任务的策略。通过在Python输出预测任务和确定性DSL双胞胎任务上的受控实验,我们发现这两个机制是不对称的。严格的数据门控在我们测试的每种奖励变体下都能保证稳定性,包括没有地面真实信息访问的自一致性奖励;而一旦移除门控,没有任何奖励变体足以保证稳定性。这种不对称性揭示了我们称之为'基础提出者悖论'的反直觉耦合:具有地面真实信息访问的提出者在与自一致性求解器配对时,会比无地面真实信息的提出者更快崩溃,因为训练集中在形成最快路径到虚假自一致性吸引子的干净任务上。将二进制门控替换为连续严格性参数ε进一步揭示了两阶段相变:训练侧指标在低ε时解耦,而验证准确率在ε远高于时才保持。数据层面的门控,而非奖励校准,是自我博弈稳定性的绑定约束。

英文摘要

Self-play reinforcement learning trains language models on their own generated tasks, co-evolving a proposer and solver without human labels. Recent systems report strong reasoning gains, but collapse and instability are widely observed and poorly understood. The dominant response treats this as a reward-design problem. We argue instead that self-play stability is governed by two distinct levers: a data-level gate that decides which proposer-generated tasks enter the training pool, and the reward signal that updates the policy on tasks already admitted. Through controlled experiments on a Python output-prediction task and a deterministic-DSL twin task that strips pretraining priors, output ambiguity, and executor noise, we find the two levers are asymmetric. A strict gate is sufficient for stability under every reward variant we test, including a self-consistency reward with no access to ground truth; while no reward variant is sufficient once the gate is removed. This asymmetry exposes a counter-intuitive coupling we call the Grounded Proposer Paradox: a proposer with ground-truth access accelerates collapse faster than an ungrounded one when paired with a self-consistency solver, by concentrating training on clean tasks that form the fastest path to a spurious self-consistent attractor. Replacing the binary gate with a continuous strictness parameter $\varepsilon$ further reveals a two-stage phase transition: training-side metrics decouple at low $\varepsilon$, while validation accuracy holds until $\varepsilon$ is much higher. Data-level gating, not reward calibration, is the binding constraint on self-play stability.

2605.22213 2026-05-22 cs.AI

Towards a compositional semantics for quantitative confidence assessment in assurance arguments

迈向定量信心评估的组合语义:在保证论证中

Benjamin Herd, Jessica Kelly, Jan Sabsch, Lydia Gauerhof

发表机构 * Luxoft GmbH(卢克斯oft GmbH) Robert Bosch GmbH(罗伯特·博世有限公司)

AI总结 本文提出了一种组合语义,用于在保证论证中进行定量信心评估,通过将论证元素表示为主观逻辑意见,并将元素间的关系映射到主观逻辑运算符,从而实现信心的传播。

Comments Accepted to the 21st European Dependable Computing Conference (EDCC 2026), Canterbury, UK

Journal ref Proceedings of the 21st European Dependable Computing Conference (EDCC 2026)

详情
AI中文摘要

保证论证提供了一种清晰且结构化的方式来解释为什么利益相关者应相信系统满足某些属性,然而广泛使用的记法,例如目标结构记法(GSN),通常缺乏推导保证信心的操作语义。现有方法解决结构和正确性,但主要在真值上推理,而不是在主张证明中的信心上。主观逻辑(SL)提供了一种信念、不信和不确定的计算,具有结合意见的运算符,使在不完整、冲突或主观证据下信心传播成为可能。然而,现有的基于SL的方法并未提供一种统一的、组合的语义,该语义涵盖所有论证元素和关系,以实现总体的信心评估。本文提出了一种信心语义,将论证元素表示为SL意见,并将元素间的关系映射到SL运算符,从而有效地将论证转化为可分析的信心网络。该方法提供了显式的担保,有原则的上下文处理,保留了来源,并与GSN兼容,并通过一个示例保证信心评估提供实用指导。

英文摘要

Assurance arguments provide a clear and structured way to explain why stakeholders should trust that a system satisfies certain properties, yet widely used notations, e.g.Goal Structuring Notation (GSN), typically lack an operational semantics for deriving assurance confidence. Existing approaches address structure and soundness but largely reason over truth values, not over confidence in the justification of claims. Subjective Logic (SL) offers a calculus of belief, disbelief, and uncertainty with operators for combining opinions, enabling confidence propagation under incomplete, conflicting, or subjective evidence. However, existing SL-based approaches do not provide a uniform, compositional semantics that covers all argument elements and relations to enable overall confidence assessment. We propose a confidence semantics that represents argument elements as SL opinions and maps relations between elements to SL operators modelling how confidence flows, effectively turning the argument into an analyzable confidence network. The approach provides explicit warrants, principled handling of context, preserved provenance, and compatibility with GSN, along with practical guidance using an exemplary assurance confidence assessment.

2605.22211 2026-05-22 cs.AI

CLORE: Content-Level Optimization for Reasoning Efficiency

CLORE:面向推理效率的内容级优化

Yuyang Wu, Qiyao Xue, Guanxing Lu, Weichen Liu, Zihan Wang, Manling Li, Olexandr Isayev

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Northwestern University(西北大学) University of North Carolina, Chapel Hill(北卡罗来纳大学教堂山分校) University of Pittsburgh(匹兹堡大学)

AI总结 本文提出CLORE框架,通过编辑正确在线轨迹来提升大语言模型的推理效率,通过外部增强模型删除冗余、不可读或无关内容,同时保留最终答案,并结合辅助参考-free DPO目标和标准策略梯度训练优化增强-原始对,实验表明CLORE在五个数学推理基准上提升了准确性和效率的平衡,并与GRPO、DAPO、Training Efficient和ThinkPrune兼容。

Comments 9 pages, 9 figures

详情
AI中文摘要

强化学习后训练已提高了大语言模型的推理能力,但往往产生不必要的长、重复或语义模糊的推理轨迹。现有高效推理方法主要通过显式预算或长度感知奖励调节响应长度,导致中间推理内容弱监督。我们提出CLORE,一种内容级优化框架,通过编辑正确在线轨迹来提高推理效率。CLORE使用外部增强模型删除重复段落、不可读或无关内容以及解决方案确定后的冗余推理,同时保留最终答案。所得到的增强-原始对通过辅助参考-free DPO目标与标准策略梯度训练优化。通过限制增强到正确轨迹并执行局部删除,CLORE使编辑轨迹接近策略分布并减轻离策略不匹配。在DeepSeek-R1-Distill-Qwen-7B和Qwen2.5-Math-7B五个数学推理基准上的实验表明,CLORE提高了准确性和效率的平衡,并与GRPO、DAPO、Training Efficient和ThinkPrune兼容。内容级分析进一步表明,CLORE减少了重复推理、不可读内容和答案后探索,支持内容级监督作为长度级控制的互补方向。

英文摘要

Reinforcement learning post-training has improved the reasoning ability of large language models, but often produces unnecessarily long, repetitive, or semantically opaque reasoning traces. Existing efficient reasoning methods mainly regulate response length through explicit budgets or length-aware rewards, leaving intermediate reasoning content weakly supervised. We propose CLORE, a content-level optimization framework that improves reasoning efficiency by editing correct on-policy rollouts. CLORE uses an external augmentation model to delete repetitive segments, illegible or task-irrelevant content, and superfluous reasoning after the solution is established, while preserving the final answer. The resulting augmented--original pairs are optimized with an auxiliary reference-free DPO objective alongside standard policy-gradient training. By restricting augmentation to correct trajectories and performing local deletion, CLORE keeps edited rollouts close to the policy distribution and mitigates off-policy mismatch. Experiments on DeepSeek-R1-Distill-Qwen-7B and Qwen2.5-Math-7B across five mathematical reasoning benchmarks show that CLORE improves the accuracy--efficiency trade-off and remains compatible with GRPO, DAPO, Training Efficient, and ThinkPrune. Content-level analyses further show that CLORE reduces repetitive reasoning, illegible content, and post-answer exploration, supporting content-level supervision as a complementary direction to length-level control.

2605.22209 2026-05-22 cs.CV

GALAR-TemporalNet v2: Anatomy-Guided Dual-Branch Temporal Classification with Bidirectional Mamba and Dual-Graph GCN for Video Capsule Endoscopy -- after competition results

GALAR-TemporalNet v2: 基于解剖引导的双分支时间分类方法,结合双向Mamba和双图GCN用于视频胶囊内镜

Jiye Won, Seangmin Lee, Soon Ki Jung

发表机构 * School of Computer Science and Engineering, Kyungpook National University(韩国庆北国立大学计算机科学与工程学院)

AI总结 该研究针对视频胶囊内镜中同时定位8个解剖区域和检测9种病理发现的多标签时间分类问题,提出GALAR-TemporalNet v2模型,通过结合窗口自注意力、双图GCN和双向Mamba解决类别不平衡、长程时间依赖和病理-解剖纠缠问题,最终在RARE-VISION测试集上取得更高的mAP指标。

Comments 7 pages, 2 figures. Post-competition preprint for the ICPR 2026 RARE-VISION Challenge

详情
AI中文摘要

视频胶囊内镜(VCE)提出了具有挑战性的多标签时间分类问题,要求在数万帧中同时定位8个解剖区域并检测9种病理发现。我们提出了GALAR-TemporalNet v2,一种分层时间模型,旨在解决三个核心挑战:极端类别不平衡、长程时间依赖性和病理-解剖纠缠。我们的架构结合了窗口自注意力进行局部建模,双图GCN用于全局帧关系,以及双向Mamba用于选择性边界上下文编码。新颖的解剖原型残差路径将病理偏差信号与正常器官外观分离,帧级GCN跳跃连接稳定了视觉上易混淆的稀有类别的训练。竞赛版本的GALAR-TemporalNet在RARE-VISION测试集上实现了整体mAP@0.5为0.2644和mAP@0.95为0.2353。在竞赛后,重新设计的GALAR-TemporalNet v2,结合了重构的病理分支、优化的损失函数和扩展的后处理,将这些结果提升到mAP@0.5为0.3409和mAP@0.95为0.3333。

英文摘要

Video Capsule Endoscopy (VCE) poses a challenging multi-label temporal classification problem, requiring simultaneous localization of 8 anatomical regions and detection of 9 pathological findings across tens of thousands of frames. We present GALAR-TemporalNet v2, a hierarchical temporal model that addresses three core challenges: extreme class imbalance, long-range temporal dependencies, and pathology--anatomy entanglement. Our architecture combines windowed self-attention for local modeling, a Dual-Graph GCN for global frame relationships, and Bidirectional Mamba for selective boundary context encoding. A novel anatomy prototype residual pathway decouples pathological deviation signals from normal organ appearance, and a frame-level GCN skip connection stabilizes training of visually confusable rare classes. The competition version, GALAR-TemporalNet, achieved an overall mAP@0.5 of 0.2644 and mAP@0.95 of 0.2353 on the RARE-VISION test set. Following the competition, the redesigned GALAR-TemporalNet v2 -- incorporating a restructured pathology branch, refined loss functions, and extended post-processing -- improved these results to mAP@0.5 of 0.3409 and mAP@0.95 of 0.3333.

2605.22205 2026-05-22 cs.AI cs.LG

Skill Weaving: Efficient LLM Improvement via Modular Skillpacks

技能编织:通过模块化技能包实现高效的LLM改进

Zhuo Li, Guodong Du, Zesheng Shi, Weiyang Guo, Weijun Yao, Yuan Zhou, Jiabo Zhang, Jing Li

发表机构 * Harbin Institute of Technology, Shenzhen, China(哈尔滨工业大学(深圳)) The Hong Kong Polytechnic University(香港理工大学) Huawei Technologies Co., Ltd.(华为技术有限公司) Shanghai Jiaotong University(上海交通大学)

AI总结 本研究提出SkillWeave框架,通过模块化技能包使LLM在固定内存预算下实现领域专业化,通过SkillZip压缩技术实现高效部署,实验表明其在多任务和代理基准上表现优异,速度提升达4倍。

Comments Accepted by ACL2026

详情
AI中文摘要

大型语言模型日益需要在多样化领域中进行专门化,但现有方法难以在多领域能力与严格的内存和推理约束之间取得平衡。本文介绍了SkillWeave,一种模块化改进框架,使LLM能够在固定内存预算下实现专业化。SkillWeave将通用模型的全部能力划分为技能包——轻量、领域特定的delta模块——以重新组织和细化模型的内部知识。为了高效部署,SkillWeave集成了SkillZip将技能包压缩为紧凑且推理友好的格式,从而在低延迟执行下实现强大的多领域性能。在多任务和代理基准上,一个9B的SkillWeave模型优于多个基线,并甚至超越了32B的单体LLM,同时实现了高达4倍的速度提升。

英文摘要

Large language models increasingly require specialization across diverse domains, yet existing approaches struggle to balance multi-domain capacities with strict memory and inference constraints. In this work, we introduce SkillWeave, a modular improvement framework that enables LLMs to specialize under fixed memory budgets. SkillWeave partitions full capabilities of a general-purpose model into skillpacks -- lightweight, domain-specific delta modules -- that reorganize and refine the model's internal knowledge. For efficient deployment, SkillWeave integrates SkillZip to compress skillpacks into compact and inference-ready format, enabling strong multi-domain performance with low-latency execution. On multi-task and agentic benchmarks, a 9B SkillWeave model outperforms several baselines and even surpasses a 32B monolithic LLM, while achieving up to 4x speedup.

2605.22204 2026-05-22 cs.CL

Audience Engagement with Arabic Women's Social Empowerment and Wellbeing: A Decadal Corpus

阿拉伯女性社会赋权与福祉的受众参与:一个十年语料库

Wajdi Zaghouani, Mabrouka Bessghaier, MD. Rafiul Biswas, Shimaa Amer Ibrahim

发表机构 * Northwestern University in Qatar(卡塔尔西北大学) Hamad bin Khalifa University(哈利法大学)

AI总结 本文提出阿拉伯女性与社会语料库,包含2013至2024年间252,487条阿拉伯语Facebook公开帖子,涵盖女性赋权和社会福祉主题,通过自动化流程处理后,为阿拉伯方言的性别话语、社会改革和情感参与的大规模分析提供了数据支持。

详情
AI中文摘要

本文介绍了阿拉伯女性与社会语料库,该语料库包含2013年至2024年间收集的252,487条阿拉伯语Facebook公开帖子,涉及女性赋权和社会福祉。该语料库从77个国家的51,660个页面中收集,产生超过267亿次用户互动。每条帖子均包含分享、评论和情感反应等参与指标,为受众情感和社会关注度提供了独特视角。数据通过自动化流程处理,包括语言识别、标准化和元数据清洗,以确保可靠性和可重复性。该语料库支持阿拉伯语自然语言处理、计算社会科学和数字传播研究。该数据集和相关文档将在研究用途下发布。

英文摘要

This paper presents the Arabic Women and Society Corpus, a ten year collection of 252,487 public Arabic Facebook posts related to women's empowerment and social wellbeing. The corpus was collected from 51,660 pages across 77 countries between 2013 and 2024, resulting in more than 267 million user interactions. Each post includes engagement metrics such as shares, comments, and emotional reactions, providing a unique view of audience sentiment and social attention. The data were processed using an automated pipeline with language identification, normalization, and metadata cleaning to ensure reliability and reproducibility. The corpus enables large scale analysis of gender discourse, social reform, and emotional engagement across Arabic dialects. It supports research in Arabic natural language processing, computational social science, and digital communication studies. The dataset and accompanying documentation will be released under request for research use.

2605.22203 2026-05-22 cs.CL

Evaluation of Chunking Strategies for Effective Text Embedding in Low-Resource Language on Agricultural Documents

对低资源语言农业文档中有效文本嵌入的分块策略评估

Sovandara Chhoun, Pichdara Po, Sereiwathna Ros, Wan-Sup Cho, Saksonita Khoeurn

发表机构 * Department of Big Data, Chungbuk National University, Cheongju-si, South Korea(大数据系, Chungbuk国立大学,韩国Cheongju市) Department of Computer Science, Chungbuk National University, Cheongju-si, South Korea(计算机科学系, Chungbuk国立大学,韩国Cheongju市) BigDataLabs Co., Ltd. Department of Management Information Systems, Chungbuk National University, South Korea(BigDataLabs公司 管理信息系, Chungbuk国立大学,韩国)

AI总结 本研究比较了四种文本分块方法在Khmer农业文档中的性能,通过检索增强生成(RAG)框架评估分块策略对密集检索优化的影响,发现基于字符的递归分块方法在低资源语言中表现最佳。

Comments 11 pages, 1 figure

详情
AI中文摘要

在本研究中,我们比较了四种文本分块方法:递归、Khmer-aware、基于句子和基于大语言模型(LLM)在检索增强生成(RAG)框架中应用于Khmer农业文档的性能。文档分块使用BGE-M3多语言嵌入模型进行编码,并使用FAISS库进行检索。性能通过四个指标评估:平均检索分数(L2距离)、答案相关性、Khmer覆盖率和Khmer交并比(IoU),均基于真实问题-答案对进行测量。在评估中,我们对18个问题-答案对进行了5折交叉验证。我们观察到基于字符的递归分块方法在分块大小为300字符时表现最佳,实现了最低的L2距离(0.4295 ± 0.0461)、最高的答案相关性(0.8663 ± 0.0199)和最高的Khmer IoU(0.6441 ± 0.0347)。配对t检验显示在L2距离上,与基于句子的分块方法相比有统计学显著改进(p = 0.0121)。这些结果突显了分块粒度和结构保持在优化形态复杂、低资源语言如Khmer的密集检索中的重要性。

英文摘要

In this study, we compare the performance of four text chunking approaches: Recursive, Khmer-Aware, Sentence-Based, and LLM-Based within a Retrieval-Augmented Generation (RAG) framework applied to Khmer agricultural documents. The document chunks are encoded using the BGE-M3 multilingual embedding model and retrieved using the FAISS library. Performance is evaluated using four metrics: Average Retrieval Score (L2 distance), Answer Relevance, Khmer Coverage, and Khmer Intersection over Union, all measured against ground-truth question-answer pairs. For evaluation, we perform 5-fold cross-validation over 18 question-answer pairs. We observe the best performance for the character-based Recursive chunking method with a chunk size of 300 characters, achieving the lowest L2 distance (0.4295 +- 0.0461), highest Answer Relevance (0.8663 +- 0.0199), and highest Khmer IoU (0.6441 +- 0.0347). A paired t-test shows a statistically significant improvement over the Sentence-Based chunking method in L2 distance (p = 0.0121). These results highlight the importance of segmentation granularity and structural preservation for optimizing dense retrieval in morphologically complex, low-resource languages such as Khmer.

2605.22202 2026-05-22 cs.CL

Structure Retention in Embedding Spaces as a Predictor of Benchmark Performance

嵌入空间中的结构保留作为基准性能预测因子

Amanda Myntti, Jenna Kanerva, Veronika Laippala, Filip Ginter

发表机构 * TurkuNLP, University of Turku, Finland(图尔库大学TurkuNLP实验室,图尔库大学,芬兰) ELLIS Institute Finland(芬兰ELLIS研究所)

AI总结 本文研究了高表现嵌入模型在嵌入空间中的一致性组织方式,通过评估25种现代嵌入模型在五个MTEB任务上的表现,发现最近邻重叠和独立成分分析(ICA)中成对文本实例的幅度差异与任务性能高度相关,揭示了嵌入任务在线性度和局部信息保留依赖性方面的差异。

详情
AI中文摘要

在本文中,我们展示了高性能的嵌入模型在其嵌入空间中以一致的方式组织。我们评估了25种现代嵌入模型在五个MTEB任务上的表现,这些任务涵盖四个多样化的任务类别(检索、双语挖掘、对分类和摘要)在英语和多语言设置中。我们发现成对文本实例之间的最近邻重叠和独立成分分析(ICA)中的幅度差异与给定任务的性能高度相关(甚至达到0.97)。最终,我们展示了嵌入任务在不同程度上表现出线性和对局部信息保留的依赖性。我们的结果进一步加深了对嵌入的理解,揭示了嵌入与模型性能的关系,并为可能的未来训练目标和优化条件嵌入提供了启示。

英文摘要

In this paper, we show that high-performing embedding models organize their embedding spaces in a consistent way. We evaluate 25 contemporary embedding models on five MTEB tasks spanning four diverse task categories (retrieval, bitext mining, pair classification, and summarization) in both English and multilingual settings, and reveal that nearest-neighbor overlap and magnitude differences in independent component analysis (ICA) between paired text instances strongly correlate (even up to 0.97) with performance on the given task. Ultimately, we show that embedding tasks display varying degrees of linearity and reliance on retention of local information. Our results further the understanding of embeddings, their relation to model performance, and shed light on possible future training objectives and optimizing conditional embeddings.

2605.22201 2026-05-22 cs.CV

Zero-Shot Temporal Action Localization Through Textual Guidance

通过文本指导实现零样本时间动作定位

Benedetta Liberatori, Alessandro Conti, Lorenzo Vaquero, Paolo Rota, Yiming Wang, Elisa Ricci

发表机构 * University of Trento(特伦托大学) Fondazione Bruno Kessler(布鲁诺·凯瑟勒基金会)

AI总结 本文提出TEGU方法,通过利用大规模语言模型和结构化文本提取的丰富文本信息,解决零样本时间动作定位中因缺乏训练监督导致的细粒度动作分类困难问题,实验表明该方法在THUMOS14和ActivityNet-v1.3数据集上优于现有方法。

Comments Accepted to FG 2026

详情
AI中文摘要

零样本时间动作定位(ZS-TAL)涉及在未修剪视频中对动作进行分类和定位,其中动作类别在训练时是未见过的。现有工作利用视觉语言模型(VLMs),借助其强大的零样本迁移能力。然而,这些模型在细粒度动作分类上面临明显挑战,难以直接用于区分动作存在与否。大多数当前ZS-TAL方法通过在大规模视频数据集上训练模型来解决这些问题,这需要标注数据且通常导致泛化性能有限。最近,不使用标注数据的方法出现了作为替代方案。沿着这一方向,我们提出了一种新的方法,即“视频中动作更精细定位的文本指导”(TEGU),通过利用大规模语言模型和从描述中提取的结构化文本所衍生的丰富文本信息,弥补训练数据缺乏监督的不足。这种额外的语境信息可以通过提供更丰富的视频内细粒度动作差异的线索,提高细粒度辨别能力。我们通过在THUMOS14和ActivityNet-v1.3数据集上进行实验验证所提出方法的有效性。我们的结果表明,通过利用丰富的文本信息来改进动作定位,TEGU在不涉及训练的最先进ZS-TAL方法上表现更优。

英文摘要

Zero-shot temporal action localization (ZS-TAL) consists of classifying and localizing actions in untrimmed videos, where action classes are unseen at training time. Existing work uses Vision and Language Models (VLMs), taking advantage of their strong zero-shot transfer capabilities. Yet, these models face evident challenges with fine-grained action classification, making it difficult to directly use them to distinguish between the presence and absence of an action. Most current methods for ZS-TAL address these challenges by training models on large-scale video datasets, which require annotated data and often result in limited generalization performance. Recently, approaches discarding the use of labeled data have emerged as an alternative. Following this direction, we propose a novel approach, ``Textual Guidance for finer localization of actions in videos'' (TEGU), that compensates for the lack of supervision from training data by exploiting rich textual information derived from large language models and structured text extracted from captions. This additional linguistic context can improve fine-grained discrimination by providing richer cues about fine-grained action differences within videos. We validate the effectiveness of the proposed method by conducting experiments on the THUMOS14 and the ActivityNet-v1.3 datasets. Our results show that, by exploiting rich textual information for improved action localization, TEGU outperforms state-of-the-art ZS-TAL approaches that do not involve training

2605.22200 2026-05-22 cs.CV cs.AI cs.LG

OSS: Open Suturing Skills Vision-Based Assessment Challenge 2024-2025

OSS: 2024-2025 开放缝合技能基于视觉的评估挑战

Hanna Hoffmann, Setareh Bady, Claas de Boer, Max Kirchner, Jan Egger, Rainer Röhrig, Frank Hölzle, Lennart Johannes Gruber, Kunpeng Xie, Marlon Neuhaus, Victor Alves, Guilherme Barbosa, Leonardo Barroso, João Carvalho, Hao Chen, Gabriella d'Albenzio, André Ferreira, Nuno Gomes, Yuichiro Hayashi, Kousuke Hirasawa, Rebecca Hisey, Seungjae Hong, Seoi Jeong, Tiago Jesus, Daehong Kang, Satoshi Kasai, Shunsuke Kikuchi, Takayuki Kitasaka, Satoshi Kondo, Hyoun-Joong Kong, Youngbin Kong, Atsushi Kouno, Shlomi Laufer, Kyu Eun Lee, Bining Long, Nooshin Maghsoodi, Hiroki Matsuzaki, Evangelos Mazomenos, Ori Meiraz, Kensaku Mori, Marina Music, Masahiro Oda, Roi Papo, Jieun Park, Rafael Piexoto, Saeid Rezaei, Mariana Ribeiro, Soyeon Shin, Yang Shu, Idan Smoller, Danail Stoyanov, Yihui Wang, Xinkai Zhao, Sebastian Bodenstedt, Isabel Funke, Stefanie Speidel, Behrus Hinrichs-Puladi

发表机构 * Department of Translational Surgical Oncology, National Center for Tumor Diseases (NCT/UCC) Dresden(转化外科肿瘤学部,肿瘤疾病国家中心(NCT/UCC)德累斯顿) The Centre for Tactile Internet with Human-in-the-Loop (CeTI), TUD Dresden University of Technology(具有人环路触觉互联网中心(CeTI),德累斯顿技术大学) Department of Oral and Maxillofacial Surgery, University Hospital RWTH Aachen(口腔和颌面外科部,亚琛大学医院) Center for Tooth-, Mouth- and Jaw Medicine, University Göttingen(牙科、口科和颌科医学中心,哥廷根大学) Institute of Medical Informatics, University Hospital RWTH Aachen(医学信息学研究所,亚琛大学医院) Faculty of Medicine and University Hospital Carl Gustav Carus, TUD Dresden University of Technology(医学系和卡尔·戈斯塔·卡鲁斯大学医院,德累斯顿技术大学) German Cancer Research Center (DKFZ)(德国癌症研究中心(DKFZ)) Muroran Institute of Technology(牟然技术学院) Niigata University of Health and Welfare(北九州市保健福利大学) Konica Minolta, Inc.(柯尼卡美能达公司) Jmees, Inc.(Jmees公司) Department of Computer Science and Engineering, The Hong Kong University of Science and Technology(计算机科学与工程部,香港科学与技术大学) Center Algoritmi/LASI, University of Minho(算法中心/ALASI,米尼奥大学) Life and Health Sciences Research Institute (ICVS), School of Medicine, University of Minho(生命与健康科学研究院(ICVS),医学院,米尼奥大学) ICVS/3B's - PT Government Associate Laboratory(ICVS/3B's - PT政府附属实验室) Institute for AI in Medicine (IKIM), University Medicine Essen(医学人工智能研究所(IKIM),埃森大学医学部) The Faculty of Data and Decisions Science, Technion - Israel Institute of Technology(数据与决策科学系,技术学院-以色列理工学院) UCL Hawkes Institute, University College London(UCL Hawkes研究所,伦敦大学学院) School of Computing, Queen's University(计算学院,皇后大学) Department of Transdisciplinary Medicine, Seoul National University Hospital(跨学科医学部,首尔国立大学医院) Interdisciplinary Program in Medical Informatics, Seoul National University(医学信息学跨学科项目,首尔国立大学) Department of Clinical Medical Sciences, Seoul National University(临床医学科学部,首尔国立大学) Institute of Convergence Medicine with Innovative Technology, Seoul National University Hospital(融合医学与创新技术研究所,首尔国立大学医院) Department of Surgery, Seoul National University College of Medicine and Seoul National University Hospital(外科部,首尔国立大学医学院和首尔国立大学医院)

AI总结 本文提出OSS挑战,旨在通过基于视觉的评估方法提升开放手术技能训练,通过挑战数据集和多任务评估,评估不同方法在开放手术技能评估中的表现,揭示视频评估的潜力与限制。

Comments Stefanie Speidel and Behrus Hinrichs-Puladi jointly supervised this work. Submitted to MEDIA

详情
AI中文摘要

通过有效的训练实现高水平的外科技能对于最佳的患者结果至关重要。自动化、数据驱动的技能评估有潜力改善外科训练。尽管基于机器学习的方法在微创手术技能评估中越来越受欢迎,但其在开放手术中的应用仍然有限。我们提出了一个专门的MICCAI挑战,旨在基准测试和推进开放手术中的基于视觉的技能评估。挑战数据集包含在干实验室环境中用静态GoPro相机记录的开放缝合训练任务视频,除了主要视频模态外,还包含仪器轨迹数据。OSS挑战连续两年举办,分别包含两个和三个独立任务:(1) 将技能水平分类为四个类别,(2) 预测涵盖八个类别的完整客观结构化评估技术技能分数,(3) 跟踪手部和手术工具。参与者提交了多种解决方案,包括基于深度学习的视频模型、跟踪驱动的方法和混合方法。通用的空间时间视频模型始终实现了最强的性能,尽管概念上多样的方法在执行良好的情况下也能达到竞争水平。预测细粒度的OSATS分数仍然具有挑战性,但受益于增加的训练数据。关键点跟踪由于频繁的遮挡和出帧实例而变得困难,限制了当前基于运动的技能分析的应用。这项工作评估了创新和多样的解决方案,突显了基于视频的评估在开放手术中的潜力和当前限制,并识别了推进自动化技能评估向临床影响发展的关键方向。

英文摘要

Achieving high levels of surgical skill through effective training is essential for optimal patient outcomes. Automated, data-driven skill assessment holds significant potential to improve surgical training. While machine learning-based methods are increasingly popular for assessing skills in minimally invasive surgery, their application to open surgery remains limited. We present the results of a dedicated MICCAI challenge designed to benchmark and advance vision-based skill assessment in open surgery. The challenge dataset comprises videos of an open suturing training task recorded with a static GoPro camera in a dry-lab setting, with instrument trajectories available in addition to the primary video modality. The OSS Challenge was hosted over two consecutive years, comprising two and three independent tasks, respectively: (1) classifying skill level into four classes, (2) predicting the full Objective Structured Assessment of Technical Skills across eight categories, and (3) tracking hands and surgical tools. Participants submitted diverse solutions including deep learning-based video models, tracking-driven methods, and hybrid approaches. General-purpose spatiotemporal video models consistently achieved the strongest performance, though conceptually diverse approaches reached competitive levels when well-executed. Predicting fine-grained OSATS scores remains challenging but benefits substantially from increased training data. Keypoint tracking proves difficult given frequent occlusions and out-of-frame instances, limiting current applicability for motion-based skill analysis. This work benchmarks innovative and diverse solutions for surgical skill assessment, highlighting both the promise and current limitations of video-based evaluation in open surgery and identifying critical directions for advancing automated skill assessment toward clinical impact.

2605.22195 2026-05-22 cs.LG

Reinforced Graph of Thoughts: RL-Driven Adaptive Prompting for LLMs

思维图增强:由强化学习驱动的LLM自适应提示方法

Manuel Noah Riesen, Peter Alfred von Niederhäusern

发表机构 * School of Engineering and Computer Science(工程与计算机科学学院) Bern University of Applied Sciences(伯恩应用科学大学)

AI总结 本文提出Reinforced Graph of Thoughts (RGoT),通过强化学习自动生成适应任务复杂度的思维图结构,提升大型语言模型的提示效果。

Comments 26 pages (including appendix), 16 figures

详情
AI中文摘要

Graph of Thoughts (GoT),作为一种针对大型语言模型(LLMs)的通用提示范式,已被证明在复杂问题解决中具有用处。通过执行一系列操作的图,LLM的思维被结构化为任意图,形成实际的思维图。最初,操作图是手动定义的,需要深入了解问题的解决方案。这种静态的操作图缺乏适应性。我们提出Reinforced Graph of Thoughts (RGoT),一种利用强化学习(RL)自动从人类定义的集合中生成操作图的自动化方法。结果表明,在某些约束下,可以以自动化的方式构建适应任务复杂度的操作图。

英文摘要

Graph of Thoughts (GoT), a generalized form of recent prompting paradigms for large language models (LLMs), has been shown to be useful for elaborate problem solving. By executing a graph of operations, thoughts of the LLM are structured as an arbitrary graph, forming the actual graph of thoughts. Originally, the graph of operations is defined manually, which requires in-depth knowledge about the solution of the problem to solve. Such a static graph of operations is rigid and therefore lacks adaptability. We propose Reinforced Graph of Thoughts (RGoT), an automated approach to the GoT prompting paradigm that leverages reinforcement learning (RL) to adaptively generate a graph of operations from a human-defined set. Results indicate that, under certain constraints, it is possible to construct graphs of operations adaptively to the task's complexity in an automated way.

2605.22192 2026-05-22 cs.CV

Ultra-High-Definition Image Quality Assessment via Graph Representation Learning

通过图表示学习实现超高清图像质量评估

Shaode Yu, Enqi Chen, Ming Huang, Xuemin Ren, Songnan Zhao, Zhicheng Zhang, Qiurui Sun

发表机构 * 1 School of Information Communication Engineering, Communication University of China, Beijing 100024, China 2 College of Engineering, Northeastern University, Silicon Valley, San Jose, CA 95113, USA 3 JancsiLab, JancsiTech, Hongkong 999077, China 4 Center of Information \& Network Technology, Beijing Normal University, Beijing 100875, China

AI总结 本文提出了一种图表示学习框架UHD-GCN-BIQA,通过显式建模采样图像区域的结构依赖关系来改进超高清图像的盲质量评估,实现了高效的高质量图像质量预测。

详情
AI中文摘要

盲图像质量评估(BIQA)对于超高清(UHD)图像仍具挑战性,因为原分辨率推理计算成本高,而强制缩放或孤立裁剪可能抑制尺度敏感的失真并削弱局部瑕疵与全局场景上下文之间的关系。本文旨在通过显式建模采样图像区域之间的结构依赖关系来改进UHD-BIQA,而不是将它们视为独立视图。所提出的图表示学习框架UHD-GCN-BIQA从每个UHD图像中采样长宽比对齐的块,将它们编码为图节点,并利用空间接近性和特征相似性构建混合k-最近邻图。残差图卷积用于在区域间传播上下文信息,门控注意力池化将块级证据聚合为图像级质量预测。采用指数移动平均归一化的多目标损失函数以稳定回归、相关性和排序目标的联合优化。在UHD-IQA基准测试中,UHD-GCN-BIQA实现了PLCC=0.7784,SRCC=0.8019,RMSE=0.0519,取得了与比较方法相竞争的相关性性能和最低的RMSE。这些结果表明,基于图的区域关系建模对UHD图像质量评估是有效的,特别是在高分辨率视觉内容下提高绝对质量评分估计。

英文摘要

Blind image quality assessment (BIQA) for ultrahighdefinition (UHD) images remains challenging because native-resolution inference is computationally expensive, whereas aggressive resizing or isolated cropping may suppress scale-sensitive distortions and weaken the relationship between local artifacts and global scene context. This paper aims to improve UHD-BIQA by explicitly modeling the structural dependencies among sampled image regions rather than treating them as independent views, and a graph representation learning framework UHD-GCN-BIQA is proposed. The framework samples aspect-ratio-aligned patches from each UHD image, encodes them as graph nodes, and constructs a hybrid k-nearest-neighbor graph using spatial proximity and feature similarity. Residual graph convolution is used to propagate contextual information across regions, and gated attention pooling aggregates patchlevel evidence into an imagelevel quality prediction. An exponential moving average normalized multiobjective loss function is adopted to stabilize the joint optimization of regression, correlation, and ranking objectives. Experiments on the UHD-IQA benchmark show that UHD-GCN-BIQA achieves PLCC = 0.7784, SRCC = 0.8019, and RMSE = 0.0519, obtaining competitive correlation performance and the lowest RMSE among the compared methods. These results indicate that graph-based region relation modeling is effective for UHD image quality assessment, particularly for improving absolute quality score estimation under high-resolution visual content.

2605.22191 2026-05-22 cs.LG cs.IT math.IT

Bandit Convex Optimization with Gradient Prediction Adaptivity

带梯度预测自适应的带状凸优化

Shuche Wang, Adarsh Barik, Vincent Y. F. Tan

发表机构 * Department of Mathematics, National University of Singapore, Singapore(新加坡国立大学数学系) Department of Computer Science and Engineering, Indian Institute of Technology Delhi, India(印度理工学院德里分校计算机科学与工程系) Department of Electrical and Computer Engineering, National University of Singapore, Singapore(新加坡国立大学电子与计算机工程系)

AI总结 本文研究了在预测自适应方式下,乐观梯度预测能否改进最坏情况下的后悔保证。提出了一种双点反馈设置下的两种点方差减少乐观梯度下降算法,该算法的梯度估计器方差与预测误差相关,从而得到O(√(dE[S_T]))的后悔界,并建立了信息论下界,证明了该算法在预测自适应后悔上的最优性。

详情
AI中文摘要

带状凸优化(BCO)是一种具有部分反馈的在线学习框架,其中学习者在每一轮中只观察所选决策点的损失。在本工作中,我们研究乐观梯度预测是否能在预测自适应的方式下改进最坏情况下的后悔保证。具体而言,给定梯度预测m_t,我们寻求与累积预测误差S_T=∑_{t=1}^T ||∇f_t(x_t)-m_t||^2相关的后悔界。我们首先得出一个负结果:在单点反馈协议下,即使S_T=o(T),仍存在不可避免的Ω(√T)的后悔下界,表明梯度估计的方差从根本上阻碍了准确预测的好处。为克服这一障碍,我们提出了适用于双点反馈设置的Two-Point Variance-Reduced Optimistic Gradient Descent(TP-VR-OPT)算法。其关键思想是新颖的方差减少梯度估计器,其方差与预测误差而非梯度范数相关。这导致了O(√(dE[S_T]))的后悔界,其中d是决策维度。补充这一结果,我们建立了信息论下界,其规模为Ω(√E[S_T]),提供了预测自适应后悔的最佳可实现性的基本特征,并证明TP-VR-OPT在至多√d因子内是最佳的。我们进一步开发了自适应变体,消除了对E[S_T]或时间范围T的先验知识的需求,并将我们的框架扩展到非平稳环境,建立了同时适应累积预测误差和比较路径长度的动态后悔保证。

英文摘要

Bandit convex optimization (BCO) is a fundamental online learning framework with partial feedback, where the learner observes only the loss incurred at the chosen decision point in each round. In this work, we investigate whether optimistic gradient predictions can improve worst-case regret guarantees in a prediction-adaptive manner. Specifically, given gradient predictions $m_t$, we seek regret bounds that scale with the cumulative prediction error $S_T=\sum_{t=1}^T \|\nabla f_t(x_t)-m_t\|^2.$ We first establish a negative result: under the single-point feedback protocol, an unavoidable $Ω(\sqrt{T})$ regret lower bound persists even when $S_T=o(T)$, showing that the variance of gradient estimation fundamentally obscures the benefit of accurate predictions. To overcome this barrier, we propose \emph{Two-Point Variance-Reduced Optimistic Gradient Descent} (TP-VR-OPT) for the two-point feedback setting. The key idea is a novel variance-reduced gradient estimator whose variance scales with the prediction error rather than the gradient norm. This yields a regret bound of $O\big(\sqrt{d\,\mathbb{E}[S_T]}\big),$ where $d$ is the decision dimension. Complementing this result, we establish an information-theoretic lower bound that scales as $Ω(\sqrt{\mathbb{E}[S_T]})$, providing a fundamental characterization of the best achievable prediction-adaptive regret and showing that TP-VR-OPT is optimal up to a factor of $\sqrt d$. We further develop adaptive variants that eliminate the need for prior knowledge of $\mathbb{E}[S_T]$ or the horizon $T$, and extend our framework to non-stationary environments, establishing dynamic regret guarantees that adapt simultaneously to the cumulative prediction error and the comparator path length.

2605.22190 2026-05-22 cs.CV

No Pose, No Problem in 4D: Feed-Forward Dynamic Gaussians from Unposed Multi-View Videos

无需姿态,无问题:从未姿态多视角视频中馈送动态高斯

Matteo Balice, Yanik Kunzi, Chenyangguang Zhang, Matteo Matteucci, Marc Pollefeys, Sungwhan Hong

发表机构 * Politecnico di Milano(米兰理工大学) ETH Zürich(苏黎世联邦理工学院) ETH AI Center(苏黎世联邦理工学院人工智能中心)

AI总结 本文提出NoPo4D,一种首个无需姿态的馈送式系统,能够处理动态内容、多视角输入和未知相机姿态,通过速度分解和双向运动编码提升性能,优于现有方法。

Comments https://bralani.github.io/nopo4d_html/

详情
AI中文摘要

近期的馈送式3D高斯散射方法在3D场景重建的单个方面取得了显著进展,但现有方法无法在单次馈送过程中同时处理动态内容、多视角输入和未知相机姿态。处理动态的 方法要么需要准确的相机姿态,要么只能接受单目输入;无姿态多视角方法仅能处理静态场景;而每场景优化方法在填补这些差距时,每场景的成本为分钟到小时。我们引入NoPo4D,首个馈送式系统,通过预训练的几何骨干网络和最近的4D高斯框架,引入速度分解,将高斯运动分解为每个像素图像平面位移和深度变化,从而可以直接从伪地面真实光流获得2D组件的监督。这规避了可微渲染将先验姿态方法与姿态准确性耦合以及先验无姿态方法所需的3D运动地面真实。系统还通过双向运动编码实现跨视角和跨帧特征聚合,以及视图依赖的不透明度,以缓解跨视角和跨时间步的高斯错位。在四个多视角动态基准上,NoPo4D一致优于现有馈送式基线,并通过可选后优化阶段超越每场景优化方法,同时运行速度快十倍。

英文摘要

Recent feed-forward 3D gaussian splatting methods have made dramatic progress on individual aspects of 3D scene reconstruction, but no existing method jointly addresses dynamic content, multi-view input, and unknown camera poses in a single feed-forward pass. Methods that handle dynamics either require accurate camera poses or accept only monocular input; pose-free multi-view methods address only static scenes; and per-scene optimization methods bridge some of these gaps but at minutes-to-hours cost per scene. We introduce NoPo4D, the first feed-forward system that addresses this empty quadrant. Building on a pretrained geometry backbone and recent 4D Gaussian frameworks, NoPo4D introduces a velocity decomposition that splits Gaussian motion into per-pixel image-plane shifts and depth changes, allowing direct supervision from pseudo ground-truth optical flow on the 2D component. This sidesteps both the differentiable rendering that couples prior posed methods to pose accuracy and the 3D motion ground truth that prior pose-free methods require. The system is rounded out by a bidirectional motion encoder for cross-view and cross-frame feature aggregation, and view-dependent opacity that mitigates cross-view and cross-timestep Gaussian misalignments. On four multi-view dynamic benchmarks, NoPo4D consistently outperforms prior feed-forward baselines, and with an optional post-optimization stage surpasses per-scene optimization methods, while running orders of magnitude faster.

2605.22189 2026-05-22 cs.RO

Learning A Unified Risk Map for Autonomous Driving in Partially Observable Environments

在部分可观察环境中学习统一的风险图

Jie Jia, Yaofeng Su, Zeyu Bao, Yun Hong, Bingzhao Gao, Zhongxue Gan, Wenchao Ding

发表机构 * Fudan University(复旦大学) Tongji University(同济大学)

AI总结 本文提出了一种统一的风险图建模与学习框架,用于部分可观察环境中的自动驾驶,通过时空建模整合交通流风险和碰撞风险,以更精细地评估遮挡引起的危险,并引入扩散基场景生成框架来解决遮挡交互场景稀缺的问题,实验表明该方法在Waymo Open Motion Dataset上显著优于现有方法。

Comments Published in IEEE Robotics and Automation Letters

详情
AI中文摘要

Occlusion-aware prediction remains a critical challenge in autonomous driving due to the inherent uncertainty of unobserved regions. Existing approaches either overestimate risk based on reachable states or struggle to predict accurate trajectories under high occlusion uncertainty. To address these limitations, we propose a unified risk map modeling and learning framework for partially observable environments. Our method integrates traffic flow risk and collision risk through spatiotemporal modeling, enabling fine-grained assessment of occlusion-induced hazards. To address the scarcity of scenarios involving occluded interactions, we introduce a diffusion-based scenario generation framework that produces realistic yet adversarial scenarios. We integrate the modeling and learning of a unified risk map into a framework that supports risk-aware planning under partial observability. Experiments on the Waymo Open Motion Dataset show that our method significantly outperforms the state-of-the-art occlusion-aware baseline, improving minimum time-to-collision by 0.78 times and average time-to-collision by 1.67 times. The proposed framework offers a comprehensive and practical solution for risk-aware planning in partially observable environments.

英文摘要

Occlusion-aware prediction remains a critical challenge in autonomous driving due to the inherent uncertainty of unobserved regions. Existing approaches either overestimate risk based on reachable states or struggle to predict accurate trajectories under high occlusion uncertainty. To address these limitations, we propose a unified risk map modeling and learning framework for partially observable environments. Our method integrates traffic flow risk and collision risk through spatiotemporal modeling, enabling fine-grained assessment of occlusion-induced hazards. To address the scarcity of scenarios involving occluded interactions, we introduce a diffusion-based scenario generation framework that produces realistic yet adversarial scenarios. We integrate the modeling and learning of a unified risk map into a framework that supports risk-aware planning under partial observability. Experiments on the Waymo Open Motion Dataset show that our method significantly outperforms the state-of-the-art occlusion-aware baseline, improving minimum time-to-collision by 0.78 times and average time-to-collision by 1.67 times. The proposed framework offers a comprehensive and practical solution for risk-aware planning in partially observable environments.

2605.22188 2026-05-22 cs.LG math.OC stat.ML

From Sequential Nodes to GPU Batches: Parallel Branch and Bound for Optimal $k$-Sparse GLMs

从顺序节点到GPU批处理:并行分支限界法用于最优k-稀疏广义线性模型

Jiachang Liu, Andrea Lodi

发表机构 * Jacobs Technion-Cornell Institute, Cornell Tech and Technion–IIT(雅各布斯技术学院-康奈尔学院,康奈尔科技与技术学院)

AI总结 本文提出了一种CPU-GPU框架,通过批量处理GPU上的分支限界节点,显著加速了大规模优化问题的求解,特别是在具有离散变量、组合结构和非线性目标的优化问题中,如验证卡数约束下的最优广义线性模型解。

详情
AI中文摘要

GPU在大规模优化的一阶方法中显著加速了计算,尤其是在连续优化中。然而,这种成功并未顺利转移到具有离散变量、组合结构和非线性目标的问题中,例如验证卡数约束下的广义线性模型的最优解。主要挑战包括分支限界(BnB)中异构节点的顺序处理以及CPU和GPU之间频繁的数据移动。我们提出了一种简单、通用且模块化的CPU-GPU框架,该框架可以在GPU上批量处理多个BnB节点。该框架围绕一组GPU高效的子程序构建,并利用填充和轻量级自定义内核来处理不规则的节点数据结构。实验表明,该框架在挑战性实例上实现了1到2个数量级的加速,并且在最优性间隙方面达到了零。该框架还可以扩展以收集整个Rashomon集,从而启用下游的统计分析,如变量重要性分析和在二次用户特定度量(例如分类中的AUC)下的模型选择。

英文摘要

GPUs have significantly accelerated first-order methods for large-scale optimization, especially in continuous optimization. However, this success has not transferred cleanly to problems with discrete variables, combinatorial structure, and nonlinear objectives, such as certifying optimal solutions for cardinality-constrained generalized linear models. Major challenges include the sequential processing of heterogeneous nodes in branch and bound (BnB) and frequent data movement between the CPU and GPU. We propose a simple, generic, and modular CPU--GPU framework that processes multiple BnB nodes in batches on GPUs. The framework is built around a small set of GPU-efficient routines and uses padding together with lightweight custom kernels to handle irregular node data structures. Experiments show one to two orders of magnitude speedups and zero optimality gap on challenging instances. The framework can also be extended to collect the entire Rashomon set, enabling downstream statistical analysis such as variable-importance analysis and model selection under secondary user-specific measures (e.g., AUC in classification).

2605.22186 2026-05-22 cs.CV

Event-Illumination Collaborative Low-light Image Enhancement with a High-resolution Real-world Dataset

事件-照明协同低光照图像增强与高分辨率现实数据集

Senyan Xu, Zhijing Sun, Kean Liu, Xin Lu, Ruixuan Jiang, Mingyang Huang, Xueyang Fu, Zheng-Jun Zha

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出EIC-LIE框架,通过事件-照明协同模块和照明感知事件滤波器,解决低光照图像增强中HDR信息整合不足和现实噪声敏感问题,并构建首个高分辨率现实事件数据集,实验证明其在多个数据集上优于现有方法。

详情
AI中文摘要

事件基于低光照图像增强(LIE)方法主要关注整合高动态范围(HDR)信息,而忽视图像中的全局照明和现实场景中事件信号的固有噪声敏感性。为解决这些问题,我们提出EIC-LIE,一种事件-照明协同LIE框架。具体而言,我们首先设计了一个事件-照明协同交互(EICI)模块,包含两个关键过程:前向收集,用于在不同光照条件下收集HDR特征,以及后向注入,为照明和事件表示提供互补内容。接下来,我们引入了一个照明感知事件滤波器(IAEF),根据图像导出的亮度统计动态减少事件噪声。此外,我们构建了一个基于光束分割器的混合成像系统,以从动态场景中收集高质量的事件-图像对,实现时间同步,提供了首个高分辨率、现实的事件基LIE数据集。广泛的实验表明,我们的EIC-LIE在五个现实和合成数据集上优于现有方法,显著超越了以前的方法,在PSNR上提高了1.24dB,在SSIM上提高了0.069。代码和数据集已发布在https://github.com/QUEAHREN/EIC-LIE。

英文摘要

Event-based low-light image enhancement (LIE) methods mainly focus on incorporating high dynamic range (HDR) information from events while overlooking the essential global illumination in images and the inherent noise sensitivity of event signals in real-world scenarios. To address these issues, we propose EIC-LIE, an event-illumination collaborative LIE framework. Concretely, we first design an Event-Illumination Collaborative Interaction (EICI) module, which contains two key processes: forward gathering, which gathers HDR features across varying lighting conditions, and backward injection, which provides complementary content for illumination and event representations. Next, we introduce an Illumination-aware Event Filter (IAEF) that dynamically reduces event noise based on brightness statistics derived from images. Additionally, we build a beam-splitter-based hybrid imaging system to collect high-quality event-image pairs with temporal synchronization from dynamic scenes, providing the first high-resolution, real-world event-based LIE dataset. Extensive experiments show that our EIC-LIE outperforms state-of-the-art methods on five real-world and synthetic datasets, significantly surpassing previous methods with improvements of up to 1.24dB in PSNR and 0.069 in SSIM. The code and dataset are released at https://github.com/QUEAHREN/EIC-LIE.

2605.22185 2026-05-22 cs.CV cs.LG

Enhancing Multimodal Large Language Models for Safety-Critical Driving Video Analysis

增强多模态大语言模型以用于安全关键驾驶视频分析

Tomaso Trinci, Henrique Piñeiro Monteagudo, Leonardo Taccari

发表机构 * Verizon Connect

AI总结 本研究通过融合降采样视频帧与同步高频 telemetry 数据及专用计算机视觉模型的语义信息,提升多模态大语言模型在安全关键驾驶场景中的感知与推理能力,从而更准确地识别和描述现实驾驶中的安全关键事件。

Comments Accepted at the 2026 IEEE International Conference on Intelligent Transportation Systems (ITSC 2026)

详情
AI中文摘要

近年来,多模态大语言模型(MLLMs)在一般视觉理解方面展现了出色的性能。然而,其在安全关键驾驶场景中的应用受限于无法准确感知和推理罕见高风险动态事件(如碰撞或接近碰撞)的能力。为此,我们提出了一种增强MLLM感知能力的流程,通过融合降采样视频帧与同步高频telematics数据(IMU和GPS)以及专用计算机视觉模型的语义信息生成高质量的伪标签,包括描述性标题和问答对,专门用于训练MLLM识别和描述现实驾驶中的安全关键事件(SCEs)。我们通过微调开源QwenVL-2.5模型并使用DoRA适配器展示了该方法的有效性:实验表明在少于50M可训练参数和有限计算预算下,显著提高了识别和解释安全关键事件的能力。

英文摘要

Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in general visual understanding. However, their application to safety-critical driving scenarios remains limited by an inability to accurately perceive and reason about rare high-stakes dynamic events, such as collisions or near-collisions. To address this, we introduce a pipeline that enhances MLLM perception by fusing downsampled video frames with synchronized high-frequency telematics data (IMU and GPS) and semantic insights from specialized computer vision models. Our pipeline generates high-quality pseudo-labels, including descriptive captions and question-answer pairs, specifically designed to train MLLMs to identify and describe Safety-Critical Events (SCEs) in real-world driving footage. We show the effectiveness of our approach fine-tuning the open-source QwenVL-2.5 model via DoRA adapters: our experiments demonstrate significant improvements in identifying and explaining safety-critical events, with fewer than 50M trainable parameters and limited computational budget.

2605.22182 2026-05-22 cs.LG

IKNO: Infinite-order Kernel Neural Operators

IKNO:无限阶核神经算子

Pengyuan Zhu, Ivor W. Tsang, Yueming Lyu

发表机构 * Nanyang Technological University(南洋理工大学) Centre for Frontier AI Research(CFAR), Agency for Science, Technology and Research (A*STAR)(前沿人工智能研究中心(CFAR),科技研究局(A*STAR))

AI总结 本文提出IKNO,一种通过无限阶核积分构建的神经算子,解决了传统模型因依赖一阶核积分而限制表达能力的问题,通过两种互补的构造方法实现了高效的全局信息聚合,并在多个基准数据集上取得了SOTA精度。

详情
AI中文摘要

神经算子在现代科学计算中因灵活性和强大的泛化能力而取得了显著成功。然而,现有模型主要依赖于一阶核积分近似,这严重限制了它们的表达能力。为此,我们提出了无限阶核神经算子(IKNO),通过无限阶核积分构建神经算子,并具有优雅的闭式有限近似。我们开发了两种互补的无限阶神经算子构造:IKNO-Vanilla,通过克罗内克特征分解在产品网格上应用完整的核解算子;以及IKNO-TP,一种替代的张量积算子,通过各轴解算子进行组合。此外,我们为这两种IKNO变体开发了快速计算方案,实现了出色的全局信息聚合同时保持高计算效率。实验证明,我们在具有任意输入形状的时间依赖和时间无关基准数据集上评估了我们的IKNO,包括大规模工业数据集。广泛的实验表明,IKNO方法在几乎所有基准数据集上都实现了显著的精度提升,同时保持了对非常大的点云的可扩展性。

英文摘要

Neural operators have achieved significant success in modern scientific computing due to their flexibility and strong generalization capabilities. Existing models, however, primarily rely on first-order kernel integral approximations, which severely limit their expressivity. To address this, we propose the Infinite-order Kernel Neural Operator (IKNO), which constructs neural operators via infinite-order kernel integrals and admits an elegant closed-form finite approximation. We develop two complementary infinite-order neural operator constructions: IKNO-Vanilla, which applies the full-kernel resolvent on the product grid via Kronecker eigendecomposition, and IKNO-TP, an alternative tensor-product operator that composes per-axis resolvents. Furthermore, we develop fast computation schemes for both variants of IKNO, which achieve outstanding global information aggregation while maintaining high computational efficiency. Empirically, we evaluate our IKNO on both time-dependent and time-independent benchmarks with arbitrary input shapes, including large-scale industrial datasets. Extensive experiments demonstrate that the IKNO method consistently achieves the SOTA accuracy with significant improvements on nearly all benchmark datasets while maintaining scalability to very large point clouds.