arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2154
专题追踪
2506.01732 2026-05-20 cs.CL

Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training

Common Corpus: 最大的伦理数据集用于LLM预训练

Pierre-Carl Langlais, Pavel Chizhov, Catherine Arnett, Carlos Rosas Hinostroza, Mattia Nee, Eliot Krzystof Jones, Irène Girard, David Mach, Anastasia Stasenko, Ivan P. Yamshchikov

发表机构 * PleIAs

AI总结 本文介绍了Common Corpus,一个最大的开放数据集,用于LLM预训练,该数据集包含大量非受版权或开放许可的数据,涵盖了多种语言和领域,为多语言预训练提供了支持。

Journal ref ICLR 2026 (Oral)

详情
AI中文摘要

大型语言模型(LLMs)是在不同来源和领域的大量数据上进行预训练的。这些数据集通常包含万亿个标记,包括大量受版权或专有内容,这引发了关于此类模型法律使用的问题。本文介绍了Common Corpus,最大的开放预训练数据集。Common Corpus中的数据要么未受版权保护,要么在开放许可下,总计约两万亿个标记。该数据集包含多种语言,从高资源的欧洲语言到一些在预训练数据集中很少见的低资源语言。此外,它还包含大量代码数据。在覆盖的领域和时间跨度方面的数据来源多样性为研究和创业需求提供了多种知识领域的路径。本文还展示了数据组装的详细来源以及数据集过滤和整理的细节。我们训练了两个小型语言模型并在Common Corpus上进行训练,发现它们的表现与其他同规模模型相当,表明我们的数据集适合多语言预训练。Common Corpus对大型语言模型的开放科学研究生态系统做出了重要贡献。

英文摘要

Large Language Models (LLMs) are pre-trained on large amounts of data from different sources and domains. Such datasets often contain trillions of tokens, including large portions of copyrighted or proprietary content, which raises questions about the legal use of such models. This underscores the need for truly open pre-training data that complies with data security regulations. In this paper, we introduce Common Corpus, the largest open dataset for LLM pre-training. The data assembled in Common Corpus are either uncopyrighted or under open licenses, totaling about two trillion tokens. The dataset contains a wide variety of languages, ranging from the high-resource European languages to some low-resource languages rarely represented in pre-training datasets. In addition, it includes a large amount of code data. The diversity of data sources in terms of covered domains and time periods opens up the paths for both research and entrepreneurial needs across diverse areas of knowledge. In this paper, we present the detailed provenance of data assembling and the details of dataset filtering and curation. We train two small language models on Common Corpus and find that they perform comparably to other models of their size, indicating that our dataset is suitable for multilingual pretraining. Common Corpus represents a key contribution to the ecosystem for open science research on Large Language Models.

2505.16819 2026-05-20 cs.CV

Character-Centered Dialogue Generation from Scene-Level Prompts

从场景级提示生成以角色为中心的对话

Taewon Kang, Ming C. Lin

发表机构 * University of Maryland at College Park, United States(马里兰大学学院市分校,美国)

AI总结 本研究提出了一种模块化流程,将动作级提示转化为视觉和听觉上一致的对话,丰富了基于场景的故事叙述。通过预训练的视觉-语言编码器提取高级视觉语义,并结合结构化提示引导大型语言模型生成对话。引入递归叙述银行以保持跨场景的上下文和情感一致性,最终生成具有表现力的角色条件语音,产生完整的视听叙事。

Comments Accepted to the 2026 IEEE International Conference on Image Processing (ICIP 2026). 18 pages, 5 figures

详情
AI中文摘要

最近的场景基于视频生成技术使结构化提示能够生成连贯的视觉叙述,但故事叙述中的关键方面--角色驱动的对话和言语--仍被忽视。我们提出了一种模块化流程,将动作级提示转化为视觉和听觉上一致的对话,从而丰富基于场景的故事叙述,增加自然语音和角色表达。我们的方法每场景使用一对提示,定义场景和角色行为。虽然故事生成模型如Text2Story生成视觉场景,我们专注于生成具有表现力且角色一致的陈述,这些陈述基于提示和代表性的场景图像。预训练的视觉-语言编码器提取高级视觉语义,这些语义与结构化提示结合,引导大型语言模型进行对话合成。为了在跨场景中保持上下文和情感一致性,我们引入递归叙述银行,这是一种说话者感知、时间结构化的记忆,用于积累每个角色的对话历史。受脚本理论启发,这种设计使对话能够反映不断变化的目标、社会情境和叙事角色。最后,我们将每个陈述渲染为具有表现力的角色条件语音,产生完整的视听叙述。我们的训练自由框架能够跨多样化的故事情境泛化,提供了一种可扩展的解决方案,用于连贯且以角色为中心的音频视觉叙述。

英文摘要

Recent advances in scene-based video generation enable coherent visual narratives from structured prompts, yet a key aspect of storytelling -- character-driven dialogue and speech -- remains underexplored. We present a modular pipeline that transforms action-level prompts into visually and auditorily grounded dialogue, enriching scene-based storytelling with natural voice and character expression. Our method takes a pair of prompts per scene, defining the setting and character behavior. While a story generation model such as Text2Story produces the visual scene, we focus on generating expressive, character-consistent utterances grounded in both the prompts and a representative scene image. A pretrained vision-language encoder extracts high-level visual semantics, which are combined with structured prompts to guide a large language model for dialogue synthesis. To maintain contextual and emotional consistency across scenes, we introduce a Recursive Narrative Bank, a speaker-aware, temporally structured memory that accumulates each character's dialogue history. Inspired by Script Theory, this design enables dialogue that reflects evolving goals, social context, and narrative roles. Finally, we render each utterance as expressive, character-conditioned speech, producing fully voiced, multimodal video narratives. Our training-free framework generalizes across diverse story settings, providing a scalable solution for coherent, character-grounded audiovisual storytelling.

2505.11628 2026-05-20 cs.CL cs.LG

Critique-Guided Distillation for Robust Reasoning via Refinement

基于批评的蒸馏用于通过细化实现稳健推理

Berkcan Kapusuzoglu, Supriyo Chakraborty, Zain Sarwar, Chia-Hsuan Lee, Sambit Sahu

发表机构 * University of Chicago, Department of Computer Science(芝加哥大学计算机科学系)

AI总结 该研究提出了一种基于批评的蒸馏方法,通过分离批评消费与批评生成,使模型在细调过程中根据教师的批评来细化错误响应,从而提升推理能力,相比传统蒸馏和Critique Fine-Tuning方法在数学推理基准上表现更优。

Comments Accepted to ICML 2026

详情
AI中文摘要

监督微调与专家演示通常会产生仅模仿输出而未内化稳健泛化所需推理过程的模型。尽管基于批评的方法显示出潜力,但训练模型直接生成批评,如Critique Fine-Tuning (CFT),可能导致输出格式漂移和泛化能力下降。我们提出Critique-Guided Distillation (CGD),一种将批评消费与批评生成分离的训练框架。在微调过程中,学生被训练在教师批评的指导下细化错误响应。CGD将批评视为一种仅在训练时使用的监督信号,鼓励内化错误意识推理:批评指导学习但推理时不存在。受控消融实验确认,这些推理收益直接由教师反馈的特异性和相关性驱动。在五个模型家族中,CGD在数学推理基准上优于CFT和标准蒸馏,平均改进7%,在AMC23上最高改进15.0%,在MATH-500上最高改进12.2%。在具有挑战性的竞赛问题如AIME24和AIME25上,CGD实现了显著更高的Pass@1和更低的Pass@k时的更强性能,表明每样本推理质量提升。重要的是,CGD在一般指令遵循能力上保持稳定,而CFT显著下降(在IFEval上下降21.3%)。这些结果将CGD定位为一种实用且计算效率高的中间训练范式,用于以推理为中心的任务,而无需引入架构推理时间的开销。

英文摘要

Supervised fine-tuning with expert demonstrations often produces models that imitate outputs without internalizing the reasoning processes needed for robust generalization. While critique-based approaches show promise, training models to generate critiques directly, such as Critique Fine-Tuning (CFT), can lead to output-format drift and degradation of general capabilities. We propose Critique-Guided Distillation (CGD), a training framework that decouples critique consumption from critique generation. During fine-tuning, the student is trained to refine flawed responses conditioned on teacher critiques. CGD treats critiques as a \textit{training-time-only} supervision signal, encouraging internalization of error-aware reasoning: critiques guide learning but are absent at inference. Controlled ablations confirm that these reasoning gains are directly driven by the specificity and relevance of the teacher's feedback. Across five model families, CGD consistently outperforms CFT and standard distillation on mathematical reasoning benchmarks, yielding 7\% average improvements and gains of up to +15.0\% on AMC23 and +12.2\% on MATH-500. On challenging competition problems such as AIME24 and AIME25, CGD achieves substantially higher Pass@1 and stronger performance at low Pass@k, indicating improved reasoning quality per sample. Importantly, CGD preserves general instruction-following capabilities where CFT degrades significantly ($-$21.3\% on IFEval). These results position CGD as a practical and compute-efficient intermediate training paradigm for reasoning-centric tasks without introducing architectural inference-time overhead.

2504.05454 2026-05-20 cs.LG cs.AI cs.CE q-bio.GN q-bio.QM

GraphPINE: Graph Importance Propagation for Interpretable Drug Response Prediction

GraphPINE: 图重要性传播用于可解释的药物反应预测

Yoshitaka Inoue, Tianfan Fu, Augustin Luna

发表机构 * Computational Biology Branch, National Library of Medicine(国家医学图书馆计算生物学分支) Developmental Therapeutics Branch, National Cancer Institute(国家癌症研究所发育治疗分支)

AI总结 本文提出GraphPINE,一种利用领域特定先验知识初始化节点重要性的图神经网络架构,以提高药物反应预测的可解释性。通过引入重要性传播层,统一更新特征矩阵和节点重要性,并利用基于GNN的图传播来传播特征值,从而实现更有效的特征学习和图表示。

详情
AI中文摘要

可解释性对于生物医学研究中的许多任务都是必要的。最近的可解释性方法集中在注意力、梯度和Shapley值上。这些方法无法处理具有强相关先验知识的数据,并且未能基于已知的预测特征之间的关系来约束可解释性结果。我们提出了GraphPINE,一种图神经网络(GNN)架构,利用领域特定的先验知识来初始化节点重要性,以便在训练过程中优化用于药物反应预测。通常,一个手动的后预测步骤会检查文献(即先验知识)以理解返回的预测特征。虽然梯度和注意力在预测后可以获取节点重要性,但这些方法的节点重要性缺乏互补的先验知识;GraphPINE旨在克服这一限制。GraphPINE与其他GNN门控方法的不同之处在于利用了类似LSTM的顺序格式。我们引入了一个重要性传播层,统一了1)特征矩阵和节点重要性的更新以及2)使用基于GNN的图传播来传播特征值。这种初始化和更新机制使得特征学习更加有据可依,并提高了图表示的质量。我们应用GraphPINE进行癌症药物反应预测,使用了超过5000个基因节点的药物筛选和基因数据,这些节点包含在基因-基因图中,并利用药物-靶点相互作用(DTI)图进行初始重要性。基因-基因图和DTI来自经过整理的来源,并通过讨论药物和基因之间关系的文章数量进行加权。GraphPINE在952种药物上实现了PR-AUC为0.894和ROC-AUC为0.796。代码可在https://anonymous.4open.science/r/GraphPINE-40DE获取。

英文摘要

Explainability is necessary for many tasks in biomedical research. Recent explainability methods have focused on attention, gradient, and Shapley value. These do not handle data with strong associated prior knowledge and fail to constrain explainability results based on known relationships between predictive features. We propose GraphPINE, a graph neural network (GNN) architecture leveraging domain-specific prior knowledge to initialize node importance optimized during training for drug response prediction. Typically, a manual post-prediction step examines literature (i.e., prior knowledge) to understand returned predictive features. While node importance can be obtained for gradient and attention after prediction, node importance from these methods lacks complementary prior knowledge; GraphPINE seeks to overcome this limitation. GraphPINE differs from other GNN gating methods by utilizing an LSTM-like sequential format. We introduce an importance propagation layer that unifies 1) updates for feature matrix and node importance and 2) uses GNN-based graph propagation of feature values. This initialization and updating mechanism allows for informed feature learning and improved graph representation. We apply GraphPINE to cancer drug response prediction using drug screening and gene data collected for over 5,000 gene nodes included in a gene-gene graph with a drug-target interaction (DTI) graph for initial importance. The gene-gene graph and DTIs were obtained from curated sources and weighted by article count discussing relationships between drugs and genes. GraphPINE achieves a PR-AUC of 0.894 and ROC-AUC of 0.796 across 952 drugs. Code is available at https://anonymous.4open.science/r/GraphPINE-40DE.

2503.11615 2026-05-20 cs.LG math.OC

From Score Matching to Diffusion: A Fine-Grained Error Analysis in the Gaussian Setting

从分数匹配到扩散:在高斯设定下的细粒度误差分析

Samuel Hurault, Matthieu Terris, Thomas Moreau, Gabriel Peyré

发表机构 * ENS Paris, PSL, CNRS(巴黎高等师范学院、巴黎综合理工学院、国家科学研究中心) Univ. Paris-Saclay, Inria, CEA(巴黎萨克雷大学、法国国家信息与自动化技术研究所、法国原子能委员会)

AI总结 本文研究了在高斯设定下使用扩散采样器时的采样误差,分析了分数匹配和扩散过程中的四个主要误差源,并揭示了数据分布各向异性与端到端采样方法关键参数之间的相互作用。

详情
AI中文摘要

从未知分布采样,仅能通过离散样本获取,是生成式人工智能的核心基础问题。当前最先进的方法遵循两步过程:首先估计分数函数(平滑对数分布的梯度),然后应用基于扩散的采样算法——如兰格-恩或扩散模型。所得到分布的正确性可能受四个主要因素影响:分数匹配中的泛化和优化误差,以及扩散过程中的离散化和最小噪声幅度。在本文中,我们明确地在高斯设定下使用扩散采样器时的采样误差。我们提供了来自这些四个误差源的Wasserstein采样误差的精确分析。这使我们能够严格追踪数据分布各向异性(通过其功率谱编码)如何与端到端采样方法的关键参数相互作用,包括初始样本数量、分数匹配和扩散中的步长以及噪声幅度。值得注意的是,我们展示了Wasserstein采样误差可以表示为数据功率谱的核型范数,其中具体的核取决于方法参数。这一结果为进一步分析优化采样精度的权衡提供了基础。

英文摘要

Sampling from an unknown distribution, accessible only through discrete samples, is a fundamental problem at the core of generative AI. The current state-of-the-art methods follow a two-step process: first, estimating the score function (the gradient of a smoothed log-distribution) and then applying a diffusion-based sampling algorithm -- such as Langevin or Diffusion models. The resulting distribution's correctness can be impacted by four major factors: the generalization and optimization errors in score matching, and the discretization and minimal noise amplitude in the diffusion. In this paper, we make the sampling error explicit when using a diffusion sampler in the Gaussian setting. We provide a sharp analysis of the Wasserstein sampling error that arises from these four error sources. This allows us to rigorously track how the anisotropy of the data distribution (encoded by its power spectrum) interacts with key parameters of the end-to-end sampling method, including the number of initial samples, the stepsizes in both score matching and diffusion, and the noise amplitude. Notably, we show that the Wasserstein sampling error can be expressed as a kernel-type norm of the data power spectrum, where the specific kernel depends on the method parameters. This result provides a foundation for further analysis of the tradeoffs involved in optimizing sampling accuracy.

2503.08633 2026-05-20 cs.LG

How Does Overparameterization Affect Machine Unlearning of Deep Neural Networks?

过度参数化如何影响深度神经网络的机器去学习?

Gal Alon, Yehuda Dar

发表机构 * Faculty of Computer and Information Science(计算机与信息科学学院)

AI总结 本文研究了深度神经网络去学习任务中模型参数化水平(即网络宽度)对性能的影响,探讨了不同去学习方法在不同参数化水平、去学习目标(隐私保护或偏见消除)以及是否显式使用被删除示例时的表现差异,发现过度参数化模型在隐私和偏见消除方面表现更优,但会带来一定的泛化能力下降。

详情
AI中文摘要

机器去学习是更新训练后的模型以忘记特定训练数据而不从头重新训练的任务。在本文中,我们研究了深度神经网络(DNN)的去学习如何受到模型参数化水平(即DNN宽度)的影响。我们定义了几种最近文献中去学习方法的验证基于调优,并展示了这些方法在(i)DNN参数化水平、(ii)去学习目标(隐私或偏见消除)以及(iii)去学习方法是否显式使用被删除示例时表现不同。我们的结果表明,去学习通常在过度参数化模型上表现更佳,通过显著提高隐私或偏见消除的性能,以合理的泛化能力降级成本;尽管对于偏见消除,这要求去学习方法必须使用被删除的示例。此外,我们测量了去学习如何改变分类决策区域,在接近被删除示例的附近改变,而在其他地方则避免改变。通过这种方式,我们展示了过度参数化模型的去学习成功源于其能够精细地改变输入空间中的小区域模型功能,同时保持大部分模型功能不变。

英文摘要

Machine unlearning is the task of updating a trained model to forget specific training data without retraining from scratch. In this paper, we investigate how unlearning of deep neural networks (DNNs) is affected by the model parameterization level, which corresponds here to the DNN width. We define validation-based tuning for several unlearning methods from the recent literature, and show how these methods perform differently depending on (i) the DNN parameterization level, (ii) the unlearning goal (unlearned data privacy or bias removal), (iii) whether the unlearning method explicitly uses the unlearned examples. Our results show that unlearning usually excels on overparameterized models by significantly improving privacy/bias at a reasonable cost of utility (generalization) degradation; although for bias removal this requires the unlearning method to use the unlearned examples. Furthermore, we measure how much the unlearning changes the classification decision regions in the proximity of the unlearned examples, and avoids changing them elsewhere. By this we show that the unlearning success for overparameterized models stems from the ability to delicately change the model functionality in small regions in the input space while keeping much of the model functionality unchanged.

2503.06310 2026-05-20 cs.CV

Scene-Action Prompt Fusion for Coherent Text-to-Video Storytelling

场景-动作提示融合用于连贯的文本到视频叙事

Taewon Kang, Divya Kothandaraman, Ming C. Lin

发表机构 * University of Maryland at College Park(马里兰大学学院市分校) Dolby Laboratories(杜比实验室)

AI总结 本文提出了一种整合场景和动作提示的叙事框架,通过动态启发的提示混合策略,解决文本到视频生成中时间一致性、语义一致性和场景-动作连续性的问题,通过三个关键组件实现了更连贯的视频叙事。

Comments Accepted to the 2026 IEEE International Conference on Image Processing (ICIP 2026). 13 pages, 4 figures

详情
AI中文摘要

从离散文本提示生成连贯的长视频序列仍然具有挑战性,因为难以在片段之间维持时间一致性、语义一致性和场景-动作连续性。我们提出了一种新的叙事框架,通过动态启发的提示混合来整合场景和动作提示。我们的方法结合了三个关键组成部分:(i)双向时间加权潜在融合策略,强制连续视频片段之间的时间一致性;(ii)动态启发的提示权重(DIPW)机制,根据CLIP对齐、叙事进展和时间平滑性,在每个扩散时间步适应性地平衡场景和动作提示;(iii)语义动作表示,编码高层动作语义以根据动作相似性调节转换。潜在空间融合在场景内保持空间一致性,而时间加权融合引入双向时间约束以防止突兀的转换。这些组件共同实现了流畅且连贯的视频叙事,忠实反映了场景上下文和动作动态。大量实验表明,我们的方法显著优于基线,生成时间一致且视觉吸引人的长视频,无需额外训练,从而填补了短片段和扩展文本驱动视频叙事之间的差距。

英文摘要

Generating coherent long-form video sequences from discrete text prompts remains challenging due to difficulties in maintaining temporal coherence, semantic consistency, and scene-action continuity across segments. We propose a novel storytelling framework that integrates scene and action prompts through dynamics-inspired prompt mixing. Our approach combines three key components: (i) a bidirectional time-weighted latent blending strategy that enforces temporal consistency between consecutive video segments, (ii) a dynamics-informed prompt weighting (DIPW) mechanism that adaptively balances scene and action prompts at each diffusion timestep based on CLIP-based alignment, narrative progression, and temporal smoothness, and (iii) a semantic action representation that encodes high-level action semantics to modulate transitions according to action similarity. Latent-space blending preserves spatial coherence within scenes, while time-weighted blending introduces bidirectional temporal constraints to prevent abrupt transitions. Together, these components enable fluid and coherent video narratives that faithfully reflect both scene context and action dynamics. Extensive experiments demonstrate that our method significantly outperforms baselines, producing temporally consistent and visually compelling long-form videos without any additional training, thereby bridging the gap between short clips and extended text-driven video storytelling.

2410.20238 2026-05-20 cs.CL cs.AI

A Survey of Large Language Models for Arabic Language and its Dialects

阿拉伯语言及其方言大型语言模型综述

Malak Mashaabi, Shahad Al-Khalifa, Hend Al-Khalifa

发表机构 * iWAN Research Group(iWAN研究组) College of Computer and Information Sciences(计算机与信息科学学院) King Saud University(沙特国王大学)

AI总结 本文综述了针对阿拉伯语言及其方言设计的大型语言模型,涵盖关键架构、预训练数据集以及单语、双语和多语模型在下游任务中的性能,同时讨论了阿拉伯LLM的开放性及其对未来研究的挑战与机遇。

Comments Submitted to ACM Transactions on Asian and Low-Resource Language Information Processing

详情
AI中文摘要

本文综述了针对阿拉伯语言及其方言设计的大型语言模型(LLMs)。它涵盖了关键架构,包括仅编码器、仅解码器和编码器-解码器模型,以及用于预训练的数据集,涵盖古典阿拉伯语、现代标准阿拉伯语和方言阿拉伯语。该研究还探讨了单语、双语和多语LLMs,分析了它们的架构和在下游任务(如情感分析、命名实体识别和问答)中的性能。此外,它评估了阿拉伯LLMs的开放性,基于源代码可用性、训练数据、模型权重和文档等因素。综述指出需要更多多样化的方言数据集,并强调开放性对于研究可重复性和透明性的重要性。最后,它通过识别关键挑战和未来研究的机会,强调了更包容和代表性的模型的必要性。

英文摘要

This survey offers a comprehensive overview of Large Language Models (LLMs) designed for Arabic language and its dialects. It covers key architectures, including encoder-only, decoder-only, and encoder-decoder models, along with the datasets used for pre-training, spanning Classical Arabic, Modern Standard Arabic, and Dialectal Arabic. The study also explores monolingual, bilingual, and multilingual LLMs, analyzing their architectures and performance across downstream tasks, such as sentiment analysis, named entity recognition, and question answering. Furthermore, it assesses the openness of Arabic LLMs based on factors, such as source code availability, training data, model weights, and documentation. The survey highlights the need for more diverse dialectal datasets and attributes the importance of openness for research reproducibility and transparency. It concludes by identifying key challenges and opportunities for future research and stressing the need for more inclusive and representative models.

2408.06843 2026-05-20 cs.RO

Learn2Decompose: Learning Problem Decomposition for Efficient Sequential Multi-object Manipulation Planning

Learn2Decompose: 为高效连续多物体操作规划学习问题分解

Yan Zhang, Teng Xue, Amirreza Razmjoo, Sylvain Calinon

发表机构 * Idiap Research Institute(Idiap研究 institute) Ecole Polytechnique Fédérale de Lausanne(瑞士联邦理工学院洛桑分校)

AI总结 本文提出了一种高效的任务与运动重计划方法,用于动态环境中连续多物体操作的规划。通过从示范中学习问题分解来加速TAMP求解器,核心方法包括目标分解学习、计算距离学习和物体减少,有效提升了重计划效率。

Comments Extension of RAL version: added PR2 Whole-body kitchen task and detailed discussion on limitations in main text; added pseudocode and robustness analysis of our approach, and formal analysis on why and when task goals are decomposable in appendix

详情
AI中文摘要

我们提出了一种高效的任务和运动重计划方法,用于动态环境中连续多物体操作的规划。传统任务与运动规划(TAMP)求解器在规划时间上随着规划时间跨度和物体数量的增长而呈指数级增加,限制了其在现实场景中的应用。为了解决这一问题,我们提出通过示范学习问题分解来加速TAMP求解器。我们的方法包含三个关键组成部分:目标分解学习、计算距离学习和物体减少。目标分解识别系统在达到最终目标之前必须经过的必要状态序列,将其视为子目标序列。计算距离学习预测两个状态之间的计算复杂性,使系统能够从扰动状态中识别出时间上最近的子目标。物体减少最小化重计划过程中考虑的活跃物体集合,进一步提高效率。我们在三个基准上评估了我们的方法,证明了其在动态环境中提升连续多物体操作任务重计划效率的有效性。

英文摘要

We present an efficient task and motion replanning approach for sequential multi-object manipulation in dynamic environments. Conventional Task And Motion Planning (TAMP) solvers experience an exponential increase in planning time as the planning horizon and number of objects grow, limiting their applicability in real-world scenarios. To address this, we propose learning problem decompositions from demonstrations to accelerate TAMP solvers. Our approach consists of three key components: goal decomposition learning, computational distance learning, and object reduction. Goal decomposition identifies the necessary sequences of states that the system must pass through before reaching the final goal, treating them as subgoal sequences. Computational distance learning predicts the computational complexity between two states, enabling the system to identify the temporally closest subgoal from a disturbed state. Object reduction minimizes the set of active objects considered during replanning, further improving efficiency. We evaluate our approach on three benchmarks, demonstrating its effectiveness in improving replanning efficiency for sequential multi-object manipulation tasks in dynamic environments.

2407.13193 2026-05-20 cs.CL

Retrieval-Augmented Generation for Natural Language Processing: A Survey

检索增强生成在自然语言处理中的应用:综述

Shangyu Wu, Ying Xiong, Yufei Cui, Haolun Wu, Can Chen, Ye Yuan, Lianming Huang, Xue Liu, Tei-Wei Kuo, Nan Guan, Chun Jason Xue

发表机构 * City University of Hong Kong(香港城市大学) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学) McGill University, Mila(麦吉尔大学,MILA) National Taiwan University(国立台湾大学)

AI总结 本文综述了检索增强生成(RAG)在自然语言处理中的应用,重点探讨了检索器和检索融合技术,提出了新的融合分类,并分析了RAG在不同NLP任务中的应用、评估方法、训练范式以及工业部署中的挑战和未来方向。

Comments Accepted by Artificial Intelligence Review

详情
AI中文摘要

大型语言模型(LLMs)在各个领域取得了强大的实证性能,受益于其庞大的参数量,这些参数存储了知识。然而,LLMs仍然面临几个关键问题,如幻觉问题、知识更新问题以及缺乏领域专业知识。检索增强生成(RAG)的出现,通过利用外部知识库来增强LLMs,缓解了这些限制。本文系统地回顾了RAG技术在自然语言处理(NLP)中的应用,重点在于检索器和检索融合。我们介绍了检索融合的新分类,如基于查询的、基于logits的、潜在的和参数化的融合,并提供了在可访问性、效率和用例方面的结构化比较。本文进一步探讨了RAG在各种NLP任务中的应用,讨论了评估方法和基准限制,并分析了带有和无知识库更新的训练范式。最后,我们探讨了工业部署的考虑因素,并确定了新兴挑战和未来方向,包括安全、效率和基于图的检索。

英文摘要

Large language models (LLMs) have achieved strong empirical performance in various fields, benefiting from their huge amount of parameters that store knowledge. However, LLMs still suffer from several key issues, such as hallucination problems, knowledge update issues, and lacking domain-specific expertise. The appearance of retrieval-augmented generation (RAG), which leverages an external knowledge base to augment LLMs, mitigates these limitations. This paper presents a systematic review of RAG techniques for natural language processing (NLP), with a focus on retrievers and retrieval fusions. We introduce a novel taxonomy of retrieval fusions, such as query-based, logits-based, latent, and parametric fusion, and provide structured comparisons across accessibility, efficiency, and use cases. The paper further examines RAG applications across diverse NLP tasks, discusses evaluation methodologies and benchmark limitations, and analyzes training paradigms with and without knowledge base updates. Finally, we explore industrial deployment considerations and identify emerging challenges and future directions, including security, efficiency, and graph-based retrieval.

2404.07106 2026-05-20 cs.CV cs.GR

3DMambaComplete: Exploring Structured State Space Model for Point Cloud Completion

3DMambaComplete:探索结构状态空间模型用于点云补全

Yixuan Li, Weidong Yang, Ben Fei

发表机构 * Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University(复旦大学计算机学院数据科学实验室) Department of Information Engineering, The Chinese University of Hong Kong(香港中文大学信息工程系)

AI总结 本文提出3DMambaComplete,一种基于Mamba框架的点云补全网络,通过HyperPoint生成、分散和变形模块有效解决点云补全中的局部细节丢失和计算复杂度问题,实验表明其优于现有方法。

Comments 24 pages, 14 figures, 10 tables

详情
AI中文摘要

点云补全旨在从初始不完整且低质量的输入生成完整且高保真的点云。一种常见策略是利用基于Transformer的模型来编码全局特征并促进重建过程。然而,使用池化操作获取全局特征表示往往会导致点云中局部细节的丢失。此外,Transformer中的注意力机制引入了额外的计算复杂性,使得处理长序列变得困难。为了解决这些问题,我们提出了3DMambaComplete,一种基于新型Mamba框架的点云补全网络。它包含三个模块:HyperPoint生成模块利用Mamba的选择机制编码点云特征,并预测一组Hyperpoints;特定偏移量被估计,下采样的点成为HyperPoints;HyperPoint Spread模块将这些HyperPoints分散到不同的空间位置以避免集中。最后,一种变形方法将HyperPoints的2D网格表示转换为精细的3D结构以进行点云重建。在各种已建立的基准上进行的大量实验表明,3DMambaComplete超越了最先进的点云补全方法,这通过定性和定量分析得到证实。

英文摘要

Point cloud completion aims to generate a complete and high-fidelity point cloud from an initially incomplete and low-quality input. A prevalent strategy involves leveraging Transformer-based models to encode global features and facilitate the reconstruction process. However, the adoption of pooling operations to obtain global feature representations often results in the loss of local details within the point cloud. Moreover, the attention mechanism inherent in Transformers introduces additional computational complexity, rendering it challenging to handle long sequences effectively. To address these issues, we propose 3DMambaComplete, a point cloud completion network built on the novel Mamba framework. It comprises three modules: HyperPoint Generation encodes point cloud features using Mamba's selection mechanism and predicts a set of Hyperpoints. A specific offset is estimated, and the down-sampled points become HyperPoints. The HyperPoint Spread module disperses these HyperPoints across different spatial locations to avoid concentration. Finally, a deformation method transforms the 2D mesh representation of HyperPoints into a fine-grained 3D structure for point cloud reconstruction. Extensive experiments conducted on various established benchmarks demonstrate that 3DMambaComplete surpasses state-of-the-art point cloud completion methods, as confirmed by qualitative and quantitative analyses.

2605.19762 2026-05-20 cs.AI cs.CL

What Really Improves Mathematical Reasoning: Structured Reasoning Signals Beyond Pure Code

什么真正提升了数学推理:超越纯代码的结构化推理信号

Yuze Zhao, Junpeng Fang, Lu Yu, Zhenya Huang, Kai Zhang, Qing Cui, Qi Liu, Jun Zhou, Enhong Chen

发表机构 * State Key Laboratory of Cognitive Intelligence, University of Science(认知智能国家重点实验室,科学大学) Institute of Artificial Intelligence, Hefei Comprehensive National Science Centerce(人工智能研究院,合肥综合性国家科学中心) Individual Researcher(个人研究员) Zhejiang University, Hangzhou, China(浙江大学,杭州,中国)

AI总结 本文通过控制预训练实验研究代码对推理能力的影响,发现代码主要提升编程能力而非通用推理,且在复杂数学推理中与知识密集型任务竞争,同时结构化推理轨迹(如代码-文本和数学-文本混合)比纯可执行代码更能提升推理能力。

Comments Accepted by ICML 2026, 22 pages, 10 figures

详情
AI中文摘要

代码已成为现代基础语言模型(LM)训练中的标准组件,但其作用超越编程仍不明确。我们重新审视代码通过控制预训练实验在10T-token语料库上进行细粒度领域分离,发现三个结论。首先,当代码限制为独立可执行程序且Code-NL数据被控制时,代码显著提升编程能力,但不作为通用推理增强器,反而在复杂数学推理中与知识密集型任务竞争。其次,通常归因于代码的推理增益更可能由跨领域结构化推理轨迹(如代码-文本和数学-文本混合)解释,而非纯可执行代码。第三,在固定数学预算内增加结构化数学领域样本密度,能在困难数学推理上获得显著提升,同时基本保持编程性能,表明认知支架提供了一种有针对的缓解跨领域权衡的方法。最后,路由分析显示数据组合效应反映在专家激活模式中,为跨领域竞争和协同作用提供了机制层面的证据。我们的结果澄清了哪些数据特征在能力维度间转移,并指出了更精确的数据导向优化策略。

英文摘要

Code has become a standard component of modern foundation language model (LM) training, yet its role beyond programming remains unclear. We revisit the claim that code improves reasoning through controlled pretraining experiments on a 10T-token corpus with fine-grained domain separation. Our findings are threefold. First, when code is restricted to standalone executable programs and Code-NL data are controlled for, code substantially improves programming ability but does not act as a general reasoning enhancer; instead, it competes with knowledge-intensive tasks, especially complex mathematical reasoning. Second, the reasoning gains often attributed to code are better explained by cross-domain structured reasoning traces, such as code-text and math-text mixtures, rather than by executable code alone. Third, increasing the density of structured math-domain samples within a fixed math budget yields substantial gains on difficult mathematical reasoning while largely preserving programming performance, suggesting that cognitive scaffolds offer a targeted way to mitigate cross-domain trade-offs. Finally, routing analyses show that data-composition effects are reflected in expert-activation patterns, providing mechanism-level evidence for competitive and synergistic interactions across domains. Our results clarify which data characteristics transfer across capability dimensions and point to more precise data-centric optimization strategies.

2605.19758 2026-05-20 cs.AI cs.DB stat.ML

CogScale: Scalable Benchmark for Sequence Processing

CogScale: 用于序列处理的可扩展基准

Yannis Bendi-Ouis, Romain de Coudenhove, Xavier Hinaut

发表机构 * Mnemosyne Inria(法国国家信息与自动化技术研究所) ENS PSL(巴黎社会科学高等学院) École normale supérieure (Ulm)(巴黎高等师范学院) Bordeaux, France(法国波尔多) Paris, France(法国巴黎)

AI总结 本文提出CogScale,一个包含14个可扩展合成任务的基准,用于评估不同架构在不同参数规模下的认知和记忆能力,通过标准化轻量框架加速架构创新验证。

详情
AI中文摘要

维持和操纵信息随时间变化的能力是生物和人工智能的基本特征。尽管现代模型在自然语言处理等任务上取得了显著成功,但评估新型架构处理序列信息的能力仍计算成本高且耗时。测试新架构通常需要扩展到大规模数据集和模型,导致巨大的计算成本和缓慢的迭代周期。在本文中,我们提出了CogScale,一个包含14个可扩展合成任务的基准,旨在隔离和评估不同参数规模下的特定认知和记忆能力。通过提供标准化的轻量框架,CogScale允许研究者在投入大规模训练之前快速验证架构创新。为了建立坚实的基础,我们评估了七种不同的架构:门控循环单元(GRU)、长短期记忆(LSTM)、xLSTM、回声状态网络(ESN)、Mamba、Transformer解码器和Transformer编码器-解码器。这些评估在严格的参数预算(1k、10k和100k)和不同的难度级别和规模下进行。我们的结果表明,尽管经典RNN和回声状态网络在严格参数预算内表现出色,只有注意力机制和现代状态空间模型在推理复杂性和任务难度增加时仍能保持高性能。

英文摘要

The ability to maintain and manipulate information over time is a fundamental aspect of living beings and Artificial Intelligence. While modern models have achieved remarkable success in tasks like natural language processing, evaluating the capacity of novel architectures to process sequential information remains computationally expensive and time-consuming. Testing a new architecture often requires scaling up to massive datasets and models, leading to vast computational costs and slow iteration cycles. In this paper, we propose CogScale, a benchmark of 14 scalable synthetic tasks designed to isolate and evaluate specific cognitive and memory abilities at different parametrizable scales. By providing a standardized, lightweight framework, CogScale allows researchers to rapidly validate architectural innovations before committing to large-scale training. To establish a solid baseline, we evaluate seven distinct architectures: Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), xLSTM, Echo State Network (ESN), Mamba, Transformer Decoder, and Transformer Encoder-Decoder. These evaluations are conducted under strict parameter budgets (1k, 10k, and 100k) and across different difficulty levels and scales. Our results show that while classical RNNs and Echo State Networks excel at basic retention within strict parameter budgets, only attention mechanisms and modern state-space models consistently maintain high performance as reasoning complexity and task difficulty scale.

2605.19752 2026-05-20 cs.LG

MSAlign: Aligning Molecule and Mass Spectra Foundation Models for Metabolite Identification

MSAlign: 用于代谢物鉴定的分子和质谱基础模型对齐方法

Paul Krzakala, Gabriel Melo, Camille Lançon, Charlotte Laclau, Rémi Flamary, Etienne Thévenot, Florence d'Alché-Buc

发表机构 * LTCI, Télécom Paris & CMAP, Ecole Polytechnique, Institut Polytechnique de Paris(LTCI,巴黎电信学院及巴黎高等技术学院的联合机构,CMAP,巴黎高等理工学院,巴黎高等技术学院) LTCI, Télécom Paris, Institut Polytechnique de Paris(LTCI,巴黎电信学院,巴黎高等技术学院) CEA, INRAE, MetaboHUB, Université Paris-Saclay(CEA,国家核能研究中心,法国农业研究机构,代谢组学枢纽,巴黎萨克雷大学)

AI总结 本研究提出MSAlign方法,通过多模态对齐技术对齐分子和质谱基础模型,以提高代谢物鉴定的准确性,并解决了数据分割策略中的分布偏移问题。

详情
AI中文摘要

准确地从质谱数据中识别代谢物(即小分子)仍然是代谢组学中的核心挑战,广泛应用于药物发现、环境分析和临床研究。我们解决了分子检索任务,即从给定的候选分子中恢复代谢物的化学结构,基于其MS/MS光谱。尽管最近发布的基准数据集如MassSpecGym和Spectraverse大大加速了新型机器学习方法的发展,但数据预处理管道的复杂性和缺乏统一的实现使得方法和结果难以重复和比较。我们做出了三个贡献。首先,我们提出一个统一的框架,涵盖了基于表示对齐和对比学习的最新方法。其次,我们引入MSAlign,受多模态对齐在视觉-语言模型中的启发,通过轻量级MLP投影学习共享的表示空间,通过基于候选的对比目标对两个冻结的基础模型(DreaMS用于质谱和ChemBERTa用于分子)进行对齐。MSAlign易于实现,训练速度快,并在所有基准测试中一致地优于现有方法。第三,我们研究了一个长期存在的评估问题:分子检索中的数据分割策略在数据泄漏和领域偏移之间进行权衡。我们通过引入分布偏移的定量度量来正式化这种张力,并利用它来评估现有基准中的分割策略。所有数据集、分割、候选集以及MSAlign和基线的统一实现已公开发布,以支持可重复的研究。

英文摘要

Accurately identifying metabolites i.e. small molecules from mass spectrometry data remains a core challenge in metabolomics, with broad applications in drug discovery, environmental analysis, and clinical research. We address the Molecule Retrieval task, which consists in recovering the chemical structure of a metabolite from its MS/MS spectrum given a set of candidate molecules. While the recent release of benchmark datasets such as MassSpecGym and Spectraverse has considerably accelerated the development of novel machine learning approaches, the complexity of data preprocessing pipelines and the lack of unified implementations make methods and results difficult to reproduce and compare. We make three contributions. First, we propose a unified framework encompassing recent approaches based on representation alignment and contrastive learning. Second, we introduce MSAlign, inspired by multimodal alignment in vision-language models, which learns a shared representation space by aligning two frozen foundation models (DreaMS for mass spectra and ChemBERTa for molecules) through lightweight MLP projections trained with a candidate-based contrastive objective. MSAlign is simple to implement, fast to train and consistently outperforms existing approaches across all benchmarks. Third, we investigate a long-standing evaluation problem: data splitting strategies in molecule retrieval implicitly trade off data leakage against domain shift. We formalize this tension by introducing a quantitative measure of distribution shift, and use it to evaluate splitting strategies in existing benchmarks. All datasets, splits, candidate sets, and a unified implementation of MSAlign and baselines are publicly released to support reproducible research.

2605.19750 2026-05-20 cs.CV

CPC-VAR:Continual Personalized and Compositional Generation in Visual Autoregressive Models

CPC-VAR:视觉自回归模型中的持续个性化与组合生成

Junhao Li, Xinhao Zhong, Yi sun, Yuxia Qiao, Bin Chen, Shu-Tao Xia, Yaowei Wang

发表机构 * Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院) Peng Cheng Laboratory(鹏城实验室) South China University of Technology(华南理工大学)

AI总结 本文研究了视觉自回归模型中的持续个性化生成问题,提出了一种统一框架,通过梯度基概念神经元选择和上下文感知组合策略,解决了连续单概念学习和多概念合成中的关键挑战,提升了长序列持续个性化和多概念图像合成的性能。

详情
AI中文摘要

视觉自回归(VAR)模型最近涌现出作为一种高效的文本到图像生成范式。尽管其强大的生成能力,现有的基于VAR的个性化方法仍局限于静态设置,无法适应不断变化的用户需求。特别是,序列概念学习导致严重的灾难性遗忘,而多概念合成常遭受特征纠缠和属性不一致的问题。在本文中,我们首次系统研究了VAR模型中的持续个性化生成。我们识别出两个关键挑战:(i)在连续定制过程中保持已学习的概念,以及(ii)以可控的方式组合多个个性化概念。为了解决这些问题,我们提出了一种统一框架,包含两个核心组件。对于持续单概念学习,我们引入了基于梯度的概念神经元选择(GCNS),该方法识别出与概念相关的神经元,并仅约束跨任务的冲突参数,从而有效缓解遗忘而不增加模型规模。对于多概念合成,我们提出了一种上下文感知的组合策略,通过多分支特征建模和局部跨注意力融合,由空间条件引导,实现了精确且解耦的概念组合。大量实验表明,我们的方法在长序列持续个性化中显著提高了性能,并在多概念图像合成中优于现有基线。这些发现突显了VAR模型在可扩展和可控个性化生成中的潜力。

英文摘要

Visual autoregressive (VAR) models have recently emerged as an efficient paradigm for text-to-image generation. Despite their strong generative capability, existing VAR-based personalization methods remain limited to static settings, failing to accommodate evolving user demands. In particular, sequential concept learning leads to severe catastrophic forgetting, while multi-concept synthesis often suffers from feature entanglement and attribute inconsistency. In this work, we present the first systematic study of continual personalized generation in VAR models. We identify two key challenges: (i) preserving previously learned concepts during sequential customization, and (ii) composing multiple personalized concepts in a controllable manner. To address these issues, we propose a unified framework with two core components. For continual single-concept learning, we introduce Gradient-based Concept Neuron Selection (GCNS), which identifies concept-relevant neurons and constrains only conflicting parameters across tasks, effectively mitigating forgetting without additional model expansion. For multi-concept synthesis, we propose a context-aware composition strategy that performs multi-branch feature modeling and localized cross-attention fusion guided by spatial conditions, enabling precise and disentangled concept composition. Extensive experiments demonstrate that our method significantly improves performance in long-sequence continual personalization while achieving superior results in multi-concept image synthesis compared to existing baselines. These findings highlight the potential of VAR models for scalable and controllable personalized generation.

2605.19748 2026-05-20 cs.AI cs.MA

Memory-Augmented Reinforcement Learning Agent for CAD Generation

具有记忆增强的强化学习代理的CAD生成

Yin Xiaolong, Liu Yu, Shen Jiahang, Lu Xingyu, Ni Jingzhe, Fan Fengxiao, Sang Fan

发表机构 * Zhejiang University(浙江大学)

AI总结 本文提出了一种记忆增强的强化学习框架,用于生成CAD模型,通过引入强化学习进行检索和策略优化,有效避免了检索陷阱,提高了复杂CAD模型生成的成功率和几何一致性。

Comments 26 pages; multilingual submission: English version first, followed by Chinese version

详情
AI中文摘要

计算机辅助设计(CAD)模型的自动生成是实现先进制造业智能化的核心技术。现有的基于大语言模型(LLMs)的生成方法在处理具有长操作序列、多样操作类型和强几何约束的复杂CAD模型时往往力不从心,主要原因是推理链断裂且缺乏有效的错误修正机制。为了解决这个问题,本文提出了一种用于CAD生成代理的记忆增强强化学习框架。该框架将底层几何内核封装成可由代理调用的结构化工具链,并构建了设计意图理解、全局规划、执行和多维验证的闭环机制。同时,该框架设计了由案例库和技能库组成的双轨记忆模块,并提出了动态效用检索算法。通过将强化学习引入检索和策略优化,代理能够有效避免检索陷阱,即在语义相似但几何不可行的例子中,实现在线自我修正和持续进化,而无需额外的大规模标注数据。实验表明,所提出的方法在复杂CAD模型生成任务中显著提高了成功率和几何一致性。

英文摘要

Automatic generation of computer-aided design (CAD) models is a core technology for enabling intelligence in advanced manufacturing. Existing generation methods based on large language models (LLMs) often fall short when handling complex CAD models characterized by long operation sequences, diverse operation types, and strong geometric constraints, primarily because reasoning chains break and effective error-correction mechanisms are lacking. To address this problem, this paper proposes a memory-augmented reinforcement learning framework for CAD generation agents. The framework encapsulates the underlying geometric kernel into a structured toolchain callable by the agent and builds a closed-loop mechanism of design intent understanding, global planning, execution, and multi-dimensional verification. It also designs a dual-track memory module consisting of a case library and a skill library, and proposes a dynamic utility retrieval algorithm. By introducing reinforcement learning into retrieval and policy optimization, the agent can effectively avoid retrieval traps in which examples are semantically similar but geometrically infeasible, enabling online self-correction and continual evolution without additional large-scale annotated data. Experiments show that the proposed method significantly improves both the success rate and geometric consistency on complex CAD model generation tasks.

2605.19744 2026-05-20 cs.CV

Real-World On-Vehicle Evaluation of Embedding-Based Anomaly Detection

车载场景中基于嵌入的异常检测实测

Albert Schotschneider, Daniel Bogdoll, Svetlana Pavlitska, Ahmed Abouelazm, Johann Marius Zoellner

发表机构 * FZI Research Center for Information Technology(FZI信息科技研究中心) KIT Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院)

AI总结 本文提出了一种适应性强的实时异常检测方法,利用预训练视觉变换器嵌入来检测潜在异常,通过在潜在语义特征空间中使用最近邻相似性检测偏差,并在真实世界场景中评估了该方法的性能。

Comments Accepted at CVPR 2026 Workshop AUTOPILOT-NA

详情
AI中文摘要

在自动驾驶中检测交通场景中的异常对于确保安全至关重要,但收集具有代表性的异常数据仍然具有挑战性。现有的异常检测方法高度专业化,并且依赖于抽象语义Cityscapes类定义的正常性,这使得难以适应多样的现实世界场景。我们提出了一种适应性强的实时异常检测方法,该方法利用预训练的视觉变换器嵌入作为基础模型,通过潜在语义特征空间中的最近邻相似性来检测偏差。基于逐块处理,该算法生成密集的异常掩码,允许定位检测到的异常。该方法通过单个参考图像稳健地建模正常性。这种形式避免了显式监督和数据集特定的训练,使其适合现实世界部署。我们在标准基准和自动化车辆的真实场景中评估了该方法。尽管其简单性,该方法在Road Anomaly基准上表现良好,并在实践中表现出一致的定性行为,成功地在多样化的场景中突出显示语义上不寻常的对象。这些结果表明,在现实操作条件下,简单的基于参考的方法可以提供有用的异常信号。

英文摘要

Detecting anomalies in traffic scenes is crucial for ensuring safety in autonomous driving, yet collecting representative anomalous data remains challenging. Existing anomaly detection methods are highly specialized and rely on normality as defined by the abstract semantic Cityscapes classes, making it difficult to adapt to diverse real-world scenarios. We propose an adaptable real-time anomaly detection method that leverages foundation models in the form of pretrained vision transformer embeddings to detect deviations via nearest-neighbor similarity in the latent semantic feature space. Based on patch-wise processing, the algorithm produces dense anomaly masks, allowing for the localization of detected anomalies. The method robustly models normality through a single reference image. This formulation avoids explicit supervision and dataset-specific training, making it suitable for real-world deployment. We evaluate the method on standard benchmarks and on an automated vehicle in real-world scenarios. Despite its simplicity, the method achieves good performance on the Road Anomaly benchmark and demonstrates consistent qualitative behavior in practice, successfully highlighting semantically unusual objects in diverse scenes. These results suggest that simple, reference-based methods can provide useful anomaly signals under realistic operating conditions.

2605.19738 2026-05-20 cs.CL cs.AI

TERGAD: Structure-Aware Text-Enhanced Representations for Graph Anomaly Detection

TERGAD: 用于图异常检测的结构感知文本增强表示

Wen Shi, Zhe Wang, Huafei Huang, Qing Qing, Ziqi Xu, Qixin Zhang, Xikun Zhang, Renqiang Luo, Feng Xia

发表机构 * College of Computer Science and Technology, Jilin University(吉林大学计算机科学与技术学院) School of Computer Science and Information Technology, Adelaide University(阿德莱德大学计算机科学与信息科技学院) School of Computing Technologies, RMIT University(皇家墨尔本理工大学计算技术学院) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院)

AI总结 本文提出TERGAD,一种通过大语言模型的语义推理能力增强图异常检测的新型数据增强框架,通过将节点拓扑属性转化为描述性自然语言,再结合门控双分支自编码器融合语义嵌入和原始节点属性,从而更有效地检测图中异常实体。

Comments 14 pages, 5 figures

详情
AI中文摘要

图异常检测(GAD)旨在识别偏离大多数的图实体,如节点、边或子结构。尽管现有文本丰富方法通常通过原始文本特征将结构上下文整合到数据表示流程中,但它们往往忽略了节点的结构上下文。这种局限性阻碍了检测由于节点固有内容与其拓扑角色之间不一致而产生的复杂异常。为此,我们提出TERGAD(用于图异常检测的结构感知文本增强表示),一种新颖的数据增强框架,通过大语言模型(LLMs)的语义推理能力增强GAD的结构语义。具体而言,TERGAD将节点层面的拓扑属性转化为描述性自然语言叙述,随后由LLM处理以获得高阶语义嵌入。这些嵌入随后通过门控双分支自编码器与原始节点属性适配融合,以共同重建图结构和节点特征。通过整合的重建误差计算异常分数,有效捕捉可观测属性和LLM引导的语义期望之间的偏差。在六个真实世界数据集上的广泛实验表明,TERGAD在性能上始终优于最先进的基线。此外,我们的消融研究验证了结构语义指导的不可或缺性和门控融合机制的有效性。代码可在https://github.com/Kantorakitty/TERGAD-main获取。

英文摘要

Graph Anomaly Detection (GAD) aims to identify atypical graph entities, such as nodes, edges, or substructures, that deviate significantly from the majority. While existing text-rich approaches typically integrate structural context into the data representation pipeline using raw textual features, they often neglect the structural context of nodes. This limitation hinders their ability to detect sophisticated anomalies arising from inconsistencies between a node's inherent content and its topological role. To bridge this gap, we propose TERGAD (Structure-aware Text-enhanced Representations for Graph Anomaly Detection), A novel data augmentation framework that enriches structural semantics for GAD via the semantic reasoning capabilities of Large Language Models (LLMs). Specifically, TERGAD translates node-level topological properties into descriptive natural language narratives, which are subsequently processed by an LLM to derive high-level semantic embeddings. These embeddings are then adaptively fused with original node attributes through a gated dual-branch autoencoder to jointly reconstruct both graph structure and node features. The anomaly score is computed based on the integrated reconstruction error, effectively capturing deviations in both observable attributes and LLM-informed semantic expectations. Extensive experiments on six real-world datasets demonstrate that TERGAD consistently outperforms state-of-the-art baselines. Furthermore, our ablation studies validate the indispensable role of structural semantic guidance and the efficacy of the gated fusion mechanism. Code is available at https://github.com/Kantorakitty/TERGAD-main.

2605.19735 2026-05-20 cs.CL cs.AI

ContextRAG: Extraction-Free Hierarchical Graph Construction for Retrieval-Augmented Generation

ContextRAG: 无提取的分层图构建用于检索增强生成

Roman Prosvirnin, Sergei Kuznetsov, Seungmin Jin

发表机构 * HSE University(俄罗斯高等经济大学)

AI总结 本文提出ContextRAG,一种无需大型语言模型提取实体和关系的图检索增强生成系统,通过残差量化k均值和Formal Concept Analysis方法构建模糊概念图,在130个任务的UltraDomain子集中实现了33.6%的F1分数,显著优于传统方法。

Comments Preprint. 6 tables

详情
AI中文摘要

图结构的检索增强生成(RAG)系统能够提高多跳问题的答案质量,但许多现有系统依赖大型语言模型(LLMs)在索引过程中提取实体、关系和摘要。这些调用会增加随语料库大小增长的token和时间成本。我们提出了ContextRAG,一种图RAG系统,其图拓扑结构无需LLM进行实体或关系提取。ContextRAG通过残差量化k均值和带有Lukasiewicz残余逻辑的Formal Concept Analysis,在片段嵌入上构建模糊概念图。通过软模糊连接和meet操作诱导桥状和meet衍生的上下文节点,而非LLM生成的图边。在130个任务的UltraDomain子集中,ContextRAG用30次LLM调用和22,073个token构建其索引。相比之下,一个本地HiRAG再现压力测试在20个任务子集上需要870次索引调用和3.54M个token才能在图构建过程中失败;线性外推到130个任务意味着超过23M个索引token。ContextRAG在整体上获得33.6%的F1分数,在多跳任务上获得36.8%的F1分数。激活分析显示,检索到至少一个由lattice衍生节点的前五查询在F1上比未检索到的查询高出+3.9个百分点;这种关联是诊断而非因果的。

英文摘要

Graph-structured retrieval-augmented generation (RAG) systems can improve answer quality on multi-hop questions, but many current systems rely on large language models (LLMs) to extract entities, relations, and summaries during indexing. These calls add token and wall-clock costs that grow with corpus size. We present ContextRAG, a graph RAG system whose graph topology is constructed without LLM-based entity or relation extraction. ContextRAG derives a fuzzy concept graph over chunk embeddings using residual-quantization k-means and Formal Concept Analysis with Lukasiewicz residuated logic. Bridge-like and meet-derived context nodes are induced by soft fuzzy join and meet operations, rather than by LLM-written graph edges. On a 130-task UltraDomain subset, ContextRAG builds its index with 30 LLM calls and 22,073 tokens. In contrast, a local HiRAG reproduction stress test required 870 indexing calls and 3.54M tokens on a 20-task subset before failing during graph construction; linear extrapolation to 130 tasks implies over 23M indexing tokens. ContextRAG obtains 33.6% F1 overall and 36.8% F1 on multi-hop tasks. An activation analysis shows that queries retrieving at least one lattice-derived node in the top five achieve +3.9 percentage points F1 over queries that do not; this association is diagnostic rather than causal.

2605.19734 2026-05-20 cs.CV

GeoMamba: A Geometry-driven MambaVision Framework and Dataset for Fine-grained Optical-SAR Object Retrieval

GeoMamba: 一种基于几何的MambaVision框架及数据集,用于细粒度光学-雷达目标检索

Tiantong Fang, Xiuwei Wang, Jing Xiao, Wujie Zhou, Liang Liao, Mi Wang

发表机构 * School of Artificial Intelligence, Wuhan University(武汉大学人工智能学院) School of Artificial Intelligence and Information Engineering, Zhejiang University of Science & Technology(浙江科技大学人工智能与信息工程学院) Hangzhou Institute of Technology, Xidian University(西安电子科技大学杭州研究院) State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University(武汉大学测绘遥感信息工程国家重点实验室)

AI总结 本文提出GeoMamba框架,通过引入几何特征注入模块和几何一致性约束模块,提升光学-雷达细粒度目标检索的鲁棒性,并构建了新的FGOS-as数据集来评估跨模态检索性能。

详情
AI中文摘要

多源遥感能够互补地观测地面物体,但跨模态细粒度目标检索仍具有挑战性,尤其是在光学和雷达条件不一致的情况下。与传统的依赖配对或空间对齐样本的检索设置不同,实际的光学-雷达检索受到显著的模态差异、斑点噪声和结构不一致的影响,限制了跨模态表示学习的鲁棒性。为此,我们提出GeoMamba,一种针对光学-雷达细粒度检索的几何驱动框架。具体而言,GeoMamba引入了一个几何特征注入(GFI)模块,以增强跨模态特征交互,并结合结构先验,从而提高雷达表示的鲁棒性并促进几何一致的特征学习。此外,几何一致性约束(GCC)模块与深度监督(DS)策略一起,利用经典操作符施加层次化的几何约束,帮助在表示学习过程中保留信息丰富的物体结构。我们进一步构建了一个新的数据集FGOS-as,包含11个航空航天和海洋类别,用于评估在现实遥感场景中的不一致跨模态细粒度目标检索性能。在FGOS-as上的大量实验表明,GeoMamba在所有对所有检索设置中优于现有方法,达到了63.3%的mAP和77.0%的Rank-1准确率。

英文摘要

Multi-source remote sensing enables complementary observation of ground objects, while cross-modal fine-grained object retrieval remains challenging, especially under unaligned optical and SAR conditions. Unlike conventional retrieval settings that rely on paired or spatially aligned samples, practical optical-SAR retrieval is affected by substantial modality discrepancy, speckle noise, and structural inconsistency, which limit robust cross-modal representation learning. To address this problem, we propose GeoMamba, a geometry-driven framework tailored for optical-SAR fine-grained retrieval. Specifically, GeoMamba introduces a Geometric Feature Injection (GFI) module that enhances cross-modal feature interaction and incorporates structural priors, thereby improving the robustness of SAR representations and promoting geometry-consistent feature learning. In addition, a Geometric Consistency Constraint (GCC) module, together with a Deep Supervision (DS) strategy, imposes hierarchical geometric constraints using classical operators, which helps preserve informative object structures during representation learning. We further construct a new dataset, FGOS-as, containing 11 aerospace and maritime categories for evaluating unaligned cross-modal fine-grained object retrieval in realistic remote sensing scenarios. Extensive experiments on FGOS-as demonstrate that GeoMamba outperforms existing methods, achieving 63.3% mAP and 77.0% Rank-1 accuracy in all-to-all retrieval setting.

2605.19728 2026-05-20 cs.CV

Aero-World: Action-Conditioned Aerial Video Generation from Inertial Controls

Aero-World: 从惯性控制生成动作条件的空中视频

Abdul Mohaimen Al Radi, Kunyang Li, Yuzhang Shang, Mubarak Shah, Yu Tian

发表机构 * Institute of Artificial Intelligence, University of Central Florida(中央佛罗里达大学人工智能研究所)

AI总结 本文提出Aero-World,一种将预训练图像到视频扩散模型转换为可控空中视频生成器的方法,通过注入加速度和角速度序列,利用冻结的物理探测器提供惯性一致性监督,从而提高生成视频对低级动作信号的符合度和时间稳定性。

详情
AI中文摘要

基础视频模型能够生成视觉逼真的结果,但其在具身AI中的应用受限,因为它们主要在自然语言上训练而不是低级控制信号。这种限制在空中飞行中尤为明显,因为运动发生在无约束的6自由度空间中,微小的自我运动误差会产生大的轨迹漂移。生成遵循精细惯性动作的空中视频可以支持可扩展的空中代理训练和评估,通过提供可控的现实世界或昂贵模拟数据代理。为此,我们提出了Aero-World,一种将预训练图像到视频扩散模型转换为可控空中视频生成器的方法。Aero-World通过动作令牌流将加速度和角速度序列注入到预训练的潜在扩散变换器中。一个冻结的潜在空间物理探测器,独立在真实视频-IMU配对上训练,通过LoRA微调期间提供可微的惯性一致性监督,同时避免计算昂贵的视频解码。我们进一步提出了AeroBench,一个评估生成无人机视频是否符合低级动作信号的基准。AeroBench使用动作对齐分数(AAS)测量与命令惯性动作的一致性,使用物理一致性率(PCR)测量时间运动稳定性。在AeroBench上,Aero-World将平均AAS从57.7提高到63.6,比仅动作微调有更高的质量控制权衡,与AirScape相比,FVD更低(596.5 vs. 1058.6),SSIM更高(0.595 vs. 0.505),Flow-IMU相关性更高(0.44 vs. 0.20)。这些结果表明,冻结的物理探测器监督是一种将预训练视频生成器适应更动作对齐的空中运动的实用机制。

英文摘要

Foundation video models produce visually impressive results, but their use in embodied AI remains limited because they are primarily trained on natural language rather than low-level control signals. This limitation is especially pronounced for aerial flight, where motion occurs in unconstrained 6-DoF space and small errors in ego-motion can produce large trajectory drift. Generating aerial videos that follow fine-grained inertial actions can support scalable training and evaluation of aerial agents by providing a controllable proxy for real-world or expensive simulation data. To address this problem, we propose \textbf{Aero-World}, a method for converting a pretrained image-to-video diffusion model into a controllable aerial video generator. Aero-World injects sequences of translational acceleration and angular velocity into a pretrained latent diffusion transformer through an action-token stream. A frozen latent-space Physics Probe, trained independently on real video--IMU pairs, provides differentiable inertial-consistency supervision during LoRA finetuning while avoiding computationally expensive video decoding. We further propose \textbf{AeroBench}, a benchmark for evaluating whether generated drone videos adhere to low-level action signals. AeroBench uses Action Alignment Score (AAS) to measure agreement with commanded inertial actions and Physical Consistency Rate (PCR) to measure temporal motion stability. On AeroBench, Aero-World improves mean AAS from 57.7 to 63.6 over action-only finetuning and gives a stronger quality-control trade-off than AirScape, with lower FVD (596.5 vs. 1058.6), higher SSIM (0.595 vs. 0.505), and higher Flow-IMU correlation (0.44 vs. 0.20). These results suggest that frozen Physics Probe supervision is a practical mechanism for adapting pretrained video generators toward more action-aligned aerial motion.

2605.19727 2026-05-20 cs.CV

Tango3D: Towards Alignment for Global and Local 2D-3D Correspondence

Tango3D: 向全局和局部2D-3D对应关系对齐迈进

Zebin He, Mingxin Yang, Shuhui Yang, Hanxiao Sun, Xintong Han, Chunchao Guo, Wenhan Luo

发表机构 * HKUST(香港科技大学) Tencent Hunyuan(腾讯混元)

AI总结 本文提出Tango3D,一种统一密集对应和全局检索的3D基础模型,通过几何感知的2D视觉骨干网络和预训练的3D VAE将图像编码为2D片段,点云编码为3D标记,并映射到共享空间以实现局部像素-点对齐和全局语义对齐。

详情
AI中文摘要

现有的3D基础模型通常将点云对齐到冻结的视觉-语言空间(如CLIP),通过将3D形状压缩成全局向量实现强大的跨模态检索。然而,这种仅全局对齐的方法无法建立精细的像素-点对应关系。为了解决这个问题,我们提出了Tango3D,一种基础模型,它统一了密集对应和全局检索。我们使用一个几何感知的2D视觉骨干网络和一个预训练的3D VAE将图像编码为2D片段,并将点云编码为3D标记。这些被映射到一个共享空间中,以实现局部像素-点对齐和全局语义对齐。为了稳定密集和全局目标的联合学习,我们引入了三阶段渐进训练策略。实验表明,我们的模型成功实现了对象级别的像素-点对齐,同时保持了具有竞争力的全局检索能力,这种联合能力是现有3D基础模型所不具备的。通过建立精细的对齐特征空间,Tango3D将丰富的语义注入到纯粹的几何3D标记中,为广泛密集3D下游任务铺平了道路。

英文摘要

Existing 3D foundation models typically align point clouds to frozen vision-language spaces like CLIP, which achieve strong cross-modal retrieval by compressing 3D shape into a global vector. However, this global-only alignment cannot establish fine-grained pixel-to-point correspondence. To solve this, we present Tango3D, a foundation model that unifies dense correspondence and global retrieval. We use a geometry-aware 2D visual backbone and a pretrained 3D VAE to encode images into 2D patches and point clouds into 3D tokens. These are mapped into a single shared space to achieve both local pixel-to-point alignment and global semantic alignment. To stabilize the joint learning of dense and global objectives, we introduce a three-stage progressive training strategy. Experiments show our model successfully achieves object-level pixel-to-point alignment while maintaining competitive global retrieval, a joint capability not offered by existing 3D foundation models. By establishing a fine-grained alignment feature space, Tango3D injects rich semantics into purely geometric 3D tokens, paving the way for a wide range of dense 3D downstream tasks.

2605.19726 2026-05-20 cs.CV

Efficient Long-Context Modeling in Diffusion Language Models via Block Approximate Sparse Attention

通过块近似稀疏注意力实现扩散语言模型的高效长上下文建模

Wenhu Zhang, Yiming Wu, Huanyu Wang, Yaoyang Liu, Huanzhang Dou, Senqiao Yang, Sitong Wu, Hanbin Zhao, Jiaya Jia

发表机构 * The Hong Kong University of Science and Technology(香港科学与技术大学) The University of Hong Kong(香港大学) Zhejiang University(浙江大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 本文提出了一种块近似稀疏注意力框架(BA-Att),通过块级预下采样操作识别信息区域,避免依赖脆弱的位置先验,从而在保持高性能的同时提升计算效率,实验表明其在注意力计算上比FlashAttention快6.95倍,并在50%稀疏度下保持接近全注意力性能。

Comments CVPR 2026 Findings paper

详情
AI中文摘要

扩散语言模型(DLMs)能够实现全局一致、双向且可控的文本生成,相较于传统自回归LLMs具有优势,但扩展到超长序列仍成本高昂。许多现有块稀疏注意力方法通过固定采样模式在高分辨率注意力空间中选择块,如尾部区域或反斜线条带。此类先验驱动的采样可能遗漏显著令牌并引入分布变化下的不稳定性。在本文中,我们提出块近似稀疏注意力框架(BA-Att)具有块级预下采样操作,能够在紧凑的下采样空间内识别信息区域,避免依赖脆弱的位置先验。为了分析其理论行为,我们定义了一个 oracle 后下采样注意力图,并正式化预下采样与后下采样方案之间的近似误差。基于这一见解,我们引入了一个轻量级的范数排序模块和一个协方差补偿修正,利用对角线QK方差近似完整协方差,从而降低计算复杂度。广泛的实验表明,我们的操作在注意力计算上比FlashAttention快达6.95倍,并在50%稀疏度下在语言模型、多模态语言模型和视频生成模型中保持接近全注意力性能,展示了强大的效率和泛化能力。

英文摘要

Diffusion Language Models (DLMs) enable globally coherent, bidirectional, and controllable text generation, offering advantages over traditional autoregressive LLMs, while scaling to ultra-long sequences remains costly. Many existing block-sparse attention methods select blocks by fixed sampling patterns over the high-resolution attention space, such as tail regions or anti-diagonal stripes. Such prior-driven sampling can miss salient tokens and introduce instability under distribution shifts. In this paper, we propose the Block Approximate Sparse Attention framework (BA-Att) with block-wise pre-downsampled operation, which identifies informative regions within a compact downsampled space, avoiding reliance on brittle positional priors. To analyze its theoretical behavior, we define an oracle post-downsample attention map and formalize the approximation error between pre- and post-downsample schemes. Based on this insight, we introduce a lightweight norm-sorting module and a covariance-compensated correction that approximates full covariance using diagonal QK variances, reducing computational complexity. Extensive experiments show that our operator achieves up to 6.95x acceleration over FlashAttention in attention computation, and maintains near full-attention performance at 50% sparsity across language models, multimodal language models, and video generation models, demonstrating strong efficiency and generalization.

2605.19723 2026-05-20 cs.CL cs.AI

Mathematical Reasoning in Large Language Models: Benchmarks, Architectures, Evaluation, and Open Challenges

大型语言模型中的数学推理:基准测试、架构、评估与开放挑战

Husnain Amjad, Raja Khurram Shahzad, Aamir Shahzad, Mehwish Fatima

发表机构 * organization= School of Electrical Engineering Computer Science, National University of Science organization= School of Computing, Data Mathematical Sciences, Western Sydney University, Indonesia organization= Department of Communication, Quality Management Information Systems, Mid Sweden University, Östersund Campus, Sweden

AI总结 本文综述了大型语言模型在数学推理方面的最新进展,通过分析数据集、架构、训练策略和评估协议,探讨了数学推理的基准测试、架构设计、评估方法以及未来的研究挑战。

详情
AI中文摘要

数学推理对于教育、科学和工业中的问题解决至关重要,是评估人工智能系统的重要基准。随着大型语言模型(LLMs)推理能力的提升,理解其在数学推理方面的表现变得越来越重要。本文综述通过结构化的数据分析集、架构、训练策略和评估协议,综合了最近在LLMs中的数学推理进展。我们的系统性回顾涵盖了大约120篇同行评审研究和预印本,探讨了该研究领域的演变,并提供了一个统一的分析框架来理解当前的进展和限制。本文特别介绍了一种统一的数学数据集分类法,区分了预训练语料库、监督微调资源和评估基准在不同推理复杂性水平上的差异。本文还系统分析了推理架构和训练策略,包括工具集成、验证器引导推理和参数高效适应,以评估其对推理鲁棒性和泛化能力的影响。此外,现有度量标准的比较评估突显了最终答案准确性与过程级推理验证之间的差距。通过综合这些领域的见解,我们的分析识别了反复出现的失败模式,如推理忠实性问题、基准偏见和泛化限制,并概述了改进符号接地、评估可靠性以及开发更稳健和可信的LLM推理系统的关键研究方向。

英文摘要

Mathematical reasoning is essential for problem-solving in education, science, and industry, serving as a crucial benchmark for evaluating artificial intelligence systems. As Large Language Models (LLMs) improve their reasoning capabilities, understanding how well they perform mathematical reasoning has become increasingly important. This survey synthesizes recent advancements in mathematical reasoning with LLMs through a structured analysis of datasets, architectures, training strategies, and evaluation protocols. Our systematic review encompasses approximately 120 peer-reviewed studies and preprints, examining the evolution of this research area and providing a unified analytical framework to understand current progress and limitations. Our study particularly introduces a unified taxonomy of mathematical datasets, distinguishing between pretraining corpora, supervised fine-tuning resources, and evaluation benchmarks across varying levels of reasoning complexity. A systematic analysis of reasoning architectures and training strategies, including tool integration, verifier-guided reasoning, and parameter-efficient adaptation, is presented to assess their effects on reasoning robustness and generalization. Moreover, a comparative evaluation of existing metrics highlights the gap between final-answer accuracy and process-level reasoning verification. By synthesizing insights across these areas, our analysis identifies recurring failure modes, such as reasoning faithfulness issues, benchmark biases, and generalization limitations, and outlines key research directions toward improving symbolic grounding, evaluation reliability, and the development of more robust and trustworthy LLM-based reasoning systems.

2605.19721 2026-05-20 cs.AI cs.LG cs.NI

Projecting Latent RL Actions: Towards Generalizable and Scalable Graph Combinatorial Optimization

投影潜在RL动作:面向通用化和可扩展的图组合优化

Franco Terranova, Guillermo Bernardez, Albert Cabellos-Aparicio, Nina Miolane, Abdelkader Lahmadi

发表机构 * Université de Lorraine, CNRS, Inria, LORIA(洛林大学、国家科学研究中心、法国国家信息与自动化研究所、LORIA实验室) University of California Santa Barbara(加州圣芭芭拉大学) Universitat Politècnica de Catalunya(加泰罗尼亚理工大学)

AI总结 本文提出了一种新的RL-GCO方法,通过在连续GNN动作嵌入空间中直接操作,实现高效的图组合优化解算,提升了通用性和可扩展性。

Comments Preprint

详情
AI中文摘要

图组合优化(GCO)因其在许多NP难问题中的自然图表示而受到越来越多的关注,但其组合爆炸使得精确方法在计算上不可行。最近的强化学习(RL)与图神经网络(GNN)的结合显著改进了基于学习的GCO求解器。然而,现有方法在跨不同图实例的泛化能力和随着动作空间增长的计算可扩展性方面存在局限。为了解决这两个挑战,我们引入了投影代理,一种新颖的RL-GCO方法,直接在连续的GNN动作嵌入空间中操作,通过单次前向传递预测所需潜在动作,并随后将其解码为有效的离散动作。此外,我们通过为观察和动作提供共享的嵌入空间,实现了RL方法之间的公平比较。在多样化的基准测试中,我们的方法在推理速度上达到现有解决方案的16.2倍,泛化能力提升40%,同时为具有多个相互依赖变量的超线性决策空间中的强大RL性能打开了大门。最后,我们发布了LaGCO-RL,一个Python库,自动化潜在动作空间的构建并支持现有RL-GCO解决方案,促进可重复性和适应新GCO基准。

英文摘要

Graph combinatorial optimization (GCO) has attracted growing interest, as many NP-hard problems naturally admit graph formulations, yet their combinatorial explosion renders exact methods computationally intractable. Recent advances in Reinforcement Learning (RL) combined with Graph Neural Networks (GNNs) have significantly improved learning-based GCO solvers. However, existing approaches face limitations in both generalization across diverse graph instances and computational scalability as action spaces grow. To address both challenges, we introduce projection agents, a novel RL-GCO approach that operates directly in a continuous GNN-based action embedding space, predicting a desired latent action in a single forward pass and subsequently decoding it into a valid discrete action. Additionally, we enable fair comparison across RL methods through a shared embedding space for both observations and actions. Across diverse benchmarks, our approach achieves up to 16.2x faster inference and up to 40% better generalization than existing solutions using only simple nearest-neighbor decoding, while opening the door to strong RL performance in super-linear decision spaces with multiple interdependent variables. Finally, we release LaGCO-RL, a Python library that automates latent action-space construction and supports existing RL-GCO solutions, promoting reproducibility and adaptation to new GCO benchmarks.

2605.19718 2026-05-20 cs.CL

CAIT: A Syntactic Parsing Toolkit for Child-Adult InTeractions

CAIT:一种用于儿童-成人互动的句法解析工具包

Francesca Padovani, Xiulin Yang, Bastian Bunzeck, Jaap Jumelet, Yevgen Matusevych, Nathan Schneider, Arianna Bisazza

发表机构 * Center for Language and Cognition (CLCG), University of Groningen(格罗宁根大学语言与认知中心) Georgetown University(乔治城大学) Computational Linguistics, Department of Linguistics, Bielefeld University(比勒菲尔德大学语言学系计算语言学)

AI总结 本文提出了一种专门针对CHILDES数据的句法解析工具包CAIT,通过训练先进的依赖解析器和标注工具,提升了对儿童-成人互动句法模式的解析精度,适用于语言习得的大规模可重复研究。

详情
AI中文摘要

CHILDES是语言习得研究的重要资源--然而用于分析其句法结构的计算工具仍然有限。利用最近发布的UD-English-CHILDES树库及其黄金标准的通用依赖关系(UD)标注,我们训练了一个最先进的依赖解析器,专门针对CHILDES进行优化。该解析器更准确地捕捉了儿童-成人互动中的句法模式,优于广泛使用的现成英语解析器,包括SpaCy和Stanza。除了解析器外,我们还发布了词性标注器和句子级构造标注器,这些工具共同构成了开放源代码的儿童-成人互动句法解析工具包(CAIT)。通过详细的错误分析和一个跟踪CHILDES中句法构造在发展时间分布的案例研究,我们展示了该工具包在语言习得大规模、可重复研究中的实用价值。

英文摘要

CHILDES is a paramount resource for language acquisition studies -- yet computational tools for analyzing its syntactic structure remain limited. Leveraging the recent release of the UD-English-CHILDES treebank with gold-standard Universal Dependencies (UD) annotations, we train a state-of-the-art dependency parser specifically tailored to CHILDES. The parser more accurately captures syntactic patterns in child--adult interactions, outperforming widely used off-the-shelf English parsers, including SpaCy and Stanza. Alongside the parser, we also release a Part-of-Speech tagger and an utterance-level construction tagger, which together form the open-source Syntactic Parsing Toolkit for Child--Adult InTeractions (CAIT). Through a detailed error analysis and a case study tracking the distribution of syntactic constructions across developmental time in CHILDES, we demonstrate the practical utility of the toolkit for large-scale, reproducible research on language acquisition.

2605.19717 2026-05-20 cs.CV

Physics-in-the-Loop: A Hybrid Agentic Architecture for Validated CAD Engineering Design

物理闭环:一种混合代理架构用于验证的CAD工程设计

Elias Berger, Muhammad Usama, Jan Mehlstäubl, Bernhard Saske, Kristin Paetzold-Byhain

发表机构 * Dresden University of Technology(德累斯顿技术大学) MAN Truck & Bus SE(MAN卡车与巴士股份有限公司) German Research Center for Artificial Intelligence(德国人工智能研究中心) RPTU Kaiserslautern-Landau(凯撒斯劳滕-兰道大学)

AI总结 本文提出了一种混合代理-物理架构,通过将经过验证的知识工程工具直接嵌入到自主AI代理的决策循环中,以解决大型语言模型在生成CAD设计时缺乏物理理解的问题。该方法通过显式的物理验证指导闭环、顺序决策过程,提高了生成CAD设计的物理正确性。

Comments Accepted in IJCAI-ECAI 2026 (Special Track on AI4Tech)

详情
AI中文摘要

大型语言模型(LLMs)可以生成计算机辅助设计(CAD),但缺乏可靠工程设计所需的物理理解。而不是试图从数据中隐式学习物理定律,我们提出了一种混合代理-物理架构,将经过验证的知识工程工具直接嵌入到自主AI代理的决策循环中。在该框架中,工程设计被建模为一个闭环、顺序决策过程,由显式的物理验证指导。基于负载案例,专用代理通过知识工程工具作为反馈信号,迭代地计划、生成、评估和修订工程设计。我们引入了一个基准数据集和评估功能有效性的指标。我们的系统生成了更复杂且经过物理验证的设计,结构复杂性提高了4.2%,与类似代理方法相比,编译率提高了3.5%。代码库、提示和数据集将向公众开放,以支持可重复性和未来研究。

英文摘要

Large Language Models (LLMs) can generate Computer-Aided Design (CAD), yet lack physical comprehension required for reliable engineering design. Instead of attempting to implicitly learn physical laws from data, we propose a Hybrid Agentic-Physical Architecture that embeds validated knowledge-based engineering tools directly into the decision making loop of autonomous AI agents. In this framework, engineering design is formulated as a closed-loop, sequential decision making process guided by explicit physical verification. Based on a load case, dedicated agents iteratively plan, generate, evaluate, and revise engineering designs using knowledge-based tools as a feedback signal. We introduce a benchmark dataset and metrics for assessing functional validity in generative CAD. Our system generates more complex and physically verified designs, with a 4.2 increase in structural complexity and improving compile rate by 3.5% compared to similar agentic methods. The codebase, prompts and dataset will be made publicly available to support reproducibility and future research.

2605.19714 2026-05-20 cs.CL

LLM-Based Financial Sentiment Analysis in Arabic: Evidence from Saudi Markets

基于大型语言模型的阿拉伯语金融情绪分析:来自沙特市场的证据

Mona H. Albaqawi, Eman M. Albalkhi, Joud A. Albaiti, Enrico Lopedoto

发表机构 * George Mason University(乔治·马歇尔大学) Damascus University(大马士革大学) University of Jeddah(朱德赫大学) City, St George's, University of London(伦敦城市大学)

AI总结 本文提出了一种针对沙特市场的阿拉伯语NLP框架,用于大规模金融情绪分析,结合官方财务新闻和社会媒体数据,通过多阶段流程构建阿拉伯语财务语料库,并利用Transformer-based NER和定制公司词典进行情绪标注,最终实现了对公司层面的情绪聚合和情绪动态分析。

Comments Accepted at the 7th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT7), co-located with LREC 2026, Palma de Mallorca, Spain, May 2026. ISBN: 978-2-493814-52-4

详情
AI中文摘要

投资者情绪塑造金融市场,然而在阿拉伯语财务语境中建模情绪仍然具有挑战性,因为语言复杂性和资源有限。我们提出了一种阿拉伯语NLP框架,用于大规模金融情绪分析,专门针对沙特市场,整合官方财务新闻和社会媒体以捕捉机构和公众投资者情绪。该框架通过多阶段流程构建大规模阿拉伯语财务语料库,包括数据收集、清洗、去重、实体链接和情绪标注。基于Transformer的NER结合定制公司词典将文本提及链接到标准公司标识符,情绪标签通过五类方案分配。最终的84,000样本数据集支持公司层面的情绪聚合和情绪动态分析,相对沙特交易所的股市行为。实验结果表明阿拉伯语金融情绪分析具有可靠性和可扩展性。

英文摘要

Investor sentiment shapes financial markets, yet modeling sentiment in Arabic financial contexts remains challenging due to linguistic complexity and limited resources. We present an Arabic NLP framework for large-scale financial sentiment analysis tailored to the Saudi market, integrating official financial news and social media to capture institutional and public investor sentiment. The framework constructs a large Arabic financial corpus through a multi-stage pipeline encompassing data collection, cleaning, deduplication, entity linking, and sentiment annotation. Transformer-based NER combined with a curated company lexicon links textual mentions to canonical company identifiers, with sentiment labels assigned using a five-class scheme. The resulting dataset of 84K samples supports company-level sentiment aggregation and analysis of sentiment dynamics relative to stock market behavior on the Saudi Exchange. Experimental results demonstrate reliable and scalable Arabic financial sentiment analysis.

2605.19712 2026-05-20 cs.CV

Physics-informed simulation framework for realistic sonar image generation and statistical validation

具有物理信息的模拟框架用于真实声纳图像生成和统计验证

Kamal Basha S, Athira Nambiar

发表机构 * Department of Computational Intelligence, SRM Institute of Science

AI总结 本文提出了一种基于物理的模拟框架ACOUSIM,用于生成真实声纳图像并进行统计验证,通过比较合成与真实声纳图像的统计特性,建立了可重复的分布级基准。

详情
AI中文摘要

合成声纳数据集为昂贵的实地采集提供了可扩展的替代方案,但其效用仍受缺乏严格定量验证的限制。我们提出了ACOUSIM(ACOustic SIMulation and Validation Platform),一个具有物理信息的框架,该框架在不依赖生成模型的情况下评估合成与真实声纳图像之间的统计一致性。基于Gazebo的环境通过显式控制海底纹理、光照驱动的阴影、平台高度和噪声生成声纳样图像。真实性通过两个公开声纳数据集SeabedObjects-KLSG-II和Sonar Common Target Detection(SCTD)进行量化,使用KL散度、JS散度和地球移动距离评估全局强度和局部纹理(LBP)分布。结果表明,在所有类别中纹理一致性都很强(KL < 0.07),其中平面类强度一致性优于船舶类,因为阴影几何复杂性。ACOUSIM为sim-to-real声纳评估建立了可重复的分布级基准,并直接支持水下图像分析的可靠数据集验证。

英文摘要

Synthetic sonar datasets offer a scalable alternative to costly real-world acquisition, yet their utility remains limited by the absence of rigorous quantitative validation. We present ACOUSIM (ACOustic SIMulation and Validation Platform), a physics-informed framework that evaluates the statistical alignment between synthetic and real sonar imagery without relying on generative models. A Gazebo-based environment generates sonar-like images by explicitly controlling seabed texture, illumination-driven shadowing, platform altitude, and noise. Realism is quantified against two public sonar datasets, SeabedObjects-KLSG-II and Sonar Common Target Detection (SCTD), using global intensity and local texture (LBP) distributions assessed via Kullback-Leibler divergence, Jensen-Shannon divergence, and Earth Mover's Distance. Results show strong texture alignment (KL < 0.07) across all classes, with plane-class intensity alignment outperforming ship-class due to shadow geometry complexity. ACOUSIM establishes a reproducible, distribution-level baseline for sim-to-real sonar evaluation and directly supports reliable dataset validation for underwater image analysis.

2605.19711 2026-05-20 cs.CL

Can Large Language Models Reliably Correct Errors in Low-Resource ASR? A Contamination-Aware Case Study on West Frisian

大型语言模型能否可靠地纠正低资源语音识别中的错误?一项考虑数据污染的西弗里西语案例研究

Yun Hao, Reihaneh Amooie, Wietse de Vries, Rik van Noord, Martijn Wieling

发表机构 * University of Groningen(Groningen大学)

AI总结 本研究探讨了大型语言模型在低资源语言(如西弗里西语)中通过生成性错误纠正(GER)提升语音识别(ASR)性能的效果,发现GER在大多数设置中提升了ASR性能,并通过详细的错误分析揭示了模型的纠正模式。

Comments Submitted to Interspeech 2026

详情
AI中文摘要

自动语音识别(ASR)近年来有了显著进步,但在低资源语言上性能仍有限。大型语言模型(LLMs)通过生成性错误纠正(GER)展示了提升ASR的潜力,但其在低资源环境中的有效性尚不明确。此外,数据污染对LLM基于GER的改进程度的影响仍不清楚。本研究调查了LLM基于GER在低资源西弗里西语中的应用。除了公开语料库外,我们构建并使用了一个包含非公开文本的西弗里西语离线数据集进行评估,以控制潜在的数据污染。结果表明,GER在大多数设置中提升了ASR性能,最佳的GPT-5.1结果超过了Oracle WERs。在离线数据集上的可比增益表明,改进反映了真正的纠正能力。我们进一步提供了详细的错误分析,揭示了模型的纠正模式。

英文摘要

Automatic speech recognition (ASR) has improved substantially in recent years, yet performance remains limited for low-resource languages. Large language models (LLMs) have shown promise for improving ASR through generative error correction (GER), but their effectiveness in low-resource settings remains underexplored. In addition, it remains unclear to what extent data contamination influences the reported improvements in LLM-based GER. This study investigates LLM-based GER for low-resource Frisian. In addition to a public corpus, we construct and use a Frisian offline dataset with non-public texts for evaluation to control for potential data contamination. Results show that GER improves ASR performance in most settings, with the best GPT-5.1 results surpassing oracle WERs. Comparable gains on the offline dataset indicate that improvements reflect true correction ability. We further provide a detailed error analysis revealing model correction patterns.