arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2042
2606.06047 2026-06-05 cs.CL

Automatic Labelling of Speech Translation Errors

语音翻译错误的自动标注

Dominik Macháček, Maike Züfle, Ondrej Klejch

发表机构 * Charles University(查尔斯大学) University of Edinburgh(爱丁堡大学) Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院)

AI总结 针对语音翻译缺乏置信度评估方法的问题,提出STEL标注协议,通过文本和多模态系统分析,发现直接语音处理对任务必要且与文本系统互补。

详情
AI中文摘要

语音翻译中的错误会降低语音翻译(ST)系统的可信度,并可能产生严重后果。然而,目前尚无评估语音翻译置信度和质量估计的成熟方法。为启动这一方向的进展,我们提出了语音翻译错误标注(STEL)。我们创建了一个标注协议、一个小型真实的端到端评估数据集,并分析了现有纯文本和语音处理系统如何执行STEL任务。我们的结果表明,纯文本XCOMET和多模态LLM Qwen2.5-Omni能够以大约人类一半的精度执行STEL任务。我们还发现,直接语音处理对于STEL任务是必要的,并且当前的纯文本和语音处理系统在标注ST中的翻译错误与语音处理错误方面具有互补性。

英文摘要

Errors in speech translations reduce trustworthiness of Speech Translation (ST) systems and can have serious consequences. Yet currently there is no established methodology for evaluating confidence and quality estimation of speech translations. To initiate progress in this direction, we propose Speech Translation Error Labelling (STEL). We create an annotation protocol, a small authentic end-to-end evaluation dataset, and we analyse how existing text-only and speech-processing systems perform the STEL task. Our results show that text-only XCOMET and multimodal LLM Qwen2.5-Omni are able to perform the STEL task in roughly half the precision of humans. We also find that direct speech processing is necessary for the STEL task, and that the current text-only and speech-processing systems are complementary in labelling translation-only vs. speech-processing errors in ST.

2606.06044 2026-06-05 cs.CL

IA-RAG: Interval-Algebra-Driven Temporal Reasoning for Dynamic Knowledge Retrieval

IA-RAG:基于区间代数的动态知识检索时间推理

Xiaoman Wang, Yaoze Zhang, Wenzhuo Fan, Hongwei Zhang, Ding Wang, Guohang Yan, Song Mao, Botian Shi, Yunshi Lan, Pinlong Cai

发表机构 * East China Normal University(华东师范大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) University of Shanghai for Science and Technology(上海科技大学) Harbin Engineering University(哈尔滨工程大学)

AI总结 提出IA-RAG框架,通过区间代数建模时间约束,实现层次化时间检索与推理,在复杂时间问答任务上表现优异。

Comments 22 pages, 10 figures, 13 tables. Code available at https://github.com/xiaoAugenstern/LogicalRAG_TemporalQA

详情
AI中文摘要

检索增强生成(RAG)在利用外部知识增强大语言模型(LLMs)方面表现出强大的有效性。然而,现有的RAG和Graph RAG框架大多将知识视为静态,或仅将时间与粗粒度的时间戳或元数据关联,未能捕捉丰富的时间结构,如持续时间、重叠和包含关系。我们提出IA-RAG,一种层次化时间RAG框架,将知识建模为时间区间,并在形式化时间约束下进行检索。IA-RAG将事实表示为区间事件单元(IEUs),并将其组织成层次化的主题森林,其中时间依赖关系由Allen的区间代数控制。为处理不完整或不确定的时间边界,IA-RAG进一步引入子图时间收紧机制,通过连接事件子图中的逻辑约束来细化模糊区间。此外,IA-RAG通过区间代数引导的遍历支持隐式时间语义检索。在多个时间问答基准(包括TimeQA、TempReason和ComplexTR)上的实验表明,IA-RAG在时间检索和推理性能上表现优异,尤其是在复杂的组合时间推理任务上。我们的代码已发布在https://github.com/xiaoAugenstern/LogicalRAG_TemporalQA。

英文摘要

Retrieval-Augmented Generation (RAG) has shown strong effectiveness in grounding Large Language Models (LLMs) with external knowledge. However, existing RAG and Graph RAG frameworks largely treat knowledge as static or associate time with coarse-grained timestamps or metadata, failing to capture rich temporal structures such as duration, overlap, and containment. We propose IA-RAG, a hierarchical temporal RAG framework that models knowledge as time intervals and performs retrieval under formal temporal constraints. IA-RAG represents facts as Interval Event Units (IEUs) and organizes them into a hierarchical Thematic Forest, where temporal dependencies are governed by Allen's Interval Algebra. To handle incomplete or uncertain temporal boundaries, IA-RAG further introduces a Sub-graph Time Tightening mechanism that refines fuzzy intervals through logical constraints within connected event subgraphs. In addition, IA-RAG supports implicit temporal semantic retrieval through interval-algebra-guided traversal. Experiments on multiple temporal question answering benchmarks, including TimeQA, TempReason, and ComplexTR, demonstrate that IA-RAG achieves strong temporal retrieval and reasoning performance, particularly on complex compositional temporal reasoning tasks. Our code is released at https://github.com/xiaoAugenstern/LogicalRAG_TemporalQA.

2606.06041 2026-06-05 cs.RO cs.AI cs.NE

Sample-efficient Low-level Motion Planning for Robotic Manipulation Tasks via Zero-shot Transfer Learning

通过零样本迁移学习实现机器人操作任务的样本高效低级运动规划

Yuanzhi He, Victor Romero-Cano, José J. Patiño, Juan David Hernández, William Sawtell, Gualtiero Colombo

发表机构 * School of Computer Science & Informatics, Cardiff University, Cardiff, UK(计算机科学与信息学系,卡迪夫大学,卡迪夫,英国)

AI总结 提出iCEM+TL框架,通过迁移学习和奖励重塑提高复杂操作任务的成功率,仿真中提升高达23%,并在真实机器人上验证。

Comments 12 pages, 5 figures, International Conference on Artificial Neural Networks (ICANN) 2026 conference accepted

详情
AI中文摘要

随着机器人系统变得日益复杂,其运动规划模型的复杂性和更长的训练时间带来了巨大挑战。进化算法如样本高效交叉熵方法(iCEM)最近通过利用高效的知识重用策略来提升性能,在低级实时规划中展现出潜力。尽管在许多控制任务中有效,但iCEM在更复杂场景中的性能可能受到限制,特别是那些需要堆叠、滑动和放置到架子的任务。在这项工作中,我们提出了一种新颖的iCEM+TL框架,明确利用迁移学习(TL),其中关键的iCEM参数从较简单的上游任务迁移以指导更复杂的下游任务。此外,我们通过任务分解对堆叠物体和放置到架子应用了奖励重塑(RR)以优化任务特定性能。仿真结果表明,我们的框架实现了高达23%的成功率提升。该框架还在真实的Franka Emika机器人上的堆叠任务中得到进一步验证,展示了其在实际部署中的可行性。

英文摘要

As robotic systems become more sophisticated, the growing complexity of their motion planning models and the longer training times pose substantial challenges. Evolutionary algorithms such as the Sample-efficient Cross-Entropy Method (iCEM) have recently demonstrated promising potential for low-level real-time planning by leveraging efficient knowledge reuse strategies to improve performance. Although effective in many control tasks, iCEM's performance can be constrained in more complex scenarios, particularly those requiring stacking, sliding, and shelf placement. In this work, we propose a novel iCEM+TL framework that explicitly leverages Transfer Learning (TL), where key iCEM parameters are transferred from simpler upstream tasks to guide more complex downstream tasks. Additionally, we applied Reward Redesign (RR) through task decomposition for stacking objects and shelf placement to optimize task-specific performance. Results from the simulation show that our framework achieves success rate improvements of up to 23%. The framework is further validated on a real Franka Emika robot in a stacking task, demonstrating its practical feasibility for real-world deployment.

2606.06040 2026-06-05 cs.RO cs.SY eess.SY

Gotta Grow Fast: Design and Benchmarking of a Tip Mount for High-Speed Vine Robots

快速生长:高速藤蔓机器人尖端支架的设计与基准测试

Antonio Alvarez Valdivia, Robert Reeve, Ankush Dhawan, Ciera McFarland, Chad Council, Margaret McGuinness, Nathaniel Hanson

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Lincoln Laboratory(林肯实验室) Stanford University(斯坦福大学) University of Notre Dame(圣母大学)

AI总结 提出一种三角滚轮尖端支架,通过滚动代替滑动减少生长阻力,实现TPU涂层防撕裂尼龙藤蔓机器人的一致外翻,并建立可重复的基准测试框架。

Comments Accepted to IEEE Robotics & Automation Letters

详情
AI中文摘要

软体生长藤蔓机器人通过尖端外翻机制扩展,该机制使其能够在杂乱环境中导航。然而,在尖端集成摄像头和其他传感器具有独特挑战,因为形成尖端的材料随着机器人生长而不断更新。这种持续的材料更替,加上内层之间的摩擦、增加的尖端重量和织物收缩,使传感器和工具安装复杂化。这些限制阻碍了藤蔓机器人在检查和搜索任务中的应用,而快速生长并携带尖端传感器至关重要。在这项工作中,我们提出了一种三角滚轮尖端支架,通过滚动而非滑动与机器人本体接触,减少生长过程中的内部阻力。通过迭代故障分析优化设计,首次实现了在TPU涂层防撕裂尼龙藤蔓机器人上的一致外翻。为了定量评估支架性能,我们引入了一个定制测试台,通过测量外翻过程中的尾部张力来隔离尖端安装效应。跨多个支架变体(包括先前设计)的比较实验表明,我们的三角滚轮支架实现了最低的尾部张力和最可重复的生长性能。这些结果既建立了一个经过验证的尖端支架设计,也为推进软体生长机器人中传感器和工具集成提供了一个可重复的基准测试框架。支架和测试台的CAD文件可在以下网址获取:https://sprout-mitll.github.io/tip_mounts/。

英文摘要

Soft, growing vine robots extend through tip eversion, a mechanism that enables navigation through cluttered environments. However, integrating cameras and other sensors at the tip is uniquely challenging because the material forming the tip is constantly renewed as the robot grows. This continual material turnover, combined with friction between internal layers, added tip weight, and fabric constriction, complicates sensor and tool mounting. These limitations hinder the deployment of vine robots for inspection and search tasks, where rapid growth while carrying tip-mounted sensors is essential. In this work, we present a triangular roller tip mount that reduces internal resistance during growth by rolling rather than sliding against the robot body. The design was refined through iterative failure analysis, enabling, for the first time, consistent eversion on a TPU-coated ripstop nylon vine robot. To quantitatively evaluate mount performance, we introduce a custom testbed that isolates tip mounting effects by measuring tail tension during eversion. Comparative experiments across multiple mount variants, including prior designs, show that our triangular roller mount achieves the lowest tail tension and most repeatable growth performance. These results establish both a validated tip mount design and a repeatable benchmarking framework for advancing sensor and tool integration in soft growing robots. CAD for the mount and testbed is available at: https://sprout-mitll.github.io/tip_mounts/.

2606.06039 2026-06-05 cs.CV

Texture-preserving implicit neural representation for Cone beam CT truncated reconstruction

保留纹理的隐式神经表示用于锥束CT截断重建

Genyuan Zhang, Junyao Wang, Haoran Lan, Chuandong Tan, Songtao Zhu, Fenglin Liu

发表机构 * National Key Research and Development Program of China(中华人民共和国国家重点研发计划) National Natural Science Foundation of China(中华人民共和国国家自然科学基金) Fundamental Research Funds for the Central Universities(中央高校基本科研业务费)

AI总结 提出一种自监督的3D重建框架,基于神经场景表示,结合物理迭代细化模块,解决锥束CT截断重建中的伪影和纹理丢失问题。

详情
AI中文摘要

锥束计算机断层扫描(CBCT)经常受到数据截断的影响,这引入了严重的伪影并限制了有效视场(FOV)。现有的用于截断锥束CT重建的深度学习方法存在严重局限性,包括严格依赖有监督的真实数据和未能考虑连续3D空间截断变化。为了解决这些挑战,我们引入了一个基于神经场景表示的自监督3D重建框架。通过在投影监督下将空间坐标直接映射到辐射密度,我们的方法固有地绕过了传统的滤波和反投影操作,从而从根本上消除了截断引起的环状伪影,同时实现了鲁棒的连续3D数据外推。然而,坐标网络容易受到固有的频谱偏差影响,这导致临床关键的高频纹理严重丢失。为了解决这一瓶颈,我们进一步将基于物理的迭代细化模块集成到神经场景表示架构中。利用来自坐标网络的无伪影外推体积作为最优初始化,该模块逐步从原始投影中重新提取高频结构信息并将其注入体积中。在模拟和真实数据集上的大量实验表明,我们的方法成功地将神经网络的优异伪影抑制和外推能力与迭代算法的高保真细节保留统一起来。

英文摘要

Cone-beam computed tomography (CBCT) frequently suffers from data truncation, which introduces severe artifacts and limits the effective field of view (FOV). Existing deep learning methods for truncated cone-beam computed tomography (CBCT) reconstruction suffer from serious limitations, including a strict reliance on supervised ground truth and a failure to account for continuous 3D spatial truncation variations. To address these challenges, we introduce a self-supervised 3D reconstruction framework based on neural scene representations. By directly mapping spatial coordinates to radiodensity under projection supervision, our approach inherently bypasses traditional filtering and backprojection operations, thereby fundamentally eliminating truncation-induced ring artifacts while enabling robust continuous 3D data extrapolation. However, coordinate networks are susceptible to an inherent spectral bias, which leads to a severe loss of clinically vital high-frequency textures. To resolve this bottleneck, we further incorporate a physics-based iterative refinement module into the neural scene representation architecture. Leveraging the artifact-free, extrapolated volume from the coordinate network as an optimal initialization, this module progressively re-extracts and injects high-frequency structural information from the original projections back into the volume. Extensive experiments on both simulated and real-world datasets demonstrate that our method successfully unifies the exceptional artifact suppression and extrapolation capabilities of neural networks with the high-fidelity detail preservation of iterative algorithms.

2606.06038 2026-06-05 cs.CL

English-to-Prakrit Machine Translation via Multilingual Transfer Learning

英语到普拉克里特语的机器翻译:基于多语言迁移学习

Om Choksi, Smit Kareliya, Shrikant Malviya, Pruthwik Mishra

发表机构 * Sardar Vallabhbhai National Institute of Technology(萨达尔·瓦拉布尔·尼西特国家理工学院)

AI总结 针对低资源目标语言普拉克里特语,通过将普拉克里特语映射到印地语标签并利用多语言模型,在少量平行语料上实现可行的机器翻译,揭示了脚本兼容的语言路由对未支持古典语言的迁移潜力及数据稀缺和方言不匹配的限制。

详情
AI中文摘要

我们在低资源环境下研究英语到普拉克里特语的机器翻译,其中目标语言不受IndicTrans2支持。我们通过将普拉克里特语映射到印地语标签(hin_Deva)来适配多语言模型,而不修改分词器、词汇表或架构。使用包含1,474对马哈拉施特拉普拉克里特语平行语料库,并在20样本的阿尔达摩揭陀语测试集上进行评估,我们报告了相对于未调优基线的语料库BLEU改进。结果表明,脚本兼容的语言路由可以实现对未支持的古典语言的可行迁移,同时突显了数据稀缺和方言不匹配带来的局限性。我们的代码和训练模型已公开发布,供进一步探索:https://github.com/D3v1s0m/indictrans2-prakrit-mt。

英文摘要

We study English-to-Prakrit machine translation in a low-resource setting where the target language is unsupported by IndicTrans2. We adapt the multilingual model by mapping Prakrit to the Hindi language tag (hin_Deva) without modifying the tokenizer, vocabulary, or architecture. Using a 1,474-pair Maharashtri Prakrit parallel corpus and evaluation on a 20-sample Ardhamagadhi test set, we report corpus BLEU improvements over an untuned baseline. The results indicate that script-compatible language routing can enable feasible transfer to unsupported classical languages, while highlighting limitations due to data scarcity and dialect mismatch. Our code and trained models are released to the public for further exploration https://github.com/D3v1s0m/indictrans2-prakrit-mt.

2606.06036 2026-06-05 cs.AI cs.IR

Memory is Reconstructed, Not Retrieved: Graph Memory for LLM Agents

记忆是重建的,而非检索的:面向LLM智能体的图记忆

Shuo Ji, Yibo Li, Bryan Hooi

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出MRAgent框架,通过关联记忆图和主动重建机制,使LLM智能体在推理过程中动态调整记忆访问,显著提升长程记忆推理性能。

Comments Accepted at ICML 2026

详情
AI中文摘要

尽管近期取得了进展,LLM智能体在处理长交互历史推理时仍面临困难。当前记忆增强智能体依赖静态的检索-推理范式,这种僵化的流水线设计阻碍了它们根据推理过程中发现的中间证据动态调整记忆访问。为弥补这一差距,我们提出MRAgent,一个将关联记忆图与主动重建机制相结合的框架。我们将记忆表示为线索-标签-内容图,其中关联标签作为语义桥梁连接细粒度线索与记忆内容。在此结构上,我们的主动重建机制将LLM推理直接融入记忆访问,使智能体能够基于累积证据迭代探索和修剪检索路径。这确保了记忆检索动态适应推理上下文,同时避免无约束扩展导致的组合爆炸。在LoCoMo基准和LongMemEval基准上的实验表明,该方法在强基线上取得了显著提升(高达23%),同时大幅降低了令牌和运行时间成本,凸显了主动和关联重建对于长程记忆推理的有效性。

英文摘要

Despite recent progress, LLM agents still struggle with reasoning over long interaction histories. While current memory-augmented agents rely on a static retrieve-then-reason paradigm, this rigid pipeline design prevents them from dynamically adapting memory access to intermediate evidence discovered during inference. To bridge this gap, we propose MRAgent, a framework that combines an associative memory graph with an active reconstruction mechanism. We represent memory as a Cue-Tag-Content graph, where associative tags serve as semantic bridges connecting fine-grained cues to memory contents. Operating on this structure, our active reconstruction mechanism integrates LLM reasoning directly into memory access, allowing the agent to iteratively explore and prune retrieval paths based on accumulated evidence. This ensures that memory retrieval is dynamically adapted to the reasoning context while avoiding combinatorial explosion caused by unconstrained expansion. Experiments on the LoCoMo benchmark and LongMemEval benchmark demonstrate significant improvements over strong baselines (up to 23%), while substantially reducing token and runtime cost, highlighting the effectiveness of active and associative reconstruction for long-horizon memory reasoning.

2606.06034 2026-06-05 cs.LG cs.AI

When Good Enough Is Optimal: Multiplication-Only Matrix Inversion Approximation for Quantized Gated DeltaNet

当足够好即最优:量化门控DeltaNet的仅乘法矩阵求逆近似

Luoming Zhang, Yuwei Ren, Kui Zhang, Tian Liu, Lingjuan Ge, Denghao Li, Matthew Harper Langston, Yin Huang, Weiliang Will Zeng, Liang Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对分块并行线性注意力中矩阵求逆的瓶颈,提出基于截断Neumann级数展开的仅矩阵乘法算法,结合结构掩码和并行残差校正,实现NPU上5倍内核加速和20%解码层开销降低。

详情
AI中文摘要

分块并行线性注意力中的矩阵求逆是长上下文建模的主要瓶颈,尤其是在NPU上,基于前向替换的方法并行性有限且硬件利用率低。我们提出了一种快速的、基于矩阵乘法(MatMul)的算法,专门针对分块线性注意力中出现的严格下三角矩阵。受Neumann级数项快速增长和逆矩阵对角集中性的启发,我们采用截断Neumann展开,结合结构掩码和并行残差校正,以消除顺序依赖。我们进一步将方法扩展到低比特INT,通过缓解重复矩阵幂运算引起的动态范围扩展,并根据块大小调整近似阶数和残差步长,以最小化计算成本同时保持模型精度。在Qwen3.5系列模型上的实验表明,在浮点和低精度推理下,该方法实现了高达5倍的内核级加速和20%的解码层开销降低,同时保持了精度。我们的方法为可扩展线性注意力提供了一种高效且硬件友好的解决方案。

英文摘要

Matrix inversion in chunk-wise parallel linear attention is a major bottleneck for long-context modeling, particularly on NPUs, where forward-substitution-based methods exhibit limited parallelism and poor hardware utilization. We propose a fast, Matrix Multiplication (MatMul)-based algorithm tailored for strictly lower-triangular matrices arising in chunk-wise linear attention. Motivated by the rapid growth of Neumann-series terms and the diagonal concentration of the inverse matrix, we employ a truncated Neumann expansion with structural masking and parallel residual correction to eliminate sequential dependencies. We further extend our method to low-bits INT by mitigating the dynamic range expansion arising from repeated matrix power operations, and adapt the approximation order and residual step to the chunk size to minimize computational cost while preserving the model's accuracy. Experiments on Qwen3.5-family models demonstrate up to 5$\times$ kernel-level speedup and a 20% reduction in decode-layer overhead, while preserving accuracy under both floating-point and low-precision inference. Our method offers an efficient and hardware-friendly solution for scalable linear attention.

2606.06032 2026-06-05 cs.LG

Catastrophic Forgetting as Accessibility Collapse: A Three-Level Framework for Knowledge Persistence in Continual Learning

灾难性遗忘作为可访问性崩溃:持续学习中知识持久性的三层次框架

Ayushman Trivedi, Bhavika Melwani

发表机构 * Independent Researchers(独立研究者)

AI总结 本文提出一个三层次框架(知识存储、表示和可访问性),通过实验证明灾难性遗忘主要是可访问性失败而非表示擦除,重新训练分类器可恢复大部分性能。

Comments 14 pages, 6 figures, 8 tables. Sequential continual-learning experiments on CIFAR-100 using ResNet-18

详情
AI中文摘要

灾难性遗忘通常被解释为在顺序学习过程中先前获得的知识不可逆转地擦除。在这项工作中,我们研究了一种替代观点:遗忘可能并非源于任务表示的完全破坏,而是源于对保存信息的可访问性丧失。我们引入了一个三层次框架,将知识存储、表示和可访问性分开,并通过在ResNet-18上对顺序CIFAR-100分类进行的一系列持续学习实验来评估每个组件。我们的分析结合了检查点持久性、线性探测、表示几何、分类器重置恢复和逐层可恢复性实验。我们观察到早期任务的完全行为遗忘,任务准确率从54.8%下降到0%,而线性探测性能保留了大约76%的原始表示信息。此外,仅重新训练最终分类器就恢复了原始任务性能的75.7%,而无需修改骨干网络。逐层分析表明,早期和中间层保留了高度可恢复的任务信息,尽管后期阶段严重退化。投影能量和主角度分析表明,保留的知识以分布式高维表示的形式持续存在,而不是通过保留一个小的主导子空间。这些发现表明,灾难性遗忘更适合被描述为可访问性失败而非完全表示擦除,并且即使在功能遗忘发生后,大量任务相关信息仍嵌入在神经表示中。

英文摘要

Catastrophic forgetting is commonly interpreted as the irreversible erasure of previously acquired knowledge during sequential learning. In this work, we investigate an alternative perspective: that forgetting may arise not from complete destruction of task representations but from a loss of accessibility to preserved information. We introduce a three-level framework separating knowledge storage, representation, and accessibility, and evaluate each component through a series of continual-learning experiments on sequential CIFAR-100 classification using ResNet-18. Our analysis combines checkpoint persistence, linear probing, representation geometry, classifier-reset recovery, and layer-wise recoverability experiments. We observe complete behavioral forgetting of earlier tasks, with task accuracy collapsing from 54.8% to 0%, while linear probe performance retains approximately 76% of the original representational information. Furthermore, retraining only the final classifier restores 75.7% of the original task performance without modifying the backbone network. Layer-wise analysis reveals that early and intermediate layers preserve highly recoverable task information despite severe degradation at later stages. Projection-energy and principal-angle analyses indicate that retained knowledge persists as distributed high-dimensional representations rather than through preservation of a small dominant subspace. These findings suggest that catastrophic forgetting is better characterized as an accessibility failure than complete representational erasure, and that substantial task-relevant information remains embedded within neural representations even after functional forgetting has occurred.

2606.06031 2026-06-05 cs.CL

NAVIRA: Decoupled Stochastic Remasking for Masked Diffusion Language Models

NAVIRA: 解耦随机重掩码用于掩码扩散语言模型

Andrey Fomenko, Maksim Kryzhanovskiy, Svetlana Glazyrina, Roman Ischenko

发表机构 * Lomonosov Moscow State University(莫斯科罗蒙诺索夫莫斯科国立大学) Institute for Artificial Intelligence(人工智能研究所)

AI总结 针对掩码扩散语言模型并行生成中的上下文污染问题,提出NAVIRA解码策略,通过解耦质量检测与重生成、随机采样重掩码位置,提升流畅性和多样性。

详情
AI中文摘要

掩码扩散语言模型通过并行迭代地解除掩码生成文本,但这种速度带来了修正问题:同一时间步生成的标记是从边缘分布预测的,早期局部依赖错误可能污染后续上下文。PRISM通过学习标记级质量分数并重掩码不可靠标记来解决此问题,但其推理规则是耦合的:同一前向传播既检测低质量标记又计算其替换的对数几率,因此错误标记仍会条件化再生。我们提出NAVIRA,一种推理时解码策略,将这两个操作分离并随机采样重掩码位置。第一次前向传播对标记评分;选中的标记被掩码;第二次前向传播从清理后的上下文再生。温度控制的重掩码减少对相同位置的重复修正,并在流畅性与多样性之间取得平衡。在170M掩码扩散语言模型的受控实验中,解耦提高了流畅性,而调度的随机重掩码保持了熵,并在更大的前向传播预算下获得了更强的LLM评判分数。这些结果表明,重掩码策略(而不仅仅是学习到的质量信号)对于可靠的掩码扩散文本生成至关重要。

英文摘要

Masked diffusion language models generate text by iteratively unmasking many tokens in parallel, but this speed comes with a correction problem: tokens generated in the same step are predicted from marginal distributions, and early local dependency errors can later contaminate the context. PRISM addresses this by learning token-level quality scores and remasking unreliable tokens, but its inference rule is coupled: the same forward pass both detects low-quality tokens and computes logits for their replacements, so the erroneous tokens still condition regeneration. We propose NAVIRA, an inference-time decoding policy that separates these two operations and samples remasking positions stochastically. A first forward pass scores tokens; selected tokens are masked; a second forward pass regenerates from the cleaned context. Temperature-controlled remasking reduces repeated correction of the same positions and balances fluency against diversity. In controlled experiments with a 170M masked diffusion language model, decoupling improves fluency, while scheduled stochastic remasking preserves entropy and achieves stronger LLM-judge scores under larger forward-pass budgets. These results show that remasking policy, not only the learned quality signal, is central to reliable masked-diffusion text generation.

2606.06027 2026-06-05 cs.AI cs.CL cs.LG cs.SI

RedditPersona: A Modular Framework for Community-Conditioned LLM Adaptation from Reddit

RedditPersona: 一个用于从Reddit进行社区条件化LLM适配的模块化框架

Amirhossein Ghaffari, Ali Goodarzi, Huong Nguyen, Simo Hosio, Lauri Lovén, Ekaterina Gilman

发表机构 * Future Computing Group University of Oulu(未来计算组奥卢大学) Centre for Applied Computing University of Oulu(应用计算中心奥卢大学)

AI总结 提出RedditPersona模块化框架,通过五种分组策略和QLoRA训练参数高效适配器,在112个Reddit子版块上评估社区条件化语言模型,发现适配器的行为可识别性与策略内在一致性相关,且所有策略在可识别性和分布相似性之间存在一致权衡。

详情
AI中文摘要

社区条件化的语言模型适配需要在每个研究中独立做出关于数据收集、社区定义和评估的选择,这使得比较假设或重用工件变得困难。我们提出了RedditPersona,一个模块化框架,标准化了这些选择:它收集Reddit帖子和评论,分析活跃用户,根据五种分组策略(基于子版块、图结构、语义、混合和基于交互)对用户进行划分,通过QLoRA为每种策略训练参数高效的适配器,并在一个涵盖流畅性、忠实度、分布对齐和社区可识别性的共享度量套件下进行评估。应用于城市福祉领域的112个子版块(301,429个用户档案,超过1600万条评论),我们发现适配器的行为可识别性追踪了每种策略与子版块基线的内在一致性,并且所有五种策略在可识别性和与真实文本的分布相似性之间存在一致的权衡。代码和配置文件可在以下网址获取:https://github.com/Ahghaffari/redditpersona。

英文摘要

Community-conditioned language model adaptation requires choices about data collection, community definition, and evaluation that are currently made independently in each study, making it hard to compare assumptions or reuse artifacts. We present RedditPersona, a modular framework that standardizes these choices: it collects Reddit posts and comments, profiles active users, partitions them under five grouping strategies (subreddit-based, graph-structural, semantic, hybrid, and interaction-based), trains a parameter-efficient adapter per strategy via QLoRA, and evaluates them under a shared metric suite spanning fluency, fidelity, distributional alignment, and community identifiability. Applied to 112 subreddits in the urban well-being domain (301,429 user profiles, 16M+ comments), we find that adapters' behavioral identifiability tracks each strategy's intrinsic agreement with the subreddit baseline, and that a consistent trade-off between identifiability and distributional similarity to real text holds across all five strategies. The code and configuration files are available at: https://github.com/Ahghaffari/redditpersona.

2606.06025 2026-06-05 cs.CL cs.AI

EGTR-Review: Efficient Evidence-Grounded Scientific Peer Review Generation via Multi-Agent Teacher Distillation

EGTR-Review: 基于多智能体教师蒸馏的高效证据支撑科学同行评审生成

Xinpeng Qiu, Wang Yihu, Zhifeng Liu, Xiaochen Wang, Jimin Wang

发表机构 * Department of Information Management, Peking University(北京大学信息管理系) PKU-WUHAN Institute for Artificial Intelligence, Peking University(北京大学武汉人工智能研究院)

AI总结 提出EGTR-Review框架,通过多智能体教师蒸馏和证据加权目标,实现轻量级学生模型的高质量、可溯源同行评审生成。

详情
AI中文摘要

科学同行评审生成因能减少评审负担并提供及时反馈而受到越来越多的关注。然而,现有基于大型语言模型(LLM)的方法往往产生缺乏证据支持和弱源可追溯性的通用评论,而复杂的多智能体系统则导致高推理成本。为应对这些挑战,我们提出EGTR-Review,一种通过多智能体教师蒸馏实现的证据支撑且可追溯的评审生成框架。EGTR-Review首先构建一个多智能体教师,执行结构感知的论文分解、关键元素提取、外部学术证据检索、证据状态标注、验证推理和评审合成。然后,通过任务前缀驱动的多任务学习,将中间推理轨迹和最终评审评论蒸馏到轻量级学生模型中。证据加权目标进一步减少弱、缺失或不可验证监督的影响。在公共同行评审数据集上的实验表明,EGTR-Review(学生)在自动指标、LLM作为评判者评估和人工评估中均优于强提示基、微调基和结构化/智能体基线,同时保持强事实基础和源可追溯性,且显著降低令牌消耗和推理时间。我们的代码、提示、配置和样本数据可在GitHub上获取。

英文摘要

Scientific peer review generation has attracted increasing attention for reducing reviewing burdens and providing timely feedback. However, existing Large Language Model (LLM)-based methods often produce generic comments with insufficient evidence support and weak source traceability, while complex multi-agent systems incur high inference costs. To address these challenges, we propose EGTR-Review, an Evidence-Grounded and Traceable Review Generation framework via Multi-Agent Teacher Distillation. EGTR-Review first constructs a multi-agent teacher that performs structure-aware paper decomposition, key-element extraction, external scholarly evidence retrieval, evidence-state labeling, verification reasoning, and review synthesis. It then distills both intermediate reasoning trajectories and final review comments into a lightweight student model through task-prefix-driven multi-task learning. An evidence-weighted objective further reduces the influence of weak, missing, or non-verifiable supervision. Experiments on public peer-review datasets show that EGTR-Review (Student) outperforms strong prompt-based, fine-tuned, and structured/agentic baselines across automatic metrics, LLM-as-Judge evaluation, and human evaluation, while maintaining strong factual grounding and source traceability with substantially lower token consumption and inference time. Our code, prompts, configurations, and sample data are available on GitHub.

2606.06022 2026-06-05 cs.CL

Contextualized Prompting For Stance Detection On Social Media

社交媒体立场检测的上下文化提示

Tilman Beck, Shakib Yazdani, Simon Kruschinski, Marcus Maurer, Iryna Gurevych

发表机构 * Institute of Intensive Care, University Hospital of Zurich and University of Zurich(重症医学研究所,苏黎世大学医院和苏黎世大学) Institute of Computer Science, University of Goettingen(计算机科学研究所,哥廷根大学) GESIS Leibniz Institute for the Social Sciences(社会科学研究莱比锡研究所) Institut für Publizistik, Johannes Gutenberg-University Mainz(主笔研究所,美因茨约翰· Gutenberg 大学) Ubiquitous Knowledge Processing Lab, Technical University of Darmstadt(无处不在知识处理实验室,达姆施塔特技术大学)

AI总结 通过系统实验,研究在零样本提示中融入真实世界、推导和LLM生成的上下文特征对Twitter立场检测的影响,发现LLM生成的目标描述能持续提升准确率,而其他用户元数据效果不一。

详情
AI中文摘要

社交媒体上的立场检测由于语言简短、嘈杂且依赖上下文而具有挑战性。虽然大型语言模型(LLM)表现出零样本泛化能力,但它们通常在没有上下文信息的情况下被提示,这限制了它们解释模糊帖子的能力。在这项工作中,我们系统地研究了将真实世界(例如,用户传记)、推导(例如,政党)和LLM生成的(例如,目标描述)上下文特征融入Twitter立场检测的零样本提示中的影响。我们的评估涵盖四个基准数据集,包括一个新的高质量德语Twitter立场数据集。在多个LLM中,我们发现整合上下文信息能提高性能,但仅在特定条件下。LLM生成的目标描述持续提升准确性,而其他用户元数据则产生混合甚至有害的效果。值得注意的是,我们表明包含同一用户的其他推文(在监督学习中通常有益)可能会因输入噪声而损害性能。我们的定性分析揭示,LLM难以区分任务特定的有用信息和无关上下文。我们的发现强调了在嘈杂的真实世界环境中使用上下文信息进行提示的前景和挑战。我们在\href{https://github.com/tilmanbeck/stance-context-twitter}{此页面}发布代码和数据。

英文摘要

Stance detection on social media is challenging due to short, noisy, and context-dependent language. While large language models (LLMs) show zero-shot generalization, they are typically prompted without contextual information, which limits their ability to interpret ambiguous posts. In this work, we systematically investigate the impact of incorporating real-world (e.g., user biographies), derived (e.g., political party), and LLM-generated (e.g., target descriptions) contextual features into zero-shot prompting for stance detection on Twitter. Our evaluation spans four benchmark datasets, including a new high-quality German Twitter stance dataset. Across multiple LLMs, we find that integrating contextual information improves performance, but only under specific conditions. LLM-generated target descriptions consistently enhance accuracy, while other user metadata has mixed or even detrimental effects. Notably, we show that the inclusion of other tweets by the same user, often beneficial in supervised learning, can impair performance due to input noise. Our qualitative analysis reveals that LLMs struggle to distinguish task-specific useful information from irrelevant context. Our findings highlight both the promise and challenges of prompting with context information in noisy real-world settings. We publish code and data at this \href{https://github.com/tilmanbeck/stance-context-twitter}{page}.

2606.06020 2026-06-05 cs.CV

ReSAGE-PAR: Representational Similarity Assessment for Generative Expansion in Pedestrian Attribute Recognition

ReSAGE-PAR:行人属性识别中生成式扩展的表征相似性评估

Pablo Ayuso-Albizu, Pablo Carballeira, Juan C. SanMiguel, Paula Moral

发表机构 * Universidad Autónoma de Madrid(阿隆托纳大学马德里分校)

AI总结 针对行人属性识别数据稀缺问题,提出ReSAGE-PAR管道,通过扩散模型生成图像并利用贝叶斯分类器验证属性,实现可扩展的高保真数据集扩展,在标准骨干网络上提升高达8.7%。

Comments Under review at IEEE Transactions on Circuits and Systems for Video Technology (TCSVT)

详情
AI中文摘要

为了解决行人属性识别(PAR)中有限的数据多样性和数据稀缺问题,我们探索了使用基于属性提示的扩散模型进行图像合成。虽然这能够实现行人图像的可控生成,但它面临两个关键挑战:(i)高质量预训练数据与低分辨率、非标准监控裁剪之间的领域差距,以及(ii)需要可靠的属性验证以防止生成幻觉。在本文中,我们引入了一个稳健的生成-评分-自动标注管道,称为ReSAGE-PAR(PAR中生成式扩展的表征相似性评估),它弥合了这一领域差距,并实现了可扩展、高保真的数据集扩展。首先,我们使用定制的基于LoRA的图像到图像方法,将预训练的扩散模型适应到原生PAR分辨率。其次,我们提取生成图像与其条件提示之间的视觉-语言对齐分数,利用包括标签一致和不一致补充的综合提示策略。最后,我们制定了一个贝叶斯分类器,将这些连续分数转换为可靠的二值伪标签。大量评估证明了ReSAGE-PAR在保留空间先验和验证属性方面的有效性。当集成到PAR训练中时,ReSAGE-PAR一致地带来了显著的改进——在标准骨干网络上实现了高达8.7%的提升,并将最先进的框架推向了新的性能水平。这证明了其作为可扩展PAR增强的架构无关解决方案的价值。ReSAGE-PAR的完整代码库可在http://www-vpu.eps.uam.es/publications/ReSAGE-PAR公开获取。

英文摘要

To address the limited diversity and data scarcity in Pedestrian Attribute Recognition (PAR), we explore image synthesis using diffusion models guided by attribute-based prompts. While this enables the controlled generation of pedestrian images, it faces two critical challenges: (i) the domain gap between high-quality pre-training data and low-resolution, non-standard surveillance crops, and (ii) the need for reliable attribute verification to prevent generative hallucinations. In this paper, we introduce a robust generate-score-autolabel pipeline called ReSAGE-PAR (REpresentational Similarity Assessment for Generative Expansion in PAR) that bridges this domain gap and enables scalable, high-fidelity dataset expansion. First, we adapt pre-trained diffusion models to native PAR resolutions using a tailored LoRA-based Image-to-Image approach. Second, we extract vision-language alignment scores between the generated images and their conditioning prompts, utilizing a comprehensive prompting strategy that includes label-consistent and inconsistent complements. Finally, we formulate a Bayesian classifier that converts these continuous scores into reliable binary pseudo-labels. Extensive evaluations demonstrate the effectiveness of ReSAGE-PAR in preserving spatial priors and verifying attributes. When integrated into PAR training, ReSAGE-PAR consistently yields significant improvements-achieving gains of up to 8.7% on standard backbones and pushing state-of-the-art frameworks to new performance levels. This proves its value as an architecture-agnostic solution for scalable PAR enhancement. The complete codebase for ReSAGE-PAR is publicly available at http://www-vpu.eps.uam.es/publications/ReSAGE-PAR.

2606.06014 2026-06-05 cs.AI cs.RO

PLAN-S: Bridging Planning with Latent Style Dynamics for Autonomous Driving World Models

PLAN-S:通过潜在风格动态桥接规划以实现自动驾驶世界模型

Xiaoyun Qiu, Jingtao He, Yijie Chen, Yusong Huang, Haotian Wang, Yixuan Wang, Xinhu Zheng

发表机构 * Intelligent Transportation Thrust, Systems Hub, and Center of Seamless Connectivity & Connected Intelligence, The Hong Kong University of Science and Technology (Guangzhou)(智能交通 thrust、系统中心及无缝连接与智能连接研究院,香港科学与技术大学(广州))

AI总结 提出PLAN-S框架,通过从潜在表示解码风格条件语义成本图,解决自动驾驶中潜在世界模型规划的可控性问题,在nuScenes和NAVSIM上降低了碰撞率并提升了驾驶性能。

详情
AI中文摘要

潜在世界模型通过预测紧凑的场景动态来增强端到端自动驾驶,用于下游规划。然而,现有的基于潜在世界模型的规划器通常直接从纠缠的潜在表示生成轨迹。这种紧凑的潜在到规划器路径缺乏对风险、可驾驶性和多样风格偏好的显式建模,使得驾驶风格动态在最终轨迹选择之前难以监督、检查或调制。我们提出PLAN-S(具有潜在风格动态的规划),一个面向规划器的桥接方法,通过从潜在表示解码风格条件的四通道语义成本图来解决这种紧凑-可控性困境。成本图以自我状态和驾驶风格为条件,并通过两个宿主侧接口在规划决策上游被消费:用于回归规划器的注意力级融合和用于锚点得分规划器的奖励级融合。我们在两个架构不同的宿主上验证PLAN-S:nuScenes上的ResWorld和NAVSIM上的WoTE,同时冻结宿主骨干以隔离所提出的桥接的贡献。在nuScenes上,PLAN-S在每个时间范围上降低了基线L2,平均L2为0.55米,3秒碰撞率相对降低42%。在NAVSIM上,规则成本变体达到89.4的预测驾驶模型分数,而学习成本变体在基线挑战场景中提供了互补增益。消融实验表明,成本路径对更安全的轨迹选择贡献最直接。定性结果进一步显示,PLAN-S可以产生多样化的成本图,其空间一致的变化与不同的驾驶风格对齐。

英文摘要

Latent world models (LWMs) have strengthened end-to-end autonomous driving by forecasting compact scene dynamics for downstream planning. However, existing LWM-based planners usually generate trajectories directly from entangled latent representations. This compact latent-to-planner pathway lacks explicit modeling of risk, drivability, and diverse style preferences, making driving-style dynamics difficult to supervise, inspect, or modulate before a final trajectory is selected. We propose PLAN-S (PLANning with latent Style dynamics), a planner-facing bridge that addresses this compactness-controllability dilemma by decoding a style-conditioned, four-channel semantic cost map from the latent representation. The cost map is conditioned on ego state and driving style and is consumed up-stream of the planning decision through two host-side interfaces: attention-level fusion for regression planners and reward-level fusion for anchor-score planners. We validate PLAN-S on two architecturally distinct hosts, ResWorld on nuScenes and WoTE on NAVSIM, while keeping the host backbones frozen to isolate the contribution of the proposed bridge. On nuScenes, PLAN-S reduces L2 at every horizon over the baseline, with 0.55 m average L2 and a 42% relative reduction in the 3 s collision rate. On NAVSIM, the rule-cost variant reaches 89.4 Predictive Driver Model Score (PDMS), while the learned cost variant provides complementary gains on baseline-challenging scenes. Ablations show that the cost pathway contributes most directly to safer trajectory selection. Qualitative results further show that PLAN-S can produce diverse cost maps, with spatially consistent variations aligned to different driving styles.

2606.06011 2026-06-05 cs.RO cs.LG cs.MA

Merging model-based control with multi-agent reinforcement learning for multi-agent cooperative teaming strategies

将基于模型的控制与多智能体强化学习相结合以实现多智能体协作团队策略

Christian Llanes, Spencer W. Jensen, Samuel Coogan

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Sandia National Laboratories(桑地亚国家实验室)

AI总结 提出一种结合多智能体强化学习与模型预测控制的框架(MA-AC-MPC),通过扩展演员-评论家模型预测控制实现安全、动态可行的协作策略,并在追逃场景和异构环境中验证其优于多层感知机模型。

Comments 12 pages, 8 figures, 7 tables

详情
AI中文摘要

在这项工作中,我们提出了一种将多智能体强化学习(MARL)与基于模型的控制相结合的框架,以在协作多智能体任务中实现安全、动态可行的动作。多智能体强化学习具有从长期规划视野中的离散不可微奖励中学习多智能体团队协作策略的优势。模型预测控制具有鲁棒性,并在快速重规划框架中为短视野提供安全、动态可行的动作。我们提出了一种将演员-评论家模型预测控制扩展到MARL的算法,称为多智能体演员-评论家模型预测控制(MA-AC-MPC)。我们通过将其应用于多智能体追逃场景来展示该算法的能力。具体来说,我们比较了使用MA-AC-MPC模型和多层感知机模型(MA-AC-MLP)的逃避者团队策略。追逐者团队使用增强比例导航,因为它被接受为一种先进的对抗控制律。我们还提供了一个异构环境的示例,其中无人机和全向轮式机器人协作,在硬件上实现了可重复且成功的着陆,MA-AC-MPC的成功率为100%,而MA-AC-MLP为60%。我们在硬件上证明了所提出的MA-AC-MPC算法在两种环境中的鲁棒性。

英文摘要

In this work, we propose a framework that combines multi-agent reinforcement learning (MARL) with model-based control to achieve safe, dynamically feasible actions in cooperative multi-agent tasks. Multi-agent reinforcement learning provides the advantage of learning cooperative policies for multi-agent teams from discrete non-differentiable rewards in a long planning horizon. Model-predictive control is robust and offers safe, dynamically feasible actions in a fast replanning framework for short horizons. We propose an algorithm that extends actor-critic model predictive control for MARL which we refer to as multi-agent actor-critic model predictive control (MA-AC-MPC). We demonstrate the capabilities of this algorithm by applying it to a multi-agent pursuit-evasion scenario. Specifically, we compare the evader team's strategy using the MA-AC-MPC model and a multi-layer perceptron model (MA-AC-MLP). The pursuer team uses augmented proportional navigation as it is accepted as an advanced adversarial control law. We also provide an example with a heterogeneous environment where a drone and omni-wheeled rover cooperate to achieve repeatable and successful landing with 100% success rate in hardware for MA-AC-MPC compared to 60% for MA-AC-MLP. We demonstrate the robustness of the proposed MA-AC-MPC algorithm in hardware for both environments.

2606.06004 2026-06-05 cs.CL

The Generator-Eraser Paradox: Community Guidelines for Responsible LLM-Assisted Dialect Resource Creation

生成器-擦除器悖论:负责任的大语言模型辅助方言资源创建的社区指南

Wajdi Zaghouani

发表机构 * Northwestern University in Qatar(卡塔尔西北大学)

AI总结 本文提出生成器-擦除器悖论理论框架,推导出12条社区指南,并通过阿拉伯方言案例展示如何在大语言模型辅助方言资源创建中平衡效率与语言多样性保护。

详情
Journal ref
Proceedings of the Workshop on Dialects in NLP - A Resource Perspective (DialRes) @ LREC 2026
AI中文摘要

方言资源在科学描述、文化保护和计算基础设施的交汇处占据独特位置。大语言模型通过检索辅助起草、语料库导航、元数据丰富和标注工作流支持,为加速方言资源开发提供了强大能力。然而,同一系统也带来重大风险:它们可能通过偏爱声望变体、统一正字法以及产生随时间减少语言多样性的合成反馈循环,导致方言擦除。这些风险对于具有双言现象、有限书面标准化或边缘化说话者社区的语言变体尤为严重。本文做出三项贡献。首先,我们整合变异社会语言学和语料库语言学的见解,将生成器-擦除器悖论形式化为一个理论框架,以理解大语言模型辅助方言工作的双重性质。其次,我们推导出12条社区指南,将该框架转化为方言资源创建和记录的可实施设计要求。第三,我们提供阿拉伯方言的深入案例研究,包括对广泛使用资源的结构化比较,以展示这些指南如何解决语言特定挑战,包括双言现象、正字法变异和社区治理。贡献是概念性和操作性的,而非实验性的,目标是使跨语言的方言社区和资源构建者能够采用大语言模型,而不牺牲真实性、变体或主权。

英文摘要

Dialect resources occupy a unique position at the intersection of scientific description, cultural preservation, and computational infrastructure. Large language models offer powerful capabilities for accelerating dialect resource development through retrieval-grounded drafting, corpus navigation, metadata enrichment, and annotation workflow support. However, the same systems pose substantial risks: they can contribute to dialect erasure by privileging prestige varieties, homogenizing orthography, and enabling synthetic feedback loops that reduce linguistic diversity over time. These risks are particularly acute for language varieties characterized by diglossia, limited written standardization, or marginalized speaker communities. This paper makes three contributions. First, we integrate insights from variationist sociolinguistics and corpus linguistics to formalize the generator-eraser paradox as a theoretical framework for understanding the dual nature of LLM-assisted dialect work. Second, we derive 12 community guidelines that operationalize this framework into implementable design requirements for dialect resource creation and documentation. Third, we provide an in-depth case study of Arabic dialects, including a structured comparison of widely used resources, to demonstrate how these guidelines address language-specific challenges including diglossia, orthographic variability, and community governance. The contribution is conceptual and operational rather than experimental, with the goal of enabling dialect communities and resource builders across languages to adopt LLMs without sacrificing authenticity, variation, or sovereignty.

2606.06003 2026-06-05 cs.AI

Beyond Vector Similarity: A Structural Analysis of Graph-Augmented Retrieval for Industrial Knowledge Graphs

超越向量相似性:面向工业知识图谱的图增强检索结构分析

Grama Chethan

发表机构 * Grama Chethan

AI总结 本文通过对比八种检索架构,提出操作符词汇表论点,证明基于LLM的图推理瓶颈在于计算操作符而非模型智能,并引入LLM查询规划器,在工业知识图谱上实现优于定制处理器的性能。

Comments 11 pages

详情
AI中文摘要

检索增强生成(RAG)在需要对互连实体进行结构推理的查询上系统性失败。我们比较了八种用于航空航天供应链情报的检索架构,从文本检索逐步过渡到图遍历和图计算。使用一个包含46个节点和64条类型边的知识图谱,我们评估了10个意图类别下的23个查询,并证明向量检索在结构上无法覆盖五类查询。我们的核心发现是操作符词汇表论点:基于LLM的图推理的障碍不是模型智能,而是作为工具可用的计算操作符。一个配备9种类型遍历原语的LLM查询规划器在性能上优于定制处理器(F1=0.632 vs 0.472),同时能泛化到未见查询。添加6种图计算工具后,LLM仅在遍历失败的查询类别上选择性采用它们。我们还发现一个测量差距:实体级F1系统性低估了正确答案为完整集合的结构查询。

英文摘要

Retrieval-Augmented Generation (RAG) fails systematically on queries requiring structural reasoning over interconnected entities. We compare eight retrieval architectures for aerospace supply chain intelligence, progressing from text retrieval through graph traversal to graph computation. Using a 46-node knowledge graph with 64 typed edges, we evaluate 23 queries across 10 intent categories and demonstrate that five query classes are structurally unreachable for vector retrieval. Our central finding is the operator vocabulary thesis: the barrier to LLM-based graph reasoning is not model intelligence but the computational operators available as tools. An LLM Query Planner with 9 typed traversal primitives outperforms bespoke handlers (F1 = 0.632 vs. 0.472) while generalizing to unseen queries. Adding 6 graph computation tools, the LLM selectively adopts them for exactly the query categories where traversal fails. We also identify a measurement gap: entity-level F1 systematically underscores structural queries where comprehensive answers are correct.

2606.05999 2026-06-05 cs.CV cs.AI

ATT-CR: Adaptive Triangular Transformer for Cloud Removal

ATT-CR: 自适应三角变换器用于云去除

Yang Wu, Ye Deng, Pengna Li, Wenli Huang, Kangyi Wu, Xiaomeng Xin, Jinjun Wang

发表机构 * Xi’an Jiaotong University(西安交通大学) School of Computing and Artificial Intelligence, Southwestern University of Finance and Economics(计算机与人工智能学院,西南财经大学) Ningbo University of Technology(宁波工程学院)

AI总结 提出自适应三角变换器(ATT-CR),通过三角注意力和特征选择门控模块降低计算复杂度并减少云像素干扰,实现高效云去除。

详情
AI中文摘要

云去除旨在准确重建遥感图像中被云遮挡的地面物体。现有的基于Transformer的方法利用自注意力有效建模云图像中的长距离依赖,取得了显著效果。然而,它们存在以下问题:1)自注意力的高计算复杂度限制了可扩展性;2)在注意力计算中将云像素和干净像素均视为有效,会在后续层中引入干扰,导致性能次优。为解决这些挑战,我们提出了自适应三角变换器用于云去除(ATT-CR),该模型有效降低了计算成本并减轻了云像素的干扰。具体而言,它包含两个核心组件:三角注意力(TAN)和特征选择门控模块(FSGM)。TAN使用下三角和上三角矩阵近似Softmax注意力,计算复杂度为O(N),显著降低了计算成本。而FSGM与TAN集成,自适应地区分云特征和干净特征,从而最小化无效信息引入后续层。在云去除基准上的大量实验表明,ATT-CR相比现有方法具有更优的性能。

英文摘要

Cloud removal aims to accurately reconstruct the ground objects obscured by clouds in remote sensing images. Existing Transformer-based methods utilizing self-attention have shown impressive results by effectively modeling long-range dependencies in cloudy images. However, they suffer from the following issues: 1) the high computational complexity of self-attention limits scalability; 2) treating both cloudy and clean pixels as valid within the attention computation brings disturbances in subsequent layers, leading to suboptimal performance. To address these challenges, we propose the Adaptive Triangular Transformer for Cloud Removal (ATT-CR), a model that effectively reduces computational costs and mitigates interference from cloudy pixels. Specifically, it consists of two core components: Triangular Attention (TAN) and Feature Selected Gating Module (FSGM). TAN employs lower and upper triangular matrices to approximate Softmax attention with O(N) computational complexity, significantly reducing the computational costs. The FSGM, on the other hand, integrates with TAN to adaptively distinguish between cloudy and clean features, which minimizes the introduction of invalid information into subsequent layers. Extensive experiments on cloud removal benchmarks demonstrate that ATT-CR delivers superior performance compared to existing methods.

2606.05998 2026-06-05 cs.CV cs.AI

Deep Learning-based 3D Oral Cavity Reconstruction Using 2D Intraoral Images

基于深度学习的二维口内图像三维口腔重建

Jihun Cho, Soo-Yeon Jeong, Eun-Jeong Bae, Sun-Young Ihm

发表机构 * KAIST(韩国科学技术院)

AI总结 提出一种仅用十张二维口内图像进行三维口腔重建的软件方法,采用MobileNetV2与多头注意力机制,降低成本和不适,实现自动化重建。

Comments 4 pages, 5 figures. English version of a paper presented at the Korea Multimedia Society Conference, November 2025

详情
AI中文摘要

口腔三维建模是牙科中最关键的阶段之一,常用的方法如印模和口内扫描各有显著局限。印模法将藻酸盐或硅胶材料放入托盘并插入患者口腔形成阴模,存在患者不适、材料变形误差及存储运输困难等问题。口内扫描仪利用结构光或激光技术实时直接扫描口腔结构,效果先进但设备成本极高。为解决这些问题,本文提出一种基于软件的方法,仅使用从不同角度拍摄的十张二维口内图像重建三维口腔模型,无需专用硬件设备。该方法降低成本,消除物理扫描设备需求,减少患者不适,并实现自动化三维重建。模型在公开的Dental3DS数据集(包含950个上颌样本)上训练,采用MobileNetV2作为图像编码器,结合多头注意力进行多视图特征融合。所提模型在最近邻匹配(距离阈值0.035)下达到77.49%的准确率。然而,预测顶点倾向于集中在真实值的高密度区域,导致重建模型上的点分布不均匀。

英文摘要

Oral 3D modelling is one of the most essential stages in dentistry, and many different approaches, such as impression taking and intraoral scanning, are commonly used for this phase, each with notable limitations. Impression taking, which involves placing alginate or silicone material in a tray and inserting it into the patient's oral cavity to form a negative mold, suffers from significant patient discomfort, material deformation errors, and difficulties in storage and transportation. Intraoral scanners, which directly scan oral structures in real time using structured light or laser technology, produce state-of-the-art results but are associated with substantially high equipment costs. To address these limitations, this paper proposes a software-based approach that reconstructs a 3D oral model using only ten 2D intraoral images captured from different angles, requiring no dedicated hardware devices. The proposed method reduces cost, eliminates the need for physical scanning equipment, minimises patient discomfort, and enables automated 3D reconstruction. The model is trained on the publicly available Dental3DS dataset, comprising 950 upper jaw samples, and employs MobileNetV2 as the image encoder combined with Multi-head Attention for multi-view feature fusion. The proposed model achieves an accuracy of 77.49%, measured by nearest-neighbor matching with a distance threshold of 0.035. However, predicted vertices tend to concentrate in high-density regions of the ground truth, resulting in uneven point distribution across the reconstructed model.

2606.05997 2026-06-05 cs.CV

Multimodal Sexism Identification and Characterization using Large Language Models and Gradient Boosting

使用大语言模型和梯度提升的多模态性别歧视识别与表征

Kyriakos Chaviaras, Maria Lymperaiou, Athanasios Voulodimos

发表机构 * Artificial Intelligence and Learning Systems Laboratory(人工智能与学习系统实验室) School of Electrical and Computer Engineering(电气与计算机工程学院) National Technical University of Athens(雅典国家技术大学)

AI总结 提出基于特征工程和梯度提升回归模型的后融合管道,结合视觉、文本、人口统计、生物特征及LLM语义指标,用于识别和表征模因和短视频中的多模态性别歧视。

详情
AI中文摘要

我们介绍了AILS-NTUA提交给CLEF EXIST 2026实验室的工作,解决模因(任务2)和短视频(任务3)中的多模态性别歧视识别与表征问题。我们的系统采用基于特征工程的后融合管道,围绕梯度提升回归模型和层次化后处理构建。对于模因,我们结合了视觉、文本、人口统计、生物特征和LLM衍生的语义指标,旨在捕捉刻板印象、物化、讽刺和厌女等高层次线索。对于视频,我们研究了特征选择、基于帧的视觉表示、基于OCR的文本特征、声学描述符和传感器衍生元数据的影响。开发结果表明,聚焦的LLM衍生语义线索改善了模因性别歧视识别,而视频性能对特征维度和跨模态噪声高度敏感。对于视频,开发结果倾向于紧凑的特征选择,但官方测试结果表明这一结论不能完全推广到未见数据,其中未过滤的表征泛化更好。总体而言,我们的发现强调了针对静态模因进行目标语义特征工程的有用性,以及在嘈杂的短视频环境中需要更鲁棒的时间建模。

英文摘要

We present the AILS-NTUA submission to the EXIST 2026 Lab at CLEF, addressing multimodal sexism identification and characterization in memes (Task 2) and short-form videos (Task 3). Our system follows a feature-engineered late-fusion pipeline built around gradient-boosted regression models and hierarchical post-processing. For memes, we combine visual, textual, demographic, biometric, and LLM-derived semantic indicators designed to capture high-level cues such as stereotyping, objectification, irony, and misogyny. For videos, we investigate the effect of feature selection, frame-based visual representations, OCR-based textual features, acoustic descriptors, and sensor-derived metadata. Development results show that focused LLM-derived semantic cues improve meme sexism identification, while video performance is highly sensitive to feature dimensionality and cross-modal noise. For videos, development results favor compact feature selection, but official test results show that this conclusion does not fully transfer to unseen data, where the unfiltered representation generalizes better. Overall, our findings highlight the usefulness of targeted semantic feature engineering for static memes and the need for more robust temporal modeling in noisy short-form video settings.

2606.05994 2026-06-05 cs.LG eess.SP

HoT-SSM:Higher-order Temporal Knowledge Graph Reasoning with State Space Models for Health Care

HoT-SSM:用于医疗保健的高阶时序知识图谱推理与状态空间模型

Thummaluru Siddartha Reddy, Vempalli Naga Sai Saketh, Yash Punjabi, Mahesh Chandran

发表机构 * Fujitsu Research of India, Bangalore(印度班加罗尔 Fujitsu 研究院)

AI总结 提出HoT-SSM模型,通过构建超图捕获高阶临床交互,并利用动态超图状态空间模型建模长程时序依赖,在MIMIC-III/IV数据集上显著提升临床预测性能。

Comments Paper under review

详情
AI中文摘要

融合临床知识的医学知识图谱(MKGs)越来越多地被用于建模电子健康记录(EHRs),以支持医疗领域的可解释预测。然而,现有的基于MKG的方法在捕获临床概念(如病情、手术和药物)之间的成对关系方面存在局限,限制了其建模共现或语义相关概念间高阶交互的能力。此外,大多数利用MKG的表示学习方法要么跨就诊折叠时间信息,要么缺乏显式建模长程时序依赖的机制,而这对于死亡率预测等临床任务至关重要。为缓解这些局限,我们提出HoT-SSM,一种参数高效的高阶时序图推理方法,结合状态空间模型。对于每次就诊,HoT-SSM通过利用领域知识将语义相关的临床概念分组为超边来构建超图,从而保留就诊级别的临床上下文。此外,为在学表示的同时建模时序动态,我们引入一种新颖的基于动态超图的状态空间模型,显式捕获患者潜在状态随时间演变,同时保留长程信息。学到的表示用于下游临床预测和推理。在MIMIC-III和MIMIC-IV数据集上的实验表明,性能显著优于当前最先进模型,证明了联合建模高阶临床交互和长程时序依赖的有效性。

英文摘要

Medical knowledge graphs (MKGs) infused with clinical knowledge have been increasingly used to model electronic health records (EHRs) to support interpretable predictions in healthcare domain. However, existing MKG-based approaches are limited in capturing pairwise relations between clinical concepts (e.g., conditions, procedures, and medications), and restricts their ability to model higher-order interactions among co-occurring or semantically related concepts. In addition, most representation learning methods that leverage MKGs either collapse temporal information across visits or lack an explicit mechanism for modeling long-range temporal dependencies, which is critical for clinical tasks such as mortality prediction. To mitigate these limitations, we propose HoT-SSM, a parameter efficient and higher-order temporal graph reasoning with state space models. For each visit, HoT-SSM constructs hypergraphs by grouping semantically related clinical concepts into hyperedges using domain knowledge, thereby preserving visit-level clinical context. Further, to model the temporal dynamics while learning the representations, we introduce a novel dynamic hypergraph-based state space model that explicitly captures patients latent state evolution over time while preserving long-range information. The learned representations are used for downstream clinical prediction and reasoning. Experiments on MIMIC-III and MIMIC-IV datasets shows significant performance improvement over the current state-of-the-art models, demonstrating the effectiveness of jointly modeling higher-order clinical interactions and long-range temporal dependencies.

2606.05988 2026-06-05 cs.LG cs.CL

Compress-Distill: Reasoning Trace Compression for Efficient Knowledge Distillation

压缩-蒸馏:面向高效知识蒸馏的推理轨迹压缩

Maxime Griot, Paul Steven Scotti, Tanishq Mathew Abraham

发表机构 * Université catholique de Louvain(列日天主教大学) Sophont Inc(Sophont公司)

AI总结 本文提出在知识蒸馏前对推理轨迹进行事后压缩,以降低训练成本并缩短推理输出,实验表明压缩在准确率与效率间存在权衡。

详情
AI中文摘要

推理模型产生长的思维链轨迹,这些轨迹蒸馏成本高且鼓励学生输出冗长内容。我们研究在知识蒸馏前对这些轨迹进行事后压缩。两个教师模型,Qwen3.5-397B-A17B 和 gpt-oss-120B,各生成约 283k 条正确轨迹;两个指令调优模型将其压缩至原始字符长度的 8.6-21.0%。在包含 48 次运行的主网格和七次 Qwen 教师截断消融实验中,压缩轨迹将训练 token 减少至原始的 12-30%,训练速度提升 2.0-7.6 倍,推理输出缩短 3-19 倍,在更短的 gpt-oss 教师下减少幅度较小。然而,原始轨迹在每个规模下和两位教师上都保持最高的下游准确率。一项长度匹配的原始轨迹截断消融实验表明,压缩并非仅仅受益于更小的 token 预算:模型压缩的轨迹通常优于或匹配朴素截断,尤其是对于较小的学生模型,同时保持更短的推理输出。总体而言,推理轨迹压缩提供了准确率与效率之间的权衡,而非免费改进:学生模型保留了原始轨迹高达 96% 的准确率,同时获得了高达 18 倍的每 token 效率提升;在 0.8B 规模下,使用 LoRA 压缩轨迹缩小了原始与压缩之间的差距,但未超过原始轨迹。

英文摘要

Reasoning models produce long chain-of-thought traces that are costly to distill and encourage verbose student outputs. We study post-hoc compression of such traces before knowledge distillation. Two teachers, Qwen3.5-397B-A17B and gpt-oss-120B, generate about 283k correct traces each; two instruction-tuned models then compress them to 8.6-21.0% of their original character length. Across a 48-run main grid plus seven Qwen-teacher truncation ablations, compressed traces reduce training tokens to 12-30% of raw, speed up training by 2.0-7.6x, and shorten inference outputs by 3-19x with smaller reductions under the shorter gpt-oss teacher. However, raw traces retain the highest downstream accuracy at every scale and for both teachers. A length-matched raw-trace truncation ablation shows that compression is not merely benefiting from a smaller token budget: model-compressed traces usually beat or match naive truncation, especially for smaller students, while maintaining shorter inference outputs. Overall, reasoning-trace compression offers an accuracy-efficiency trade-off rather than a free improvement: students retain up to 96% of raw-trace accuracy while gaining up to 18x higher per-token efficiency, and at the 0.8B scale under LoRA compressed traces narrow the raw-vs-compressed gap but do not exceed raw.

2606.05985 2026-06-05 cs.CL cs.CY

Beyond Alignment: Value Diversity as a Collective Property in Multicultural Agent Systems

超越对齐:多元文化智能体系统中的价值多样性作为集体属性

Shaoyang Xu, Jingshen Zhang, Long P. Hoang, Jinyuan Li, Wenxuan Zhang

发表机构 * Singapore University of Technology and Design(新加坡科技设计大学) Washington University in St. Louis(华盛顿大学圣路易斯分校)

AI总结 针对多元文化多智能体系统,提出以价值多样性作为系统级评估轴,通过文化条件化智能体在共享价值调查中的响应差异度量,发现多样性几乎与对齐无关,且当前系统远低于人类社会,混合骨干系统缩小但未消除差距,社会互动进一步侵蚀多样性。

详情
AI中文摘要

多元文化多智能体系统越来越多地部署在全球多样化的环境中,其中不同的智能体基于不同的文化背景。现有的文化评估侧重于价值对齐:单个智能体与目标文化的匹配程度。然而,对齐是每个智能体的属性,无法揭示系统作为一个整体是否保留了其旨在代表的文化多元性。我们提出价值多样性作为多元文化智能体系统的系统级评估轴,通过文化条件化智能体在共享价值调查上的响应差异来定义。利用世界价值观调查,我们评估了19种文化和18个骨干模型在广泛的系统配置下的表现。我们发现多样性在很大程度上与对齐无关,表明两者捕捉了互补的系统属性,并且当前的多元文化智能体系统在价值多样性上远低于人类社会。混合骨干系统缩小了这一差距但未消除,且该差距在文化组成和智能体规模上持续存在。社会互动进一步通过驱使智能体达成共识而侵蚀多样性,一个参与式预算案例研究表明,这种同质化缩小了集体决策的广度。总之,我们的结果将价值多样性确立为多元文化多智能体系统的一个独特评估轴,并揭示了当前基于LLM的社会中持续存在的同质化趋势。我们的代码和数据公开在 https://github.com/iNLP-Lab/MultiAgent-Diversity。

英文摘要

Multicultural multi-agent systems are increasingly deployed in globally diverse settings, where different agents are grounded in different cultural backgrounds. Existing cultural evaluation focuses on value alignment: how closely a single agent matches a target culture. Yet alignment is a per-agent property and cannot reveal whether a system, taken as a whole, preserves the cultural plurality it is meant to represent. We propose value diversity as a system-level evaluation axis for multicultural agent systems, defined through the dissimilarity between culturally conditioned agents' responses on a shared value survey. Using the World Values Survey, we evaluate 19 cultures and 18 backbone models across a wide range of system configurations. We find that diversity is largely uncorrelated with alignment, indicating that the two capture complementary system properties, and that current multicultural agent systems fall substantially below human societies in value diversity. Mixed-backbone systems narrow this gap but do not close it, and the gap persists across culture compositions and agent scales. Social interaction further erodes diversity by driving agents toward consensus, and a participatory budgeting case study shows that this homogenization narrows the breadth of collective decision-making. Together, our results establish value diversity as a distinct evaluation axis for multicultural multi-agent systems and reveal a persistent homogenization tendency in current LLM-based societies. Our code and data are publicly available at https://github.com/iNLP-Lab/MultiAgent-Diversity.

2606.05983 2026-06-05 cs.AI cs.CL

Framing, Judging, Steering: An Assessable Competency Model for Teach-ing Students to Reason With Generative AI

框架构建、判断、引导:一种可评估的能力模型,用于教授学生与生成式AI进行推理

Alexander Apartsin, Yehudit Aperstein

发表机构 * Holon Institute of Technology(霍洛恩技术学院) Afeka College of Engineering(阿菲卡工程学院)

AI总结 提出CoRe-3能力模型,将有效使用AI分解为框架构建、判断和引导三种可评估技能,并通过模拟实验验证其区分效度。

Comments 18 pages, 4 pages

详情
AI中文摘要

生成式AI使答案变得容易而理解变得困难,不加批判的使用会导致认知卸载。学校仍然衡量无辅助的表现,但真正的任务是用AI产生好的工作:构建一个定义不明确的任务,判断输出,并引导模型获得更好的结果。这种能力很少被单独评估;即使被衡量,它也坍缩为一个单一的“提示”分数,无法诊断AI使用成功或失败的原因。我们提出CoRe-3(协同推理),一个能力模型,将生产性AI使用分解为三种可评估的技能,我们缩写为FJS:框架构建(在调用AI之前指定一个定义不明确的任务)、判断(评估输出中的错误和未声明的假设)和引导(迭代地重新引导模型)。其显著主张是将生成前的框架构建与生成后的引导分开,判断作为两者之间的门控。我们将这些技能建立在理论基础上,提出五个可检验的命题,并在CoReasoningLab中实例化它们,这是一个开放平台,呈现有缺陷的AI输出并独立评分。在模拟学习者(由不同模型生成和评分)上,这些技能是分离的:每个技能跟踪其自身的操纵能力,而其他技能保持不变,并且当一个能力在所有三个技能中共享时(收敛和区分效度),分数变得相关,评分后端来自两个提供商。接下来是人类评分者一致性和结果;我们发布工具、数据和协议。

英文摘要

Generative AI makes answers easy and understanding hard, and uncritical use invites cognitive offloading. Schools still measure unaided performance, yet the real task is to produce good work with AI: framing an ill-defined task, judging the output, and steering the model toward a better result. This ability is rarely assessed in its own right; where measured, it collapses into one "prompting" score that cannot diagnose why AI use succeeds or fails. We propose CoRe-3 (Co-Reasoning), a competency model factoring productive AI use into three assessable skills we abbreviate FJS: Framing (specifying an ill-defined task before invoking AI), Judging (evaluating output for errors and unstated assumptions), and Steering (iteratively redirecting the model). Its distinguishing claim is the separation of pre-generation Framing from post-generation Steering, with Judging as the gate between. We ground the skills in theory, state five testable propositions, and instantiate them in CoReasoningLab, an open platform that presents flawed AI output and scores them independently. Over simulated learners (generated and graded by different models), the skills dissociate: each tracks its own manipulated competence while staying flat in the others, and grades become correlated when one competence is shared across all three (convergent and discriminant validity), across grader backends from two providers. Human-rater agreement and outcomes are next; we release the instrument, data, and protocol.

2606.05981 2026-06-05 cs.CV cs.LG

Video-Rate Streaming Stylization on a Vision-Aware MLLM-Conditioned Edit Diffusion: Asymmetric Batched Inference on a Distilled UNet + MLLM Text Encoder

基于视觉感知的多模态大语言模型条件编辑扩散的视频率流式风格化:蒸馏UNet + MLLM文本编码器上的非对称批处理推理

Yoshiyuki Ootani

发表机构 * Independent researcher(独立研究员)

AI总结 针对蒸馏扩散模型中文本编码器成为瓶颈的问题,提出一种结合非对称CUDA流水线、编译友好的ControlNet-LLLite重构和周期性条件刷新调度的流式管线,在消费级GPU上实现视频率实时风格化编辑。

Comments 12 pages, 4 figures, 12 tables. Under review at IEEE Transactions on Circuits and Systems for Video Technology. Code, evaluation harness, and the released v3 Temporal LLLite adapter weights are at https://github.com/otanl/dreamlite-stream (also mirrored to Hugging Face and Zenodo)

详情
AI中文摘要

扩散U-Net的激进蒸馏反转了实时文本到图像流水线的逐帧瓶颈:一旦去噪器成为4步或1步蒸馏的学生模型,文本编码器就成为关键路径。这种反转在视觉感知编辑扩散中最为严重,其中编码器是多模态大语言模型(MLLM)。我们研究了一个0.39B蒸馏编辑U-Net与2.13B MLLM文本编码器(Qwen3-VL)配对的情况,并提出了一种针对该场景的流式管线,该管线围绕三种工程机制构建:非对称侧流/主流CUDA流水线,带有批处理文本编码器摊销(以及可选的静态提示缓存);一种编译友好的ControlNet-LLLite重构,将整个U-Net +适配器堆栈折叠成单个融合图;以及一个带有钩子子集的周期性条件刷新调度,用于摊销每帧条件成本。在单个消费级RTX 3090 Ti上,512x512分辨率下,管线在批大小B=8时维持27.4 fps,B=16时维持29.6 fps,端到端p50延迟分别约为0.5和1.0秒;相同操作点在RTX 4090上测得54.9 fps,在RTX 5090上测得74.1 fps。我们报告的是视频率流式吞吐量而非交互式低延迟,并将我们的数据与相同堆栈的StreamDiffusion重运行进行对比,作为系统上下文,而非基准优越性声明。对于训练的油画风格,发布的时序适配器在剪辑内噪声中泛化到19个未使用的DAVIS-2017序列和来自七个来源的15个非DAVIS剪辑;对未见风格族的提示级泛化有限,并单独报告。

英文摘要

Aggressive distillation of the diffusion U-Net inverts the per-frame bottleneck of real-time text-to-image pipelines: once the denoiser is a 4-step or 1-step distilled student, the text encoder becomes the critical path. This inversion is most acute in vision-aware edit diffusion, where the encoder is a multimodal large language model (MLLM). We study the case of a 0.39B distilled edit U-Net paired with a 2.13B MLLM text encoder (Qwen3-VL) and present a streaming pipeline targeted at this regime built around three engineering mechanisms: asymmetric side-stream / main-stream CUDA pipelining with batched text-encoder amortisation (and optional static-prompt caching), a compile-friendly ControlNet-LLLite reformulation that folds the entire U-Net + adapter stack into a single fused graph, and a periodic conditioning-refresh schedule with a hook subset that amortises the per-frame conditioning cost. On a single consumer RTX 3090 Ti at 512x512 the pipeline sustains 27.4 fps over a 480-frame run at batch size B=8 and 29.6 fps at B=16, with end-to-end p50 latency of approximately 0.5 and 1.0 seconds respectively; the same operating point measures 54.9 fps on RTX 4090 and 74.1 fps on RTX 5090. We report video-rate streaming throughput rather than interactive low latency, and locate our numbers against same-stack StreamDiffusion re-runs as systems context, not as a benchmark superiority claim. For the trained oil-painting style, the released temporal adapter generalises within in-clip noise to 19 unused DAVIS-2017 sequences and 15 non-DAVIS clips from seven sources; prompt-level generalisation to unseen style families is bounded and reported separately.

2606.05979 2026-06-05 cs.RO cs.AI

World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis

世界-语言-动作模型:统一世界建模、语言推理与动作合成

Yi Yang, Zhihong Liu, Siqi Kou, Yiyang Chen, Yanzhe Hu, Jianbo Zhou, Boyuan Zhao, Zhijie Wei, Xiao Xia, Xueqi Li, Pengfei Liu, Zhijie Deng

发表机构 * SJTU(上海交通大学) SII(上海研究院) HUST(华中科技大学) SCUT(华南理工大学) ECUST(东华大学) SHU(上海大学) NJUPT(南京工业大学)

AI总结 提出世界-语言-动作(WLA)模型,通过自回归Transformer联合预测文本子任务、子目标图像和机器人动作,融合世界建模与语言推理能力,实现多任务和长时域任务的最优性能。

Comments 19 pages, 10 figures

详情
AI中文摘要

我们提出世界-语言-动作(WLA)模型作为一类新的具身基础模型。WLA以文本指令、图像和机器人状态为输入,联合预测文本子任务、子目标图像和机器人动作,结合了世界-动作模型(WAM)中从大量自我中心视频学习的世界建模接口,以及视觉-语言-动作(VLA)模型中解决复杂长时域任务的语言推理能力。WLA的核心是一个自回归(AR)Transformer主干,而非WAM中的双向扩散Transformer,用于预测下一状态,包括语义级别的文本意图和互补的细粒度物理动态。物理动态由基于专用世界专家的世界建模目标监督,并用于简化动作专家的状态-动作相关性表征。WLA利用元查询使世界预测隐式影响动作生成,从而在推理时可禁用世界预测。世界预测也可被激活以实现测试时缩放,从而改进机器人控制。我们的WLA-0原型具有2B活跃参数,在NVIDIA RTX 5090上每次推理耗时40毫秒。在模拟和真实环境中的评估表明,WLA-0实现了最先进的多任务和长时域学习能力,例如在RoboTwin2.0 Clean上成功率为92.94%,在RMBench上成功率为56.5%。WLA-0还有望直接从跨具身机器人视频中学习新任务,无需动作标注。

英文摘要

We propose world-language-action (WLA) models as a new class of embodied foundation models. WLA takes textual instructions, images, and robot states as inputs to jointly predict textual subtasks, subgoal images, and robot actions, conjoining the \emph{world modeling interface} to learn from extensive egocentric videos as in the world-action model (WAM) and the \emph{language reasoning} capacities to solve complex long-horizon tasks as in vision-language-action (VLA) models. At the core of WLA lies an \emph{autoregressive (AR)} Transformer backbone, instead of a bidirectional diffusion Transformer as in WAMs, to predict the \emph{next state}, comprising the \emph{semantic-level} textual intention and complementary \emph{fine-grained} physical dynamics. The physical dynamics are supervised by the world modeling objective based on a dedicated World Expert, and are leveraged to ease the characterization of the state-action correlation for the Action Expert. WLA leverages meta-queries to make the world prediction \emph{implicitly} impact the action generation so that the former can be disabled during inference. The world prediction can also be activated to enable test-time scaling for improved robot control. Our WLA-0 prototype, with 2B active parameters, achieves 40 ms per inference on an NVIDIA RTX 5090. Evaluations across simulated and real-world environments demonstrate that WLA-0 achieves state-of-the-art multi-task and long-horizon learning abilities, e.g., 92.94\% success rate on RoboTwin2.0 Clean and 56.5\% success rate on RMBench. WLA-0 also holds the promise to learn novel tasks directly from \emph{cross-embodiment robot videos} without action annotations.

2606.05976 2026-06-05 cs.AI cs.CL

The Self-Correction Illusion: LLMs Correct Others but Not Themselves

自我修正错觉:LLM 纠正他人但不纠正自己

Kuan-Yen Chen, Fang-Yi Su, Jung-Hsien Chiang

发表机构 * National Taiwan University(国立台湾大学)

AI总结 本文通过保持错误声明字节一致仅改变角色标签,发现 LLM 无法自我修正并非能力缺陷,而是聊天模板角色标签的人为产物,并提出无需训练或模型修改的提示结构干预方法。

详情
AI中文摘要

近期研究表明,LLM 智能体难以纠正自身推理轨迹中的错误,但当相同声明出现在外部来源时,其修正率显著更高。我们探究这种不对称性反映的是能力缺陷还是角色标签的人为产物:智能体纠正错误声明的意愿是否因果地依赖于承载该声明的聊天模板角色,而非声明内容本身?我们的实验设置在所有条件下保持错误声明的字节完全一致(SHA-256 验证),仅改变其包装角色:智能体自身的 \role{<thought>}、\role{user} 消息、\role{tool} 响应或 \role{system <memory>} 块。在覆盖七个模型家族和三个领域的 13 个模型-领域单元(每个单元 n=30 对任务)中,将声明从 \role{<thought>} 重新标记为外部角色后,显式修正率提升了 23 到 93 个百分点,其中 13 个单元中有 10 个达到 p<0.001。进一步实验证实该效应是不对称的、机制上可分解的,并且跨领域稳健。自我修正失败并非认知缺陷,而是聊天模板的人为产物。我们利用这一人为产物设计了一种仅涉及提示结构、无需训练和模型修改的干预方法,其最强角色标签依赖于领域:在数学上 \role{<memory>} 占主导,而在逻辑推理上普通 \role{user} 消息占主导。

英文摘要

Recent work shows that LLM agents struggle to correct errors in their own reasoning traces yet show markedly higher correction rates when identical claims appear under external sources. We ask whether this asymmetry reflects a capability deficit or a role-label artifact: does an agent's willingness to correct a wrong claim depend causally on the chat-template role that carries it, rather than on the claim's content? Our setup keeps the erroneous claim byte-identical across all conditions (SHA-256 verified) and varies only its wrapping role: the agent's own \role{<thought>}, a \role{user} message, a \role{tool} response, or a \role{system <memory>} block. Across 13 model-domain cells covering seven model families and three domains ($n{=}30$ paired tasks per cell), relabeling the claim from \role{<thought>} to an external role lifts the explicit-correction rate by 23 to 93 percentage points, with 10 of 13 cells reaching $p{<}0.001$. Further experiments confirm that the effect is asymmetric, mechanistically decomposable, and robust across domains. The failure to self-correct is not a cognitive deficit; it is a chat-template artifact. We exploit this artifact by designing a prompt-structure-only intervention that requires no training and no model modification, with its strongest role label being domain-dependent: \role{<memory>} dominates on math, while a plain \role{user} message dominates on logical deduction.

2606.05975 2026-06-05 cs.CV cs.RO

T-FunS3D: Task-Driven Hierarchical Open-Vocabulary 3D Functionality Segmentation

T-FunS3D:任务驱动的分层开放词汇3D功能分割

Jingkun Feng, Reza Sabzevari

发表机构 * P4MARS Lab at the Faculty of Aerospace Engineering, Delft University of Technology(代尔夫特理工大学航空航天工程学院P4MARS实验室)

AI总结 提出T-FunS3D方法,通过构建开放词汇场景图并利用视觉语言模型,实现任务驱动的分层3D功能分割,在保持性能的同时提升速度和降低内存消耗。

详情
AI中文摘要

开放词汇3D功能分割使机器人能够在3D场景中定位功能性物体组件。这是一项需要空间理解和任务解释的挑战性任务。当前的开放词汇3D分割方法主要关注物体级识别,而场景级部分分割方法试图详尽地分割整个场景,导致资源密集且耗时。在粒度、准确性和速度之间平衡分割性能仍然是一个挑战。作为缓解这一问题的一步,我们引入了T-FunS3D,一种任务驱动的分层开放词汇3D功能分割方法,为机器人应用提供可操作的感知。我们的方法以室内场景的3D点云和带姿态的RGB-D图像作为输入。通过提取环境中的实例及其视觉嵌入,我们构建了一个开放词汇场景图。给定任务描述,T-FunS3D识别场景图中最相关的实例,并利用视觉语言模型定位其功能组件。在SceneFun3D数据集上的实验表明,T-FunS3D在开放词汇3D功能分割方面与最先进方法相当,同时实现了更快的运行时间和更少的内存使用。

英文摘要

Open-vocabulary 3D functionality segmentation enables robots to localize functional object components in 3D scenes. It is a challenging task that requires spatial understanding and task interpretation. Current open-vocabulary 3D segmentation methods primarily focus on object-level recognition, while scene-wide part segmentation methods attempt to segment the entire scene exhaustively, making them highly resource-intensive and time consuming. Balancing segmentation performance in terms of granularity, accuracy, and speed remains a challenge. As one step towards alleviating this, we introduce T-FunS3D, a task-driven hierarchical open-vocabulary 3D functionality segmentation method that provides actionable perception for robotic applications. Our method takes as input the 3D point cloud and posed RGB-D images of an indoor scene. We construct an open-vocabulary scene graph by extracting instances and their visual embeddings in the environment. Given a task description, T-FunS3D identifies the most relevant instances in the scene graph and locates their functional components leveraging a vision-language model. Experiments on the SceneFun3D dataset demonstrate that T-FunS3D is comparable to state-of-the-art in open-vocabulary 3D functionality segmentation, while achieving faster runtime and reduced memory usage.

2606.05972 2026-06-05 cs.LG

LLM Explainability with Counterfactual Chains and Causal Graphs

基于反事实链和因果图的LLM可解释性

Nirit Nussbaum-Hoffer, Nitay Calderon, Liat Ein-Dor, Roi Reichart

发表机构 * Faculty of Data and Decision Sciences, Technion I IBM Research(数据与决策科学学院,技术离子IBM研究所)

AI总结 提出一种四阶段方法,利用因果图建模LLM推理过程,通过MCMC启发的反事实增强发现类判别性概念并生成可解释图,用于疾病诊断、情感分析等任务。

详情
AI中文摘要

因果图为使机制透明提供了高级语言。近期工作使用大型语言模型(LLMs)恢复外部世界过程的因果图。相反,在本文中,我们使用因果图对LLM推理本身进行建模,为利益相关者提供模型如何感知和组织高层概念以产生预测的透明视图。我们提出了一种四阶段方法来构建此类图。给定目标LLM和一组文本示例,我们的方法发现类判别性、人类可解释的概念,并将每个输入映射到LLM感知的概念状态。然后,我们引入一种受MCMC启发的反事实增强过程,通过反事实链扩展稀疏的观测数据。这使得使用$σ$-CG进行稳定的因果发现成为可能,从而产生信息丰富且可解释的图。我们将我们的方法应用于三个LLM,涵盖疾病诊断、情感分析和LLM作为评判者的分类任务。我们评估了学习到的图的预测保真度和结构稳定性,以及受MCMC启发的增强的收敛性和下游效用。我们的结果表明,发现的因果图捕获了与LLM推理一致的有意义的依赖关系。总之,本文为LLM的概念级可解释性提供了基础。

英文摘要

Causal graphs provide a high-level language for making mechanisms transparent. Recent work uses Large Language Models (LLMs) to recover causal graphs of external-world processes. Instead, in this paper, we use causal graphs to model LLM inference itself, providing stakeholders with a transparent view of how the model perceives and organizes high-level concepts to produce a prediction. We propose a four-phase method for constructing such graphs. Given a target LLM and a set of textual examples, our method discovers class-discriminative, human-interpretable concepts and maps each input to LLM-perceived concept states. We then introduce an MCMC-inspired counterfactual augmentation procedure that expands the sparse observational data through chains of counterfactuals. This enables stable causal discovery with $σ$-CG, yielding informative, interpretable graphs. We apply our method to three LLMs across disease diagnosis, sentiment analysis, and LLM-as-a-judge classification tasks. We evaluate the learned graphs for predictive fidelity and structural stability, and the MCMC-inspired augmentation for convergence and downstream utility. Our results show that the discovered causal graphs capture meaningful dependencies consistent with LLMs' reasoning. Together, this paper provides a foundation for concept-level explainability of LLMs.