arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2094
热门方向导航
2606.19774 2026-06-19 cs.RO 新提交

Start Right, Arrive Right: Asynchronous Execution via Initial Noise Selection

开始正确,到达正确:通过初始噪声选择实现异步执行

Trong-Bao Ho, Quang-Tan Nguyen, Thien-Loc Ha, Gia-Binh Nguyen, Viet-Thanh Nguyen, Long Dinh, Minh N. Vu, Duy M. H. Nguyen, An Thai Le, Ngo Anh Vien

发表机构 * VinRobotics VinUniversity DFKI(德国人工智能研究中心) University of Stuttgart(斯图加特大学) IMPRS-IS(国际马克斯·普朗克智能系统研究学院)

AI总结 针对流式策略异步执行中的动作块边界不一致问题,提出无需训练的PAINT方法,通过初始噪声选择而非轨迹引导实现前缀一致性,在12个模拟和6个真实操作任务中提升执行一致性与任务性能。

Comments First version 19 pages, project site: https://paint-action-chunking.github.io

详情
AI中文摘要

动作分块使机器人策略能够产生时间上连贯的行为,但基于流的策略生成多步动作序列会产生延迟,与实时控制不兼容。在异步执行下,机器人继续执行当前块的同时生成下一个块,即使微小延迟也会在块边界造成不一致。现有方法通过将生成导向已执行的动作前缀来解决此问题。我们则表明,通过在生成开始前选择合适的初始噪声即可实现前缀一致性,使得未经修改的流ODE能够生成连贯的下一块。这将异步推理重新定义为噪声选择问题而非轨迹引导问题。我们提出\textbf{PAINT},一种无需训练的方法,通过后向欧拉反演找到此噪声,并通过重绘规则构建最终块。总之,\texttt{PAINT}不需要梯度、重新训练或策略修改;然而它在\textit{12个模拟基准}和\textit{6个真实世界操作任务}(涵盖单臂、双臂和人形机器人)上提高了执行一致性和任务性能。网站:~\href{ this https URL }{\texttt{ this https URL }}。

英文摘要

Action chunking enables robot policies to produce temporally coherent behavior, but generating multi-step action sequences with flow-based policies incurs latency that is incompatible with real-time control. Under asynchronous execution, the robot continues executing the current chunk while the next one is generated, causing even minor delays to create inconsistencies at chunk boundaries. Existing methods address this problem by steering generation toward the already executed action prefix. We instead show that prefix consistency can be achieved by selecting an appropriate initial noise before generation begins, allowing the unmodified flow ODE to produce a coherent next chunk. This reframes asynchronous inference as a noise selection problem rather than a trajectory steering problem. We introduce \textbf{PAINT}, a training-free method that finds this noise via backward Euler inversion and constructs the final chunk through a repainting rule. In summary, \texttt{PAINT} requires no gradients, retraining, or policy modification; yet it improves execution consistency and task performance across \textit{12 simulated benchmarks} and \textit{6 real-world manipulation tasks} spanning single-arm, bimanual, and humanoid embodiments. Website: ~\href{https://paint-action-chunking.github.io}{\texttt{https://paint-action-chunking.github.io}}.

2606.19771 2026-06-19 cs.AI 新提交

Beyond Entropy: Learning from Token-Level Distributional Deviations for LLM Reasoning

超越熵:从令牌级分布偏差中学习以增强LLM推理

Xuanzhi Feng, Zhengyang Li, Zeyu Liu, Haoxi Li, Yuming Jiang, Bing Guo, Jingcai Guo, Jie Zhang, Song Guo

发表机构 * The Hong Kong University of Science and Technology(香港科技大学) Sichuan University(四川大学) The Hong Kong Polytechnic University(香港理工大学)

AI总结 针对RLVR中令牌更新导致的熵塌陷或爆炸问题,提出ICT框架,利用JS散度识别关键令牌,通过选择性更新平衡策略集中度,提升推理性能。

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)显著推进了大语言模型(LLM)推理;然而,它面临一个基本的优化不稳定性:均匀令牌更新会导致熵塌陷,从而过早收敛到次优策略,而过度的香农熵最大化可能导致熵爆炸,驱动盲目探索走向不连贯的推理链。为解决这一二分问题,我们引入了独立组合令牌(ICT)框架,该框架将优化焦点从标量不确定性转移到令牌logits的分布特性。通过利用令牌logits分布之间的詹森-香农(JS)散度,ICT将具有独特分布模式的令牌识别为引导LLM推理中有效探索的关键分支点。我们的理论分析基于香农熵和二阶Rényi熵,证明选择性地更新这些令牌可以调节策略集中度:它降低了由香农熵度量的整体分布不确定性,同时控制了由二阶Rényi熵捕获的概率集中度。这种双重效应防止了过度集中的令牌生成削弱探索,并有效稳定了训练景观。实验结果表明,在Qwen2.5(0.5B/1.5B/7B)模型上仅更新前10%的独特令牌,在涵盖数学、常识和奥林匹克级别问题的七个基准测试中,与GRPO、20-Entropy和STAPO基线相比,平均pass@4提升了4.58%,最大提升达14.9%。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced Large Language Model (LLM) reasoning; however, it faces a fundamental optimization instability: uniform token updates precipitate entropy collapse, leading to premature convergence to suboptimal strategies, whereas excessive Shannon Entropy maximization can cause entropy explosion, driving blind exploration toward incoherent reasoning chains. To resolve this dichotomy, we introduce the Independent Combinatorial Tokens (ICT) framework, which shifts the optimization focus from scalar uncertainty to the distributional properties of token logits. By leveraging the Jensen-Shannon (JS) divergence between token logits distributions, ICT identifies tokens with distinctive distributional patterns as critical branching points for guiding effective exploration in LLM reasoning. Our theoretical analysis, grounded in both Shannon and second-order Rényi entropy, proves that selectively updating on these tokens regulates policy concentration: it reduces the overall distribution uncertainty measured by Shannon entropy, while controlling probability concentration captured by second-order Rényi entropy. This dual effect prevents over-concentrated token generation from weakening exploration and effectively stabilizes the training landscape. Empirical results demonstrate that updating only the top 10% of unique tokens on Qwen2.5 (0.5B/1.5B/7B) models yields an average pass@4 improvement of 4.58%, with a maximum gain of 14.9%, over GRPO, 20-Entropy, and STAPO baselines across seven benchmarks spanning math, commonsense, and Olympiad-level problems.

2606.19770 2026-06-19 cs.LG 新提交

An Information Theoretic Framework for Graph Novelty Generation via Latent Mixture Modeling

基于潜在混合建模的图新颖性生成的信息论框架

Itsuki Nakagawa, Kenji Yamanishi

发表机构 * Graduate School of Information Science and Technology, The University of Tokyo(东京大学信息科学与技术研究生院)

AI总结 提出信息论框架,通过潜在混合建模和描述长度约束,生成与现有模式不同且保持全局结构一致性的新颖图数据。

详情
AI中文摘要

我们提出了一个用于图新颖性生成的信息论框架,旨在生成与现有模式不同且保持全局结构一致性的数据。我们的方法将数据嵌入潜在空间,使用有限混合模型对潜在分布进行建模,并通过基于描述长度制定的显式新颖性和可靠性条件生成新颖样本。具体来说,新颖性通过要求生成样本难以被所有现有混合成分解释来强制执行,而可靠性则根据最小描述长度(MDL)原则约束其对整体混合结构的影响。我们提供了理论分析,表明在适当的阈值选择下,将非新颖或不可靠样本错误分类的概率以显式速率收敛到零。在合成和基准图数据集上的实验表明,所提出的方法能够以可量化的风险实现原则性的新颖性生成。

英文摘要

We propose an information-theoretic framework for graph novelty generation, which aims to generate data that are distinct from existing patterns while preserving global structural consistency. Our approach embeds data into a latent space, models the latent distribution using finite mixture models, and generates novel samples by imposing explicit novelty and reliability conditions formulated in terms of description length. Specifically, novelty is enforced by requiring generated samples to be poorly explained by all existing mixture components, while reliability constrains their impact on the overall mixture structure under the Minimum Description Length (MDL) principle. We provide a theoretical analysis showing that, with appropriate threshold choices, the probabilities of misclassifying non-novel or unreliable samples converge to zero with explicit rates. Experiments on synthetic and benchmark graph datasets demonstrate that the proposed method enables principled novelty generation with quantifiable risk.

2606.19759 2026-06-19 cs.AI cs.SI 新提交

Optimal Scheduling in a Question-Answering Forum of Knowledge Workers

知识工作者问答论坛中的最优调度

Rohit Negi, Mustafa Yilmaz

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 针对知识工作者问答论坛,提出基于专家专业水平的请求调度模型,计算系统容量并设计达到容量的调度器,同时探讨专家协作对容量的提升。

Comments 14 pages, 4 figures

详情
AI中文摘要

随着个人转向互联网寻找他们可能遇到的问题的答案,一些问答论坛已经发展起来,在这些论坛中,某些主题知识渊博的用户可以贡献他们的专业知识来回答这些信息请求。虽然目前这些是志愿性质的,但我们考虑未来版本雇佣在特定主题上是专家的知识工作者。在这样的系统中,形成排队系统的请求-回答过程可以利用调度器,将不同主题的请求分配给论坛中的专家,这些专家可以根据他们在不同主题上的专业水平来回答。通过这个模型,我们计算了系统在处理请求时的容量,同时保持系统稳定,并设计了达到容量的调度器。我们还研究了专家之间在回答请求时的协作如何可能增加容量。

英文摘要

As individuals turn to the Internet to find answers to questions they may have, several Question Answering (QA) forums have evolved, where users knowledgeable in certain topics can contribute their expertise to answering these requests for information. While these are currently volunteer based, we consider a future version employing knowledge workers who are experts in certain topics. In such a system, the request-answer processes forming the queuing system may utilize schedulers that assign requests in different topics to the experts in the forum, who may be able to answer them according to their expertise levels in different topics. With this model, we calculate the capacity of the system for handling the requests while keeping the system stable, and design schedulers that achieve capacity. We also investigate how collaboration between experts in answering requests can potentially increase capacity.

2606.19752 2026-06-19 cs.RO cs.AI 新提交

Temporal Self-Imitation Learning

时间自我模仿学习

Yinsen Jia, Boyuan Chen

发表机构 * Duke University(杜克大学)

AI总结 提出时间自我模仿学习框架,通过挖掘高效成功轨迹并转化为可重用监督信号,提升长时域机器人操作任务的学习效率与鲁棒性。

详情
AI中文摘要

基于奖励塑形训练的长时域机器人操作策略仍可能通过低效交互利用密集奖励,而训练过程中稀有高效行为可能被遗忘。我们认为时间效率本身为强化学习提供了强大且未充分利用的自我监督源。我们引入时间自我模仿学习(TSIL),一种强化学习框架,挖掘学习过程中产生的时间高效成功轨迹,并将其转化为可重用的监督信号以改进未来策略。TSIL通过从快速成功轨迹中提取配置条件自适应时间目标逐步优化学习,并通过效率加权自我模仿学习保留和重放高效行为。在15个不同的长时域操作任务中,TSIL持续提升了学习效率、任务完成效率、快速成功行为的重访率以及对不稳定训练条件的鲁棒性。更广泛地,我们的结果表明,成功行为的时间结构本身为强化学习提供了超越人工奖励塑形的可扩展自我监督信号。

英文摘要

Long-horizon robot manipulation policies trained with reward shaping can still exploit dense rewards through inefficient interaction, while rare efficient behaviors may be forgotten during training. We argue that temporal efficiency itself provides a powerful and underutilized source of self-supervision for reinforcement learning. We introduce Temporal Self-Imitation Learning (TSIL), a reinforcement learning framework that mines temporally efficient successful trajectories generated during learning and converts them into reusable supervision for future policy improvement. TSIL progressively refines learning using configuration-conditioned adaptive temporal targets derived from fast successful trajectories, while preserving and replaying efficient behaviors through efficiency-weighted self-imitation learning. Across 15 distinct long-horizon manipulation tasks, TSIL consistently improves learning efficiency, task-completion efficiency, revisitation of fast successful behaviors, and robustness to unstable training conditions. More broadly, our results suggest that the temporal structure of successful behavior itself provides a scalable self-supervisory signal for reinforcement learning beyond manually engineered reward shaping alone.

2606.19750 2026-06-19 cs.LG cs.AI cs.CL 新提交

Manifold Bandits: Bayesian Curriculum Learning over the Latent Geometry of Large Language Models

流形赌博机:大语言模型潜在几何上的贝叶斯课程学习

Darrien McKenzie, Nicklas Hansen, Xiaolong Wang

发表机构 * University of California, San Diego(加州大学圣迭戈分校)

AI总结 提出贝叶斯流形课程(BMC)框架,将问题采样建模为流形结构赌博机问题,通过层次任务树和贝叶斯学习引导采样,平衡学习信号、多样性和实用性。

Comments Webpage: https://darrienmckenzie.com/manifold-bandits/

详情
AI中文摘要

强化学习(RL)是提高大语言模型(LLMs)推理能力的关键方法,其中训练效率关键取决于优化过程中问题的采样方式。现有的自适应课程学习方法通常优先考虑中等难度的提示,将问题选择视为具有独立臂的标准赌博机问题,忽略了任务空间的结构化和异质性。在这项工作中,我们将问题采样框架化为具有内生非平稳性的流形结构赌博机问题:问题通过模型的潜在表示空间相关联,采样决策可以影响学习信号在该空间中的演变方式。为了实现这一视角,我们引入了贝叶斯流形课程(BMC),这是一个结构感知框架,将问题组织成层次任务树,并应用贝叶斯学习来指导采样。实验发现,不同的采样策略在生产性(学习信号)、多样性(任务流形覆盖)和实用性(评估相关性)之间引入了非平凡的权衡。这些结果表明,仅优先考虑难度不足以获得强大的下游性能,突出了将结构和类型感知纳入问题采样中的重要性。

英文摘要

Reinforcement learning (RL) is a central approach for improving reasoning capabilities in large language models (LLMs), where training efficiency depends critically on how problems are sampled during optimization. Existing adaptive curriculum learning methods typically prioritize prompts of intermediate difficulty, treating problem selection as a standard bandit problem with independent arms and overlooking the structured, heterogeneous nature of the task space. In this work, we frame problem sampling as a manifold-structured bandit problem with endogenous non-stationarity: problems are related through the model's latent representation space, and sampling decisions can steer how learning signals evolve across that space. To operationalize this perspective, we introduce Bayesian Manifold Curriculum (BMC), a structure-aware framework that organizes problems into a hierarchical task tree and applies Bayesian learning to guide sampling. Empirically, we find that different sampling strategies induce non-trivial tradeoffs between productivity (learning signal), diversity (coverage of the task manifold), and utility (evaluation relevance). These results show that prioritizing difficulty alone is insufficient for strong downstream performance, highlighting the importance of incorporating structure and type-awareness into problem sampling.

2606.19749 2026-06-19 cs.AI cs.CL 新提交

Benchmarking Agentic Review Systems

基准测试智能审稿系统

Dang Nguyen, Wanqing Hao, Yanai Elazar, Chenhao Tan

发表机构 * University of Chicago(芝加哥大学) Bar-Ilan University(巴伊兰大学)

AI总结 针对AI辅助研究给同行评审带来的压力,新兴智能审稿系统涌现,但缺乏评估标准。本文评估了多种系统,发现最佳配置(OpenAIReview + GPT-5.5)在成对准确性上达83.0%,能捕获71.6%注入错误,且用户反馈正面。

Comments 11 pages, 7 tables, 4 figures

详情
AI中文摘要

一类新的智能审稿系统正在兴起,以缓解AI辅助研究给同行评审系统带来的压力,但如何评估它们尚不明确。我们评估了两个开源系统(OpenAIReview和coarse)、一个专有系统(Reviewer3)以及一个零样本基线,跨越六个涵盖前沿和高效模型的LLM。首先,我们研究ICLR/NeurIPS论文上的AI评审是否与论文质量(通过引用和接受决定等外部信号近似)相关。每个系统在成对准确性上均高于随机水平,最佳为OpenAIReview + GPT-5.5,达到83.0%。其次,为测试系统能否捕获已知真实错误的错误,我们构建了一个扰动基准,向八个arXiv学科类别的论文中注入四类错误,并测量检测召回率。最强配置(OpenAIReview + GPT-5.5)捕获了71.6%的注入错误,仍有很大改进空间。六个模型的检测并集达到83.3%的召回率,表明不同模型检测不同错误,更好的利用设计可能提高性能。除这些基准外,我们研究了OpenAIReview在真实用户中的公开部署。对其评论的投票偏向正面,比例为1.44:1,最常见的抱怨是误报和琐碎挑剔。总之,通过评估基于最先进模型的全审稿系统在真实研究论文上的表现,我们表明虽然AI评审仍有改进空间,但它们已经能够很好地跟踪人类质量判断、捕获重要错误,并获得真实用户的正面反馈。

英文摘要

A new class of agentic review systems are emerging as a remedy to the pressure placed on peer review systems by AI-assisted research, but it is unclear how they should be evaluated. We evaluate two open-source systems (OpenAIReview and coarse), one proprietary system (Reviewer3), and a zero-shot baseline, across six LLMs spanning frontier and efficient models. First, we study whether AI reviews on ICLR/NeurIPS papers track with papers' quality as approximated by external signals such as citations and acceptance decisions. Every system performs above chance in pairwise accuracy, and the best is OpenAIReview + GPT-5.5 at 83.0%. Second, to test whether systems can catch errors with known ground truth, we construct a perturbation benchmark that injects four categories of errors into papers across eight arXiv subject classes and measure detection recall. The strongest configuration (OpenAIReview + GPT-5.5) catches 71.6% of injected errors, leaving substantial room for improvement. The union of detections across six models reaches 83.3% recall, suggesting different models detect different errors and better harness design can potentially increase performance. Beyond these benchmarks, we study a public deployment of OpenAIReview with real users. Votes on its comments skew positive at 1.44 to 1, and the most common complaints are about false positives and minor nitpicks. Together, by evaluating full review systems backed by state-of-the-art models on real research papers, we show that while AI reviews still have room for improvement, they can already track human quality judgments well, catch important errors, and earn positive feedback from real users.

2606.19747 2026-06-19 cs.AI 新提交

A Comparative Study of Pretrained Transformer Models for Quranic ASR: Speech Representations, Label Formats, and Dataset Composition

预训练Transformer模型用于古兰经语音识别的比较研究:语音表示、标签格式和数据集组成

Nabil Mosharraf Hossain, Riasat Islam, Unaizah Obaidellah

发表机构 * Greentech Apps Foundation(Greentech Apps基金会) Queen Mary University of London(伦敦玛丽女王大学) University of Malaya(马来亚大学)

AI总结 本文系统比较了Wav2Vec2.0、HuBERT和XLS-R等预训练Transformer模型在古兰经语音识别中的微调效果,通过870小时数据集实验,最佳配置实现0.08词错误率,训练时间从140小时降至40小时。

Comments 30 pages, 9 figures, 5 tables, Submitted to International Journal of Speech Technology

详情
AI中文摘要

古兰经自动语音识别(ASR)旨在将古兰经诵读转换为文本,从而支持辅助记忆工具和古兰经搜索引擎等应用。然而,现有的ASR模型在用户诵读的经文上通常表现出较高的词错误率(WER),并且缺乏对古兰经语料库的完整覆盖。本文对基于预训练Transformer模型的领域特定微调进行了系统的实证研究,使用了先进的语音特征提取方法:Wav2Vec2.0、HuBERT和XLS-R。这些模型通过掩码输入音频的部分内容并利用Transformer架构学习上下文感知的语音特征,应用自监督学习。预训练模型在超过870小时的专业和用户诵读过滤后的古兰经数据集上进行微调。通过跨特征提取器、输出标签格式、训练策略和剪辑时长的全面消融研究,我们确定了影响该领域转录准确性的关键因素。我们的最佳配置在EveryAyah子集上实现了0.08的WER,在EveryAyah+Tarteel组合设置上实现了0.11的WER,相比Citrinet基线(WER=0.163)提高了约五个百分点,同时将组合模型训练时间从140小时减少到40小时。不带变音符号的阿拉伯文本产生了最佳的微调结果,而Wav2Vec2-XLSR-53提供了最强的整体表示。未来的工作包括改进数据集质量和开发音素感知模型,以提取更深的语音特征表示,用于对Tajweed敏感的应用。

英文摘要

Quran Automatic Speech Recognition (ASR) aims to convert Quranic recitation into text, enabling applications such as aided memorisation tools and Quranic search engines. However, existing ASR models often exhibit high Word Error Rates (WER) on user-recited verses and lack full coverage of the Quranic corpus. This paper presents a systematic empirical study of domain-specific fine-tuning of pretrained Transformer-based models for Quranic ASR, using advanced speech feature extraction methods: Wav2Vec2.0, HuBERT, and XLS-R. These models apply self-supervised learning by masking portions of input audio and using Transformer architectures to learn context-aware speech features. The pretrained models are fine-tuned on a filtered Quranic dataset exceeding 870 hours of professional and user recitations. Through comprehensive ablation studies across feature extractors, output label formats, training strategies, and clip durations, we identify the key factors that affect transcription accuracy in this domain. Our best-performing configuration achieves a WER of 0.08 on the EveryAyah subset and 0.11 on the combined EveryAyah+Tarteel setting, representing roughly a five-percentage-point gain over the Citrinet baseline (WER = 0.163) while reducing combined-model training time from 140 hours to 40 hours. Arabic text without diacritics yields the best fine-tuning results, and Wav2Vec2-XLSR-53 provides the strongest overall representation. Future work includes improving dataset quality and developing phoneme-aware models to extract deeper speech feature representations for Tajweed-sensitive applications.

2606.19744 2026-06-19 cs.CL cs.AI cs.HC 新提交

Beyond Uniform Forgetting: A Study of Sequential Direct Preference Optimization Across Preference Settings

超越统一遗忘:不同偏好设置下顺序直接偏好优化的研究

Pranav Bhandari, Nicolas Fay, Amitava Datta, Usman Naseem, Mehwish Nasim

发表机构 * Network Analysis and Social Influence Modelling (NASIM) Lab(网络分析与社会影响建模实验室) School of Physics Maths and Computing, The University of Western Australia(西澳大学物理数学与计算学院) School of Psychological Science, The University of Western Australia(西澳大学心理科学学院) School of Computing, Macquarie University(麦考瑞大学计算机学院)

AI总结 研究顺序DPO在不同偏好设置下的影响,发现遗忘模式并非统一,而是取决于目标关系、信号强度和训练顺序,并提出未来对齐流程应考虑目标兼容性。

Comments Submitted to EMNLP 2026

详情
AI中文摘要

将语言模型与人类偏好对齐通常需要优化多个行为目标。一种实用方法是使用直接偏好优化(DPO)等偏好优化方法顺序应用这些目标,但目前尚不清楚后续训练是否会统一降低先前学习的偏好,或者这种影响是否取决于目标之间的关系。我们研究了跨越四种偏好设置(包括分布冲突、多属性交互、强安全信号和兼容的响应质量目标)的顺序DPO。使用带有LoRA适配器的Llama-3.1-8B-Instruct,我们在每个阶段后使用固定的基础模型参考评估所有目标。我们发现顺序DPO不会产生单一的遗忘模式;偏好变化从部分退化到稳定、成对重新分配或正迁移,具体取决于目标关系、信号强度和训练顺序。使用长度归一化策略边界的成对分析表明,聚合指标可能掩盖偏好对之间的异质性变化,而四分位数分解显示,高置信度对可能根据设置而退化或改进。机制诊断表明,在所有设置中,阶段2的梯度和适配器更新与先前目标接近正交,几乎没有证据表明直接梯度对立是主要驱动因素。这些发现表明,未来的顺序对齐流程应考虑目标兼容性和信号强度,而不是假设后续目标会统一影响先前的偏好。

英文摘要

Aligning language models with human preferences often requires optimising multiple behavioural objectives. A practical approach is to apply these objectives sequentially using preference optimisation methods such as Direct Preference Optimisation (DPO), but it remains unclear whether later training uniformly degrades preferences learned earlier or whether the effect depends on the relationship between objectives. We study sequential DPO across four preference settings covering distributional conflict, multi-attribute interaction, strong safety signal, and compatible response-quality objectives. Using Llama-3.1-8B-Instruct with LoRA adapters, we evaluate all objectives after every stage with a fixed base-model reference. We find that sequential DPO does not produce a single forgetting pattern; preference change ranges from partial degradation to stability, pair-level redistribution, or positive transfer depending on objective relationship, signal strength, and training order. Pair-level analysis using length-normalised policy margins shows that aggregate metrics can mask heterogeneous changes across preference pairs, whereas quartile decomposition reveals that high-confidence pairs can either degrade or improve depending on the setting. Mechanistic diagnostics show that Stage~2 gradients and adapter updates are near-orthogonal to the previous objective across all settings, providing little evidence that direct gradient opposition is the primary driver. These findings suggest that future sequential alignment pipelines should account for objective compatibility and signal strength, rather than assuming that later objectives affect earlier preferences uniformly.

2606.19741 2026-06-19 cs.AI cs.LG 新提交

Interpreting Neural Combinatorial Optimization via Evolving Programmatic Bottlenecks

通过演化程序瓶颈解释神经组合优化

Haocheng Duan, Yuxin Guo, Jieyi Bi, Anqi Xie, Sirui Li, Yining Ma, Cathy Wu

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Nanyang Technological University(南洋理工大学) Microsoft Research(微软研究院) Massachusetts Institute of Technology(麻省理工学院)

AI总结 提出演化程序瓶颈(EPB)框架,通过将黑盒神经组合优化模型蒸馏为可读程序组合,利用LLM和混合梯度下降实现可解释性,揭示模型行为与经典启发式变体的关系。

Comments Under Review

详情
AI中文摘要

神经组合优化(NCO)取得了强劲性能,但其黑盒性质仍然是部署和科学诊断的关键障碍。标准可解释性工具(如概念瓶颈模型)不适用于NCO,因为其决策是动态的、状态依赖的,且缺乏适当的概念词汇定义。为弥合这一差距,我们引入了演化程序瓶颈(EPB),据我们所知,这是首个通过将黑盒NCO模型蒸馏为人类可读程序组合来解释NCO策略的框架。EPB利用LLM自主演化一组程序,其中每个程序的每步动作分布作为瓶颈。EPB通过迭代框架工作:模块I固定程序库容量,并引入混合文本-数值梯度下降方案,该方案将学生路由器更新的数值梯度和基于LLM程序修订的文本梯度相结合;模块II通过故障目标扩展和冗余剪枝动态调整库容量。大量实验证明了EPB的有效性和广泛适用性,蒸馏后的程序组合在很大程度上保持了原始性能。EPB还揭示了NCO行为在优化阶段的变化,并且可以近似为经典启发式变体的组合。我们的工作推进了可解释NCO,并将EPB建立为解释序列决策模型的有前途工具。

英文摘要

Neural Combinatorial Optimization (NCO) achieves strong performance, yet its black-box nature remains a key roadblock to deployment and scientific diagnosis. Standard interpretability tools, such as Concept Bottleneck Models (CBMs), are ill-equipped for NCO, whose decisions are dynamic, state-dependent, and lack proper concept vocabulary definition. To close this gap, we introduce Evolving Programmatic Bottlenecks (EPB), to our knowledge, the first framework for interpreting NCO policies by distilling black-box NCO models into human-readable program portfolios. EPB employs an LLM to autonomously evolve a bank of programs, where each program's per-step action distribution serves as the bottleneck. EPB works through an iterative framework: Block I fixes program bank capacity and introduces a hybrid textual-numerical gradient descent scheme that couples numerical gradients for student router updates and textual gradients for LLM-based program revision; Block II dynamically adapts bank capacity via fault-targeted expansion and redundancy pruning. Extensive experiments demonstrate EPB's effectiveness and broad applicability, where the distilled program portfolios largely match original performance. EPB also reveals that NCO behavior shifts across optimization stages and can be approximated as a composition of classic heuristic variants. Our work advances interpretable NCO and establishes EPB as a promising tool for interpreting sequential decision-making models.

2606.19736 2026-06-19 cs.CV 新提交

VFACamou: View-Fused Adversarial Camouflage for Environment-Adaptive Physical Evasion

VFACamou: 视图融合的对抗性伪装用于环境自适应物理规避

Shihui Yan, Hu Liu, Junyu Shi, Zihui Zhu, Ziqi Zhou, Yufei Song, Youming Geng, Minghui Li, Shengshan Hu

发表机构 * State Key Laboratory of Intelligent Vehicle Safety Technology(智能汽车安全技术国家重点实验室) School of Cyber Science and Engineering, Huazhong University of Science and Technology(华中科技大学网络空间安全学院) School of Computer Science and Technology, Huazhong University of Science and Technology(华中科技大学计算机科学与技术学院) School of Software Engineering, Huazhong University of Science and Technology(华中科技大学软件学院) Hebei Energy College of Vocation And Technology(河北能源职业技术学院)

AI总结 提出一种端到端框架,结合UV体积渲染与扩散纹理生成器,并引入照明颜色一致性估计器和多尺度动态训练策略,生成可穿戴对抗图案,在无人机侦察等动态视角和光照变化下实现稳定物理攻击。

Comments Accepted by ICME 2026

详情
AI中文摘要

物理世界中的对抗性伪装仍然极具挑战性,尤其是在无人机侦察场景下,目标会经历连续的几何变化和极端光照变化。现有方法要么优化无法泛化到动态视角的2D数字扰动,要么产生视觉上不自然的纹理而无法在实际场景中部署。因此,我们提出一个端到端的对抗性伪装生成框架,能够自动生成可穿戴的对抗图案,并在视角、姿态和光照条件变化的真实物理环境中保持稳定的攻击性能。我们的方法将UV体积渲染与基于扩散的纹理生成器相结合,使得在不同尺度、姿态和光照条件下外观保持一致。为了确保环境真实性,我们提出一个照明颜色一致性估计器,提取主导背景属性并引导自然纹理损失,使生成的UV纹理与周围环境对齐。多尺度动态训练策略进一步增强了对抗视角变化和身体变形的鲁棒性。在多个主流检测器上的大量实验表明,我们的方法在保持高感知自然性的同时实现了强大且稳定的物理攻击性能,在不引入不自然伪影的情况下降低了人类检测率。

英文摘要

Adversarial camouflage in the physical world remains highly challenging, particularly under UAV reconnaissance where targets undergo continuous geometric changes and extreme illumination variations. Existing methods either optimize 2D digital perturbations that fail to generalize to dynamic viewpoints or produce visually unnatural textures that cannot be deployed in real scenarios. Therefore, we propose an end-to-end framework for adversarial camouflage generation that automatically produces wearable adversarial patterns and maintains stable attack performance in real physical environments with changing viewpoints, poses, and lighting conditions. Our method integrates UV-volume rendering with a diffusion-based texture generator, enabling consistent appearance under varying scales, poses, and lighting conditions. To ensure environmental realism, we propose an illumination color consistency estimator that extracts dominant background attributes and guides a natural texture loss to align the generated UV texture with the surrounding environment. A multi-scale dynamic training strategy further enhances robustness against viewpoint shifts and body deformation. Extensive experiments across multiple mainstream detectors demonstrate that our method achieves strong and stable physical attack performance while maintaining high perceptual naturalness, reducing human detection rates without introducing unnatural artifacts.

2606.19735 2026-06-19 cs.AI cs.CV 新提交

GLARE: A Natural Language Interface for Querying Global Explanations

GLARE: 用于查询全局解释的自然语言接口

Bhavan Vasu, Rajesh Mangannavar

发表机构 * Oregon State University(俄勒冈州立大学)

AI总结 提出基于LLM的交互接口GLARE,将自然语言问题转换为SQL查询以聚合局部解释数据,提升全局解释的可访问性和可用性。

Comments 16 pages, 2 figures

详情
AI中文摘要

虽然全局解释对于理解跨数据集、类别和决策上下文的视觉模型至关重要,但其复杂和单一的性质常常阻碍实际探索。由于用户通常寻求针对特定问题的目标答案,而不是静态产物,我们提出了一种基于LLM的交互接口,提供对黑盒图像分类器全局解释的自然语言访问。系统的核心LLM充当调解者,将自然语言问题转换为对局部解释数据的结构化SQL查询。这使得灵活聚合成为可能,而无需向用户暴露低级表示。对于每个查询,接口输出统计增强的自然语言响应,支持局部解释和意图对齐的可视化。我们在意图解释、查询映射准确性、对新查询和数据集的泛化能力以及对语言错误的鲁棒性方面评估了该系统。我们的结果表明,LLM中介的查询显著提高了以人为中心的XAI中全局解释的可访问性和可用性。

英文摘要

While global explanations are crucial for understanding vision models across datasets, classes, and decision contexts, their complex and monolithic nature often hinders practical exploration. Because users typically seek targeted answers to specific questions rather than static artifacts, we present an LLM-based interactive interface that provides natural language access to global explanations for black-box image classifiers. The system's core LLM acts as a mediator, translating natural language questions into structured SQL queries over local explanation data. This enables flexible aggregation without exposing users to low-level representations. For each query, the interface outputs statistics-augmented natural language responses, supporting local explanations, and intent-aligned visualizations. We evaluate the system on intent interpretation, query mapping accuracy, generalization to novel queries and datasets, and robustness to linguistic errors. Our results demonstrate that LLM-mediated querying substantially improves the accessibility and usability of global explanations for human-centered XAI.

2606.19734 2026-06-19 cs.LG 新提交

Federated Bilevel Performative Prediction

联邦双层执行预测

Liangxin Qian, Chang Liu, Xuanyu Cao, Jun Zhao, Kwok-Yan Lam

发表机构 * Nanyang Technological University(南洋理工大学) Zhejiang University(浙江大学) Washington State University(华盛顿州立大学)

AI总结 研究联邦学习中客户端数据分布受决策影响的双层优化问题,提出联邦双层执行稳定点概念及两种求解方法,实验验证了稳定性阈值和元泛化提升。

Comments Accepted by ICML 2026

详情
AI中文摘要

联邦双层优化广泛用于跨分布式客户端的嵌套学习问题,例如在隐私和通信约束下的联邦超参数调整和元学习。大多数现有公式假设客户端数据分布固定,但执行性可能违反这一假设,其中部署的决策会重塑客户端行为和数据收集,导致客户端特定的、决策依赖的分布偏移。我们研究联邦双层执行预测,其中上层(UL)和下层(LL)目标都在客户端依赖、决策依赖的分布下进行评估。我们在解耦风险视角下形式化联邦双层执行稳定(FBPS)点,并给出其存在性和唯一性的充分条件。然后,我们开发两种联邦方法来计算FBPS解:FBi-RRM,在收缩条件下线性收敛;以及FBi-SGD,一种基于联邦超梯度估计的通信高效随机方法,在步长递减且敏感性足够小时具有收敛保证。在策略回归和元策略分类上的实验验证了预测的稳定性阈值,并展示了相对于非执行基线的元泛化改进,基于CNN的分类进一步证明了所提方法在非凸神经网络设置中的实际有效性。

英文摘要

Federated bilevel optimization is widely used for nested learning problems across distributed clients, such as federated hyperparameter tuning and meta-learning under privacy and communication constraints. Most existing formulations assume fixed client data distributions, which can be violated by performativity, where deployed decisions reshape client behavior and data collection, inducing client-specific, decision-dependent distribution shift. We study federated bilevel performative prediction, where both upper-level (UL) and lower-level (LL) objectives are evaluated under client-dependent, decision-dependent distributions. We formalize the federated bilevel performatively stable (FBPS) point under a decoupled-risk perspective and provide sufficient conditions for its existence and uniqueness. We then develop two federated methods to compute the FBPS solution: FBi-RRM, which converges linearly under a contraction condition, and FBi-SGD, a communication-efficient stochastic method based on federated hypergradient estimation with convergence guarantees under diminishing step sizes when sensitivities are sufficiently small. Experiments on strategic regression and meta strategic classification validate the predicted stability thresholds and demonstrate improved meta-generalization over non-performative baselines, and CNN-based classification further demonstrates the practical effectiveness of the proposed methods in nonconvex neural network settings.

2606.19733 2026-06-19 cs.CV cs.AI 新提交

QueryGaussian: Scalable and Training-Free Open-Vocabulary 3D Instance Retrieval

QueryGaussian: 可扩展且无需训练的开词汇3D实例检索

Xiuyuan Zhu, Ke Lu, Zijie Yang, Chao Yue, Jian Xue, Dongming Zhang

发表机构 * University of Chinese Academy of Sciences(中国科学院大学) State Key Laboratory of Communication Content Cognition(通信内容认知国家重点实验室) Peng Cheng Laboratory(鹏城实验室)

AI总结 提出QueryGaussian,一种无需训练的开词汇3D实例检索框架,通过实例级查询机制解耦语义与几何,结合2D视觉模型和时序融合模块,在保持精度的同时降低70%以上GPU内存并加速180倍,支持城市级场景。

Comments 8 pages, 4 figures, 6 tables. Accepted to the 2026 IEEE International Conference on Systems, Man, and Cybernetics (SMC 2026)

详情
AI中文摘要

通过自然语言提示从大规模场景中高效检索特定3D实例仍然是多媒体分析中的一个严峻挑战。现有方法主要遵循“场景级嵌入”范式,需要将高维语义特征蒸馏到每个3D基元中。这种策略存在一个根本性的架构瓶颈:内存和计算成本随场景复杂度线性增长,不可避免地导致城市级环境中的内存溢出(OOM)故障。为了解决这一障碍,我们提出了QueryGaussian,一个无需训练的框架,用于快速且可扩展的开词汇3D实例检索。与整体语义蒸馏不同,QueryGaussian采用实例级查询机制,将语义理解与几何表示解耦。具体来说,我们利用预训练的2D视觉模型解释用户提示,并通过并发最大权重关联策略将分割掩码提升到3D,确保语义-视觉一致性。为了缓解投影歧义,我们引入了一个具有多阶段自适应密度聚类的时间融合模块。实验结果表明,QueryGaussian不仅匹配了最先进方法的准确性,还实现了决定性的效率飞跃,将GPU内存使用减少超过70%,并将推理速度提升180倍。关键的是,QueryGaussian能够在包含数千万个高斯的城市级场景中,使用消费级硬件实现快速的实例检索。

英文摘要

Efficiently retrieving specific 3D instances from large-scale scenes via natural language prompts remains a formidable challenge in multimedia analysis. Existing approaches predominantly follow a "scene-level embedding" paradigm, which requires distilling high-dimensional semantic features into every 3D primitive. This strategy suffers from a fundamental architectural bottleneck: memory and computational costs scale linearly with scene complexity, inevitably triggering out-of-memory (OOM) failures in city-scale environments. To address this barrier, we propose QueryGaussian, a training-free framework for expeditious and scalable open-vocabulary 3D instance retrieval. Unlike holistic semantic distillation, QueryGaussian employs an instance-level query mechanism that decouples semantic understanding from geometric representation. Specifically, we leverage pre-trained 2D vision models to interpret user prompts and lift segmentation masks into 3D via a concurrent maximum-weight association strategy, ensuring semantic-visual consistency. To mitigate projection ambiguity, we introduce a temporal fusion module with multi-stage adaptive density clustering. Experimental results demonstrate that QueryGaussian not only matches the accuracy of state-of-the-art methods but also delivers a decisive efficiency leap, reducing GPU memory usage by over 70% and accelerating inference by 180x. Crucially, QueryGaussian enables expeditious instance retrieval on city-scale scenes containing tens of millions of Gaussians using consumer-grade hardware.

2606.19729 2026-06-19 cs.RO cs.AI 新提交

VOiLA: Vectorized Online Planning with Learned Diffusion Model for POMDP Agents

VOiLA: 基于学习扩散模型的向量化在线规划用于POMDP智能体

Marcus Hoerger, Rishikesh Joshi, Rahul Shome, Ian Manchester, Hanna Kurniawati

发表机构 * Australian National University(澳大利亚国立大学) The University of Sydney(悉尼大学)

AI总结 提出VOiLA框架,利用条件扩散模型学习POMDP模型,通过蒸馏加速采样并与向量化在线规划器集成,在三个基准任务和实物机器人上实现高效在线规划。

Comments Submitted to the 2026 International Symposium of Robotics Research (ISRR)

详情
AI中文摘要

不确定性下的规划是自主机器人的关键能力。部分可观测马尔可夫决策过程(POMDP)为此提供了强大框架。尽管基于POMDP的规划已取得显著进展,但其在现实问题中的应用常受限于难以获得准确的POMDP模型。我们提出VOiLA(Vectorized Online planning wIth Learned diffusion model for POMDP Agents),一个学习任务无关POMDP模型以实现在不确定性下在线规划的框架。VOiLA使用条件扩散模型学习转移和观测采样器,并学习用于基于粒子的信念更新的观测似然模型。为实现高效在线规划,扩散采样器被蒸馏为紧凑的前馈生成器,并与VOPP(一种利用GPU并行化的在线POMDP规划器)集成。实验结果表明,蒸馏策略将采样成本降低了近三个数量级,使学习到的生成式POMDP模型对在线规划实用。在三个基准问题上的评估表明,VOiLA在使用不到10%训练数据的情况下,性能达到或优于递归软演员-评论家算法,并且对未见环境配置的泛化能力更强。实物机器人评估表明,VOiLA仅使用模拟数据学习模型,并在10次运行中全部成功完成任务。

英文摘要

Planning under uncertainty is an essential capability for autonomous robots. The Partially Observable Markov Decision Process (POMDP) provides a powerful framework for such a capability. Although POMDP-based planning has advanced significantly, its application to real-world problems is often limited by the difficulty of obtaining faithful POMDP models. We present Vectorized Online planning wIth Learned diffusion model for POMDP Agents (VOiLA), a framework that learns task-agnostic POMDP models for online planning under uncertainty. VOiLA learns transition and observation samplers using conditional diffusion models and learns observation-likelihood models for particle-based belief updates. To enable efficient online planning, the diffusion samplers are distilled into compact feedforward generators and integrated with Vectorized Online POMDP Planner (VOPP), an online POMDP planner designed to leverage GPU parallelization. Experimental results indicate the distillation strategy reduces sampling cost by up to nearly three orders of magnitude, making learned generative POMDP models practical for online planning. Evaluation of VOiLA on three benchmark problems indicate that VOiLA achieves equal or better performance than Recurrent Soft Actor Critic while using less than 10% training data, and generalizes much better to unseen environment configurations. Physical robot evaluation indicates VOiLA uses the models learned using only simulated data and generates a policy that successfully accomplish the task in 10 of 10 runs.

2606.19728 2026-06-19 cs.RO cs.AI 新提交

Bidirectional Tutoring for Developmental Motor Learning in Robots: Co-Developed Interaction Dynamics Support Stable Learning

机器人发展性运动学习的双向辅导:共同发展的交互动力学支持稳定学习

Rui Fukushima, Jun Tani

发表机构 * Okinawa Institute of Science and Technology Graduate University(冲绳科学技术大学院大学)

AI总结 提出双向辅导框架,通过人类或AI导师与机器人动态适应,利用自由能原理神经网络实现稳定序列学习,在物体操作任务中验证了行为一致性和泛化能力。

Comments 16 pages, 14 figures

详情
AI中文摘要

众所周知,婴儿通过与照顾者的密集互动来发展运动技能。尽管这种社会互动对人类发展至关重要,但机器人的运动技能学习通常被视为单向过程,机器人被动接受导师的演示。这忽视了社会互动的一个关键特性:它本质上是双向的,导师和学习者相互动态适应。在这种互动中,机器人的过往经验可能作为先验约束,塑造共同发展轨迹的动态。我们假设双向辅导允许这些约束引导形成一致的行为模式,从而保持行为一致性并支持泛化,而单向互动缺乏此类约束,导致更广泛、更不一致的行为模式。为检验这一假设,我们使用实体人形机器人进行了两个物体操作实验:一个涉及人机互动,另一个采用AI导师通过自适应干预机制与真实机器人互动,以检验在更受控条件下是否会出现类似效果。我们使用基于自由能原理的神经网络并扩展生成回放来实现发展性学习框架,该框架支持从单个辅导情节中进行稳定的逐序列学习。在两种设置中,双向辅导促进了行为一致性和阶段性泛化,同时机器人逐渐需要更少的导师指导。这些结果表明,双向辅导作为一种具身和社会化方法,为机器人的发展性运动学习提供了有效支架。

英文摘要

Infants are well known to develop their motor skills through dense interaction with caregivers. Although such social interaction is crucial for human development, motor-skill learning in robots is often treated as a unidirectional process in which robots passively receive demonstrations from tutors. This overlooks a key property of social interaction: it is inherently bidirectional, with tutor and learner dynamically adapting to each other. In such interactions, the robot's past experiences may function as prior constraints that shape the dynamics of their co-developed trajectories. We hypothesize that bidirectional tutoring allows such constraints to guide the formation of consistent behavioral patterns that preserve behavioral coherence and support generalization, whereas unidirectional interaction lacks such constraints and leads to broader, less consistent behavioral patterns. To examine this hypothesis, we conducted two experiments with a physical humanoid robot performing an object manipulation task: one involving human-robot interaction and another employing an AI tutor interacting with the real robot through an adaptive intervention mechanism designed to examine whether similar effects would emerge under more controlled conditions. We implement the developmental learning framework using a free-energy-principle-based neural network extended with generative replay, which supports stable sequence-by-sequence learning from single tutored episodes. Across both settings, bidirectional tutoring fostered consistent behaviors and stage-wise generalization, while the robot gradually required less tutor guidance. These results suggest that bidirectional tutoring, as an embodied and socially grounded approach, provides an effective scaffold for developmental motor learning in robots.

2606.19727 2026-06-19 cs.CL cs.AI 新提交

NRITYAM: Language Models Meet Art and Heritage of Dance

NRITYAM:语言模型遇见舞蹈的艺术与遗产

Punit Kumar Singh, Niladri Ghosh, Advait Joshiınst, Shailee Choudhary, Michael Färber, Haiqin Yang

发表机构 * Shenzhen Technology University(深圳技术大学) New Delhi Institute of Management(新德里管理学院) Technische Universität Dresden(德累斯顿工业大学) Ramakrishna Mission Vivekananda Educational and Research Institute(罗摩克里希纳传道会维韦卡南达教育与研究学院) Indian Institute of Technology(印度理工学院) Swami Vivekananda Institute of Technology(斯瓦米·维韦卡南达技术学院) GuangDong Engineering Technology Research Center of Edge Intelligence(广东省边缘智能工程技术研究中心)

AI总结 提出NRITYAM基准,包含9,260个跨12语言的文化问答对,评估语言模型对全球舞蹈传统的文化理解能力,涵盖多种模型类型。

Comments 18 pages, 12 figures, in ECML_PKDD'26

详情
AI中文摘要

语言模型已成为塑造现代工作流程的重要工具。然而,其全球有效性取决于对当地社会文化背景的细致理解。为弥补这一差距,我们提出NRITYAM,一个用于评估语言模型在全球舞蹈传统背景下文化理解能力的综合基准。NRITYAM包含9,260个精心策划的问答对,涵盖12种语言,是专门用于评估舞蹈文化知识的最大数据集。该数据集通过与本地舞蹈艺术家和母语者的密切合作从头开发,他们创作并验证了特定地区的文化相关问题。我们评估了一系列模型,包括大型语言模型、小型语言模型、多模态大型语言模型和小型多模态语言模型。作为一个多语言和多文化基准,NRITYAM为评估AI系统理解和推理传统表演艺术的能力设定了新标准。详细数据集样本可在\url{this https URL}获取。

英文摘要

Language models have become essential tools in shaping modern workflows. However, their global effectiveness hinges on a nuanced understanding of local socio-cultural contexts. To address this gap, we present NRITYAM, a comprehensive benchmark for evaluating the cultural comprehension capabilities of language models in the context of global dance traditions. NRITYAM comprises 9,260 carefully curated question-answer pairs spanning 12 languages, making it the largest dataset dedicated to evaluating cultural knowledge in dance. The dataset has been developed from the ground up through close collaboration with native dance artists and native speakers of the languages, who authored and validated culturally relevant questions specific to their regions. We evaluate a broad set of models, including large language models, small language models, multimodal large language models, and small multimodal language models. As a multilingual and multicultural benchmark, NRITYAM sets a new standard for evaluating the ability of AI systems to understand and reason about traditional performing arts. Detailed dataset samples are available at~\url{https://github.com/niladrighosh03/NRITYAM}.

2606.19721 2026-06-19 cs.LG cs.AI 新提交

OnDeFog: Online Decision Transformer under Frame Dropping

OnDeFog:帧丢失下的在线决策变压器

Daiki Yotsufuji, Kenta Nishihara, Shoma Shimizu, Kento Uchida, Shinichi Shirakawa

发表机构 * Yokohama National University(横滨国立大学)

AI总结 针对帧丢失导致性能下降的问题,提出OnDeFog,将DeFog机制与在线决策变压器结合,通过直接环境交互学习策略,在高丢帧率环境下优于ODT,在低奖励数据集上优于DeFog。

Comments Accepted to PRICAI 2025

详情
AI中文摘要

在具有挑战性的现实世界强化学习应用中,通信延迟或传感器故障经常导致帧丢失,此时智能体无法接收丢失的状态及相关奖励。为了解决帧丢失导致的性能下降问题,通过将额外机制引入决策变压器以处理帧丢失,开发了随机帧丢失下的决策变压器(DeFog)。尽管DeFog可以缓解帧丢失环境中的性能下降,但由于DeFog是一种离线学习方法,它难以有效泛化到训练数据集中未充分表示的新状态。在本研究中,我们提出OnDeFog,它将DeFog中的机制与在线决策变压器(ODT)相结合,ODT是一种通过直接环境交互学习策略的在线强化学习方法。全面的实验评估表明,我们提出的OnDeFog在高丢帧率环境下相比ODT取得了更优的性能,并且在包含大量低奖励数据的数据集上优于DeFog。

英文摘要

In challenging real-world reinforcement learning applications, communication delays or sensor failures often cause frame dropping, in which the agent cannot receive the dropped states and associated rewards. To address the performance degradation caused by frame dropping, the Decision Transformer under Random Frame Dropping (DeFog) was developed by incorporating additional mechanisms into the decision transformer to tackle frame dropping. Although DeFog can mitigate performance degradation in frame-dropping environments, since DeFog is an offline learning method, it struggles to effectively generalize to novel states not adequately represented in the training dataset. In this study, we propose OnDeFog, which integrates the mechanisms in DeFog with the online decision transformer (ODT), an online reinforcement learning method that learns policies through direct environmental interaction. Comprehensive experimental evaluation demonstrates that our proposed OnDeFog achieves superior performance compared to ODT in environments characterized by high dropping frame rate and outperforms DeFog on datasets containing a large amount of low-reward data.

2606.19718 2026-06-19 cs.CV 新提交

One-Shot Novel View and Pose Human Image Synthesis via 3D Prior Guided Diffusion Model

基于3D先验引导扩散模型的单样本新视角与姿态人体图像合成

Shenjian Gong, Kangkan Wang, Shanshan Zhang, Jian Yang

发表机构 * PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, and Jiangsu Key Lab of Image and Video Understanding for Social Security, School of Computer Science and Engineering, Nanjing University of Science and Technology(南京理工大学计算机科学与工程学院教育部高维信息智能感知与系统重点实验室、江苏省社会安全图像与视频理解重点实验室及PCA实验室) Advanced Laser Technology Laboratory of Anhui Province, Electronic Engineering Institute, National University of Defense Technology, and Jianghuai Advance Technology Center(国防科技大学电子工程学院安徽省先进激光技术实验室及江淮前沿技术中心)

AI总结 提出一种基于条件去噪扩散模型的方法,利用3D人体先验(法线图和颜色提示)作为几何和颜色条件,从单张参考图像合成任意姿态和视角的高质量人体图像,包括被遮挡部分。

Comments 30 pages, 10 figures

详情
AI中文摘要

本文解决了单样本新视角和姿态人体图像合成的挑战。现有方法通过一组2D姿态关键点将参考人体图像转移到目标姿态,或基于可泛化人体NeRF(使用人体模型先验提取逐点特征)合成人体图像。然而,基于姿态转移的方法无法处理使用模糊2D姿态作为条件的复杂人体姿态,而可泛化人体NeRF在缺乏可靠特征时可能无法准确恢复被遮挡/不可见的人体部分。为解决这些问题,我们提出了一种基于条件去噪扩散模型的新方法,用于从单张人体图像进行新视角和姿态合成。我们的扩散模型将新视角和姿态合成问题分解为一系列条件去噪步骤。具体而言,为了生成具有复杂和任意姿态的人体,我们将3D人体先验(即3D法线图和颜色提示)作为几何和颜色条件引入生成过程。通过一系列扩散步骤将参考人体转移到目标人体,我们的扩散模型能够实现高质量合成,包括被遮挡/不可见部分。此外,我们提出了一种基于自重建的自定义细化方法,以在测试新视角时增强细节。在多个公共数据集上的实验结果表明,我们的方法显著优于先前方法,并显示出更好的跨数据集泛化能力。代码将在https://this https URL上公开。

英文摘要

This paper addresses the challenge of one-shot novel view and pose human image synthesis. The existing methods transfer the reference human image to a target pose using a set of 2D pose keypoints or synthesize human images based on generalizable human NeRF which uses human model priors to extract point-wise features. However, pose transfer based methods can not handle complex human pose using ambiguous 2D pose as the condition, while generalizable human NeRFs may be inaccurate to recover occluded/invisiable human parts without extracted reliable features. To solve these problems, we propose a novel approach for novel view and pose synthesis from a singe human image via conditional denoising diffusion model. Our diffusion model divides the novel view and pose synthesis problem into a sequence of conditional denoising steps. Specifically, to generate humans with complex and arbitrary poses, we introduce 3D human priors, i.e., 3D normal map and color prompt, as geometry and color conditions into the generation process. By transferring the reference human into the target human with a series of diffusion steps, our diffusion model enables high-quality synthesis including the occluded/invisible parts. Further, we propose a self-reconstruction based customized refinement to enhance fine details when tested on novel persons.Experimental results on different public datasets demonstrate that our approach significantly outperforms previous methods and also shows better generalization ability across datasets. The code will be made publicly available at https://github.com/Yankeegsj/3DPGDM.

2606.19712 2026-06-19 cs.LG cs.CV 新提交

Efficient Neural Network Model Selection for Few-Class Application Datasets

面向少类应用数据集的高效神经网络模型选择

Bryan Bo Cao, Abhinav Sharma, Lawrence O'Gorman, Michael Coss, Shubham Jain

发表机构 * Nokia Bell Labs(诺基亚贝尔实验室)

AI总结 针对实际应用中常见的少类数据集,提出基于数据属性的分类难度度量,实现比传统方法快6-29倍的模型选择,并扩展模型族至更小规模,在移动机器人等场景中提升效率。

Comments 36 pages, 9 tables, 13 figures

详情
AI中文摘要

尽管大量工作集中在开发和基准测试高性能神经网络上,但较少关注已知的数据集属性如何指导高效的模型选择。神经网络模型通常在数千类数据集上评估,然而许多实际应用涉及少于十类。为了解决这一被忽视但常见的情况,我们基于数据侧属性开发了一种分类难度度量,并展示了它如何为少类数据集实现更高效的模型选择,而传统方法在此效果较差。我们将此现象称为“少类独特性”。我们的度量允许比重复训练和测试快6到29倍的模型和数据集比较。利用这一洞察,我们将缩放模型族扩展到已发布的最小模型以下,在相似精度下实现更高效率,例如在移动机器人任务中模型比YOLOv5-nano小42%。针对资源受限的应用,我们在移动机器人、无人机和物联网场景中展示了少类模型选择,突出了在不牺牲性能的情况下效率的实际提升。

英文摘要

While much effort has focused on developing and benchmarking high-performance neural networks, less attention has been given to how dataset properties, known to practitioners, can guide efficient model selection. Neural models are typically evaluated on datasets with thousands of classes, yet many real-world applications involve fewer than ten. To address this understudied but common setting, we develop a measure of classification difficulty based on data-side properties and show how it enables more efficient model selection for few-class datasets, where traditional approaches are less effective. We term this phenomenon "few-class distinctiveness". Our metric allows comparison of models and datasets 6 to 29$\times$ faster than repeated training and testing. Leveraging this insight, we extend scaled model families below the smallest published models, achieving greater efficiency at similar accuracy, for example models up to 42% smaller than YOLOv5-nano for a mobile robot task. Targeting resource-constrained applications, we demonstrate few-class model selection across mobile robot, drone, and IoT scenarios, highlighting practical gains in efficiency without sacrificing performance.

2606.19711 2026-06-19 cs.RO cs.LG cs.SY eess.SY 新提交

A Differentiable Composite Approximation Framework for Autonomous Underwater Vehicle Maneuvering Modeling from Sea-Trial Data

一种可微复合近似框架:基于海试数据的自主水下航行器机动建模

Aobo Wang, Aifei Xia, Zihao Wang, Lizhu Hao

发表机构 * College of Shipbuilding Engineering, Harbin Engineering University(哈尔滨工程大学船舶工程学院) China Academy of Aerospace Aerodynamics(中国航天空气动力技术研究院) Institute of Artificial Intelligence, Shanghai University(上海大学人工智能研究院) China Ship Scientific Research Center(中国船舶科学研究中心)

AI总结 提出可微复合近似框架,结合多项式基与数据自适应基联合校准,并引入转向运动电流估计补偿,提升AUV机动预测精度。

详情
AI中文摘要

基于机载测量的场建模可以生成反映真实运行特性的自主水下航行器(AUV)机动模型。从近似角度看,传统机动模型使用预定义的约束多项式基,而数据驱动模型使用数据自适应基。受此基函数视角启发,本文提出一种可微复合近似公式,其中多项式基分量和数据自适应基分量被视为单个预测器的可微部分并联合校准。开发了一种基于梯度的协同校准方法用于全尺寸AUV机动预测,其中灵敏度感知机制调节有界多项式更新,而神经残差在共享预测目标下捕获剩余非线性差异。为了考虑现场数据中的海流效应,引入了一种基于转向运动的电流估计和补偿程序,以构建电流补偿的学习目标用于训练和滚动预测。该框架使用从7米长AUV在多种机动条件下收集的海试数据进行评估。结果表明,与纯多项式、纯神经网络和冻结先验混合基线相比,所提方法改进了递归轨迹和速度预测,证明了其在基于现场数据的AUV机动建模中的适用性。

英文摘要

Field-based modeling from onboard measurements can produce autonomous underwater vehicle (AUV) maneuvering models that reflect real operating characteristics. From an approximation perspective, conventional maneuvering models use predefined constraint polynomial bases, whereas data-driven models use data-adaptive bases. Motivated by this basis-function view, this paper presents a differentiable composite-approximation formulation, in which the polynomial-basis component and the data-adaptive basis component are treated as differentiable parts of a single predictor and calibrated jointly. A gradient-based co-calibration method is developed for full-scale AUV maneuvering prediction, where a sensitivity-aware mechanism regulates bounded polynomial updates while the neural residual captures remaining nonlinear discrepancies under a shared prediction objective. To account for ocean-current effects in field data, a turning-motion-based current estimation and compensation procedure is incorporated to construct current-compensated learning targets for training and rollout. The framework is evaluated using sea-trial data collected from a 7-meter AUV under multiple maneuvering conditions. Results show that the proposed method improves recursive trajectory and velocity prediction compared with polynomial-only, neural-only, and frozen-prior hybrid baselines, demonstrating its applicability to field-data-based AUV maneuvering modeling.

2606.19710 2026-06-19 cs.CL cs.AI 新提交

FineREX: Fine-Tuned NER-RE for Human Smuggling Knowledge Graphs

FineREX: 面向人口走私知识图谱的微调NER-RE

Elijah Feldman, Dipak Meher, Carlotta Domeniconi

发表机构 * Thomas Jefferson High School for Science and Technology(托马斯·杰斐逊科技高中)

AI总结 提出FineREX,一个基于微调LLM的流水线,用于从法律文档中提取实体和关系构建知识图谱,在F1分数上分别提升15.50%和31.46%,并减少50%处理时间。

Comments Code available at https://github.com/ElijahFeldman7/FineREX

详情
AI中文摘要

法庭记录包含关于人口走私网络的有价值证据,但这些信息通常埋藏在非结构化的、充满术语的法律文件中。虽然大型语言模型(LLM)可以通过自动信息提取支持知识图谱构建,但现有方法依赖通用模型,未针对该领域所需的实体和关系定义进行定制。我们提出FineREX,一个精简的知识图谱构建流水线,基于微调的LLM进行命名实体识别和关系提取(NER-RE)。使用包含512个文本块的手动标注数据集,FineREX在实体和关系F1分数上分别比更大的通用基线模型绝对提高了15.50%和31.46%。这些提升转化为更高质量的知识图谱,将法律噪声减少近一半,并将长文档上的节点重复率从17.78%降至11.17%。通过消除文档重写和冗余提取阶段,FineREX还将端到端处理时间减少了50.0%。我们的结果表明,领域特定的微调可以显著优于更大的通用模型,同时提高非法网络分析知识图谱构建的质量和效率。

英文摘要

Court proceedings contain valuable evidence about human smuggling networks, but this information is often buried within unstructured, jargon-heavy legal documents. While large language models (LLMs) can support knowledge graph construction through automated information extraction, existing approaches rely on general-purpose models that are not tailored to the entity and relationship definitions required in this domain. We introduce FineREX, a streamlined knowledge graph construction pipeline built around a fine-tuned LLM for named entity recognition and relationship extraction (NER-RE). Using a manually annotated dataset of $512$ text chunks, FineREX achieves absolute improvements of 15.50% and 31.46% in entity and relationship F1-score, respectively, compared to a larger general-purpose baseline. These gains translate into higher-quality knowledge graphs, reducing legal noise by nearly half and lowering node duplication on long documents from 17.78% to 11.17%. By eliminating document rewriting and redundant extraction stages, FineREX also reduces end-to-end processing time by 50.0%. Our results demonstrate that domain-specific fine-tuning can substantially outperform larger general-purpose models while improving both the quality and efficiency of knowledge graph construction for illicit network analysis.

2606.19706 2026-06-19 cs.CV cs.CL 新提交

NEST: Narrative Event Structures in Time for Long Video Understanding

NEST:面向长视频理解的时间叙事事件结构

Ali Asgarov, Kaushik Narasimhan, Najibul Haque Sarker, Hani Alomari, Chia-Wei Tang, Anushka Sivakumar, Zaber Ibn Abdul Hakim, Shaurya Mallampati, Chris Thomas

发表机构 * Department of Computer Science, Virginia Tech(弗吉尼亚理工大学计算机科学系)

AI总结 提出NEST数据集(1005部全长电影),通过多模态叙事事件标注和关系链接,评估模型在长视频中理解事件结构、时间顺序和长程依赖的能力,实验表明事件检测等任务极具挑战性。

详情
AI中文摘要

视觉-语言模型的最新进展使得处理越来越长的视频序列成为可能,但处理扩展令牌流的能力并不能转化为对长视频中叙事结构的理解。现有的长视频基准侧重于大海捞针式检索,而不是评估低级动作如何形成事件、事件如何跨时间交互以及叙事如何进展,例如,模型是否能够将早期的挫折(如失业)与后来的关系破裂联系起来,尽管存在长时间间隔、中间场景或重新诠释事件的闪回。我们引入了NEST(面向长视频理解的时间叙事事件结构),一个包含1005部全长电影(平均98分钟)的数据集,每部电影都标注了102个基于视觉内容、对话和音频的多模态叙事事件。NEST通过基于视觉内容、对话和音频的结构化标注捕捉多模态叙事事件,并通过反映叙事结构的关系(包括时间顺序、层次组合和长程依赖)将它们联系起来。我们引入了事件触发检测(ETD)、事件定位(EL)、事件论元抽取(EAE)和事件关系抽取(ERE)的基线。该基准对于基于事件发现极具挑战性,ETD低于8%,EL低于6%,EAE低于11%。相比之下,一旦事件给定,ERE更容易处理,零样本F1达到35.45%,微调后F1达到44.42%。

英文摘要

Recent progress in vision-language models has enabled the processing of increasingly long video sequences, but the ability to handle extended token streams does not translate to understanding of narrative structure in long videos. Existing long video benchmarks focus on needle-in-a-haystack retrieval rather than evaluating how low-level actions form events, how events interact across time, and how narratives progress, for example, whether a model can connect an early setback, such as a job loss to a later relationship breakup, despite long gaps, intervening scenes, or flashbacks that reframe what occurred. We introduce NEST (Narrative Event Structures in Time for Long Video Understanding), a dataset of 1005 full-length movies (avg. 98 minutes), each annotated with 102 multimodal narrative events grounded in visual content, dialogue, and audio. NEST captures multimodal narrative events with structured annotations grounded in visual content, dialogue, and audio, and links them through relations that reflect narrative structure, including temporal ordering, hierarchical composition, and long-range dependencies. We introduce baselines for event trigger detection (ETD), event localization (EL), event argument extraction (EAE), and event relation extraction (ERE). The benchmark is highly challenging for grounded event discovery, with ETD below 8%, EL under 6%, and EAE below 11%. In contrast, ERE is more tractable once events are given, reaching 35.45% F1 zero-shot and 44.42% F1 after fine-tuning.

2606.19700 2026-06-19 cs.CL 新提交

TerraMARS: A Domain-Adapted Small-Language-Model Pipeline for Mars Terraforming Literature

TerraMARS: 用于火星地球化改造文献的领域自适应小语言模型管道

Jyotsna Singh, Ash Black, Jeff Larsen, Scott R. Saleska

发表机构 * University of Arizona(亚利桑那大学) College of Information Science, University of Arizona(亚利桑那大学信息科学学院) Biosphere 2, University of Arizona(亚利桑那大学生物圈2) Department of Ecology and Evolutionary Biology, University of Arizona(亚利桑那大学生态与进化生物学系) Department of Environmental Sciences, University of Arizona(亚利桑那大学环境科学系)

AI总结 提出TerraMARS管道,结合领域自适应小语言模型,从火星科学文献中提取结构化信息,支持地球化改造研究。

Comments 16 pages, 1 figure, 4 tables

详情
AI中文摘要

研究人员有兴趣了解火星,以便最终使其适合人类居住。为此,需要通过科学文献全面了解行星的大气、水文、表面化学、辐射环境和空间特征。这些文献包含有价值的信息和有意义的定量约束,可用于其他模型和研究,如宜居性评估和未来的地球化改造研究。我们提出了TerraMARS,一个端到端的信息提取管道,它结合了领域自适应的小语言模型来回答火星地球化改造相关问题,并将非结构化的火星科学文本转换为机器可读的结构化输出(JSON格式)。收集了一个开放获取论文语料库,并使用多阶段检索和分块框架进行处理。使用量化低秩自适应(QLoRA)对火星特定问答和信息提取数据集进行微调,使Google Gemma 3 1B适应领域。生成的管道产生两种类型的输出,并为将科学文献中的知识整合到下游应用(如数字孪生和火星宜居性建模)提供了基础。该管道的输出看起来很有前景,但需要进一步改进以提高提取准确性和事实一致性。

英文摘要

Researchers are interested in learning about Mars so that it may eventually become habitable for humans. To achieve this, there is a need for comprehensive knowledge of the planet's atmosphere, hydrology, surface chemistry, radiation environment, and spatial features through the scientific literature. These contain valuable information and meaningful quantitative constraints that can be used in other models and studies, such as habitability assessment and future terraforming studies. We present TerraMARS, an end-to-end information extraction pipeline that combines a domain-adapted Small Language Model to answer Mars terraforming-related questions and convert unstructured Mars science text into machine-readable structured outputs in JavaScript Object Notation (JSON) format. A corpus of open-access papers is collected and processed using a multistage retrieval and chunking framework. Google Gemma 3 1B was adapted to the domain using Quantized Low-Rank Adaptation (QLoRA) fine-tuning on Mars-specific question-answering and information extraction datasets. The resulting pipeline generates both types of output and provides a foundation for integrating knowledge from scientific literature into downstream applications like digital twins and habitability modeling for Mars. The output from this pipeline looks promising, but further improvements are needed to increase extraction accuracy and factual consistency.

2606.19699 2026-06-19 cs.RO cs.LG cs.SY eess.SY 新提交

Comparative Study on Agility, Efficiency, and Impact Absorption of Bipedal Robots with Active Toes

具有主动脚趾的双足机器人敏捷性、效率和冲击吸收的比较研究

Joong-Gil Kim, Wontae Ye, Geunwoo Cho, Seong-Ho Yun, Se-Hyoung Cho, Yong-Jae Kim

发表机构 * School of Electrical, Electronics and Communication Engineering, Korea University of Technology and Education(韩国技术教育大学电气、电子与通信工程学院) Artificial Intelligence and Robotics Institute, Korea Institute of Science and Technology(韩国科学技术研究院人工智能与机器人研究所) Robot Innovation Hub, WIRobotics Inc.(WIRobotics公司机器人创新中心)

AI总结 提出一种14自由度双足机器人,模拟人类脚趾的轻量、高扭矩、坚固特性,通过高保真仿真训练环境,对比有无主动脚趾的配置,发现脚趾机器人以1.33米/秒行走时,CoT降低17.5%,脚跟冲击力降低5.0%,路径偏差平均和最大分别降低25.0%和34.0%。

Comments 6 pages, 7 figures

详情
AI中文摘要

人类腿部表现出高效率、敏捷性和冲击吸收能力,其中脚趾在这些能力中起着关键作用。尽管已经有许多尝试在机器人中实现类似人类的脚趾,但它们尚未完全复制人类特征,也没有严格验证其益处。我们提出了一种14自由度的双足机器人,模拟人类脚趾的轻量、高扭矩、坚固特性。为了定量分析主动脚趾在敏捷性、效率和冲击吸收方面的有效性,我们开发了一个高保真仿真训练环境,该环境反映了具有耦合传动和精确功耗的实际执行器。为了确保有和没有主动脚趾的配置之间的公平比较,我们设计了一个最小化强化学习奖励函数,并对两者应用了相同的训练程序。仿真结果表明,在1.33米/秒行走时,与无脚趾配置相比,配备脚趾的机器人将CoT降低了17.5%,脚跟冲击力降低了5.0%。在敏捷性测试中,平均和最大路径偏差分别降低了25.0%和34.0%。

英文摘要

Human legs exhibit high efficiency, agility, and impact absorption, with toes playing a crucial role in these capabilities. While many attempts have been made to implement human-like toes in robots, they have not fully replicated human characteristics nor rigorously validated their benefits. We propose a 14-DOF biped robot emulating human toes' lightweight, high-torque, robust nature. To quantitatively analyze the effectiveness of the active toes in terms of agility, efficiency, and impact absorption, we developed a high-fidelity simulation training environment that reflects actual actuators with coupled transmissions and accurate power consumption. To ensure a fair comparison between configurations with and without active toes, we designed a minimal RL reward function and applied an identical training procedure to both. The simulation results indicate that, at 1.33 m/s walking, the toe-equipped robot reduced CoT by 17.5% and heel-strike GRF by 5.0% compared with the toe-ablation configuration. On the agility test, average and maximum path deviation decreased by 25.0% and 34.0%, respectively.

2606.19697 2026-06-19 cs.LG cs.AI cs.CL 新提交

Efficiently Representing Algorithms With Chain-of-Thought Transformers

高效表示链式思维Transformer中的算法

Yanhong Li, Anej Svete, Ashish Sabharwal, William Merrill

发表机构 * Allen Institute for AI(艾伦人工智能研究所) ETH Zürich(苏黎世联邦理工学院)

AI总结 本文证明链式思维Transformer能以多对数开销高效模拟Word RAM算法,包括排序和Dijkstra算法,优于模拟图灵机的二次开销。

详情
AI中文摘要

推理模型(即在产生答案前输出一系列推理或思维token的语言模型)日益流行,部分原因在于理论结果表明链式思维(CoT)Transformer可以模拟图灵机,从而执行任意计算。然而,图灵机虽然适用于复杂性理论分析,但在讨论算法时并不方便、直观或高效。算法通常在更高的抽象层次上设计和分析,即具有随机访问存储器和单位成本操作(对$\bigO(\log n)$位字)的Word RAM模型。因此,Word RAM算法可能比其图灵机对应物更高效,这引出了一个问题:CoT Transformer能否高效模拟Word RAM算法?例如,它们能否在$\bigO(n \log n)$步内对n个元素排序,或在$\bigO(E + V \log V)$步内运行Dijkstra算法?我们给出肯定回答,开销不超过多对数。我们首先为具有多对数宽度和最右唯一硬注意力的有限精度Transformer建立这一结果,然后将结果推广到两个更实际的设置:有限宽度和对数精度:连续CoT(其中推理采用向量而非token形式)和混合架构(其中Transformer层位于循环(线性RNN)层之上)。在所有三种情况下,我们发现CoT可以高效模拟任何Word RAM算法,仅需在n上多对数开销。当Word RAM具有“平坦”指令集时,此开销降至对数平方,而对于无乘法平坦指令仅需对数开销——这与已知的CoT模拟图灵机(需要二次开销)形成鲜明对比。

英文摘要

The increasing popularity of \emph{reasoning} models -- language models that output a series of reasoning or thought tokens before producing an answer -- is justified, in part, by theoretical results showing that chain-of-thought (CoT) transformers can simulate Turing machines, and thus perform arbitrary computation. However, the Turing machine, while suitable for complexity-theoretic analysis, is not convenient, intuitive, or efficient for discussing algorithms. Algorithms are typically designed and analyzed at a higher level of abstraction, captured by the \emph{Word RAM} model with random-access memory and unit-cost operations on $\bigO(\log n)$-bit words. As a result, Word RAM algorithms can be substantially more efficient than their Turing machine counterparts, raising the question: \emph{Can CoT transformers efficiently simulate Word RAM algorithms?} For instance, can they sort $n$ items in $\bigO(n \log n)$ steps or run Dijkstra's algorithm in $\bigO(E + V \log V)$ steps? We answer affirmatively, up to poly-logarithmic overhead. We first establish this for finite-precision transformers with poly-logarithmic width and rightmost unique hard attention, then strengthen the result to two more practical settings with finite width and log-precision: \emph{continuous} CoT, where reasoning takes the form of vectors rather than tokens, and a \emph{hybrid} architecture in which transformer layers sit atop a recurrent (linear RNN) layer. In all three cases, we find that CoT \emph{can} efficiently simulate any Word RAM algorithm with only a poly-logarithmic overhead in $n$. This overhead reduces to log-square when the Word RAM has a ``flat'' instruction set, and only logarithmic for multiplication-free flat instructions -- in stark contrast to known CoT simulations of Turing machines, which require quadratic overhead over Word RAM.

2606.19688 2026-06-19 cs.SD eess.AS 新提交

Latency-Configurable Streaming Speech Enhancement via Asymmetric Temporal Padding

通过非对称时间填充实现延迟可配置的流式语音增强

Yunsik Kim, Yoonyoung Chung

发表机构 * Department of Electrical Engineering, Pohang University of Science and Technology (POSTECH)(电气工程系,浦项科技大学) Intus Co. Ltd.(Intus有限公司)

AI总结 提出LaCo-SENet,通过非对称时间填充和双缓冲流式机制,在单一超参数下实现延迟与质量的灵活权衡,在VoiceBank+DEMAND上以1.37M参数获得12.5-75.0ms延迟范围,PESQ从3.35到3.43。

Comments 5 pages, 3 figures. Accepted for presentation at Interspeech 2026

详情
AI中文摘要

流式语音增强需要在算法延迟和质量之间取得平衡,但现有方法大多将其视为因果与非因果的二元选择。LaCo-SENet通过单个训练时超参数参数化的两种机制解决了这个问题。首先,非对称时间填充重新分配卷积中的过去和未来上下文,实现系统性的延迟配置。其次,双缓冲流式结合了过去上下文的状体缓冲区和在输入和特征层面提供未来上下文的超前缓冲区。选择性状态更新还防止未来帧泄漏到流式状态中,确保训练-推理一致性。在VoiceBank+DEMAND上,固定预算(1.37M参数)的主干网络产生了覆盖12.5-75.0毫秒的模型系列,PESQ从3.35上升到3.43。在仅12.5毫秒(完全因果)时,PESQ为3.35,达到或超过了先前的因果最先进水平(46.5毫秒时为3.27)。

英文摘要

Streaming speech enhancement requires balancing algorithmic latency against quality, yet existing approaches largely treat this as a binary causal versus non-causal choice. LaCo-SENet addresses this issue with two mechanisms parameterized by a single training-time hyperparameter. First, asymmetric temporal padding redistributes past and future context in convolutions, enabling systematic latency configuration. Second, dual-buffer streaming combines state buffers for past context with lookahead buffers that supply future context at both the input and feature levels. Selective state updates also prevent future-frame leakage into the streaming state, ensuring training-inference consistency. On VoiceBank+DEMAND, a fixed-budget (1.37M parameters) backbone yields a family of models spanning 12.5-75.0 ms, with PESQ rising from 3.35 to 3.43. At just 12.5 ms (fully causal), a PESQ of 3.35 matches or exceeds the prior causal state-of-the-art (3.27 at 46.5 ms).

2606.19687 2026-06-19 cs.RO 新提交

Route-Constrained Robust Fusion Estimation for MEMS/GNSS Integrated Navigation of Unmanned Ground Vehicles in GNSS Degraded Environments

MEMS/GNSS组合导航中无人地面车辆在GNSS退化环境下的路径约束鲁棒融合估计

Jingzhi Cui, Chao Zhang, Yuliang Mao, Shaolin Lü, Dongmei Li, Huan Che, Rong Zhang

发表机构 * State Key Laboratory of Precision Space-time Information Sensing Technology, Tsinghua University(清华大学精密时空信息感知技术国家重点实验室) Xiaomi Inc.(小米公司)

AI总结 针对GNSS信号严重遮挡下结构化道路环境中无人地面车辆的累积定位漂移,提出一种鲁棒的路径约束状态估计方法,利用历史航位推算轨迹与高精地图匹配生成伪位置观测,通过扩展卡尔曼滤波持续注入道路级约束,抑制位置偏差并改善方位估计。

Comments Accepted workshop paper, 1st Workshop on Robot Meets GNSS and Ranging for Seamless Autonomy, IEEE ICRA 2026

Journal ref 1st Workshop on Robot Meets GNSS and Ranging for Seamless Autonomy, IEEE ICRA 2026, Vienna, Austria, June 5, 2026

详情
AI中文摘要

为了解决在严重全球导航卫星系统信号遮挡下结构化道路环境中无人地面车辆的累积定位漂移问题,本文提出了一种鲁棒的路径约束状态估计方法。在无卫星信号期间,该方法建立了历史航位推算轨迹与从高精地图中提取的任务路线局部段之间的对应关系,并通过二维刚性变换估计出路线参考位置。然后将估计的位置作为伪位置观测,纳入扩展卡尔曼滤波更新中。这样,道路级的路径约束可以持续注入到统一的状态估计框架中,从而抑制相对于任务路线的位置偏差,同时间接改善方位估计。为了增强实际适用性,进一步引入了触发控制、匹配质量验证、路径偏移补偿和单次更新修正限制等工程策略。在三个代表性场景(长隧道、多段隧道和弯曲隧道)中的实验表明,所提方法有效抑制了卫星中断期间的误差累积,降低了最大偏差过大的风险,并提高了定位连续性和道路级可用性。

英文摘要

To address cumulative localization drift of unmanned ground vehicles in structured road environments under severe Global Navigation Satellite System signal occlusion, this paper proposes a robust route-constrained state estimation method. During periods without satellite signals, the proposed method establishes the correspondence between the historical dead reckoning trajectory and local segments of the mission route extracted from a high-definition map, and estimates a route-referenced position via a two-dimensional rigid transformation. The estimated position is then formulated as a pseudo-position observation and incorporated into an Extended Kalman Filter update. In this way, route constraints at the road level can be continuously injected into a unified state estimation framework, thereby suppressing position deviation relative to the mission route while indirectly improving azimuth estimation. To enhance practical applicability, engineering strategies, such as trigger control, matching quality validation, route offset compensation, and single update correction limiting, are further introduced. Experiments in three representative scenarios, including a long tunnel, a multi-segment tunnel, and a curved tunnel, show that the proposed method effectively suppresses error accumulation during satellite outages, reduces the risk of large maximum deviation, and improves localization continuity and road-level usability.

2606.19684 2026-06-19 cs.CV 新提交

Exploring Multi-Modal Large Language Models and Two-Stage Fine-Tuning for Fashion Image Retrieval

探索多模态大语言模型与两阶段微调在时尚图像检索中的应用

Nguyen Cao Hoang, Hoang Bui Le, Nam Vo Hoang, Trung-Nghia Le

发表机构 * University of Science, VNU-HCM(胡志明市国家大学下属理科大学) Vietnam National University, Ho Chi Minh(胡志明市国家大学)

AI总结 提出融合多模态大语言模型(LLaVA)生成属性感知三元组,并采用两阶段微调策略增强对比学习,以解决时尚图像检索中标注数据稀缺和负采样简单的问题。

Comments SOICT 2025

详情
AI中文摘要

组合图像检索通过参考图像和修改文本描述的复合查询来检索目标图像。在时尚领域,该任务需要理解颜色、图案和纹理等细微属性变化。然而,现有方法因标注数据稀缺和负采样简单而面临局限性。我们提出了一种新颖框架,该框架集成多模态大语言模型(LLaVA)以生成属性感知三元组,并引入两阶段微调策略来增强对比学习。我们利用预训练的视觉-语言模型(如CLIP-ViT/B32)生成句子级提示并与相对描述拼接,以及使用静态表示来增加负样本数量。实验结果表明,该框架增强了组合推理能力并改进了细粒度检索行为,突显了所提框架在时尚检索中的可行性和潜力。

英文摘要

Composed image retrieval retrieves a target image using a composed query of a reference image and a modified text description. In the fashion domain, this task requires understanding subtle attribute variations such as color, pattern, and texture. However, existing approaches face limitations due to scarce annotated data and simplistic negative sampling. We propose a novel framework that integrates a multi-modal large language model (LLaVA) to generate attribute-aware triplets and introduces a two-stage fine-tuning strategy to enhance contrastive learning. We leverage pretrained vision-language models, such as CLIP-ViT/B32, to generate and concatenate sentence-level prompts with the relative caption and to scale the number of negatives using static representations. Experimental results demonstrate enhanced compositional reasoning and improved fine-grained retrieval behavior, underscoring the feasibility and potential of the proposed framework for fashion retrieval.

2606.19683 2026-06-19 cs.AI cs.MA cs.SY eess.SY 新提交

Exit-and-Join Dynamics for Decentralized Coalition Formation

去中心化联盟形成的退出与加入动力学

Quanyan Zhu

发表机构 * New York University Tandon School of Engineering(纽约大学坦登工程学院) Department of Electrical and Computer Engineering(电气与计算机工程系)

AI总结 研究基于单边退出与加入决策的去中心化联盟形成动力学,利用Aumann-Dreze值计算个体收益,建立合作支付分配与非合作最优反应的关联,并分析均衡特征及成本对局部稳定性的影响。

详情
AI中文摘要

本文研究联盟形成作为一种由单边退出与加入决策驱动的去中心化动力学过程。智能体使用Aumann-Dreze值评估局部移动,因此收益在智能体当前联盟内计算,而非通过全局协商的联盟结构。由此产生的模型将合作支付分配与非合作最优反应行为联系起来:一个终端划分恰好是一个没有可接受的、个体有利可图的退出与加入偏离的联盟结构。我们建立了均衡特征,确定了动力学允许标量Lyapunov或精确势函数表示的条件,并分析了切换和接受成本如何塑造局部稳定性。数值实验测试了有限时间稳定、成本敏感性以及一个特殊的凸博弈基准。

英文摘要

This paper studies coalition formation as a decentralized dynamical process driven by unilateral exit-and-join decisions. Agents evaluate local moves using the Aumann-Dreze value, so payoffs are computed within the agent's current coalition rather than through a globally negotiated coalition structure. The resulting model links cooperative payoff allocation with noncooperative best-response behavior: a terminal partition is precisely a coalition structure with no admissible, individually profitable exit-and-join deviation. We establish equilibrium characterizations, identify conditions under which the dynamics admit scalar Lyapunov or exact-potential representations, and analyze how switching and acceptance costs shape local stability. Numerical experiments test finite-time stabilization, cost sensitivity, and a special convex-game benchmark.