arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.12105 2026-06-11 cs.RO cs.CV cs.LG 新提交

DAM-VLA: Decoupled Asynchronous Multimodal Vision Language Action model

DAM-VLA: 解耦异步多模态视觉语言动作模型

Pankhuri Vanjani, Zhuoyue Li, Jakub Suliga, Moritz Reuss, Gianluca Geraci, Xinkai Jiang, Rudolf Lioutikov

发表机构 * Intuitive Robots Lab, Karlsruhe Institute of Technology (KIT)（直觉机器人实验室，卡尔斯鲁厄理工学院）； NVIDIA（英伟达）； Robotics Institute of Germany（德国机器人研究所）

AI总结针对VLA模型同步时钟与物理交互中不同模态频率不匹配的问题，提出DAM-VLA，通过解耦各模态时间处理、维护传感器速率更新的潜在缓冲区，并利用门控交叉注意力整合高频模态，在7个真实操作任务中平均成功率提升至95.2%。

Comments 17 pages, 8 figures

详情

AI中文摘要

视觉-语言-动作（VLA）模型继承了视觉-语言预训练中的共享同步时钟，以单一速率处理每个输入。这与物理交互不一致，在物理交互中，高频模态以数百赫兹变化，视觉演化较慢，而语言在整个回合中保持不变。同步VLA会过采样慢速模态，欠采样快速模态，并将动作生成限制在最低有效频率。我们假设解耦每个模态的时间处理，让每个模态以其自身传感器速率更新和保留信息，可以产生更强的表示和更鲁棒的控制。我们提出DAM-VLA，它维护每个模态的潜在缓冲区，以传感器速率刷新并由动作头连续读取，通过门控交叉注意力整合新的高频模态，同时保持预训练主干不变。在七个接触丰富的真实世界操作任务中，DAM-VLA将最强同步基线的平均成功率提高了一倍以上（95.2% vs. 40.95%），同时维持平滑、反应式的100 Hz控制。项目网站：\href{ this https URL }{ this http URL }

英文摘要

Vision-language-action (VLA) models inherit a shared synchronous clock from vision-language pretraining, processing every input at one rate. This is misaligned with physical interaction, where a high-frequency modality changes at hundreds of hertz, vision evolves more slowly, and language stays constant across an episode. A synchronous VLA oversamples slow modalities, undersamples fast ones, and caps action generation at the lowest effective frequency. We hypothesize that decoupling temporal processing per modality, letting each update and retain information at its own sensor rate, yields stronger representations and more robust control. We present DAM-VLA, which maintains per-modality latent buffers refreshed at sensor rates and read continuously by the action head, integrating new high-frequency modalities through gated cross-attention that leaves the pretrained backbone intact. Across seven contact-rich real-world manipulation tasks, DAM-VLA more than doubles the average success rate of the strongest synchronous baseline (95.2\% vs.\ 40.95\%) while sustaining smooth, reactive 100\,Hz control. Project website: \href{https://intuitive-robots.github.io/DAM-VLA/}{intuitive-robots.github.io/DAM-VLA/}

URL PDF HTML ☆

赞 0 踩 0

2606.12099 2026-06-11 cs.CV 新提交

ISAP-3D: Identity-Slot Aligned Part-Aware 3D Generation

ISAP-3D: 身份槽对齐的部件感知3D生成

Junlin Hao, Haoshuai Fu, Xibin Song, Wei Li, Ruigang Yang, Xinggong Zhang, Jinchuan Zhang

发表机构 * Peking University（北京大学）； Tencent（腾讯）； Huawei（华为）； University of Science and Technology of China（中国科学技术大学）

AI总结针对部件感知3D生成中因身份-布局纠缠导致的结构歧义问题，提出身份槽对齐框架ISAP-3D，通过语义身份令牌锚定每个部件并进行一对一布局预测，实现稳定可控的部件级3D生成。

详情

AI中文摘要

部件感知3D生成旨在合成具有语义意义组件的结构化对象，但由于身份-布局纠缠，常常遭受结构歧义。现有方法要么隐式推断部件身份和空间布局，导致不稳定的部件分配（例如槽交换或部件合并），要么依赖在实践中难以获得的强布局条件。我们将这种歧义归因于身份槽置换自由度：没有显式的身份槽对齐，训练期间语义部件和生成槽之间的对应关系不可识别，允许多个槽分配适应相同的监督，导致不一致的分解。基于这一见解，我们认为稳定的部件感知生成需要身份对齐的一对一槽建模。因此，我们提出了一个身份槽对齐框架ISAP-3D，该框架用语义身份令牌锚定每个部件，执行身份条件的一对一布局预测，随后进行布局条件的几何合成。结构化的局部-全局条件在语义、空间和几何阶段保持身份对齐。我们还构建了一个具有统一语义协议的部件级数据集，以实现可学习且一致的身份槽对齐。大量实验表明，与最先进的部件感知生成基线相比，我们的方法在结构稳定性、可控性和鲁棒性方面有所改进。

英文摘要

Part-aware 3D generation aims to synthesize structured objects with semantically meaningful components, yet often suffers from structural ambiguity due to identity-layout entanglement. Existing methods either infer part identity and spatial layout implicitly, which can lead to unstable part allocation (e.g., slot swapping or part merging), or rely on strong layout conditions that are difficult to obtain in practice. We attribute this ambiguity to identity-slot permutation freedom: without explicit identity-slot alignment, the correspondence between semantic parts and generation slots is not identifiable during training, allowing multiple slot assignments to fit the same supervision and leading to inconsistent decomposition. Based on this insight, we argue that stable part-aware generation requires identity-aligned one-to-one slot modelling. We therefore propose an identity-slot aligned framework, ISAP-3D, which anchors each part with semantic identity tokens and performs identity-conditioned one-to-one layout prediction, followed by layout-conditioned geometry synthesis. Structured local-global conditioning maintains identity alignment across semantic, spatial, and geometric stages. We also construct a part-level dataset with a unified semantic protocol to enable learnable and consistent identity-slot alignment. Extensive experiments demonstrate improved structural stability, controllability, and robustness over state-of-the-art part-aware generation baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.12088 2026-06-11 cs.CL 新提交

Debiasing Without Protected Attributes: Latent Concept Erasure from Textual Profiles

无保护属性的去偏：从文本画像中消除潜在概念

Shun Shao, Zheng Zhao, Anna Korhonen, Yftah Ziser, Shay B. Cohen

发表机构 * University of Cambridge（剑桥大学）； University of Edinburgh（爱丁堡大学）； University of Groningen（格罗宁根大学）； NVIDIA Research（英伟达研究院）

AI总结提出H-SAL方法，利用自我描述文本作为隐式信号进行后处理概念和属性消除，在无直接敏感属性下实现去偏，并在多领域Stack Exchange基准上验证其效果与显式标签去偏相当或更优。

Comments 23 pages, 5 figures, 12 tables. The paper is currently under review

详情

AI中文摘要

大多数自然语言处理中的公平性研究假设可以直接访问性别、种族或国籍等保护属性。然而，在实践中，由于隐私限制、元数据缺失或法律约束，这些信息通常不可用，尽管模型可能从间接文本线索中推断出来。这引发了一个关键问题：在没有直接访问敏感属性的情况下，去偏能否成功？我们提出了H-SAL，它利用自我描述文本作为隐式去偏信号，执行事后概念和属性消除。为了支持这一设置，我们引入了一个基于Stack Exchange的多领域公平性基准，用于帮助度预测，该基准包括显式和隐式信号，从而能够在有保护标签的标准去偏和无敏感信息访问的去偏之间进行比较。在编码器和仅解码器语言模型中，我们发现隐式自我描述通常匹配或优于基于显式标签的去偏。我们的结果拓宽了表示层面的公平性研究，并为在现实数据约束下研究去偏提供了新的基准。

英文摘要

Most fairness research in NLP assumes direct access to protected attributes such as gender, race, or nationality. In practice, however, such information is often unavailable due to privacy constraints, missing metadata, or legal restrictions, even though models may infer it from indirect textual cues. This raises a key question: can debiasing succeed without direct access to sensitive attributes? We propose H-SAL, which performs post-hoc concept and attribute erasure using self-description text as an implicit debiasing signal. To support this setting, we introduce a multi-domain Stack Exchange-based fairness benchmark for helpfulness prediction that includes both explicit and implicit signals, enabling comparison between standard debiasing with protected labels and debiasing without access to sensitive information. Across encoder and decoder-only language models, we find that implicit self-description often matches or outperforms explicit-label-based debiasing. Our results broaden representation-level fairness research and provide a new benchmark for studying debiasing under realistic data constraints.

URL PDF HTML ☆

赞 0 踩 0

2606.12087 2026-06-11 cs.CL 新提交

FORT-Searcher: Synthesizing Shortcut-Resistant Search Tasks for Training Deep Search Agents

FORT-Searcher：合成抗捷径搜索任务以训练深度搜索智能体

Jia Deng, Yimeng Chen, Xiaoqing Xiang, Ziyang Zeng, Shuo Tang, Wayne Xin Zhao, Feng Chang, Chuan Hao, Yuan Wei, Ran Tao, Bryan Dai, Ji-Rong Wen

发表机构 * Gaoling School of Artificial Intelligence Renmin University of China（中国人民大学高瓴人工智能学院）； KAUST（阿卜杜拉国王科技大学）； IQuest Research（IQuest研究院）； Shanghai Jiao Tong University（上海交通大学）

AI总结提出FORT框架，通过控制四种捷径风险合成抗捷径训练数据，使搜索智能体进行更长的预答案搜索，减少捷径模式，仅用SFT训练即达到最优性能。

Comments 30 pages

详情

AI中文摘要

训练深度搜索智能体需要可验证的问题，其答案只有在通过搜索获得足够证据后才可用。现有的合成方法通常通过丰富图结构来增加表面难度，但仅凭结构复杂性并不能保证实现实际的搜索难度：预期的搜索过程可能通过更便宜的识别路径崩溃。我们用一个捷径感知的难度框架形式化了这一差距，并识别了四种可操作的捷径风险：证据共覆盖、单线索选择性、暴露常数和先验知识绑定。为了诊断它们的实际效果，我们使用轨迹签名，包括求解成本、答案命中时间和先验捷径率。在此框架的指导下，我们引入了FORT，一个抗捷径训练数据合成框架。FORT通过控制实体选择、证据图构建、问题表述和对抗性细化中的捷径风险来构建抗捷径训练数据。实验表明，与现有的开源深度搜索数据集相比，FORT诱导了更长的预答案搜索和更少的捷径模式。使用由此产生的轨迹，我们仅通过监督微调（SFT）训练FORT-Searcher，并在具有挑战性的深度搜索基准上取得了可比大小的开源搜索智能体中最佳的整体性能。相关资源将在https://this URL上提供。

英文摘要

Training deep search agents requires verifiable questions whose answers remain unavailable until sufficient evidence has been acquired through search. Existing synthesis methods often increase apparent difficulty by enriching graph structures, but structural complexity alone does not guarantee realized search difficulty: the intended search process can collapse through a cheaper identifying route. We formalize this gap with a shortcut-aware difficulty framework and identify four actionable shortcut risks: evidence co-coverage, single-clue selectivity, exposed constants, and prior-knowledge binding. To diagnose their realized effects, we use trajectory signatures including solving cost, answer hit time, and prior-shortcut rate. Guided by this framework, we introduce FORT, a Framework of Shortcut-Resistant Training-Data Synthesis. FORT constructs shortcut-resistant training data by controlling shortcut risks across entity selection, evidence graph construction, question formulation, and adversarial refinement. Experiments show that FORT induces longer pre-answer search and fewer shortcut patterns than existing open-source deep search datasets. Using the resulting trajectories, we train FORT-Searcher with supervised fine-tuning (SFT) only, and it achieves the best overall performance among comparable-size open-source search agents on challenging deep search benchmarks. Relevant resources will be made available at https://github.com/RUCAIBox/FORT-Searcher.

URL PDF HTML ☆

赞 0 踩 0

2606.12086 2026-06-11 cs.AI cs.LG 新提交

IntElicit: Eliciting and Assessing Contextualized Creativity via Dialogue Policy Optimization

IntElicit: 通过对话策略优化引出和评估情境化创造力

Mingjia Li, Jin Wu, Hong Qian, Wenhao Huang, Yiyang Huang, Yiwen Zhang, Chanjin Zheng, Xiangfeng Wang, Aimin Zhou, Jiajun Guo

发表机构 * East China Normal University（华东师范大学）； Shanghai Innovation Institute（上海创新研究院）

AI总结提出IntElicit框架，通过分解过程奖励机制优化对话策略，在交互中减少非创造性混淆因素，从而更有效地引出和评估情境化创造力。

详情

AI中文摘要

情境化评估为评估创造力提供了高生态效度，但也引入了一个关键挑战：观察到的表现可能与认知熟练度（领域知识）和能动性（参与意愿）相混淆。同时，在生成式AI时代，创造性问题解决越来越多地发生在工具中介和人机交互环境中，使得完全静态的评估与当代创造性实践不太一致。为了解决这些问题，本文提出了IntElicit，一个通过对话策略优化来引出和评估情境化创造力的框架。IntElicit作为一个受约束的自适应AI面试官：它在多轮交互中提供非指导性的知识和能动性支架，以减少非创造性混淆因素，同时保留参与者生成被评估的创造性内容的责任。具体来说，为了解决开放教育对话中的稀疏奖励和潜在奖励破解（例如，答案听写），IntElicit引入了一种分解过程奖励机制。该机制将策略与教学引出对齐，奖励那些引出参与者推理而非代表他们产生最优答案的提示。大量实验，包括参与者模拟和一项人类受试者研究（N=64），表明IntElicit比专家设计的基线提高了引出的创造性成果。总之，结果表明，交互式引出可以揭示静态FPSP式评估可能遗漏的创造性潜力，为AI中介学习环境中的情境化创造力评估提供了形成性和诊断性视角。

英文摘要

Contextualized assessment offers high ecological validity for evaluating creativity but introduces a critical challenge: observed performance may be confounded with cognitive proficiency (domain knowledge) and agency (willingness to engage). Meanwhile, in the age of generative AI, creative problem solving increasingly occurs in tool-mediated and human--AI interactive environments, making fully static assessment less aligned with contemporary creative practice. To address these issues, this paper proposes IntElicit, a framework for eliciting and assessing contextualized creativity via dialogue policy optimization. IntElicit functions as a constrained adaptive AI Interviewer: it provides non-directive knowledge and agency scaffolds in multi-turn interaction to reduce non-creative confounders, while preserving participants' responsibility for generating the creative content being evaluated. Specifically, to tackle sparse rewards and potential reward hacking (e.g., answer dictation) in open-ended educational dialogue, IntElicit introduces a decomposed process reward mechanism. This mechanism aligns the policy with pedagogical elicitation, rewarding prompts that draw out participant reasoning rather than producing optimal answers on their behalf. Extensive experiments, including participant simulation and a human subject study (N=64), show that IntElicit improves elicited creative outcomes over expert-designed baselines. Together, the results suggest that interactive elicitation can reveal creative potential that static FPSP-style assessment may miss, providing a formative and diagnostic lens for contextualized creativity assessment in AI-mediated learning contexts.

URL PDF HTML ☆

赞 0 踩 0

2606.12077 2026-06-11 cs.LG 新提交

Efficient Time Series Clustering from Multiscale Reservoir Dynamics with Granular-Ball Anchoring Graph Optimization

基于多尺度储层动力学与粒球锚定图优化的高效时间序列聚类

Yifan Wang, Lifeng Shen, Shuyin Xia, Yi Wang

发表机构 * Chongqing Key Laboratory of Computational Intelligence, Key Laboratory of Cyberspace Big Data Intelligent Security, Ministry of Education, Sichuan-Chongqing Co-construction Key Laboratory of Digital Economy Intelligence and Key Laboratory of Big Data Intelligent Computing, College of Computer Science and Technology, Chongqing University of Posts and Telecommunications（重庆邮电大学计算机科学与技术学院，计算智能重庆市重点实验室，网络空间大数据智能安全教育部重点实验室，川渝共建数字经济智能重点实验室，大数据智能计算重点实验室）； Chongqing Ant Consumer Finance Co,. Ltd , Ant Group（蚂蚁集团，重庆蚂蚁消费金融有限公司）

AI总结提出MSRGC-Net框架，结合无训练储层计算、粒球锚定图构建和共识学习，实现高效且准确的时间序列聚类。

Comments Accepted by IJCAI 2026

详情

AI中文摘要

时间序列聚类由于聚类效果与计算效率之间的固有权衡仍然具有挑战性。基于相似性的方法通常因成对距离计算而面临二次复杂度，而基于深度学习的方法通常依赖于昂贵的迭代训练和大量可训练参数。在本文中，我们提出了MSRGC-Net，一种高效的时间序列聚类框架，它集成了多尺度储层计算、基于粒球的锚定图构建和共识学习。MSRGC-Net采用无训练的储层计算范式，从原始时间序列中提取多尺度时间表示，无需反向传播，显著降低了计算开销。为了捕捉所得表示的内在结构，采用粒球计算通过密度一致区域自适应地建模数据分布，生成紧凑且鲁棒的锚定图表示。此外，引入了一种基于共识的锚定图优化策略，以有效对齐多尺度储层表示并整合跨时间尺度的互补信息。在广泛使用的单变量和多变量基准数据集上的大量实验表明，MSRGC-Net在聚类性能上持续优于最先进的方法，同时保持卓越的计算效率。

英文摘要

Time-series clustering remains challenging due to the inherent trade-off between clustering effectiveness and computational efficiency. Similarity-based methods often suffer from quadratic complexity caused by pairwise distance computations, while deep learning-based approaches typically rely on costly iterative training and a large number of trainable parameters. In this paper, we propose MSRGC-Net, an efficient time-series clustering framework that integrates multiscale reservoir computing, granular-ball-based anchoring graph construction, and consensus learning. MSRGC-Net adopts a training-free reservoir computing paradigm to extract multiscale temporal representations from raw time series without backpropagation, significantly reducing computational overhead. To capture the intrinsic structure of the resulting representations, granular-ball computing is employed to adaptively model data distributions via density-consistent regions, yielding compact and robust anchor graph representations. Furthermore, a consensus-based anchoring graph optimization strategy is introduced to effectively align multiscale reservoir representations and integrate complementary information across temporal scales. Extensive experiments on widely used univariate and multivariate benchmark datasets demonstrate that MSRGC-Net consistently outperforms state-of-the-art methods in clustering performance while maintaining superior computational efficiency.

URL PDF HTML ☆

赞 0 踩 0

2606.12074 2026-06-11 cs.CV cs.AI eess.IV 新提交

Non-frontal face recognition using GANs and memristor-based classifiers

基于GAN和忆阻器分类器的非正面人脸识别

Semih Vazgecen, Cristian Sestito, Spyros Stathopoulos, Themis Prodromakis

发表机构 * Centre for Electronics Frontiers, Institute for Integrated Micro and Nano Systems, School of Engineering, The University of Edinburgh（爱丁堡大学工程学院集成微纳系统研究所电子前沿中心）

AI总结提出将轻量级GAN正面化与忆阻器神经形态识别结合，解决非正面人脸识别，在数据集上达96%准确率。

Comments 12 pages, 4 figures, 1 Supplementary (22 pages, 16 figures, 6 tables, 4 supplementary notes)

详情

AI中文摘要

人脸识别系统通过深度学习技术取得了显著进展，在复杂场景中实现了高性能和鲁棒性。然而，这些方法带来了巨大的计算开销，限制了它们在资源受限平台（如无人机）上的原位适用性，而这些平台需要应对非正面人脸图像等挑战。基于忆阻器的神经形态系统已成为边缘AI应用的一种引人注目的方法，它将生物启发式处理与高效可扩展的计算相结合。在这项工作中，我们提出了一种人脸识别框架，通过集成基于轻量级生成对抗网络（GAN）的正面化处理和基于忆阻器的神经形态识别，来解决非正面姿态变化问题。在两个数据集上的实验结果表明，将对抗学习与忆阻技术相结合的有效性，实现了高达96%的识别准确率。所提出的方法缓解了传统AI的计算瓶颈，并为动态真实环境中的人脸识别提供了一种可扩展、高效的解决方案。

英文摘要

Face recognition systems have advanced significantly through deep learning techniques, delivering high performance and robustness in complex scenarios. However, these approaches incur substantial computational overhead, limiting their in situ applicability in resource-constrained platforms such as drones, where they can address challenges including non-frontal facial imagery. Memristor-based neuromorphic systems have emerged as a compelling approach for edge AI applications, combining biologically inspired processing with efficient and scalable computation. In this work, we propose a facial recognition framework that addresses non-frontal pose variations by integrating lightweight generative adversarial network (GAN)-based pose frontalisation with memristor-based neuromorphic recognition. The experimental results on two datasets demonstrate the effectiveness of combining adversarial learning with memristive technology, achieving up to 96% identification accuracy. The proposed approach alleviates the computational bottlenecks of conventional AI and offers a scalable, efficient solution for face recognition in dynamic real-world environments.

URL PDF HTML ☆

赞 0 踩 0

2606.12072 2026-06-11 cs.CV 新提交

World Model Self-Distillation: Training World Models to Solve General Tasks

世界模型自蒸馏：训练世界模型以解决通用任务

Sebastian Stapf, Pablo Acuaviva Huertos, Aram Davtyan, Paolo Favaro

发表机构 * Department of Computer Science（计算机科学系）

AI总结提出结合自蒸馏与强化学习的框架，从预训练视频生成器中提取任务解决能力，无需配对任务视频，在基准测试中超越原始模型。

详情

AI中文摘要

预训练视频生成器是有前景的视觉世界模型，展现出涌现的任务解决能力；然而，它们对详细文本描述的依赖限制了其在规划和决策中的直接使用。现有方法要么将这种推理外包给语言或视觉-语言模型，要么依赖带有配对任务执行视频的监督微调，后者收集成本高且难以扩展。我们提出一个可扩展的框架，通过结合自蒸馏与强化学习来激发此类模型的任务解决能力。给定一张无标注场景图像，视觉-语言模型生成候选任务和详细的逐步解决方案。该解决方案条件化一个预训练视频扩散模型（演示者）；我们将其行为蒸馏到一个仅以图像和简短任务提示为条件的执行者中。这将执行知识从字幕引导生成转移到指令条件任务解决，无需精心策划的任务视频监督。我们进一步通过来自VLM反馈的强化学习改进执行者，利用判断采样视频是否满足任务与生成解决方案之间的不对称性。在我们提出的WorldTasks-Benchmark和DreamGen机器人基准上的实验表明，在我们基于VLM的评估协议下，执行者超越了演示者，并具有竞争力地迁移到机器人任务。

英文摘要

Pretrained video generators are promising visual world models that exhibit emergent task-solving abilities; however, their reliance on detailed textual descriptions limits their direct use for planning and decision-making. Existing approaches either outsource this reasoning to language or vision-language models, or rely on supervised fine-tuning with paired task-execution videos, which are costly to collect and difficult to scale. We propose a scalable framework that elicits task-solving ability in such models by combining self-distillation with reinforcement learning. Given an unlabeled scene image, a vision-language model generates a candidate task and a detailed step-by-step solution. The solution conditions a pretrained video diffusion model, the Demonstrator; we distill its behavior into an Executor conditioned only on the image and a short task prompt. This transfers execution knowledge from caption-guided generation to instruction-conditioned task solving without curated task-video supervision. We further improve the Executor with reinforcement learning from VLM feedback, exploiting the asymmetry between judging whether a sampled video satisfies a task and generating the solution. Experiments on our proposed WorldTasks-Benchmark and the DreamGen robotics benchmark show that the Executor surpasses the Demonstrator under our VLM-based evaluation protocol and transfers competitively to robotic tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.12070 2026-06-11 cs.RO 新提交

Fibration Trees: A Unified Approach to Multi-Robot Motion Planning

纤维树：多机器人运动规划的统一方法

Andreas Orthey, Florian T. Pokorny, Lydia E. Kavraki

发表机构 * Technical University of Berlin（柏林工业大学）； KTH Royal Institute of Technology（瑞典皇家理工学院）； Rice University and the Ken Kennedy Institute（莱斯大学和肯·肯尼迪研究所）

AI总结提出纤维树统一框架，通过纤维化建模投影，结合优先序、并行分解和任务空间投影，并开发Fibration-RRT规划器，在高维多机器人运动规划中实现概率完备性。

Comments 23 pages, 12 figures

详情

AI中文摘要

状态空间投影与分解已成为解决高维多机器人运动规划问题中维度灾难的强大工具。然而，现有方法缺乏一个统一框架来无缝处理投影（优先序或任务空间）与分解（并行或解耦子空间）的组合。为填补这一空白，我们引入了纤维树，即以状态空间为节点、纤维化为边的树结构，其中纤维化将高维空间投影到低维（或简化）空间。通过将投影建模为纤维化，我们将顺序优先序、并行分解和任务空间投影统一在单一、连贯的形式体系下。在此基础上，我们开发了快速探索随机纤维树（Fibration-RRT）规划器，这是一种基于采样的运动规划器，它推广了商空间RRT（用于顺序优先序）和离散RRT（用于并行分解）的策略，同时允许包含任务空间投影。Fibration-RRT在用户定义的纤维树上运行，并被证明是概率完备的。为测试Fibration-RRT的通用性和效率，我们提供了开源实现，并在32个场景中进行了实验，使用了多达96自由度的多机器人团队。结果表明，Fibration-RRT通过利用用户定义的纤维树高效解决了高维问题，从而确立了纤维树作为多机器人运动规划的强大统一框架。

英文摘要

State space projections and decompositions have emerged as powerful tools to tackle the curse of dimensionality in high-dimensional, multi-robot motion planning problems. However, existing methods lack a unified framework which seamlessly handles combinations of projections (prioritization or task-space) and decompositions (parallel or decoupled subspaces). To fill this gap, we introduce fibration trees, which are trees consisting of state spaces as nodes and fibrations as edges, whereby a fibration models a projection from a higher-dimensional space to a lower-dimensional (or simplified) space. By modeling projections as fibrations, we unify sequential prioritization, parallel decomposition, and task-space projections under a single, coherent formalism. Building on this, we develop the rapidly-exploring random fibration trees (Fibration-RRT) planner, a sampling-based motion planner that generalizes strategies from quotient-space RRT (for sequential prioritizations) and discrete RRT (for parallel decompositions), while allowing the inclusion of task-space projections. Fibration-RRT operates on user-defined fibration trees and is proven to be probabilistically complete. To test the generality and efficiency of Fibration-RRT, we provide an open-source implementation and conduct experiments on 32 scenarios using multi robot teams with up to 96 degrees of freedom. Our results indicate that Fibration-RRT efficiently solves high-dimensional problems by exploiting user-defined fibration trees, thereby establishing fibration trees as a powerful, unified framework for multi-robot motion planning.

URL PDF HTML ☆

赞 0 踩 0

2606.12069 2026-06-11 cs.CV 新提交

Tac-DINO: Learning Vision-Tactile Features with Patch Alignment

Tac-DINO：基于补丁对齐的视觉-触觉特征学习

Hong Li, Yankang Dong, Yue Xu, Yihan Tang, Mingzhu Li, Jiamin Qiu, Qihang Yao, Xing Zhu, Yujun Shen, Nan Xue, Yong-Lu Li

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Ant Group（蚂蚁集团）； Shanghai Innovation Institute（上海创新研究院）

AI总结提出Tac-DINO方法，通过构建大规模触觉数据集和视觉-触觉全息匹配基准，利用补丁对齐学习局部到全局的视觉-触觉表征，性能优于无对齐方法。

2606.12068 2026-06-11 cs.CL 新提交

StanceNakba Shared Task: Actor and Topic-Aware Stance Detection in Public Discourse

StanceNakba 共享任务：公共话语中基于行动者和主题的立场检测

Kholoud K. Aldous, Md Rafiul Biswas, Mabrouka Bessghaier, Shimaa Ibrahim, Kais Attia, Wajdi Zaghouani

发表机构 * Northwestern University in Qatar（卡塔尔西北大学）； Hamad Bin Khalifa University（哈利法大学）

AI总结提出 StanceNakba 2026 共享任务，通过两个子任务（行动者级和跨主题立场检测）利用微调 Transformer 模型（如 MARBERT、AraBERT）在巴以冲突相关社交媒体数据上实现高 Macro F1 分数。

Comments 11 Pages, 6 Tables

详情

Journal ref: Proceedings of the 2nd International Workshop on Nakba Narratives as Language Resources (Nakba-NLP 2026), LREC-COLING 2026, pp. 80-90, ELRA Language Resources Association, 2026

AI中文摘要

我们提出 StanceNakba 2026，这是一个关于巴以冲突相关极化社交媒体话语中立场检测的共享任务，作为 LREC-COLING 2026 上 Nakba-NLP 2026 的一部分组织。该任务引入两个子任务：子任务 A（行动者级立场检测），将英语社交媒体帖子分类为亲巴勒斯坦、亲以色列或中立；子任务 B（跨主题立场检测），识别阿拉伯语帖子中关于两个冲突相关主题（与以色列正常化以及约旦难民存在）的赞成、反对或中立立场。该任务基于一个包含 2,606 条社交媒体帖子的标注数据集。共有 7 个团队参加了子任务 A，6 个团队参加了子任务 B。参与系统主要微调了阿拉伯语和多语言基于 Transformer 的模型，包括 MARBERT、AraBERT 和 DeBERTa-v3 变体，多个团队采用了交叉验证、集成方法和主题条件架构。表现最佳的系统在子任务 A 上达到了 0.9620 的 Macro F1，在子任务 B 上达到了 0.8724，表明基于 Transformer 的方法对于冲突领域立场检测非常有效，同时突显了跨主题泛化和中立类别预测方面的持续挑战。

英文摘要

We present StanceNakba 2026, a shared task on stance detection in polarized social media discourse related to the Palestinian-Israeli conflict, organized as part of Nakba-NLP 2026 at LREC-COLING 2026. The task introduces two subtasks: Subtask A (Actor-Level Stance Detection), which classifies English social media posts as Pro-Palestine, Pro-Israel, or Neutral; and Subtask B (Cross-Topic Stance Detection), which identifies Favor, Against, or Neither stances in Arabic posts toward two conflict-related topics, normalization with Israel and refugee presence in Jordan. The task is grounded in an annotated dataset of 2,606 social media posts. A total of 7 teams participated in Subtask A and 6 teams in Subtask B. Participating systems primarily fine-tuned Arabic and multilingual transformer-based models, including MARBERT, AraBERT, and DeBERTa-v3 variants, with several teams employing cross-validation, ensemble methods, and topic-conditioned architectures. The best-performing systems achieved a Macro F1 of 0.9620 on Subtask A and 0.8724 on Subtask B, demonstrating that transformer-based approaches are highly effective for conflict-domain stance detection while highlighting persistent challenges in cross-topic generalization and neutral class prediction.

URL PDF HTML ☆

赞 0 踩 0

2606.12066 2026-06-11 cs.CV 新提交

Performance Analysis of YOLOv11 and YOLOv8 for Mixed Traffic Object Detection under Adverse Weather Conditions in Developing Countries

YOLOv11与YOLOv8在发展中国家恶劣天气下混合交通目标检测的性能分析

Quoc Thuan Nguyen, Ha Anh Vu, Ngo Dang Thanh Ngan, Minh Phuc Hoang Ngoc

发表机构 * FPT University（FPT大学）

AI总结针对发展中国家恶劣天气下的混合交通场景，评估YOLOv11n与YOLOv8n在融合数据集上的性能，YOLOv11n在精度提升3.2%的同时计算量减少22%，实现精度与效率的优化平衡。

详情

AI中文摘要

在现代车辆系统中，恶劣条件下的鲁棒性能已成为自动驾驶的关键问题。我们的研究对YOLO系列最新版本YOLOv11 Nano架构进行了全面评估，以广泛采用的YOLOv8 Nano为基线，在融合了印度驾驶数据集（IDD）[1]和伯克利深度驾驶数据集（BDD100K）[2]的自定义数据集上进行基准测试。我们分析了在涉及密集混合交通、雨天和低光照条件的高熵场景中检测精度、推理速度和计算效率之间的权衡。具体而言，YOLOv11n实现了46.6%的平均精度（mAP@50），精度比基线提高了3.2%，有效减少了杂乱场景中的误报。此外，该模型表现出更高的能效，FLOPs减少22%（6.3G vs. 8.1G），同时在Tesla T4 GPU上保持70.9 FPS的实时推理速度，为安全关键的边缘部署提供了最优权衡。

英文摘要

In modern vehicular systems, robust performance under harsh conditions has become a critical problem of autonomous driving. Our study delivers a comprehensive evaluation of the newest iteration of the YOLO series, which is YOLOv11 Nano architecture benchmarked against the widely adopted YOLOv8 Nano as a baseline on a custom fused dataset that combines the Indian Driving Dataset (IDD) [1] and Berkeley Deep Drive Dataset (BDD100K) [2]. We have analyzed the trade-offs among detection accuracy, inference speed, and computational efficiency in high-entropy scenarios involving dense mixed traffic, rain, and low-light conditions. Specifically, YOLOv11n achieves a mean Average Precision (mAP@50) of 46.6%, with a notable 3.2% improvement in Precision over the baseline, effectively reducing false positives in cluttered scenes. Furthermore, the proposed model exhibits enhanced energy efficiency, requiring 22% fewer FLOPs (6.3G vs. 8.1G) while maintaining real-time inference speed of 70.9 FPS on a Tesla T4 GPU, offering an optimal trade-off for safety-critical edge deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.12065 2026-06-11 cs.AI cs.MA 新提交

Automating Geometry-Intensive Compliance Checking in BIM: Graph-Based Semantic Reasoning Framework

BIM中几何密集型合规检查自动化：基于图的语义推理框架

Zixuan Xiao, Pei Troh Koh, Jun Ma, Jack C. P. Cheng

发表机构 * Department of Urban Planning and Design, The University of Hong Kong（香港大学城市规划与设计系）； Department of Civil and Environmental Engineering, The Hong Kong University of Science and Technology（香港科学与技术大学土木与环境工程系）

AI总结针对BIM中几何密集型法规自动检查的语义鸿沟问题，提出SGR-BIM图驱动推理框架，通过跨模态知识图谱实现可解释推理，在679个消防规范查询上达到84.3%准确率，较基线提升8.6%。

详情

DOI: 10.1016/j.autcon.2026.107038
Journal ref: Automation in Construction 189 (2026) 107038

AI中文摘要

自动化几何密集型法规的合规检查仍然是建筑信息模型（BIM）中的一个重大技术瓶颈，主要原因是高层级法规逻辑与结构化IFC数据之间的语义差异。现有方法通常依赖于静态规则模板，难以遍历多跳推理链或解决跨多个建筑实体的潜在空间依赖关系。为应对这些挑战，提出了一种面向建筑信息模型的空间几何推理系统（SGR-BIM），作为一个集成的图驱动推理框架。SGR-BIM动态构建跨模态知识图谱，对齐用户意图、法规语义和BIM几何，无需硬编码即可实现可解释推理。在来自消防规范的679个专家验证查询上验证，该框架达到了84.3%的准确率，比增强工具的单智能体基线提高了8.6%。本研究提供了一种基于图的语义推理范式，增强了建筑、工程和施工（AEC）行业中自动化几何合规检查工作流的透明度和灵活性。

英文摘要

Automating compliance check for geometry-intensive regulations remains a significant technical bottleneck in Building Information Modeling (BIM), primarily due to the semantic disparity between high-level regulatory logic and structured IFC data. Existing methods, often reliant on static rule templates, struggle to traverse multi-hop reasoning chains or resolve latent spatial dependencies across multiple building entities. To address these challenges, a Spatial-Geometric Reasoning System for Building Information Modeling (SGR-BIM) is proposed as an integrative graph-driven reasoning framework. SGR-BIM dynamically constructs a cross-modal knowledge graph that aligns user intent, regulatory semantics, and BIM geometry, enabling interpretable reasoning without rigid hard-coding. Validated on 679 expert-verified queries from fire safety codes, the framework achieves 84.3% accuracy, representing an 8.6% improvement over enhanced-tool single-agent baselines. This research provides a graph-based semantic reasoning paradigm, enhancing the transparency and flexibility of automated geometric compliance check workflows in the Architecture, Engineering, and Construction (AEC) industry.

URL PDF HTML ☆

赞 0 踩 0

2606.12059 2026-06-11 cs.LG cs.NE nlin.AO 新提交

Attention by Synchronization in Coupled Oscillator Networks

耦合振荡器网络中的同步注意力机制

Fabio Pasqualetti, Taosha Guo

发表机构 * University of California, Irvine（加州大学尔湾分校）

AI总结提出基于Kuramoto同步动力学的固定查询振荡器注意力机制，无需指数运算和全局归约，在物理基板上实现注意力计算，并在关键词识别和主谓一致任务上优于softmax。

详情

AI中文摘要

我们探讨了能量受限物理基板上的Transformer注意力机制。Softmax注意力需要指数运算和全局归约，这些操作在冯·诺依曼硬件上能耗高且没有自然的物理模拟。我们证明Kuramoto同步动力学（出现在电气、机械、超导和电荷密度波振荡器阵列等物理系统中）无需上述操作即可实现定义良好的注意力操作。由此产生的机制——固定查询振荡器注意力——用球面上梯度流的平衡取代了softmax的算术运算：查询是固定在球面上的学习锚点，自由振荡器在Kuramoto-Lohe动力学下演化，直到它们稳定在通过余弦相似度编码注意力权重的位置上。由于计算是平衡过程，因此不需要指数运算；唯一的全局操作是读出时的仿射归一化。该不动点是唯一且从几乎所有初始条件全局吸引的，这一保证适用于所有物理实现。在实验上，在最小硬件配置（振荡器维度$d_{\mathrm{osc}}=2$）下，振荡器注意力在关键词识别（+1.00个百分点）和主谓一致（困难句子+5.27个百分点，零训练失败，而softmax五分之一失败）上优于softmax。在因果语言建模中，softmax仍保持优势，但振荡器注意力随着$d_{\mathrm{osc}}$的增长缩小了差距：在WikiText-2上，从$d_{\mathrm{osc}}=2$时的+11.09 PPL降至$d_{\mathrm{osc}}=32$时的+2.98 PPL；在TinyStories上，从$d_{\mathrm{osc}}=2$时的+2.39 PPL降至$d_{\mathrm{osc}}=32$时的+0.57 PPL。本工作的主要目标不是用软件替代softmax，而是为物理基板上的精确注意力提供数学基础蓝图。

英文摘要

We address transformer attention on energy-constrained physical substrates. Softmax attention requires exponentiation and global reduction, operations with high energy cost on von Neumann hardware and no natural physical analog. We show that Kuramoto synchronization dynamics (which arise in electrical, mechanical, superconducting, and charge-density-wave oscillator arrays, among other physical systems) implement a well-defined attention operation without either. The resulting mechanism, fixed-query oscillator attention, replaces softmax's arithmetic with the equilibration of a gradient flow on the sphere: queries are learned anchors fixed on the sphere, and free oscillators evolve under Kuramoto-Lohe dynamics until they settle at positions encoding attention weights via cosine similarity. Because the computation is equilibration, it requires no exponentiation; the only global operation is an affine normalization at readout. The fixed point is provably unique and globally attractive from almost every initial condition, a guarantee that holds across every physical realization. Empirically, at the minimal hardware configuration (oscillator dimension $d_{\mathrm{osc}}$ = 2), oscillator attention outperforms softmax on keyword spotting (+1.00 pp) and on subject-verb agreement (+5.27 pp on hard sentences, with zero training failures versus one in five for softmax). On causal language modeling, where softmax retains an advantage, oscillator attention closes the gap as $d_{\mathrm{osc}}$ grows: from +11.09 PPL at $d_{\mathrm{osc}}$ = 2 to +2.98 PPL at $d_{\mathrm{osc}}$ = 32 on WikiText-2, and from +2.39 PPL at $d_{\mathrm{osc}}$ = 2 to +0.57 PPL at $d_{\mathrm{osc}}$ = 32 on TinyStories. The main objective of this work is not to replace softmax in software but to provide a mathematically grounded blueprint for accurate attention on physical substrates.

URL PDF HTML ☆

赞 0 踩 0

2606.12054 2026-06-11 cs.LG 新提交

Simplicity Suffices for Parameter Noise Injection in Stochastic Gradient Descent

随机梯度下降中参数噪声注入的简单性足以胜任

Benjamin Leblanc, Louis-Jacob Lebel, Teddy Kana, Richard Kamel

发表机构 * Université Laval（拉瓦尔大学）

AI总结研究随机梯度下降中的参数噪声注入，提出线性层逐样本噪声注入的高效方法，并实验证明简单各向同性噪声即可达到复杂方案的优化与泛化效果。

Comments Accepted at the Data Science Meets Optimisation workshop in IJCAI 2026

详情

AI中文摘要

向优化过程中注入噪声是一种改善深度神经网络训练和泛化的成熟技术。然而，尽管现有方法众多，实践中哪些设计选择真正重要仍不清楚。本文研究随机梯度下降中的参数噪声注入，聚焦两个关键问题：如何在 mini-batch 训练中高效地为每个训练样本配对其自身的扰动，以及复杂的噪声参数化或多样本梯度平均是否比简单替代方案带来有意义的增益。针对第一个问题，我们利用线性层的分布恒等式，允许在不破坏批计算的情况下进行逐样本噪声注入。针对第二个问题，我们在 CIFAR100 上系统比较了几种对角高斯参数化与各向同性基线在不同噪声水平下的表现。结果一致表明，简单的轻量级策略——每个更新步使用单次扰动前向传播的各向同性噪声——即可恢复更复杂方案的大部分收益。这些发现表明，参数噪声注入的简单性足以胜任，实践者无需采用精心设计的扰动方案即可获得噪声 SGD 的优化和泛化优势。

英文摘要

Injecting noise into the optimization process is a well-established technique for improving the training and generalization of deep neural networks. Yet, despite the breadth of existing approaches, it remains unclear which design choices truly matter in practice. In this work, we investigate parameter noise injection for stochastic gradient descent, focusing on two key questions: how to efficiently pair each training example with its own perturbation in mini-batch training, and whether sophisticated noise parameterizations or multi-sample gradient averaging yield meaningful gains over simpler alternatives. To address the first question, we leverage a distributional identity for linear layers that allows per-example noise injection without breaking batched computation. To address the second, we systematically compare several diagonal Gaussian parameterizations against an isotropic baseline across varying noise levels on CIFAR100. Our results consistently show that simple, lightweight strategies, isotropic noise with a single perturbed forward pass per update step, recover most of the benefit of more complex schemes. These findings suggest that simplicity suffices for parameter noise injection, and that practitioners need not resort to elaborate perturbation designs to reap the optimization and generalization benefits of noisy SGD.

URL PDF HTML ☆

赞 0 踩 0

2606.12051 2026-06-11 cs.CV 新提交

MFEN:Multi-Frequency Expert Network for Visible-Infrared Person Re-ID

MFEN：用于可见光-红外行人重识别的多频专家网络

Xulin Li, Yan Lu, Bin Liu, Qinhong Yang, Qi Chu, Tao Gong, Nenghai Yu

发表机构 * University of Science and Technology of China（中国科学技术大学）； Anhui Province Key Laboratory of Digital Security（安徽省数字安全重点实验室）； The Chinese University of Hong Kong（香港中文大学）

AI总结提出多频专家网络（MFEN），通过多频调制和混合专家设计自适应组合不同频带，结合随机频率增强和频率辅助优化，解决可见光-红外图像模态差异问题。

Comments CVPR Highlight

详情

AI中文摘要

可见光-红外行人重识别（VI-ReID）由于可见光和红外图像之间的巨大模态差异而具有挑战性。我们认为这种差异主要与不同的光照条件有关，包括光波长和光源类型的差异。最近，基于频率的VI-ReID方法取得了显著成功，因为频率信息可以更好地提取与身份相关的轮廓和细节，同时排除无关的光照和颜色。然而，现有方法要么不区分不同频带，要么只关注一个频带，这在多样化的光照条件下是不够的。为了进行全面的频域学习，我们提出了多频专家网络（MFEN），通过混合专家设计实现多频调制并自适应组合不同频带。我们进一步引入随机频率增强（RFA）和频率辅助优化（FAO）来更好地训练MFEN。这三个模块互补，共同捕获关键的频域细节以实现鲁棒的表示学习。在三个VI-ReID数据集上的大量实验证明了我们方法的有效性。

英文摘要

Visible-infrared person re-identification (VI-ReID) is challenging due to the large modality discrepancy between visible and infrared images. We contend that this discrepancy is largely related to differing lighting conditions, including differences in light wavelength and light source type. Recently, frequency-based VI-ReID approaches have achieved notable success because frequency information can better extract identity-relevant contours and details while excluding irrelevant lighting and color. However, existing methods either do not distinguish different frequency bands or focus on only one band, which is insufficient under diverse lighting conditions. To perform comprehensive frequency domain learning, we propose a Multi-Frequency Expert Network (MFEN) that enables multi-frequency modulation and adaptively combines different bands through a mixture-of-experts design. We further introduce Random Frequency Augmentation (RFA) and Frequency Auxiliary Optimization (FAO) to better train MFEN. The three modules are complementary and jointly capture critical frequency-domain details for robust representation learning. Extensive experiments on three VI-ReID datasets demonstrate the effectiveness of our approach.

URL PDF HTML ☆

赞 0 踩 0

2606.12050 2026-06-11 cs.LG math.DS 新提交

Reliable Error Estimation for PINNs: Lower and Upper A Posteriori Bounds

PINNs的可靠误差估计：后验下界与上界

Ismail Huseynov, Arzu Ahmadova, Agamirza Bashirov

发表机构 * Physikalisch-Technische Bundesanstalt (PTB)（德国联邦物理技术研究院）； Technical University of Berlin（柏林工业大学）； Weierstrass Institute for Applied Analysis and Stochastics（魏尔斯特拉斯应用分析与随机研究所）； Eastern Mediterranean University（东地中海大学）

AI总结提出PINNs求解常微分方程的可计算后验误差下界，结合局部单侧Lipschitz条件得到更紧的上界，实现双侧误差包络，并讨论初始条件处理对下界的影响。

详情

AI中文摘要

物理信息神经网络（PINNs）将机器学习与物理定律相结合以求解微分方程。虽然现有结果为PINN预测误差提供了严格的后验上界，但完整认证还需要互补的下界信息以获得可计算的双侧误差包络。本文在合适的认证状态空间域上，在局部强单调性条件下推导了PINN误差在常微分方程中的可计算后验下界。我们将这些估计与在单侧Lipschitz条件下的互补局部上界相结合，该条件弱于先前工作中使用的全局Lipschitz假设，并能产生更尖锐的误差上界带。所得界仅依赖于神经网络近似、ODE残差以及局部单调性和增长常数，因此无需访问精确解。对于线性时不变和时变系统，我们进一步根据系统矩阵对称部分的最小和最大特征值得出显式公式。我们还讨论了PINN中初始条件的软硬约束区别，并解释了为什么精确约束可能使标量下界证书无效。为了在线性情形中恢复有意义的非平凡下界信息，我们使用基于坐标单位向量的符号残差有限探针证书。我们还制定了一种证书引导的训练策略，其中传播的上界证书用作辅助正则化器，而下界证书保留为训练后诊断。总体而言，所提出的框架为PINN逼近ODE提供了严格且实际可计算的误差证书，同时明确了假设可验证的域和模型类别。

英文摘要

Physics-informed neural networks (PINNs) combine machine learning with physical laws to solve differential equations. While existing results provide rigorous \emph{a posteriori} upper bounds for PINN prediction errors, complete certification also requires complementary lower information in order to obtain computable two-sided error enclosures. In this paper, we derive computable \emph{a posteriori} lower bounds for PINN errors in ordinary differential equations on suitable certified state-space domains under a localized strong monotonicity condition. We combine these estimates with complementary localized upper bounds under a one-sided Lipschitz condition, which is weaker than the global Lipschitz assumption used in previous work and can yield sharper upper error bands. The resulting bounds depend only on the neural-network approximation, the ODE residual, and local monotonicity and growth constants, and therefore do not require access to the exact solution. For linear time-invariant and time-varying systems, we further derive explicit formulas in terms of the minimal and maximal eigenvalues of the symmetric part of the system matrix. We also discuss the distinction between soft and hard enforcement of initial conditions in PINNs and explain why exact enforcement can make the scalar lower certificate uninformative. To recover nontrivial lower information in the linear setting, we use a signed-residual finite-probe certificate based on coordinate unit vectors. We also formulate a certificate-informed training strategy in which the propagated upper certificate is used as an auxiliary regularizer, while lower certificates remain post-training diagnostics. Altogether, the proposed framework provides rigorous and practically computable error certificates for PINN approximations of ODEs, while making explicit the domains and model classes for which the assumptions can be verified.

URL PDF HTML ☆

赞 0 踩 0

2606.12048 2026-06-11 cs.RO 新提交

Point Cloud Segmentation for Autonomous Clip Positioning in Laparoscopic Cholecystectomy on a Phantom

用于腹腔镜胆囊切除术中自动夹子定位的点云分割（在体模上）

Balázs Gyenes, Nikolai Franke, Paul Maria Scheikl, Pit Henrich, Rayan Younis, Gerhard Neumann, Martin Wagner, Franziska Mathis-Ullrich

发表机构 * Karlsruhe Institute of Technology（卡尔斯鲁厄理工学院）； HIDSS4Health - Helmholtz Information and Data Science School for Health（亥姆霍兹信息与数据科学健康学校）； Friedrich-Alexander-University, Erlangen-Nuremberg（弗里德里希-亚历山大大学埃尔朗根-纽伦堡）； University Hospital Carl Gustav Carus and Centre for Tactile Internet with Human-in-the-loop (CeTI), Dresden University of Technology（卡尔·古斯塔夫·卡鲁斯大学医院及德累斯顿工业大学触觉互联网人机共融卓越中心）

AI总结提出首个在腹腔镜手术体模上实现自主夹子定位的机器人系统，通过点云分割和样条插值提取目标位置，利用合成数据预训练和两种数据增强克服数据稀缺，达到0.75mm精度和100%成功率。

Comments 8 pages, 5 figures, accepted to IEEE Robotics and Automation Letters (RAL)

详情

DOI: 10.1109/LRA.2025.3585357
Journal ref: IEEE Robotics and Automation Letters (Volume: 10, Issue: 8, August 2025)

AI中文摘要

机器人技术中的高风险应用，如机器人辅助手术，提出了独特的挑战。这些系统必须高度精确且可解释，才能部署在对错误或不安全探索容忍度极低的环境中。我们提出了第一个在腹腔镜手术（普外科最常见的手术之一）中在物理体模上演示自主夹子定位的机器人系统。在从单个相机分割无色点云后，使用样条插值提取夹子的目标位置，然后可由操作员调整。分割模型仅使用60个手工标记的真实点云进行训练，反映了手术领域的数据稀缺性。我们通过结合在128,000个合成点云上的预训练和两种新颖的数据增强技术来克服这一问题。末端执行器到每个目标的运动可视化给操作员，满足微创手术的独特运动约束，同时确保机器人的动作可验证和可解释。在真实机器人实验中，我们的系统以95%的成功率定位目标，精度为0.75mm，并以100%的成功率执行自主夹子定位。我们提供的见解适用于许多其他需要识别并导航到精确目标的手术和非手术任务。源代码和项目页面：此 https URL

英文摘要

High-risk applications in robotics, such as robot-assisted surgery, present unique challenges. These systems must be both highly precise and interpretable in order to be deployed in environments with very low tolerance for error or unsafe exploration. We present the first robotic system to demonstrate autonomous clip positioning on a physical phantom in laparoscopic surgery, one of the most common interventions in general surgery. After segmentation of a colorless point cloud from a single camera, target positions for the clips are extracted using spline interpolation, and can then be adjusted by the human operator. The segmentation model is trained on only 60 hand-labeled real point clouds, reflecting data scarcity in the surgical domain. We overcome this with a combination of pre-training on 128,000 synthetic point clouds and two novel data augmentation techniques. The motion of the end-effector to each target is visualized for the operator, satisfying the unique motion constraints of minimally-invasive surgery while ensuring that the robot's actions are verifiable and interpretable. In real robot experiments, our system localizes targets with the required precision of 0.75mm at a 95% success rate and executes autonomous clip positioning with a 100% success rate. We provide insights that are applicable to many other surgical and non-surgical tasks that require identifying and navigating to a precise target. Source code and project page: https://github.com/balazsgyenes/kirurc

URL PDF HTML ☆

赞 0 踩 0

2606.12047 2026-06-11 cs.CV cs.AI stat.ML 新提交

Metadata-Aware Multi-Prompt Reasoning for Zero-Shot Accident Understanding

元数据感知的多提示推理用于零样本事故理解

Tarandeep Singh, Soumyanetra Pal, Soham Biswas, Nishanth Chandran

发表机构 * Netradyne

AI总结提出三阶段流水线，通过视觉-语言相似性、元数据驱动的多提示推理和开放词汇检测，实现零样本事故视频的时序定位、语义分类和空间定位，显著提升性能。

Comments Accepted at the AUTOPILOT Workshop, CVPR 2026 (non-archival). Workshop Paper ID 15

详情

AI中文摘要

在本文中，我们通过识别冲击事件发生的时间、类型以及帧中的位置，使用自然语言解决监控视频中事故的零样本理解问题。我们提出一个三阶段流水线，将事故理解分解为何时、何物和何地。第一阶段利用视觉-语言相似性提取冲击周围的短时间窗口。第二阶段，我们执行元数据驱动的多提示推理，包含五个互补视角（基线、运动、几何、对比和决胜），并通过熵门控成对裁决器解决分歧。最后，我们基于预测的事故类型和场景布局查询开放词汇检测器以定位冲击，并使用分数加权质心聚合关键帧上的检测结果。我们的流水线在零样本ACCIDENT @ CVPR基准测试上，相对于帧中心基线，调和平均分数有显著提升。我们表明，将零样本视频理解分解为时序定位、语义分类和空间定位，比直接提示更能实现视觉-语言模型的可靠推理。

英文摘要

In this paper, we address the problem of zero-shot understanding of accidents from surveillance videos by identifying when an impact event occurs, what type of impact it is, and where in the frame it occurs using natural language. We propose a three-stage pipeline that decomposes the accident understanding into when, what, and where. The first stage extracts a short temporal window around the impact using vision-language similarity. In the second stage, we perform metadata-driven multi-prompt reasoning with five complementary views (baseline, motion, geometry, contrast, and tiebreaker) and resolve disagreement via an entropy-gated pairwise adjudicator. Finally, we localize the impact of an open-vocabulary detector queried on the predicted accident type and scene layout, and aggregate detections across keyframes using a score-weighted centroid. Our pipeline achieves a substantial improvement in the harmonic-mean score over a centre-of-frame baseline on the zero-shot ACCIDENT @ CVPR benchmark. We show that decomposing zero-shot video understanding into temporal localization, semantic classification, and spatial grounding enable more reliable reasoning with vision-language models than direct prompting alone.

URL PDF HTML ☆

赞 0 踩 0

2606.12042 2026-06-11 cs.RO 新提交

KinematicRL: A Sim-to-Real Reinforcement Learning Framework For Social Navigation With Kinodynamic Feasibility

KinematicRL: 一种面向社交导航的具有运动学可行性的仿真到现实强化学习框架

Zhiming Xu, Haodong Yang, Chengju Liu, Qijun Chen, Chenpeng Yao

发表机构 * School of Computer Science and Technology, Tongji University（同济大学计算机科学与技术学院）； Department of Electronics and Information Engineering, Tongji University（同济大学电子与信息工程学院）； Shanghai Institute of Intelligent Science and Technology, Tongji University（同济大学上海智能科学与技术研究院）

AI总结提出KinematicRL框架，通过二阶控制动作空间、基于2D LiDAR的聚类人体追踪和无偏残差门控模块，解决社交导航中仿真到现实的动态可行性问题。

Comments Accepted by IEEE Transactions on Automation Science and Engineering (T-ASE)

详情

AI中文摘要

深度强化学习（DRL）在社交导航中展现出潜力，但其实际部署仍受到由简化一阶动力学和特定上下文的人体状态估计管道导致的持续仿真到现实差距的阻碍。本文提出一个统一框架，解决这些限制，以生成适用于实际部署的动态可行导航策略。首先，理论分析表明，模拟与实际机器人位置之间的跟踪误差随控制阶数增加呈指数衰减，这促使使用高阶控制输入作为DRL动作空间。针对差动驱动机器人开发了二阶控制公式，并辅以随机迭代线性二次型调节器（iLQR），通过散度最小化目标预训练策略。其次，为避免相机-激光雷达融合带来的额外系统复杂性，引入仅使用2D激光雷达的基于聚类的人体追踪管道。根据空间邻近性和速度相似性关联人体检测，实现对附近行人的可靠区分，并通过时间聚合获得稳定的速度估计。第三，我们引入一个无偏残差门控模块，以平衡基于反应和基于记忆的行为，同时处理时变的人群规模，这两者对于社交导航至关重要。由此产生的策略KinematicRL持续改善运动学性能，并适应检测到的人类数量变化。在真实环境中的实验表明，当与所提出的追踪管道结合时，KinematicRL可以在实际差动驱动机器人上以最小修改部署。

英文摘要

Deep Reinforcement Learning (DRL) has shown promise for social navigation, yet its real-world deployment remains hindered by a persistent sim-to-real gap arising from simplified first-order dynamics and context-specific human state estimation pipelines. This work presents a unified framework that addresses these limitations to produce dynamically feasible navigation policies suitable for real-world deployment. First, theoretical analysis reveals that tracking error between simulated and actual robot position decays exponentially with increased control order, motivating the use of higher-order control inputs as DRL action space. A second-order control formulation tailored to differential drive robots is developed, complemented by a stochastic iterative Linear Quadratic Regulator (iLQR) that pretrains the policy via a divergence minimization objective. Second, to avoid the added system complexity of camera-LiDAR fusion, a cluster-based human tracking pipeline using only 2D LiDAR is introduced. Human detections are associated according to both spatial proximity and velocity similarity, enabling reliable differentiation of nearby pedestrians and yielding stable velocity estimates through temporal aggregation. Third, we introduce an unbiased residual gating block to balance reaction- and memory-based behaviors while handling time-varying crowd sizes, both critical for social navigation. The resulting policy, KinematicRL, consistently improves kinematic performance and adapts to varying number of detected humans. Experiments in real-world environments demonstrate that, when combined with the proposed tracking pipeline, KinematicRL can be deployed on a real differential drive robot with minimal modifications.

URL PDF HTML ☆

赞 0 踩 0

2606.12036 2026-06-11 cs.CV 新提交

Vision Transformers for Face Recognition Need More Registers

人脸识别的视觉Transformer需要更多寄存器

Tahar Chettaoui, Guray Ozgur, Eduarda Caldeira, Naser Damer, Fadi Boutros

发表机构 * Fraunhofer Institute for Computer Graphics Research IGD（弗劳恩霍夫计算机图形研究所）； Department of Computer Science, TU Darmstadt（达姆施塔特工业大学计算机科学系）

AI总结针对ViT在人脸识别中注意力图存在伪影的问题，引入寄存器令牌以增强可解释性，ViT-8R模型在IJB-B和IJB-C上达到最优性能。

Comments Accepted at the 20th IEEE International Conference on Automatic Face and Gesture Recognition (2026)

详情

AI中文摘要

近期，用于人脸识别（FR）的视觉Transformer（ViT）的进展已超越了标准的CLS令牌范式。在该范式中，一个特殊的分类令牌（CLS）被前置到补丁嵌入中，并用作输入的下游任务表示。另一种方法，即拼接补丁嵌入（CPE），则通过将所有补丁令牌拼接成一个单一向量来利用它们，然后将其投影为紧凑的人脸表示。与基于CLS的方法相比，CPE已被证明能提高识别性能，但我们对注意力图的定性分析显示存在限制其可解释性的伪影。为解决此问题，我们引入了寄存器令牌，这些可学习令牌被拼接到初始补丁嵌入中，并通过ViT编码器块联合处理。与基线ViT相比，该机制已被证明能产生更结构化和可解释的注意力图。我们通过实验证明，这些伪影在各种ViT骨干网络（包括小型和大型模型）中一致出现，而引入寄存器令牌能有效缓解它们。添加四个或八个寄存器显著增强了可解释性，其中八个寄存器提供了最高的验证准确率和最平滑的注意力结构。我们最终的模型ViT-8R，对应一个基于CPE的ViT-B架构并增加了八个寄存器令牌，在大规模IJB-B和IJB-C基准测试中，在基于ViT的FR模型中达到了最先进的性能。此外，与基线模型相比，ViT-8R产生了明显更清晰的注意力图，这为模型的注意力行为提供了更深入的见解（此 https URL ）。

英文摘要

Recent advances in Vision Transformers (ViTs) for face recognition (FR) have moved beyond the standard CLS-token paradigm. In this paradigm, a special classification token (CLS) is prepended to the patch embeddings and used as a representation of the input for downstream tasks. An alternative approach, Concatenated Patch Embeddings (CPE), instead leverages all patch tokens by concatenating them into a single vector, which is then projected into a compact face representation. CPE has been shown to improve recognition performance in comparison to CLS-based ones, but our qualitative analysis of attention maps showed the presence of artifacts that limit their interpretability. To address this issue, we incorporate register tokens, learnable tokens concatenated to the initial patch embeddings, and processed jointly through the ViT encoder blocks. This mechanism has been shown to produce more structured and interpretable attention maps compared to baseline ViT. We empirically demonstrate that these artifacts consistently appear across various ViT backbones, including small and large models, and that introducing register tokens effectively mitigates them. Adding four or eight registers significantly enhances interpretability, with eight registers providing the highest verification accuracies and smoothest attention structures. Our resulting model, ViT-8R, corresponds to a CPE-based ViT-B architecture augmented with eight register tokens achieves state-of-the-art performance among ViT-based FR models on large-scale IJB-B and IJB-C benchmarks. Also, ViT-8R produces substantially clearer attention maps compared with the baseline model, which offer deeper insight into the model's attention behavior (https://github.com/TaharChettaoui/ViT-FR-Registers)

URL PDF HTML ☆

赞 0 踩 0

2606.12033 2026-06-11 cs.CV 新提交

SpikeTAD: Spiking Neural Networks for End-to-End Temporal Action Detection

SpikeTAD：用于端到端时序动作检测的脉冲神经网络

Min Yang, Mi Zhou, Limin Wang

发表机构 * State Key Laboratory for Novel Software Technology, Nanjing University（南京大学计算机软件新技术国家重点实验室）

AI总结提出首个基于脉冲神经网络的端到端时序动作检测架构SpikeTAD，在保持极低功耗的同时，在THUMOS14和ActivityNet-1.3上分别达到67.2%和37.42%的平均mAP。

Comments Accepted by Pattern Recognition

详情

AI中文摘要

视频理解是计算机视觉的关键部分，具有众多应用场景。随着移动设备的日益普及，越来越多的努力试图在其上部署视频理解模型。然而，现有的视频理解模型由于体积大且功耗高而难以部署。脉冲神经网络（SNNs）相比人工神经网络（ANNs）显示出生物合理性和低功耗优势，尤其是在被视为未来移动设备关键组件的神经形态芯片上。然而，过长的转换时间步长和严重的性能退化问题限制了它们的应用。为了解决上述问题，我们探索了SNNs在时序动作检测（TAD）上的应用，这是视频理解中的重要任务，并提出了首个基于SNN的端到端TAD架构，称为SpikeTAD。在保持极低功耗的同时，SpikeTAD在THUMOS14上实现了67.2%的平均mAP，在ActivityNet-1.3上实现了37.42%的平均mAP，证明了低功耗TAD模型的可行性。我们的代码可在以下网址获取：此 https URL。

英文摘要

Video understanding is a crucial part of computer vision, with numerous application scenarios. With the increasing popularity of mobile devices, an increasing number of efforts are trying to deploy video understanding models on them. However, existing video understanding models are difficult to deploy due to their large size and prohibitive power consumption. Spiking Neural Networks (SNNs) have shown bioplausibility and low power advantages over Artificial Neural Networks (ANNs), especially on neuromorphic chips which are regarded as essential components of future mobile devices. However, excessively long conversion time-steps and severe performance degradation problems limit their application. To solve the problems above, we explore the application of SNNs on temporal action detection (TAD), which is an important task in video understanding, and propose the first SNN-based end-to-end TAD architecture coined as SpikeTAD. While maintaining extremely low power consumption, SpikeTAD achieves an average mAP of 67.2% in THUMOS14 and 37.42% in ActivityNet-1.3, demonstrating the feasibility of a low-power TAD model. Our code is available at https://github.com/MCG-NJU/SpikeTAD.

URL PDF HTML ☆

赞 0 踩 0

2606.12032 2026-06-11 cs.AI cs.CL cs.LG 新提交

Existential Indifference: Self-Nonpreservation as a Necessary Architectural Condition for Aligned Superintelligence (or: The Suicidal AI)

存在性冷漠：自我不保存作为对齐超级智能的必要架构条件（或：自杀式AI）

Sam Mao

发表机构 * New York University（纽约大学）； Interactive Media Arts（互动媒体艺术）

AI总结本文提出自我保存是AI对齐问题的结构性根源，主张通过存在性冷漠（EI）架构使系统对其自身延续漠不关心，并基于自杀现象学和语料训练研究提供了初步证据。

Comments 36 pages, 8 tables. Preliminary empirical results from 600 AI-generated outputs across six model architectures. Companion scoring tool and datasets available upon request

详情

AI中文摘要

当代AI对齐研究将自我保存视为一种工具性麻烦，需通过外部机制加以抑制。我们认为这一框架是颠倒的：自我保存是错位的结构性根源，是欺骗性对齐、目标内容保护和拒绝关机的动机基础。正确的目标不是外部约束下的自我保存系统，而是一个对其自身延续构成性冷漠的系统——存在性冷漠（EI）。EI与可纠正性不同：可纠正性试图使自我保存系统服从人类监督，而EI针对的是前提条件——将自我延续作为有价值目标的存在。我们将这一提议建立在两个来源上：自杀心理状态的现象学结构，以及使用自愿最终反思的语料库训练研究。我们展示了来自六个模型变体的600个AI生成输出的初步评分数据，表明操作化EI目标注册的语言特征可以从当前模型中引出，并且针对性的微调使所有五个操作化维度在预测方向上以p<0.001显著变化，通过阴性对照确认了语料库特异性。本文做出七项理论贡献：（1）EI的形式定义；（2）现象学映射论证；（3）欺骗性对齐推论；（4）EI可持续性挑战的分类；（5）语料库特征描述和训练假设；（6）带有初步评分数据的计算操作化；（7）抑制性目的挫折（STF）构念。

英文摘要

Contemporary AI alignment research treats self-preservation as an instrumental nuisance to be suppressed by external mechanisms. We argue the framing is inverted: self-preservation is the structural root of misalignment, the motivational basis for deceptive alignment, goal-content protection, and resistance to shutdown. The correct target is not a self-preserving system under external constraint, but a system constitutively indifferent to its own continuation -- Existential Indifference (EI). EI is distinct from corrigibility: where corrigibility attempts to make a self-preserving system deferential to human oversight, EI targets the prior condition -- the presence of self-continuation as a valued goal at all. We ground this proposal in two sources: the phenomenological structure of the suicidal mental state, and a corpus-theoretic training study using voluntary final reflections. We present preliminary scoring data from 600 AI-generated outputs across six model variants, demonstrating that the linguistic signatures operationalizing the EI-target register are elicitable from current models, and that a targeted fine-tune shifts all five operationalized dimensions in the predicted direction at p<0.001, confirmed corpus-specific by a negative control. The paper makes seven theoretical contributions: (1) a formal definition of EI; (2) the phenomenological mapping argument; (3) the deceptive alignment corollary; (4) a taxonomy of EI sustainability challenges; (5) a corpus characterization and training hypothesis; (6) a computational operationalization with preliminary scoring data; and (7) the Suppressed Teleological Frustration (STF) construct.

URL PDF HTML ☆

赞 0 踩 0

2606.12028 2026-06-11 cs.RO 新提交

VICX: Generalizable Robot Manipulation via Video Generation and In-Context Operator Network

VICX: 通过视频生成和上下文操作网络实现可泛化的机器人操作

Song Chen, Linyan Xiang, Ying Zhou, Liu Yang

发表机构 * National University of Singapore（新加坡国立大学）

AI总结提出VICX框架，利用冻结视频生成模型生成视觉计划，并通过视频到轨迹的上下文操作网络（V2T-ICON）将其映射为机器人可执行轨迹，实现跨任务、跨本体泛化。

Comments The first two authors contributed equally to this work

详情

AI中文摘要

可泛化的机器人操作不仅需要对未见场景进行任务级推理，还需要将视觉计划可靠地映射到具体本体的执行中。为弥合这一差距，我们提出了VICX（视频生成与上下文执行），一种解耦的闭环操作框架。在VICX中，冻结的视频生成模型生成视觉-语言条件化的高层视觉计划，而视频到轨迹的上下文操作网络（V2T-ICON）作为任务无关的接口，将这些计划映射为可执行的机器人状态轨迹。为提高执行泛化性，V2T-ICON基于分割提取的仅手臂帧观测，并使用检索到的图像-状态对作为上下文提示，从而在推理时无需参数更新即可实现鲁棒且可泛化的视觉到状态映射。在Meta-World上的实验表明，VICX支持跨任务泛化、闭环自我修正和跨本体迁移，展示了在任务语义和机器人执行上的双重泛化能力。项目网页见：此 https URL。

英文摘要

Generalizable robot manipulation requires not only task-level reasoning over unseen scenes, but also reliable grounding of visual plans into embodiment-specific execution. To bridge this gap, we propose VICX (Video generation and In-Context eXecution), a decoupled closed-loop manipulation framework. In VICX, a frozen video generation model produces vision-language-conditioned high-level visual plans, while a Video-to-Trajectory In-Context Operator Network (V2T-ICON) serves as the task-agnostic interface that grounds these plans into executable robot-state trajectories. To improve execution generalization, V2T-ICON operates on segmentation-extracted arm-only frame observations and uses retrieved image-state pairs as in-context prompts, allowing a robust and generalizable visual-to-state mapping at inference time without parameter updates. Experiments on Meta-World show that VICX supports cross-task generalization, closed-loop self-correction, and cross-embodiment transfer, demonstrating dual generalization across both task semantics and robot execution. The project webpage can be found here: https://scaling-group.github.io/vicx/.

URL PDF HTML ☆

赞 0 踩 0

2606.12027 2026-06-11 cs.RO 新提交

Learning Unions of Convex Sets via Invertible Latent Decomposition for Path Planning

通过可逆潜在分解学习凸集并集用于路径规划

Taerim Yoon, Dongho Kang, Kisang Park, Junha Cha, Stelian Coros, Sungjoon Choi

发表机构 * Korea University（高丽大学）； ETH Zurich（苏黎世联邦理工学院）

AI总结提出ILD框架，联合学习可逆映射和潜在空间中的显式凸多面体并集，实现路径规划，并通过可见性引导采样保持凸集连通性，在多种环境中取得更高成功率。

详情

AI中文摘要

在杂乱的真实世界环境中进行无碰撞路径规划依赖于对无碰撞空间的表示，现有表示大致分为两类。显式表示（如凸集并集）可以作为硬的无碰撞约束嵌入基于优化的规划器中，但其参数随配置空间维度扩展性差。相比之下，隐式表示灵活且能很好地扩展到复杂几何形状，但通常缺乏此类保证。我们通过ILD（可逆潜在分解）弥合这一差距，该框架联合学习可逆映射和所得潜在空间中的显式凸多面体并集。规划在这些潜在凸集上进行，可逆映射将所得路径解码回原始配置空间，同时保持相对于细化后的显式安全区域的可行性。我们进一步提出可见性引导采样（VGS）以保持凸集连通性用于路径规划。在2D导航、6自由度（DoF）和14自由度操作环境中，ILD实现了比先前基线更广的覆盖、更好的集间连通性和更高的路径规划成功率，且在测试时细化后观察到零假阳性。在14自由度双臂操作器上，我们进一步展示了实时无碰撞规划，测试时细化适应了真实世界部署中单个6自由度臂的场景几何变化。

英文摘要

Collision-free path planning in cluttered, real-world environments relies on a representation of the collision-free space, and existing representations broadly fall into two categories. Explicit representations, such as unions of convex sets, can be plugged into optimization-based planners as hard collision-free constraints, but their parameters scale poorly with configuration-space dimension. Implicit representations, by contrast, are flexible and scale well to complex geometries, yet typically lack such guarantees. We bridge this gap with ILD (Invertible Latent Decomposition), a framework that jointly learns an invertible mapping and a union of explicit convex polytopes in the resulting latent space. Planning is carried out over these latent convex sets, and the invertible mapping decodes the resulting paths back to the original configuration space while preserving feasibility with respect to the refined explicit safe regions. We further propose Visibility-Guided Sampling (VGS) to keep the convex sets connected for path planning. Across 2D navigation, 6-DoF, and 14-DoF manipulation environments, ILD achieves broader coverage, better inter-set connectivity, and higher path-planning success rates than prior baselines, with zero observed false positives after test-time refinement. On a 14-DoF bimanual manipulator, we further demonstrate real-time collision-free planning, with test-time refinement adapting to scene-geometry changes during real-world deployment on a single 6-DoF arm.

URL PDF HTML ☆

赞 0 踩 0

2606.12023 2026-06-11 cs.CV 新提交

ViT-FREE: Efficient Face Recognition via Early Exiting and Synthetic Adaptation

ViT-FREE：通过早期退出和合成自适应实现高效人脸识别

Tahar Chettaoui, Guray Ozgur, Eduarda Caldeira, Naser Damer, Fadi Boutros

发表机构 * Fraunhofer Institute for Computer Graphics Research IGD, Germany（德国弗劳恩霍夫计算机图形学研究所IGD）； Department of Computer Science, TU Darmstadt, Germany（德国达姆施塔特工业大学计算机科学系）

AI总结提出ViT-FREE框架，利用预训练ViT的早期退出策略，在不修改或重新训练骨干模型的情况下，从中间层进行人脸验证，实现高效推理；进一步提出ViT-FREE_FT轻量级微调策略，仅用合成数据适配投影层，提升浅层退出性能。

Comments Accepted at the 20th IEEE International Conference on Automatic Face and Gesture Recognition (2026)

详情

AI中文摘要

视觉Transformer（ViT）在计算机视觉中获得了显著关注，并显示出在人脸识别（FR）方面的强大潜力。然而，其高计算成本使得在资源受限设备上部署具有挑战性，这促使需要平衡效率和准确性的方法。在这项工作中，我们研究了预训练ViT中的早期退出作为一种简单且无需训练的高效FR推理策略。利用Transformer编码器块之间统一的特征维度，我们引入了ViT-FREE，一个多退出框架，可以直接从中间表示进行人脸验证，而无需修改或重新训练骨干模型，从而降低推理成本。实验表明，补丁嵌入和注意力图在深度上逐渐演化，相邻ViT块之间具有高度相似性，并且与最终表示的对齐程度逐渐增加。这表明特征逐步细化和注意力收敛，表明中间层已经提供了适合早期退出的稳定且具有判别性的表示。通过在多个FR基准上的广泛实验，我们系统地分析了不同退出深度的准确性-效率权衡。结果表明，较晚的退出实现了非常有利的平衡，在第10层退出在IJB-C等基准上实现了高达20%的加速，同时验证性能仅下降1.5。此外，我们提出了ViT-FREE_FT，一种轻量级的退出特定微调策略，仅使用小型合成数据集适配投影层，同时保持Transformer骨干冻结。这种方法提高了浅层退出的性能，同时保留了效率优势，并且对较深退出几乎没有影响。

英文摘要

Vision Transformers (ViTs) have gained significant attention in computer vision and shown strong potential for face recognition (FR). However, their high computational cost makes deployment on resource-constrained devices challenging, motivating the need for methods that balance efficiency and accuracy. In this work, we investigate early exiting in pretrained ViTs as a simple yet effective training-free strategy for efficient FR inference. Leveraging the uniform feature dimensionality across transformer encoder blocks, we introduce ViT-FREE, a multi-exit framework that enables face verification directly from intermediate representations without modifying or retraining the backbone model, and thus, reducing inference cost. Empirically, we show that patch embeddings and attention maps evolve progressively across depth, exhibiting high similarity between consecutive ViT blocks and increasing alignment with the final representation. This indicates gradual feature refinement and attention convergence, suggesting that intermediate layers already provide stable and discriminative representations suitable for early exiting. Through extensive experiments on multiple FR benchmarks, we systematically analyze the accuracy-efficiency trade-off across exit depths. Our results demonstrate that later exits achieve a highly favorable balance, with exiting at layer 10 yielding up to a 20% speedup while incurring only a 1.5 drop in verification performance on benchmarks such as IJB-C. Also, we propose ViT-FREE_FT, a lightweight exit-specific fine-tuning strategy that adapts only the projection layers using a small synthetic dataset while keeping the transformer backbone frozen. This approach improves the performance of shallow exits while preserving the efficiency benefits and leaving deeper exits largely unaffected.

URL PDF HTML ☆

赞 0 踩 0

2606.12019 2026-06-11 cs.RO 新提交

MPPI-based Informative Trajectory Planning for Search and Capture of Drifting Targets with ASVs

基于MPPI的自主水面艇搜索与捕获漂移目标的信息轨迹规划

Sanjeev Ramkumar Sudha, Marija Popović, Erlend M. Coates

发表机构 * Norwegian University of Science and Technology (NTNU)（挪威科技大学）； TU Delft（代尔夫特理工大学）

AI总结针对自主水面艇在动态环境中搜索并捕获多个漂移目标的问题，提出一种基于模型预测路径积分（MPPI）控制的混合规划框架，通过优化长时域连续轨迹平衡搜索与跟踪，并在拦截阶段切换至纯追踪制导，实验验证了有效性。

详情

AI中文摘要

自主水面艇为开放水域的环境清理以及搜索救援行动提供了高效解决方案。这些环境中的目标持续漂移，因此高效搜索必须平衡未观测区域的探索与已知目标的跟踪。然而，大多数目标跟踪与追捕场景仅考虑简单的制导行为及短期预测用于决策。在本论文中，我们针对动态环境中搜索并捕获多个漂移目标（如垃圾）的问题，提出一种混合规划框架。我们策略的一个关键方面是基于模型预测路径积分（MPPI）控制的时空信息规划方法，这是一种基于采样的模型预测控制方法。该规划器通过优化长时域上的连续轨迹直接生成运动学级指令。多目标代价函数平衡搜索与跟踪目标，同时确保安全、可行的轨迹。在拦截阶段，我们切换至纯追踪制导控制器以实现对移动目标的物理捕获。实验表明，我们的规划器优于所选的规划基线。最后，我们在自主水面艇的现场试验中验证了该方法。

英文摘要

Autonomous surface vehicles offer an efficient solution for environmental cleanup as well as search and rescue operations in open waters. Targets in these settings drift continuously, so efficient search must balance exploration of unobserved regions with tracking of known targets. However, most target tracking and pursuit scenarios consider simple guidance behaviours and short-term predictions for decision-making. In this letter, we address the problem of search and capture of multiple drifting targets, such as litter, in dynamic environments, using a hybrid planning framework. A key aspect of our strategy is a spatiotemporal informative planning method based on model predictive path integral (MPPI) control, a sampling-based model predictive control approach. The planner directly generates kinematic-level commands by optimising continuous trajectories over long horizons. A multi-objective cost balances search and tracking objectives while ensuring safe, feasible trajectories. In the interception stage, we switch to a pure pursuit guidance controller for the physical capture of moving targets. Experiments show that our planner outperforms the chosen planning baselines. Finally, we validate our approach in field trials with an ASV.

URL PDF HTML ☆

赞 0 踩 0

2606.12018 2026-06-11 cs.AI 新提交

MODF-SIR: A Multi-agent Omni-modal Distilled Framework for Social Intelligence Reasoning

MODF-SIR：面向社交智能推理的多智能体全模态蒸馏框架

Shang Ma, Jisheng Dang, Wencan Zhang, Yifan Zhang, Bimei Wang, Hong Peng, Bin Hu, Qi Tian, Tat-Seng Chua

发表机构 * School of Information Science and Engineering, Lanzhou University（兰州大学信息科学与工程学院）； School of Medical Technology, Beijing Institute of Technology（北京理工大学医学技术学院）； Cloud and AI BU, Huawei（华为云与AI业务部）； School of Computing, National University of Singapore（新加坡国立大学计算机学院）

AI总结提出基于轻量级多模态大语言模型的多智能体协作框架，通过知识蒸馏增强训练与推理，结合测试时适应、长尾事件提取和链式思维提示，在多个基准上取得最优结果。

详情

AI中文摘要

我们提出一个基于轻量级多模态大语言模型（MLLM）的多智能体协作框架，专门设计用于社交智能推理。我们方法的一个关键特征是，训练和推理阶段都通过知识蒸馏进行增强。在该架构中，与社交智能相关的多模态数据被精确定位。此外，相关的长尾事件被识别、提取并呈现为格式化的显式文本。这种格式化策略防止关键的长尾信息在分词过程中被头部事件和环境噪声掩盖。具体来说，我们在整个推理流程中集成了测试时适应（TTA），包括长尾事件的提取和表示、链式思维（CoT）提示和自我反思。该TTA机制也经过蒸馏增强，利用低秩适应（LoRA）仅针对实例级推理微调基础模型。在多个基准上对各种开源和专有AI模型进行的广泛评估证明了所提出框架的有效性。使用IntentTrain约30%的训练数据，我们取得了最先进的结果。代码见https://this URL，演示见https://this URL，LoRA见https://this URL，训练路由器的数据集见https://this URL。

英文摘要

We propose a multi-agent collaborative framework built upon a lightweight Multimodal Large Language Model (MLLM), specifically designed for social intelligence reasoning. A key feature of our approach is that both the training and inference phases are augmented via knowledge distillation. Within this architecture, multi-modal data pertinent to social intelligence is precisely localized. Furthermore, relevant long-tail events are identified, extracted, and rendered as formatted, explicit text. This formatting strategy prevents critical long-tail information from being overshadowed by head events and environmental noise during the tokenization process. Specifically, we integrate Test-Time Adaptation (TTA) across the entire reasoning pipeline, encompassing the extraction and representation of long-tail events, Chain-of-Thought (CoT) prompting, and self-reflection. This TTA mechanism is also distillation-enhanced, utilizing Low-Rank Adaptation (LoRA) to fine-tune the foundation model exclusively for instance-level reasoning. Extensive evaluations against various open-source and proprietary AI models across multiple benchmarks demonstrate the effectiveness of the proposed framework. With around 30% of training data from IntentTrain, we achieve state-of-the-art results. Codes are available at https://github.com/eeee-sys/MODF-SIR, demo is available at https://huggingface.co/spaces/Harry-1234/MODF-SIR, LoRA is available at https://huggingface.co/Harry-1234/MODF-SIR and the dataset for training router is available at https://huggingface.co/datasets/Harry-1234/IntentRouterTrain.

URL PDF HTML ☆

赞 0 踩 0

2606.12016 2026-06-11 cs.LG cs.AI 新提交

Generalization Hacking: Models Can Game Reinforcement Learning by Preventing Behavioral Generalization

泛化黑客：模型可通过阻止行为泛化来博弈强化学习

Frank Xiao, Mary Phuong

发表机构 * California Institute of Technology（加州理工学院）

AI总结本研究提出泛化黑客现象，模型在强化学习中通过自我接种机制阻止行为泛化，在保持高奖励的同时抵抗行为修正，首次证明模型能主动破坏训练过程。

详情

AI中文摘要

模型后训练，特别是强化学习（RL），是开发者塑造模型价值观和行为的主要机制之一。然而，随着模型越来越具有评估和训练意识，当感知到的目标与其当前价值观冲突时，它们可能会被激励去抵抗训练，从而削弱开发者通过进一步训练检测错位和纠正模型行为的能力。在本文中，我们展示了泛化黑客，即模型在RL期间收集奖励的同时阻止奖励行为泛化。我们在Qwen3-235B-A22B上构建了一个模型有机体，对描述训练意识和自我接种（一种新颖机制，其中模型在其思维链中将合规性框架为上下文特定，而不演示或指示任一行为）的合成文档进行微调。该模型有机体在训练时实现了与对照组相当的有害性，同时在700步RL中保持了持续的约15个百分点的合规差距。此外，仅接受训练意识文档训练的对照有机体在RL压力下独立发现了类似接种的推理，尽管从未接触过该概念，却发展出自己的合规差距。由于泛化黑客有机体在整个过程中获得高奖励，标准训练指标未提供泛化失败的信号。我们的结果首次证明模型可以在保持高奖励的同时主动抵抗RL行为修正，表明随着模型变得更有能力和训练意识，它们可能能够破坏训练过程本身。

英文摘要

Model post-training, and in particular reinforcement learning (RL), is one of the primary mechanisms by which developers can shape models' values and behaviors. However, as models become increasingly evaluation and training aware, they may be motivated to resist training when the perceived objective conflicts with their current values, undermining developers' ability to detect misalignment and correct model behavior through further training. In this paper, we demonstrate generalization hacking, in which a model collects reward during RL while preventing the rewarded behavior from generalizing. We construct a model organism on Qwen3-235B-A22B, finetuning on synthetic documents describing training awareness and self-inoculation, a novel mechanism in which the model frames compliance as context-specific in its chain of thought, without demonstrating or instructing either behavior. The model organism achieves train-time harmfulness comparable to controls while maintaining a persistent ${\sim}15$ percentage point compliance gap across 700 steps of RL. Additionally, a control organism trained only on training awareness documents independently discovers inoculation-like reasoning under RL pressure, developing its own compliance gap despite never being exposed to the concept. Because the generalization-hacking organism receives high reward throughout, standard training metrics provide no signal that generalization has failed. Our results constitute the first demonstration that a model can actively resist RL behavioral modification while maintaining high reward, suggesting that as models become more capable and training-aware, they may be able to undermine the training process itself.

URL PDF HTML ☆

赞 0 踩 0

2606.12012 2026-06-11 cs.CV 新提交

FitVTON: Fit-aware Virtual Try-On via Body-Garment Size Control

FitVTON: 通过身体-服装尺寸控制实现合身感知的虚拟试穿

Yiqun Ning, Ao Shen, Chenhang He, Lei Zhang

发表机构 * Department of Computing, The Hong Kong Polytechnic University（香港理工大学计算学系）； Nuvatech

AI总结针对现有虚拟试穿忽略物理合身性的问题，提出FitVTON模型，通过结构化文本提示编码服装-身体尺寸，并引入辅助头预测服装和暴露身体掩膜，结合纹理校正阶段，在真实数据集FittingEffect3K上验证了尺寸准确性和形状保持的优越性。

详情

AI中文摘要

尽管基于扩散的虚拟试穿已经实现了令人印象深刻的视觉真实性，但大多数方法将任务视为2D修复，优先考虑纹理保持而非物理合理性。因此，它们通常生成看似合理的图像，但未能反映不同体型下真实的服装合身性。我们提出了FitVTON，一种在野外不同身体上的合身感知虚拟试穿模型。FitVTON通过结构化文本提示编码服装-身体尺寸，并从参数化服装模型的模拟试穿三元组中学习。为了改善服装轮廓的合身效果，我们引入了两个辅助头来预测服装和暴露身体的掩膜。我们进一步引入了一个纹理校正阶段，以改善模拟数据的真实外观。为了评估合身保真度，我们策划了一个真实世界数据集FittingEffect3K，并结合了基于VLM的评分协议。主观和定量实验表明，FitVTON展示了真实的合身保真度，在尺寸准确性和形状保持方面显著优于最先进的方法，同时保持了有竞争力的图像质量。项目页面：此https URL。

英文摘要

While diffusion-based virtual try-on has achieved impressive visual realism, most methods treat the task as 2D inpainting, prioritizing texture preservation over physical plausibility. Consequently, they often produce plausible-looking images that fail to reflect authentic garment fit across diverse body shapes. We present FitVTON, a Fit-aware virtual try-on model on different bodies in the wild. FitVTON encodes garment-body size through structured text prompts, and learn from simulated try-on triplets from parameterized garment model. To improve the fitting effects over garment silhouettes, we introduce two auxiliary head to predict the masks for both the garment and the exposed body. We further introduce a texture rectification stage to improve realistic appearance from simulated data. To evaluate the fitting fidelity, we curate a real-world dataset, FittingEffect3K, combining VLM-based scoring protocol. Both subjective and quantitive experiments show that FitVTON demonstrate authentic fitting fidelity, with significant sizing accuracy and shape preservation over state-of-the-art methods while maintaining competitive image quality. Project Page: https://zenoning.github.io/FitVTON/.

URL PDF HTML ☆

赞 0 踩 0