arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1938
专题追踪
2605.20602 2026-05-21 cs.CL cs.AI cs.LG

Self-Training Doesn't Flatten Language -- It Restructures It: Surface Markers Amplify While Deep Syntax Dies

自我训练不使语言扁平化——它重构了它:表面标记增强而深层语法消失

Ming Liu

发表机构 * Amazon(亚马逊)

AI总结 该研究通过实验发现自我训练过程并非使语言扁平化,而是重构了语言结构,表面标记增强而深层语法结构消失,并提出了结构性深度假说来解释这一现象。

Comments 19 pages (14 main + 5 appendix), 8 figures, 3 tables

详情
AI中文摘要

连续对语言模型自身输出进行自我训练通常被描述为一种扁平化过程:多样性下降,分布变窄,文本变得“更像自己”。我们提供了证据表明这种描述是不完整的。在对五个模型(GPT-2 124M,Pythia-410M,Pythia-1.4B,OPT-1.3B,Pythia-2.8B)进行十一代自我训练的过程中,语言并非均匀扁平化——它被重构了。表面标记(连贯词、缓和词、破折号)上升,而中层和深层语法结构(疑问句、插入语、被动语态、条件句)崩溃。我们正式将这种不对称崩溃定义为结构性深度假说(SDH):语言特征的每一代衰减率主要由其结构性深度——它所需嵌套语法依赖的数量——决定,其次才由其生成零次输出频率决定。通过汇总五个模型中三个架构家族的17个特征面板(N=85),汇总的斯皮尔曼相关系数为rho=0.540(p < 10^{-6};簇Bootstrap 95% CI [0.434, 0.634]),而频率是一个显著较弱的预测因子(rho=0.225)。一个匹配的人类文本微调对照实验得到rho=0.039(p=0.88),证实了该梯度是特定于自我训练的。我们进一步记录了一个表面复杂性悖论:总体复杂性代理(依赖树深度、TTR、词长)在底层从句结构消失时均上升,这对训练数据筛选和LLM文本检测有直接影响。

英文摘要

Successive self-training on a language model's own outputs is widely characterized as a process of flattening: diversity drops, distributions narrow, and the text becomes "more like itself." We provide evidence that this characterization is incomplete. Across eleven generations of self-training on five models (GPT-2 124M, Pythia-410M, Pythia-1.4B, OPT-1.3B, Pythia-2.8B), language is not flattened uniformly -- it is restructured. Surface markers (discourse connectives, hedges, em-dashes) rise, while mid- and deep-syntactic structures (questions, parentheticals, passives, subjunctives) collapse. We formalize this asymmetric collapse as the Structural Depth Hypothesis (SDH): the per-generation decay rate of a linguistic feature is predicted primarily by its structural depth -- the number of nested syntactic dependencies it requires -- and only secondarily by its generation-zero output frequency. Pooling 17-feature panels from five models spanning three architecture families (N=85), the pooled Spearman correlation is rho=0.540 (p < 10^{-6}; cluster-bootstrap 95% CI [0.434, 0.634]), while frequency is a substantially weaker predictor (rho=0.225). A matched human-text fine-tuning control yields rho=0.039 (p=0.88), confirming the gradient is self-training-specific. We further document a Superficial Complexity Paradox: aggregate complexity proxies (dep-tree depth, TTR, word length) all rise as the underlying clause structure dies, with direct implications for training-data curation and LLM-text detection.

2605.20600 2026-05-21 cs.CV

Head-Aware Key-Value Compression for Efficient Autoregressive Image Generation

面向头部的键值压缩用于高效自回归图像生成

Guotao Liang, Baoquan Zhang, Zhiyuan Wen, Yunming Ye

发表机构 * Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Peng Cheng Laboratory(鹏城实验室)

AI总结 本文提出HeadKV框架,通过根据注意力头的局部性偏置分配不同的缓存预算,提高自回归图像生成的效率和内存利用率,同时设计分层令牌驱逐策略以保留长距离信息。

Comments Under review

详情
AI中文摘要

自回归(AR)视觉生成在性能上取得了显著成果,但存在内存使用高和吞吐量低的问题,因为需要缓存之前生成的视觉标记。最近的研究表明,仅保留少量缓存标记即可维持高质量图像,同时显著减少内存使用并提高吞吐量。然而,这些方法为每个注意力头分配固定预算,忽视了注意力头之间的异质性,导致内存分配不优。在本文中,我们观察到不同层的注意力头表现出多样的注意力模式,其中一些头专注于局部邻域,而另一些头捕捉更广泛的上下文依赖。基于这一见解,我们提出了一种新的面向头部的键值(KV)缓存压缩框架,称为HeadKV,用于自回归图像生成,该框架为局部偏置头分配较小的预算,为具有更广泛注意力的头分配更大的预算。一个关键挑战在于确定每个注意力头的类型以指导缓存压缩。我们进一步观察到,在同一层中,每个头在不同位置的令牌上表现出一致的注意力模式,即一个头在早期令牌上的行为与后期令牌上的行为保持一致。这一见解表明,头类型可以在早期阶段确定并在生成过程中重用以进行KV压缩。其优势是它不需要额外的训练或数据集级统计,并且可以无缝泛化到不同的输入。此外,我们设计了一种分层令牌驱逐策略以有效保留长距离信息。广泛的实验展示了其在多种自回归图像生成模型上的有效性。

英文摘要

Autoregressive (AR) visual generation has achieved remarkable performance but suffers from high memory usage and low throughput, as it requires caching previously generated visual tokens. Recent research has shown that retaining only a few lines of cache tokens can maintain high-quality images while significantly reducing memory usage and improving throughput. However, these methods allocate a fixed budget to each attention head, overlooking the heterogeneity among attention heads, leading to suboptimal memory allocation. In this paper, we observe that attention heads across different layers exhibit diverse attention patterns, where some heads focus on local neighborhoods while others capture broader contextual dependencies. Based on this insight, we propose a novel head-aware key-value (KV) cache compression framework for autoregressive image generation, called HeadKV, which assigns smaller budgets to locality-biased heads and larger budgets to heads with broader attention. A key challenge lies in identifying the type of each attention head to guide cache compression. We further observe that, within the same layer, each head exhibits consistent attention patterns across token positions, \emph{i.e.}, a head's behavior for early tokens remains consistent with that for later tokens. This insight suggests that head types can be identified during the early stage and reused for KV compression throughout generation. Its advantage is that it requires no additional training or dataset-level statistics and generalizes seamlessly across different inputs. Moreover, we design a Stratified Token Eviction strategy to effectively preserve long-range information. Extensive experiments demonstrate its effectiveness across multiple autoregressive image generation models.

2605.20599 2026-05-21 cs.LG

Unsupervised clustering and classification of upper limb EMG signals during functional movements: a data-driven

无监督聚类和分类功能性运动中上肢EMG信号:一种数据驱动的方法

L. F. Salazar Álvarez, D. Escobar-Saltarén, M. B. Salazar Sánchez, S. C. Henao-Aguirre

发表机构 * In2Lab, Engineer Faculty, Universidad de Antioquia(1 In2实验室,工程师学院,安提奥基亚大学) School of Engineering and Sciencies, Tecnológico de Monterrey(2 工程与科学学院,蒙特雷技术学院)

AI总结 本文提出了一种综合方法,用于对功能性抓取和抓握运动中上肢表面肌电信号进行聚类和分类,通过数据驱动的方法在NINAPRO DB4数据集上应用,提出了一种四阶段流程,包括信号预处理、特征提取、通过层次聚类选择手势以及比较模型评估,最终选出五个关键特征用于分类任务。

Comments 19 Congreso Colombiano de Computación (19CCC)

详情
AI中文摘要

本研究提出了一种综合方法,用于对功能性抓取和抓握运动中上肢表面肌电信号进行聚类和分类。该方法应用于NINAPRO DB4数据集,该数据集提供了52个手势的多通道肌电信号记录。设计了一种四阶段流程,包括信号预处理、特征提取、通过层次聚类选择手势以及比较模型评估。预处理包括四阶低通滤波器(0.6 Hz)和希尔伯特包络变换,有效减少噪声并增强信号清晰度。特征提取得到26个时域和频域指标,随后通过视觉分析、互信息、主成分分析和决策树重要性分数进行优化。最终选出五个关键特征用于分类任务。通过使用马氏距离进行层次聚类,选择了六个代表性动作,平衡了生物力学多样性和计算效率。200 ms窗口被确定为最佳时间分割长度,基于稳定性和生理合理性。分类器模型在两个阶段进行评估。使用PyCaret自动比较发现Extra Trees(ET)和人工神经网络(ANN)表现最佳。随后的独立训练证实了它们的稳定性和泛化能力,ANN显示出渐进学习,而ET保持了稳健、一致的结果。研究结果支持了对肌电假肢实施自适应、低延迟控制策略的实现,并提供了一个可扩展的流程用于未来实时应用。

英文摘要

This study presents a comprehensive approach for the clustering and classification of upper-limb surface electromyography (sEMG) signals during functional reach and grasp movements. The methodology was applied to the NINAPRO DB4 dataset, which provides multichannel EMG recordings of 52 gestures. A four-stage pipeline was designed, including signal preprocessing, fea-ture extraction, gesture selection via hierarchical clustering, and comparative model evaluation. Preprocessing involved a fourth-order low-pass filter (0.6 Hz) and Hilbert envelope transformation, effectively reducing noise and enhancing signal clarity. Feature extraction yielded 26 temporal and frequency-domain met-rics, which were later refined using visual analysis, mutual information, principal component analysis, and decision tree importance scores. A final subset of five key features was selected for classification tasks. Gesture selection was per-formed through hierarchical clustering using Mahalanobis distance, resulting in six representative movements that balanced biomechanical diversity and compu-tational efficiency. A 200 ms window was identified as optimal for temporal seg-mentation based on stability and physiological plausibility. Classifier models were evaluated in two stages. Automated comparison using PyCaret identified Extra Trees (ET) and Artificial Neural Networks (ANN) as top performers. Sub-sequent independent training confirmed their stability and generalization capac-ity, with ANN showing progressive learning and ET maintaining robust, con-sistent results. The findings support the implementation of adaptive, low-latency control strategies for myoelectric prostheses and provide a scalable pipeline for future real-time applications.

2605.20595 2026-05-21 cs.RO cs.MA cs.NI

Intent-First Aerial V2V for Tactical Coordination and Separation: Protocol and Performance Under Density and Disturbance

意图优先的空中车对车通信用于战术协调与分离:协议与性能在密度和干扰下的表现

Mehrnaz Sabet

发表机构 * Cornell University(康奈尔大学)

AI总结 本研究提出了一种意图优先的空中车对车(V2V)协议,用于密集的无人机交通管理(UTM)操作,通过部署的邻近通信机制提供新鲜可信的信息以实现局部协调,该协议结合刷新的状态和意图信标,用于局部感知、协同感知和降级模式评估,并通过事件触发的消息进行让行、排序、释放和应急协调。

Comments Submitted to IEEE Transactions on Intelligent Transportation Systems

详情
AI中文摘要

密集的低空航空操作需要的不仅仅是预先飞行路线协调和最后手段的碰撞避免。一旦飞机进入空中,扰动可以在战略重新授权能够吸收的时间尺度以下出现,而碰撞避免太晚且具有破坏性,无法作为常规交通管理。虽然战术分离被认可为中间层,但实现它需要一个可部署的邻近通信机制,该机制能够为本地协调提供新鲜、可信的信息。本文提出了迄今为止我们所知的第一个控制器耦合的特征化,即一个全空中、 sidelink 类型、意图优先的车辆对车辆(V2V)战术邻近交换堆栈,用于密集的无人机交通管理(UTM)操作。与仅意识广播不同,所提出的交换结合了刷新的状态和意图信标,用于局部感知、协同感知和降级模式评估,并通过事件触发的消息进行让行、排序、释放和应急协调。我们通过使用 sidelink 类型的 C-V2X 模块实现并评估了该模型,这些模块具有认证的 freshness 检查。评估使用了由实时、实地锚定的基础设施支持的场景驱动、高流量压力测试。结果表明,V2V 减少了过时信念的分歧,通过协同感知保持可观测性,拒绝无效的战术信息,抑制虚假的局部推断,并结构化共享资源协调。所实现的堆栈在较低到中等密度范围内提供了一个可行的通信层用于战术分离,但随着密度、干扰和复杂性的增加,会转向受保护的回退模式。这些发现将意图优先的空中 V2V 定位为在扰动驱动的都市空域中扩展战术协调的有界促进者。

英文摘要

Dense low-altitude aerial operations require more than pre-flight route coordination and last-resort collision avoidance. Once aircraft are airborne, disturbances can emerge on timescales shorter than strategic reauthorization can absorb, while collision avoidance is too late and disruptive to serve as routine traffic management. Although tactical separation is recognized as the intermediate layer, realizing it at scale requires a deployable neighborhood communication mechanism that provides fresh, trusted information for local coordination. This paper presents what is, to our knowledge, the first controller-coupled characterization of an all-airborne, sidelink-class, intent-first vehicle-to-vehicle (V2V) tactical neighborhood exchange stack for dense Unmanned Aircraft System Traffic Management (UTM) operations. Unlike awareness-only broadcast, the proposed exchange combines refreshed state and intent beacons for local awareness, cooperative perception, and degraded-mode assessment with event-triggered messages for yielding, sequencing, release, and contingency coordination. We implement and evaluate this model on an all-airborne V2V stack using sidelink-class C-V2X modules with authenticated freshness checks. Evaluation uses a scenario-driven, high-volume stress campaign supported by real-time, field-anchored infrastructure. Results show that V2V reduces stale-belief divergence, preserves observability through cooperative perception, rejects invalid tactical messages, suppresses false local inference, and structures shared-resource coordination. The implemented stack provides a viable communication layer for tactical separation in lower-to-moderate regimes, but transitions toward guarded fallback as density, impairment, and complexity increase. These findings position intent-first aerial V2V as a bounded enabler for scaling tactical coordination in disturbance-driven urban airspace.

2605.20592 2026-05-21 cs.LG

ReversedQ: Opportunities for Faster Q-Learning in Episodic Online Reinforcement Learning

ReversedQ: 在回合制在线强化学习中更快的Q学习机会

Sofia R. Miskala-Dinc, Aviva Prins

发表机构 * University of Maryland(马里兰大学)

AI总结 本文研究了在回合有限的马尔可夫决策过程(MDPs)中使用无模型Q学习的效率问题,提出了ReversedQ方法,通过改进价值函数更新顺序、更新频率和初始化来提升学习速度,实验表明其在多个任务中均优于现有方法。

Comments This paper contains 5 pages and 2 figures. To be presented at the Adaptive and Learning Agents workshop (ALA 2026) at AAMAS 2026

详情
AI中文摘要

我们研究了在有限回合的回合制马尔可夫决策过程(MDPs)中使用无模型Q学习的性能,其中动态在回合间保持稳定。我们识别了新兴无模型后验抽样工作中一个核心问题:为了证明理论保证,必须依赖延迟学习。特别是,我们识别了三个加速学习的机会:(i)价值函数更新顺序,(ii)更新频率,以及(iii)价值函数初始化。基于Wang等人提出的RandomizedQ,我们展示了这些变化及其单独和累积的影响,并在多个经验研究中进行了验证。我们发现,我们的综合修改,称为ReversedQ,在Bidirectional Diabolical Combination Lock(BDCL)任务中,相对于RandomizedQ,缩放后的平均累积奖励从9.53%提升至78.78%,在链状MDP中,从21.76%提升至61.81%。

英文摘要

We study model-free Q-learning in finite-horizon episodic Markov Decision Processes (MDPs) with stationary dynamics across episodes. We identify a central issue in nascent model-free posterior-sampling works: the reliance on delayed learning in order to prove theoretical guarantees. In particular, we identify three opportunities for faster learning - (i) value-function update order, (ii) update frequencies, and (iii) value-function initialization. Using Wang et al.'s RandomizedQ as a basis, we illustrate these changes and their individual (as well as cumulative) impact in multiple empirical studies. We find that our combined modifications, termed ReversedQ, improve scaled mean cumulative reward compared to RandomizedQ, from 9.53% to 78.78% in the Bidirectional Diabolical Combination Lock (BDCL), and from 21.76% to 61.81% in a chain MDP.

2605.20591 2026-05-21 cs.CL cs.CY

Do No Harm? Hallucination and Actor-Level Abuse in Web-Deployed Medical Large Language Models

有害吗?网络部署医疗大语言模型中的幻觉与作用层面滥用

Sunday Oyinlola Ogundoyin, Muhammad Ikram, Rahat Masood

发表机构 * The University of New South Wales, Sydney, Australia(新南威尔士大学,悉尼,澳大利亚)

AI总结 本文研究了网络部署的医疗大语言模型中的幻觉和作用层面滥用问题,通过评估6233个MedGPT和10个开源LLM,发现25-30%的MedGPT事实准确性较低,33.6-54.3%违反操作阈值,57.06%的Action-enabled模型缺乏充分的隐私披露,揭示了系统性漏洞,强调了多指标评估和更强的安全保障的必要性。

详情
AI中文摘要

医疗大语言模型(LLMs),包括定制医疗GPT(MedGPTs)和开源模型,正越来越多地部署在网页平台上以提供临床指导。然而,它们存在幻觉、政策不合规和不安全设计的风险。我们对6,233个MedGPT进行了大规模评估,评估了1,500个分层样本以及10个开源LLM。我们引入了两个框架:MedGPT-HEval用于幻觉检测,以及一个基于LLM的流程用于评估违规行为和开发者意图。我们的结果表明,25-30%的MedGPT事实准确性较低,底层和中层模型风险最高;33.6-54.3%违反操作阈值,57.06%的Action-enabled模型缺乏充分的隐私披露。与开源模型相比,MedGPT在事实准确性和语义对齐方面表现更好,但开源模型更稳定。这些结果揭示了幻觉和合规性的系统性缺口,强调了多指标评估和更强的安全保障的必要性。我们发布了HAA-MedGPT,一个结构化数据集,支持未来关于网络面向医疗LLM安全性的研究。

英文摘要

Medical large language models (LLMs), including custom medical GPTs (MedGPTs) and open-source models, are increasingly deployed on web platforms to provide clinical guidance. However, they pose risks of hallucination, policy noncompliance, and unsafe design. We conduct a large-scale assessment of 6,233 MedGPTs, evaluating a stratified sample of 1,500, together with 10 open-source LLMs. We introduce two frameworks: MedGPT-HEval for hallucination detection and an LLM-based pipeline for assessing policy violations and developer intent. Our results show that 25-30% of MedGPTs exhibit low factual accuracy, with bottom- and middle-tier models at highest risk; 33.6-54.3% violate operational thresholds, and 57.06% of Action-enabled models lack adequate privacy disclosures. Compared with open-source models, MedGPTs achieve higher factual accuracy and semantic alignment, though open-source models are more stable. These results reveal systemic gaps in hallucination and compliance, highlighting the need for multi-metric evaluation and stronger safeguards. We release HAA-MedGPT, a structured dataset that supports future research on the safety of web-facing medical LLMs.

2605.20588 2026-05-21 cs.CL cs.CV

Direct Translation between Sign Languages

手语之间的直接翻译

Zetian Wu, Bowen Xie, Wuyang Meng, Milan Gautam, Stefan Lee, Liang Huang

发表机构 * Oregon State University(俄勒冈州立大学)

AI总结 本文提出了一种直接的手语到手语翻译方法,通过使用回译技术生成合成的手语对,从而克服了传统级联方法中的误差传播和信息丢失问题,并在多个手语数据集上实现了更高的翻译质量和速度提升。

详情
AI中文摘要

手语翻译领域在手语与口语之间的翻译上取得了显著进展,但手语之间的翻译仍鲜为人知且难以实现。后者可以帮助15亿全球聋人和听力障碍者在语言障碍中交流,而无需依赖听力翻译者或书面语言能力。级联方法由单独的手语到文本、文本到文本和文本到手语系统组成,但存在误差传播、额外延迟以及视觉模态中独特信息的丢失。我们旨在开发直接的手语到手语翻译。然而,尚未有大规模的开放领域平行语料库在手语之间。为了实现直接的手语翻译,我们使用回译技术从不对齐的个体语言语音-手语语料库中生成合成的手语对。使用这些数据,我们联合训练了一个基于MBART的单一模型,用于文本到手语(T2S)和手语到手语(S2S)。在合成生成的美国手语(ASL)、中国手语(CSL)和德国手语(DGS)之间配对集上,我们的直接S2S方法在几何手语误差指标(20%更低的DTW对齐MPJPE)和翻译回句子后的语言匹配指标(50%高BLEU-4)上优于级联基线,同时实现了大约2.3倍的速度提升。在一小部分现有的跨语言手语数据上,我们发现我们的方法也实现了类似的改进。

英文摘要

The field of sign language translation has witnessed significant progress in the translation between sign and spoken languages, but the translation between sign languages remains largely unexplored and out of reach. The latter can help 1.5 billion deaf and hard-of-hearing (DHH) people worldwide communicate across language barriers without relying on hearing interpreters or written-language fluency. The cascade approach composing separate sign-to-text, text-to-text, and text-to-sign systems suffers from error propagation and extra latency as well as the loss of information unique in the visual modality. We aim to develop direct sign-to-sign translation. However, a large-scale open-domain parallel corpus has not been curated between sign languages. To enable direct translation between sign language utterances, we use back-translation to produce synthetic sign-sign pairs from unaligned individual language utterance-sign corpora. Using this data, we jointly train a single MBART-based model for both text->sign (T2S) and sign->sign (S2S). On synthetically generated paired sets between American Sign Language (ASL), Chinese Sign Language (CSL), and German Sign Language (DGS), our direct S2S method outperforms the cascaded baseline on geometric sign error metrics (20% lower DTW-aligned MPJPE) and language matching metrics after predicted sign utterances are translated back to sentences (50% high BLEU-4) while achieving a roughly 2.3* speedup. On a small set of pre-existing cross-lingual sign data, we find similar improvements for our proposed method.

2605.20584 2026-05-21 cs.CV

QwenSafe: Multimodal Content Rating Description Identification via Preference-Aligned VLMs

QwenSafe: 通过偏好对齐的视觉语言模型实现多模态内容评级描述识别

Dishanika Denipitiyage, Aruna Seneviratne, Suranga Seneviratne

发表机构 * University of Sydney(悉尼大学) University of New South Wales(新南威尔士大学)

AI总结 本文提出QwenSafe,一种通过联合推理应用元数据和截图来自动识别苹果定义的内容评级描述(CRDs)的视觉语言模型,通过引入metadata2CRD数据构建管道和直接偏好优化(DPO)提升模型预测准确性,实验结果显示QwenSafe在二元CRD分类中显著优于现有模型。

详情
AI中文摘要

移动应用市场要求开发者披露标准化的内容评级描述(CRDs)以告知用户潜在敏感或受限制的内容。确保这些披露的准确性和一致性仍然具有挑战性,因为应用内容的多模态性质跨越了文本描述和视觉界面。在本文中,我们提出了QwenSafe,一种视觉语言模型(VLM),旨在通过联合推理应用元数据和截图自动识别苹果定义的CRDs。为了使该任务能够扩展训练,我们引入了metadata2CRD数据构建管道,通过结合应用描述、截图和正式描述定义来合成描述对齐的问题-答案对。我们通过监督微调后直接偏好优化(DPO)调整Qwen3-VL-8B,以使模型预测与视觉和文本模态的描述特定证据和解释对齐。我们在12个苹果定义的内容评级描述上评估QwenSafe,并将其与最先进的视觉语言模型进行比较,包括Qwen3-VL、LLaVA-1.6和Gemini-2.5-Flash。QwenSafe在二元CRD分类中始终优于所有基线模型,分别在正类召回率上实现了111.8%、36.1%和2.1%的提升。我们的结果表明,描述意识的多模态对齐显著提高了自动化内容分类,并突显了视觉语言模型在支持移动应用市场中可扩展和一致的内容评级方面的潜力。

英文摘要

Mobile app marketplaces require developers to disclose standardized content rating descriptors (CRDs) to inform users about potentially sensitive or restricted content. Ensuring the accuracy and consistency of these disclosures remains challenging due to the multimodal nature of app content, which spans textual descriptions and visual interfaces. In this paper, we present QwenSafe, a Vision-Language Model (VLM) designed to automatically identify the presence of Apple-defined CRDs by jointly reasoning over app metadata and screenshots. To enable scalable training for this task, we introduce metadata2CRD, a data-construction pipeline that synthesizes descriptor-aligned question-answer pairs by combining app descriptions, screenshots, and formal descriptor definitions. We adapt Qwen3-VL-8B using supervised fine-tuning followed by Direct Preference Optimization (DPO) to align model predictions with descriptor-specific evidence and explanations across visual and textual modalities. We evaluate QwenSafe on 12 Apple-defined content rating descriptors and compare it against state-of-the-art vision-language models, including Qwen3-VL, LLaVA-1.6, and Gemini-2.5-Flash. QwenSafe consistently outperforms all baselines in binary CRD classification, achieving improvements in positive-class recall of 111.8%, 36.1%, and 2.1%, respectively. Our results demonstrate that descriptor-aware multimodal alignment substantially improves automated content classification and highlights the potential of vision-language models to support scalable and consistent content rating in mobile app marketplaces.

2605.20581 2026-05-21 cs.LG cond-mat.mtrl-sci

TriForces: Augmenting Atomistic GNNs for Transferable Representations

TriForces: 为可迁移表示增强原子istic GNNs

Ali Ramlaoui, Alexandre Duval, Hannah Bull, Victor Schmidt, Hugues Talbot, Fragkiskos D. Malliaros, Joseph Musielewicz

发表机构 * Université Paris-Saclay, CentraleSupélec, Inria, Gif-sur-Yvette, France(巴黎-萨克雷大学,中央超算研究所,法国国家信息与自动化技术研究院,法国吉夫-sur-耶vette)

AI总结 TriForces通过分离组成和结构信息并结合自监督学习,提升MatBench和QM9的性能,无需DFT标签,并在OMat24上实现高效相似结构检索。

Comments 28 pages, 11 figures. Accepted at ICML 2026

详情
AI中文摘要

机器学习互原子势(MLIPs)在训练于大规模密度泛函理论(DFT)数据时能取得优异的准确性。为了在实践中有用,它们通常需要通过小而昂贵的特定任务数据集进行调整。然而,MLIPs在不同领域之间的迁移不一致,其表示往往失去可访问的组成和结构信息。为此,我们提出了TriForces,一个模型无关的三流框架,通过分离组成和结构信息并结合自监督学习来保持可迁移的表示。TriForces在MatBench和QM9上优于基线模型,无需DFT标签,并通过其学习的潜在空间实现高效的相似结构检索。在OMat24上,在有限数据训练条件下,TriForces在20K样本仅需时将能量MAE减少57%,并在不同样本数量下提升力MAE。我们还发布了多个MLIP架构的预训练TriForces变体,并在https://github.com/Ramlaoui/triforces上提供代码。

英文摘要

Machine learning interatomic potentials (MLIPs) achieve excellent accuracy when trained on large Density Functional Theory (DFT) data. To be useful in practice, they must often be adapted to target chemistries using small and expensive task-specific datasets. However, MLIPs transfer inconsistently across domains, with representations that often loose accessible composition and structure information. To address this, we present TriForces, a model-agnostic three-stream framework that separates composition and structure information, combined with self-supervised learning to preserve transferable representations. TriForces improves performance on MatBench and QM9 over baselines without needing DFT labels and enables efficient similar structure retrieval through its learned latent space. On OMat24, in limited-data training regime, TriForces reduces energy MAE by 57% at 20K samples only and improves force MAE across sample sizes. We release pretrained TriForces variants across multiple MLIP architectures with code at https://github.com/Ramlaoui/triforces.

2605.20580 2026-05-21 cs.LG

Deep Learning Surrogates for Emulating Stochastic Climate Tipping Dynamics

深度学习代理用于模拟随机气候临界动态

Adeline Hillier, Jennifer Sleeman, Jay Brett, Caroline Tang, Jenelle Millison, Anand Gnanadesikan

发表机构 * Johns Hopkins Applied Physics Laboratory(约翰霍普金斯应用物理实验室) Johns Hopkins University(约翰霍普金斯大学)

AI总结 本文提出了一种基于动态信息的时序融合变换器作为数据驱动的代理,用于高效模拟复杂的地球系统模拟,通过预测临界事件的时间来提高计算效率。

详情
AI中文摘要

本文探讨了一种基于动态信息的时序融合变换器(TFT)作为数据驱动代理,用于计算密集型地球系统模拟。聚焦于描述全球海洋输送的多变量时间序列,我们展示了该代理在数千个时间步上预测临界事件的能力。数据包括多达21个非平稳时间序列以及描述自由参数和初始条件的静态协变量。对架构和目标函数的修改使代理能够高保真地预测大西洋和太平洋崩溃的时间,并捕捉跨集合预测的随机不确定性。所学代理在数值模拟器上实现了465倍的计算加速,同时保持对参数和初始条件的可微性。

英文摘要

This work explores a dynamics-informed Temporal Fusion Transformer (TFT) as a data-driven surrogate for computationally intensive Earth system simulations. Focusing on multivariate time series describing global ocean transport, we demonstrate the surrogate's ability to forecast tip events across thousands of time steps. The data involve up to 21 non-stationary time series in addition to static covariates describing free parameters and initial conditions. Modifications to the architecture and objective function yield a surrogate that anticipates the timing of Atlantic and Pacific collapses to high fidelity and captures the stochastic uncertainty in transition timing across ensemble predictions. The learned surrogate achieves a 465x computational speedup over the numerical simulator while maintaining differentiability with respect to parameters and initial conditions.

2605.20577 2026-05-21 cs.AI cs.LG

Mahjax: A GPU-Accelerated Mahjong Simulator for Reinforcement Learning in JAX

Mahjax: 一种用于在JAX中进行强化学习的GPU加速麻将模拟器

Soichiro Nishimori, Shinri Okano, Keigo Habara, Sotetsu Koyamada, Eason Yu, Masashi Sugiyama

发表机构 * The University of Tokyo(东京大学) RIKEN AIP(日本理化学研究院AIP) Nara Institute of Science and Technology(奈良科学技術大學) Kobe University(Kobe大学) Kyoto University(京都大学) ATR The University of Sydney(悉尼大学)

AI总结 本文提出Mahjax,一种基于JAX实现的麻将环境,利用GPU加速大规模并行化,以解决麻将游戏中的高维状态空间和随机性问题,为强化学习提供高效的训练平台。

详情
AI中文摘要

Riichi Mahjong是一种多玩家、信息不完全的游戏,具有随机性和高维状态空间的特性。这些属性构成了强化学习中复杂决策问题的独特挑战。尽管先前研究主要依赖于从人类游戏日志中监督学习来预训练策略,但能够从头开始学习(tabula rasa)的算法在通用性上具有更大潜力,如AlphaZero所示。为促进此类研究,我们引入了Mahjax,一个完全向量化实现的Riichi Mahjong环境,用于在图形处理器(GPU)上实现大规模的回放并行化。我们还提供了一个高质量的可视化工具,以简化调试和与训练代理的交互。实验结果表明,Mahjax在八块NVIDIA A100 GPU上分别实现了高达200万和100万步每秒的吞吐量。此外,我们通过展示代理能够有效训练以提高其相对于基线策略的排名,验证了该环境在强化学习中的实用性。

英文摘要

Riichi Mahjong is a multi-player, imperfect-information game characterized by stochasticity and high-dimensional state spaces. These attributes present a unique combination of challenges that mirror complex real-world decision-making problems in reinforcement learning. While prior research has heavily relied on supervised learning from human play logs to pre-train the policy, algorithms capable of learning \textit{tabula rasa} (from scratch) offer greater potential for general applicability, as evidenced by the AlphaZero lineage. To facilitate such research, we introduce \textbf{Mahjax}, a fully vectorized Riichi Mahjong environment implemented in JAX to enable large-scale rollout parallelization on Graphics Processing Units (GPUs). We also provide a high-quality visualization tool to streamline debugging and interaction with trained agents. Experimental results demonstrate that Mahjax achieves throughputs of up to \textbf{2 million} and \textbf{1 million steps per second} on eight NVIDIA A100 GPUs under the no-red and red rules, respectively. Furthermore, we validate the environment's utility for reinforcement learning by showing that agents can be trained effectively to improve their rank against baseline policies.

2605.20576 2026-05-21 cs.CV

$Δ$ynamics: Language-Based Representation for Inferring Rigid-Body Dynamics From Videos

$Δ$ynamics: 一种基于语言的表示方法,用于从视频中推断刚体动力学

Chia-Hsiang Kao, Cong Phuoc Huynh, Chien-Yi Wang, Noranart Vesdapunt, Stefan Stojanov, Bharath Hariharan, Oleksandr Obiednikov, Ning Zhou

发表机构 * Cornell University(康奈尔大学) Amazon(亚马逊)

AI总结 本文提出$Δ$YNAMICS框架,通过语言统一表示刚体动力学,利用结构化文本生成物理模拟场景配置,结合自然语言运动推理和光流输入提升泛化能力,在CLEVRER数据集上实现了7倍于现有VLMs的分割IoU,并在新数据集上展示了良好的迁移能力。

Comments Accepted to CVPR 2026. Project page: https://iandrover.github.io/2026_dynamics

详情
AI中文摘要

从单目视频中推断刚体物理状态和属性是实现基于物理的感知和模拟的关键步骤。现有方法假设特定的物理系统、物体类型和相机姿态,无法泛化到复杂的现实环境。我们引入$Δ$YNAMICS,一种视觉-语言框架,利用语言作为刚体动力学的统一表示。不同于直接预测参数,$Δ$YNAMICS生成结构化的文本格式场景配置用于物理模拟。我们通过整合自然语言运动推理和利用光流作为语义无关的输入来增强模型的泛化能力。在CLEVRER数据集上,$Δ$YNAMICS实现了0.30的分割IoU,比领先的VLMs(InternVL3-8B,Qwen2.5-VL-7B和Claude-4-Sonnet)提高了7倍。此外,测试时采样和进化搜索分别将分割IoU提高27%和120%。最后,我们展示了在包含235个现实世界刚体视频的新数据集上的良好迁移能力,突显了语言驱动的物理推断在连接感知和模拟方面的潜力。

英文摘要

Inferring rigid-body physical states and properties from monocular videos is a fundamental step toward physics-based perception and simulation. Existing approaches assume specific underlying physical systems, object types, and camera poses, making them unable to generalize to complex real-world settings. We introduce $Δ$YNAMICS, a vision-language framework that uses language as a unified representation of rigid-body dynamics. Instead of directly predicting parameters, $Δ$YNAMICS generates scene configurations in a structured text format for physics simulation. We enhance the model's generalization by integrating natural language motion reasoning and leveraging optical flow as a semantic-agnostic input. On the CLEVRER dataset, $Δ$YNAMICS achieves a segmentation IoU of 0.30, a 7x improvement over leading VLMs (InternVL3-8B, Qwen2.5-VL-7B and Claude-4-Sonnet). Additionally, test-time sampling and evolutionary search further boost performance by 27% and 120% in segmentation IoU, respectively. Finally, we demonstrate strong transfer to a new dataset of 235 real-world rigid-body videos, highlighting the potential of language-driven physics inference for bridging perception and simulation.

2605.20569 2026-05-21 cs.CV

End-to-End Unmixing with Material Prompts for Hyperspectral Object Tracking

端到端材料提示的超光谱目标跟踪

Xu Han, Mohammad Aminul Islam, Lei Wang, Zekun Long, Guanmanyi Fu, Wangshu Cai, Kuldip K. Paliwal, Jun Zhou

发表机构 * School of Information and Communication Technology, Griffith University, Australia(信息与通信技术学院,格里菲斯大学,澳大利亚) School of Engineering and Built Environment, Griffith University, Australia(工程与建筑环境学院,格里菲斯大学,澳大利亚) School of Environment and Science, Griffith University, Australia(环境与科学学院,格里菲斯大学,澳大利亚)

AI总结 本文提出了一种端到端的材料感知跟踪框架,通过联合优化材料分解和目标定位,利用加权目标导向的解混损失对齐材料表示与定位精度,以提升超光谱图像在外观模糊、光照变化和背景杂波下的跟踪鲁棒性。

详情
AI中文摘要

超光谱成像编码了丰富的材料属性,可以在外观模糊、光照变化和背景杂波下提高跟踪鲁棒性。然而,由于超光谱视频数据有限,许多现有方法通过空间或通道融合策略适应预训练的RGB跟踪器,很大程度上忽略了超光谱成像中的内在材料信息。此外,很少的材料感知方法通常依赖于外部光谱解混管道,这些管道与跟踪目标解耦,限制了对材料表示的有效优化。为了解决这些限制,我们将超光谱目标跟踪公式化为材料分解和目标定位的联合优化问题,通过加权目标导向的解混损失将两个任务耦合起来,显式地对齐材料表示与定位精度。具体来说,我们提出了一种用于深度学习光谱解混的材料表示分解模块,具有自适应频率分解。基于分解的材料表示,我们进一步引入了双分支小波增强的材料提示模块,通过频域中的高效空间-材料交互学习低频和高频的材料提示。该框架是模型无关的,可以无缝扩展到不同的解混后端。在标准的超光谱跟踪基准上的大量实验验证了所提出端到端材料感知跟踪框架的最先进性能,并验证了其有效性。代码可在https://github.com/han030927/E2EMPT上获得。

英文摘要

Hyperspectral imagery encodes rich material properties that can improve tracking robustness under appearance ambiguity, illumination change, and background clutter. However, due to the limited availability of hyperspectral video data, many existing methods adapt pretrained RGB trackers via spatial or channel fusion strategies, largely neglecting the intrinsic material information in hyperspectral imagery. Moreover, the few material-aware approaches typically rely on external spectral unmixing pipelines that are decoupled from the tracking objective, limiting effective optimization of material representations for target localization. To address these limitations, we formulate hyperspectral object tracking as a joint optimization problem of material decomposition and target localization, coupling the two tasks via a weighted target-oriented unmixing loss that explicitly aligns material representations with localization accuracy. Specifically, we propose a material representation decomposition module for deep learning-based spectral unmixing with adaptive frequency decomposition. Building on the decomposed material representations, we further introduce a dual-branch wavelet-enhanced material prompt module that learns low- and high-frequency material prompts through efficient spatial-material interactions in the frequency domain. The framework is model-agnostic and can be seamlessly generalized to different unmixing backbones. Extensive experiments on standard hyperspectral tracking benchmarks demonstrate state-of-the-art performance and validate the effectiveness of the proposed end-to-end material-aware tracking framework. Code is available at https://github.com/han030927/E2EMPT.

2605.20566 2026-05-21 cs.RO

Conflict-Aware Active Perception and Control in 3D Gaussian Splatting Fields via Control Barrier Functions

基于控制屏障函数的3D高斯点云场中冲突感知与控制

Amirhossein Mollaei Khass, Athanasios Cosse, Vivek Pandey, Nader Motee

发表机构 * Department of Mechanical Engineering and Mechanics, Lehigh University(机械工程与力学系,莱恩大学)

AI总结 本文提出了一种基于控制屏障函数的冲突感知与控制框架,用于在3D高斯点云场环境中安全导航并获取信息以减少地图不确定性,通过统一的安全关键和感知感知二次规划程序解决安全与感知目标的冲突。

Comments Project website: https://sircesoc.github.io/Conflict_Aware_Active_Perception/

详情
AI中文摘要

在不确定环境中主动感知要求机器人在安全导航的同时获取信息以减少地图不确定性。这些目标本质上存在冲突,因为信息丰富的视角通常位于不确定区域,具有更高的碰撞风险。为了解决这一挑战,我们开发了一种冲突感知和控制框架,用于在由3D高斯点云(3DGS)表示的环境中运行的机器人系统。通过从平均条件风险AV@R碰撞风险度量中导出的控制屏障函数(CBF)来确保安全,该度量考虑了几何不确定性和保证了安全集的前向不变性。为了提高感知,我们提出了一种风险感知的预期信息增益(EIG)公式,用于选择下一个最佳视角,并引入了将摄像机方向对齐局部信息上升方向的感知屏障函数。为了获得这些冲突的安全和感知目标的可处理公式,我们提出了一种统一的安全关键和感知感知二次规划程序,通过松弛变量放松感知约束。仿真结果表明,所提出的方法在安全性和信息获取方面均优于现有基于3DGS的方法。

英文摘要

Active perception in uncertain environments requires robots to navigate safely while acquiring informative observations to reduce map uncertainty. These objectives inherently conflict, as informative viewpoints often lie near uncertain regions with higher collision risk. To address this challenge, we develop a conflict-aware active perception and control framework for robotic systems operating in environments represented by 3D Gaussian Splatting (3DGS). Safety is enforced using a Control Barrier Function (CBF) derived from an Average Value-at-Risk AV@R collision-risk metric that accounts for geometric uncertainty and guarantees forward invariance of a safe set. To improve perception, we propose a risk-aware Expected Information Gain (EIG) formulation for selecting the next-best-view and introduce perception barrier functions that align the camera orientation with the local information-ascent direction. To obtain a tractable formulation for these conflicting safety and perception objectives, we propose a unified safety-critical, perception-aware quadratic program that enforces safety as a hard constraint while relaxing perception constraints through slack variables. Simulation results demonstrate that the proposed method improves both safety and information acquisition compared to existing 3DGS-based approaches.

2605.20561 2026-05-21 cs.RO

Fault-Tolerant, Rigidity-Preserving Control of Inflatable Truss Robots

容错、保持刚性的可膨胀桁架机器人控制

James Wade, Isaac Weaver, Mihai Stanciu, Nathan Usevitch

发表机构 * Ira A. Fulton School of Engineering, Mechanical Engineering Department, Brigham Young University(伊拉·A·福林工程学院,机械工程系, Brigham Young 大学)

AI总结 本文提出了一种容错控制框架,用于可膨胀机器人桁架,能够在电机故障的情况下保持功能,通过三个关键贡献:扩展运动学优化以处理任意电机故障组合,引入离散时间控制屏障函数约束以保证结构刚性,以及利用 onboard 编码器反馈和基于正向运动学的状态估计器实现闭环位置控制。

详情
AI中文摘要

等周机器人桁架可以适应不同的任务和环境,因为它们具有高强重比,能够大幅改变自身形状,并可以重新配置成多种不同形状。然而,操作环境中电机故障如果未得到妥善处理,会严重限制操作能力。本文提出了一种容错控制框架,用于可膨胀机器人桁架,能够在电机故障的情况下保持功能,通过三个关键贡献。首先,我们扩展运动学优化以处理任意组合的电机故障,通过施加等式约束确保故障执行器不被使用。其次,我们引入离散时间控制屏障函数(DTCBF)约束,数学上保证结构刚性的同时最大化工作空间利用率,这是在离散时间控制下可靠操作桁架机器人的重要要求。第三,我们利用 onboard 编码器反馈和基于正向运动学的状态估计器实现闭环位置控制,在存在干扰的情况下提高位置精度。我们通过模拟和硬件实验验证了我们的方法,针对一个具有6个执行器的2D等周桁架测试平台。对于具有6个执行器的2D配置,我们展示了在单个电机故障下工作空间保留超过69%,并利用闭环控制实现了跟踪精度的25%提升。这些结果为在退化驱动条件下更鲁棒和坚韧的等周桁架机器人奠定了基础。

英文摘要

Isoperimetric robotic trusses can adapt to different tasks and environments because they have a high strength-to-weight ratio, can change their own shape dramatically, and can be reconfigured into a variety of different shapes. However, motor failures in operational environments can severely limit operational capabilities if not properly addressed. This paper presents a fault-tolerant control framework for an inflatable robotic truss that maintains functionality despite motor failures, shown through three key contributions. First, we extend the kinematic optimization to handle arbitrary combinations of motor failures by imposing equality constraints to ensure failed actuators are not used. Second, we introduce discrete-time control barrier function (DTCBF) constraints that mathematically guarantee structural rigidity while maximizing workspace utilization, a critical requirement for reliable operation of truss robots under discrete-time control. Third, we implement closed-loop position control using onboard encoder feedback and a forward kinematics-based state estimator, improving positional accuracy in the presence of disturbances. We validate our approach through simulation and hardware experiments on a 2D isoperimetric truss testbed. For a 2D configuration with 6 actuators, we demonstrate >69% workspace preservation under single-motor failures and a >25% improvement in tracking accuracy with closed-loop control. These results establish a foundation for more robust and resilient isoperimetric truss robots operating under degraded actuation.

2605.20555 2026-05-21 cs.LG cs.AI

Complementing reinforcement learning with SFT through logit averaging in the post training of LLMs

通过logit平均在LLMs后训练中补充强化学习

Xingwei Gan, Ying Zhu

发表机构 * UC San Diego(加州大学圣迭戈分校)

AI总结 本文提出一种在LLMs后训练中通过logit平均补充强化学习的方法,将该方法整合到Group Relative Policy Optimization (GRPO)中,无需使用KL正则化或critic,通过logit平均结构将可训练策略与参考策略耦合,以利用可训练策略的推理能力并保持SFT的格式优势。

详情
AI中文摘要

我们介绍了一种新颖的方法,该方法对冻结的参考策略(例如SFT)和可训练策略的logits进行平均,并将该方法整合到Group Relative Policy Optimization (GRPO)中。与Reinforcement Learning with Verifiable Rewards (RLVR)方法不同,我们的方法不涉及Kullback Leibler (KL)正则化或critic;可训练策略和参考锚点通过logit平均结构耦合,以利用可训练策略的推理能力,同时保持SFT的格式优势。我们的方法在MATH、cn-k12和MMLU上进行了评估,结果表明其准确率高于或至少与传统的KL正则化GRPO相当。

英文摘要

We introduce a novel method that averages the logits of a frozen reference policy (e.g., SFT) and a trainable policy, and incorporate the method into Group Relative Policy Optimization (GRPO). In contrast to Reinforcement Learning with Verifiable Rewards (RLVR) methods, our proposal does not involve a Kullback Leibler (KL) regularization or critic; the trainable policy and the reference anchor are coupled through the logit averaging structure to leverage the reasoning expertise of the trainable policy while maintaining the formatting advantage of SFT. Our method is evaluated on MATH, cn-k12, and MMLU, and the results show a higher accuracy or at least comparable accuracy relative to the canonical KL-regularized GRPO.

2605.20554 2026-05-21 cs.AI cs.HC cs.SI

Personality Engineering with AI Agents: A New Methodology for Negotiation Research

利用AI代理的人格工程:谈判研究的新方法论

Michelle A. Vaccaro, Jared R. Curhan

发表机构 * MIT Institute for Data, Systems, and Society(MIT数据、系统与社会研究所) MIT Sloan School of Management(MIT斯隆管理学院)

AI总结 本文提出了一种利用AI代理进行谈判者人格参数化、操纵和评估的方法,通过人际圆周坐标系中的温暖和支配两个核心维度,为谈判理论的严格测试和AI谈判代理的人格设计提供了一种新方法。

详情
AI中文摘要

根据经典谈判理论,人们在谈判中的成功取决于他们平衡竞争需求的能力--共情与主张,表现出对他人的关心和对自己的关心,对人温和而对问题强硬。然而,人们难以管理这些张力,因此研究人员缺乏在受控条件下严格测试该领域规定的能力。AI代理没有相同的限制,其精确性、 repertoire、一致性以及可扩展性使能够贡献于谈判理论的新一类实验成为可能。在本文中,我们介绍了一种称为人格工程的方法论,该方法利用AI代理来精确参数化、操纵和评估谈判者的人格。我们提议使用人际圆周--以及其两个核心维度温暖和支配--作为该领域的基础坐标系统。这种方法不仅提供了一种严格测试经典谈判理论的方法,还为设计AI谈判代理的人格提供了一种实用指南。

英文摘要

According to canonical negotiation theory, people's success in a negotiation depends on how well they balance competing demands--empathizing and asserting, demonstrating concern for other and concern for self, being soft on the people and hard on the problem. Yet people struggle to manage these tensions, so researchers have lacked the ability to rigorously test the field's prescriptions under controlled conditions. AI agents do not face the same limitations, and their precision, repertoire, consistency, and scalability enable a new class of experiments to contribute to negotiation theory. In this article, we introduce personality engineering: a methodology that uses AI agents to precisely parameterize, manipulate, and evaluate negotiator personality. We propose using the interpersonal circumplex--and its two core dimensions of warmth and dominance--as a foundational coordinate system for the field. This approach offers both a rigorous methodology for testing classic negotiation theories and a practical guide for designing the personalities of AI negotiation agents.

2605.20551 2026-05-21 cs.CV cs.AI cs.RO

Faster or Stronger: Towards Flexible Visual Place Recognition via Weighted Aggregation and Token Pruning

更快或更强:通过加权聚合和标记剪枝实现灵活的视觉位置识别

Zichao Zeng, June Moh Goo, Junwei Zheng, Weijia Fan, Jiaming Zhang, Rainer Stiefelhagen, Jan Boehm

发表机构 * University College London(伦敦大学学院) Karlsruhe Institute of Technology(卡尔斯鲁厄大学) Hunan University(湖南大学) Shenzhen University(深圳大学)

AI总结 本文提出了一种加权聚合描述符(WeiAD)和标记剪枝框架(WeiToP),用于提升视觉位置识别的性能和效率,通过动态调整特征提取的精度与效率平衡。

详情
AI中文摘要

视觉位置识别(VPR)旨在将查询图像匹配到大规模数据库中相同地点的参考图像。最近最先进的方法采用视觉Transformer(ViTs)作为基础模型,提取对视角、光照和季节变化具有鲁棒性的补丁级特征,然后聚合为紧凑的全局描述符进行检索。大多数现有聚合方法将补丁标记均匀地池化到学习的簇中,尽管不同簇往往编码不同的空间或语义模式,并对VPR性能贡献不均。为了解决这一限制,我们提出了加权聚合描述符(WeiAD),在聚合过程中分配簇的权重,产生更具判别性的全局表示。除了准确性之外,检索延迟是大规模部署和资源受限边缘设备的关键关注点。先前的工作主要通过压缩全局描述符来减少延迟,而忽略了特征提取的成本,这在基于ViT的基础模型中变得更加严重。因此,我们引入了面向VPR的标记剪枝框架WeiToP,通过自蒸馏减少特征提取成本,其中聚合诱导的标记重要性监督一个轻量级剪枝模块,附加到早期Transformer层上,使推理时能够进行标记剪枝。在单次联合训练阶段后,WeiToP能够在推理时实现插拔式的标记剪枝,允许在不额外训练的情况下灵活地控制精度-效率权衡。此外,WeiToP在现有针对通用视觉任务的标记剪枝方法上表现更优。

英文摘要

Visual Place Recognition (VPR) aims to match a query image to reference images of the same place in a large-scale database. Recent state-of-the-art methods employ Vision Transformers (ViTs) as backbone foundation models to extract patch-level features that are robust to viewpoint, illumination, and seasonal variations, which are then aggregated into a compact global descriptor for retrieval. Most existing aggregation methods uniformly pool patch tokens into learned clusters, despite the fact that different clusters often encode distinct spatial or semantic patterns and contribute unequally to VPR performance. To address this limitation, we propose Weighted Aggregated Descriptor (WeiAD), which assigns weights to clusters during aggregation, producing more discriminative global representations. Beyond accuracy, retrieval latency is a critical concern for large-scale deployments and resource-constrained edge devices. Prior work mainly reduces latency by compressing global descriptors, while overlooking the cost of feature extraction, an issue exacerbated by ViT-based backbones. We therefore introduce WeiToP, a VPR-oriented token pruning framework that reduces feature extraction cost via self-distillation, where aggregation-induced token importance supervises a lightweight pruning module attached to an early transformer layer, enabling inference-time token pruning. After a single joint training phase, WeiToP enables plug-and-play token pruning at inference time, allowing flexible and on-demand control over the accuracy-efficiency trade-off without additional training. Moreover, WeiToP outperforms existing token pruning methods adapted from general vision tasks.

2605.20549 2026-05-21 cs.CV

MAPS: A Synthetic Dataset for Probing Vision Models in a Controlled 3D Scene Space

MAPS:用于在受控3D场景空间中探测视觉模型的合成数据集

Santiago Galella, Pamela Osuna-Vargas, Maren Wehrheim, Martina G. Vilas, Gemma Roig, Matthias Kaschube

发表机构 * FIAS & Institute of Computer Science Goethe University Frankfurt(FIAS与计算机科学研究所弗赖堡大学) Mila & Department of Biology York University(Mila与生物学系约克大学) Institute of Computer Science Goethe University Frankfurt(计算机科学研究所弗赖堡大学)

AI总结 本文提出MAPS数据集,用于在受控3D场景空间中研究视觉模型的行为,通过回归敏感性分析评估20种模型对场景因素的依赖性,发现相机距离和高度是导致识别失败的主要因素,且现代CNN和Transformer模型在敏感性上表现出相似性。

Comments 33 pages, 20 figures

详情
AI中文摘要

现代视觉模型在标准基准上表现强劲,但其整体准确率难以揭示驱动预测的场景属性。现有鲁棒性基准提供重要压力测试,但通常操纵全局2D图像属性,依赖现实世界变化或仅覆盖有限的3D对象和场景参数。我们引入MAPS(Manifolds of Artificial Parametric Scenes),一种可扩展的工具,用于受控地将视觉模型行为归因于场景参数。MAPS包含2,618个经过筛选的逼真3D网格,已验证在560个ImageNet类别上具有可识别性,并提供基于Blender的渲染管道,可按需生成图像,连续变化九个独立场景因素,涵盖背景、相机和照明,可扩展至其他因素。为了展示其适用性,我们使用MAPS评估20种卷积和Transformer模型,通过基于回归的敏感性分析量化其对这些场景因素的依赖性。我们发现所有测试架构中普遍存在一个几乎普遍的失败轴:相机距离和高度在识别失败中始终占主导地位,无论ImageNet准确性如何。然而,完整的敏感性结构揭示出现代CNN和Transformer模型聚集在一起,与旧架构不同,表明细粒度的架构设计选择,而非粗粒度的CNN与Transformer区别,是敏感性特征的更强决定因素。

英文摘要

Modern vision models achieve strong performance on standard benchmarks, yet their aggregate accuracy reveals little about which scene properties drive their predictions. Existing robustness benchmarks provide important stress tests, but typically manipulate global 2D image properties, rely on entangled real-world variation, or cover only a limited set of 3D objects and scene parameters. We introduce MAPS (Manifolds of Artificial Parametric Scenes), a scalable instrument for controlled attribution of vision model behavior to scene parameters. MAPS comprises 2,618 curated photorealistic 3D meshes validated for recognizability across 560 ImageNet classes and provides a Blender-based rendering pipeline for on-demand image generation under continuous variation of nine independent scene-factors spanning background, camera, and lighting, extensible to other factors. To showcase its applicability, we use MAPS to evaluate 20 convolutional and transformer-based models by quantifying their reliance on these scene factors through regression-based sensitivity analysis. We find a near-universal failure axis across all tested architectures: camera distance and elevation consistently dominate recognition failure regardless of ImageNet accuracy. However, the full sensitivity structure reveals that modern CNNs and transformers cluster together, distinct from older architectures, suggesting that fine-grained architectural design choices, rather than the coarse CNN-versus-transformer distinction, are the stronger determinant of sensitivity profiles.

2605.20547 2026-05-21 cs.LG cs.AI stat.ML

Latent Process Generator Matching

潜在过程生成器匹配

Lukas Billera, Hedwig Nora Nordlinder, Ben Murrell

发表机构 * Department of Microbiology, Tumor and Cell Biology, Karolinska Institutet(微生物学、肿瘤和细胞生物学系,Karolinska研究院)

AI总结 本文提出了一种潜在过程生成器匹配框架,该框架将观测到的生成状态视为可 tractable 马尔可夫过程的确定性图像,从而扩展了生成器匹配理论,使其适用于时间依赖的潜在条件过程。

Comments 18 pages, 1 figure

详情
AI中文摘要

许多近期的流匹配和扩散式生成模型在训练过程中依赖于辅助的随机动力学:通过模拟更丰富的过程来定义条件目标,但辅助状态在生成时要么难以采样,要么并不属于期望的输出。现有的生成器匹配理论规范了对静态潜在随机变量的条件,而几篇近期论文证明了特定增强状态构造的投影结果的特殊情况。我们引入了潜在过程生成器匹配,一种通用框架,将观测到的生成状态视为可 tractable 马尔可夫过程的确定性图像 $X_t=Φ(Y_t)$。我们显示在这一设定下,可以在图像空间中学习一个随机过程的生成器,其一阶边缘分布与投影过程相同。这扩展并涵盖了文献中的离散潜在过程结果,并将生成器匹配从静态潜在变量扩展到丰富的时间依赖潜在条件过程家族。

英文摘要

Many recent flow-matching and diffusion-style generative models rely on auxiliary stochastic dynamics during training: a richer process is simulated to define conditional targets, but the auxiliary state is either intractable to sample at generation time or simply not part of the desired output. Existing Generator Matching theory formalises conditioning on static latent random variables, and several recent papers prove special cases of projection results for particular augmented-state constructions. We introduce latent process generator matching, a general framework that treats the observed generative state as a deterministic image $X_t=Φ(Y_t)$ of a tractable Markov process $Y_t$. We show that in this setting one may learn the generator of a stochastic process on the image space which has the same one-time marginal distributions as the projected process. This generalizes and subsumes the discrete latent process results from the literature, and extends Generator Matching from static latent variables to a rich family of time-dependent latent conditional processes.

2605.20544 2026-05-21 cs.RO cs.CV

The Yes-Man Syndrome: Benchmarking Abstention in Embodied Robotic Agents

顺从综合征:具身机器人代理中的退避基准测试

Doguhan Yeke, Elif Su Temirel, Ananth Shreekumar, Brandon Lee, Dongyan Xu, Z Berkay Celik

发表机构 * Purdue University(普渡大学) Bilkent University(比尔肯特大学)

AI总结 本文提出了一种用于具身机器人代理的退避基准测试框架RoboAbstention,通过五种机器人数据集中的图像生成退避指令,评估了多个前沿VLMs在退避任务中的表现,并探讨了改进退避性能的方法。

详情
AI中文摘要

视觉语言模型(VLMs)被用作具身代理的高层规划器,将自然语言指令和视觉观察转化为行动计划。尽管先前的工作研究了LLMs中的退避行为,但现有的基准测试大多仅限于文本,无法捕捉到具身机器人环境中的感知基础和物理约束。在这样的环境中,退避需要识别指令模糊、物理不可行、基于错误前提或在给定可用感觉模态和上下文下无法解决的情况。为了解决这一差距,我们引入了一个分类法来分类具身机器人中的退避行为,并提出了RoboAbstention,一个可扩展且可审计的框架,用于生成基于五个机器人数据集收集的图像的退避指令。RoboAbstention通过三个阶段的流程实现该分类法:(1)结构化的视觉基础,(2)确定性的约束推导,(3)通过类别特定模板进行受控的指令生成。这使能够构建一个具有可验证退避条件的多样化数据集。我们评估了几种前沿VLMs,并发现所有模型在退避任务中都表现出显著的弱点,包括那些具有高级推理能力的模型。表现最好的模型Gemini 2.5 Flash仅在6,069个基准指令中退避39.0%,而具身规划器Gemini Robotics ER 1.6 Preview仅在16.5%的指令中退避。我们进一步探讨了改进VLM规划器退避性能的方法,如防御性提示和上下文学习,并发现这些干预措施显著提高了性能,达到Gemini Robotics ER 1.6 Preview的93.6%退避率和GPT 5.4 Mini的88.6%退避率,但没有任何方法完全解决了该问题。我们开源了RoboAbstention在https://purseclab.github.io/RoboAbstention/。

英文摘要

Vision-language models (VLMs) are used as high-level planners for embodied agents, translating natural language instructions and visual observations into action plans. While prior work has studied abstention in LLMs, existing benchmarks are largely text-only and do not capture the perceptual grounding and physical constraints inherent to embodied robotics environments. In such settings, abstention requires recognizing when instructions are ambiguous, physically infeasible, based on false premises, or otherwise unresolvable given the available sensory modalities and context. To address this gap, we introduce a taxonomy to categorize abstention in the context of embodied robotics and present RoboAbstention, a scalable and auditable framework for generating abstention instructions grounded in images gathered from five robotics datasets. RoboAbstention instantiates the taxonomy through a three-phase pipeline: (1) structured visual grounding, (2) deterministic constraint derivation, and (3) controlled instruction generation via category-specific templates. This enables the construction of a diverse dataset with verifiable abstention conditions. We evaluate several frontier VLMs and find that all models exhibit significant weaknesses in abstention, including those with advanced reasoning capabilities. The best-performing model, Gemini 2.5 Flash, abstains on only 39.0% of our 6,069 benchmark instructions, while the embodied planner Gemini Robotics ER 1.6 Preview abstains on just 16.5%. We further explore methods for improving abstention in VLM planners, such as defensive prompting and in-context learning, and find that these interventions substantially improve performance, reaching 93.6% abstention rate for Gemini Robotics ER 1.6 Preview and 88.6% for GPT 5.4 Mini, yet no approach fully solves the problem. We open-source RoboAbstention at https://purseclab.github.io/RoboAbstention/.

2605.20543 2026-05-21 cs.CV

Uncertainty-Guided Conservative Propagation for Structured Inference in Vessel Segmentation

不确定性引导的保守传播用于血管分割的结构推理

Huan Huang, Michele Esposito, Chen Zhao

发表机构 * Department of Computer Science, Kennesaw State University(凯斯西储大学计算机科学系) Department of Cardiology, Medical University of South Carolina(南卡罗来纳医科大学心脏病科)

AI总结 本文提出了一种不确定性引导的保守传播(UGCP)模块,用于改进血管分割的结构推理,通过局部预测交互进行多次logit空间更新步骤,提高分割的Dice相似系数、中心线Dice和95百分位Hausdorff距离,同时减少血管断开并提高结构一致性。

Comments Pattern Recognition submission. 35 pages, 6 figures

详情
AI中文摘要

准确的血管分割对于医学图像分析至关重要,但仍然具有挑战性,因为复杂的血管模式和成像模糊性导致了困难。大多数深度模型依赖于单次预测,限制了它们在推理过程中细化不确定或断开区域的能力。为了解决这一限制,我们提出了不确定性引导的保守传播(UGCP),这是一种通用的插件模块用于血管分割。与其直接使用一次输出作为最终预测不同,UGCP通过局部预测交互进行少量logit空间更新步骤来改进分割。预测不确定性引导可靠区域以支持模糊区域,同时结构意识调制和源基于稳定化减少不可靠传播和过度漂移。该模块是可微的,可以与不同的分割网络端到端训练。我们在四个公开的血管分割数据集上评估了UGCP,涵盖2D和3D任务,包括视网膜血管、冠状动脉和脑血管分割。使用基于卷积神经网络和Transformer的后端进行的实验显示,Dice相似系数、中心线Dice和95百分位Hausdorff距离均有所提高。进一步分析表明,UGCP在有限的额外计算下减少了血管断开并提高了结构一致性。代码将在https://github.com/chenzhao2023/UGC_PR上提供。

英文摘要

Accurate vessel segmentation is essential for medical image analysis, yet remains challenging due to complex vascular patterns and imaging ambiguity. Most deep models rely on single-pass prediction, limiting their ability to refine uncertain or disconnected regions during inference. To address this limitation, we propose Uncertainty-Guided Conservative Propagation (UGCP), a general plug-in module for vessel segmentation. Instead of directly using a one-shot output as the final prediction, UGCP performs a small number of logit-space update steps to refine the segmentation through local predictions interaction. Predictive uncertainty guides reliable regions to support ambiguous regions, while structure-aware modulation and source-based stabilization reduce unreliable propagation and excessive drift. The module is differentiable and can be trained end-to-end with different segmentation networks. We evaluate UGCP on four public vessel segmentation datasets covering 2D and 3D tasks, including retinal vessel, coronary artery, and cerebral vessel segmentation. Experiments with convolutional neural network-based and Transformer-based backbones show consistent improvements in Dice similarity coefficient, centerline Dice, and 95th percentile Hausdorff distance. Further analysis demonstrates that UGCP reduces vessel disconnections and improves structural consistency with limited additional computation. The code will be made available at https://github.com/chenzhao2023/UGC_PR.

2605.20052 2026-05-21 cs.CL cs.AI

PromptRad: Knowledge-Enhanced Multi-Label Prompt-Tuning for Low-Resource Radiology Report Labeling

PromptRad: 基于知识的多标签提示微调用于低资源放射报告标注

Ying-Jia Lin, Tzu-Chin Lo, Ping-Chien Li, Chi-Tung Cheng, Chien-Hung Liao, Hung-Yu Kao

发表机构 * Department of Artificial Intelligence and AI Research Center, Chang Gung University(人工智能系及AI研究中心,长庚大学) Department of Radiology, Sijhih Cathay General Hospital(放射科,西吉医院) Department of Medical Imaging and Intervention, Chang Gung Memorial Hospital(医学影像与介入科,长庚纪念医院) Department of Trauma and Emergency Surgery, Chang Gung Memorial Hospital(创伤与急诊外科,长庚纪念医院) Department of Computer Science, National Tsing Hua University(计算机科学系,国立清华大学)

AI总结 本文提出PromptRad,一种基于知识的多标签提示微调方法,用于在低资源环境下进行放射报告标注,通过引入UMLS元词典中的同义词增强类别表示,以更少的标注数据实现优于传统方法的性能。

Comments BioNLP 2026 @ ACL (camera-ready version)

详情
AI中文摘要

自动报告标注有助于从非结构化文本中识别临床发现,并为医学影像研究提供大规模注释。现有的基于规则的标注器难以处理临床报告中的多样化描述,而微调预训练语言模型(PLMs)需要大量标注数据,这些数据在临床环境中通常不可用。在本文中,我们提出PromptRad,一种基于知识的多标签提示微调方法,用于在低资源环境下进行放射报告标注。PromptRad将多标签分类重新表述为掩码语言建模,并将UMLS元词典中的同义词纳入多词提示器以丰富类别表示。通过微调PLM而不增加额外分类层,PromptRad所需的标注数据比传统微调要少得多。在肝CT报告上的实验表明,PromptRad在仅使用32个标注训练示例的情况下,优于基于词典和微调的基线方法,并且在使用远小模型的情况下,性能与GPT-4具有竞争力。进一步分析显示,PromptRad比现有方法更有效地捕捉复杂的否定模式,使其成为数据稀缺临床场景中报告标注的有希望的解决方案。我们的代码可在https://github.com/ila-lab/PromptRad上获得。

英文摘要

Automatic report labeling facilitates the identification of clinical findings from unstructured text and enables large-scale annotation for medical imaging research. Existing rule-based labelers struggle with the diverse descriptions in clinical reports, while fine-tuning pre-trained language models (PLMs) requires large amounts of labeled data that are often unavailable in clinical settings. In this paper, we propose PromptRad, a knowledge-enhanced multi-label \textbf{prompt}-tuning approach for \textbf{rad}iology report labeling under low-resource settings. PromptRad reformulates multi-label classification as masked language modeling and incorporates synonyms from the UMLS Metathesaurus into a multi-word verbalizer to enrich category representations. By fine-tuning the PLM without additional classification layers, PromptRad requires substantially less labeled data than conventional fine-tuning. Experiments on liver CT (computed tomography) reports show that PromptRad outperforms dictionary-based and fine-tuning baselines with only 32 labeled training examples, and achieves competitive performance with GPT-4 despite using a much smaller model. Further analysis demonstrates that PromptRad captures complex negation patterns more effectively than existing methods, making it a promising solution for report labeling in data-scarce clinical scenarios. Our code is available at https://github.com/ila-lab/PromptRad.

2605.20030 2026-05-21 cs.LG math.OC

Take It or Leave It: Intent-Controlled Partial Optimal Transport

Take It or Leave It: Intent-Controlled Partial Optimal Transport

Salil Parth Tripathi, Bertrand Chapron, Fabrice Collard, Nicolas Courty, Ronan Fablet

发表机构 * OceanDataLab Ifremer Université Bretagne Sud(布列塔尼大学) IMT Atlantique(IMT阿蒂提斯)

AI总结 本文提出了一种意图控制的局部最优传输(IC-POT),通过引入点wise拒绝成本替代全局拒绝机制,解决了在应用中需要更结构化的点wise拒绝机制的问题,并展示了其在正样本无标签学习和开放部分领域适应中的实际应用价值。

详情
AI中文摘要

虽然最优传输(OT)通过要求两个测度精确匹配来施加刚性约束,而部分最优传输通过允许通过全局预算、标量退款或统一拒绝规则来保留未匹配的质量。然而,许多应用需要更结构化的点wise拒绝机制,其中决定是否未匹配质量取决于侧面特定的可靠性、支持几何或外部信息,关于哪些组件应参与比较。我们引入了意图控制的部分最优传输(IC-POT),即部分传输的一种有针对性的扩展,它用两个测度上的点wise拒绝成本替代了全局拒绝范式。我们证明了由此产生的优化问题可以以局部接受阈值的形式进行双解释,并可以通过将其重新表述为在扩展支持上的平衡Kantorovich OT问题来求解。除了理论分析外,我们还展示了IC-POT在拒绝由侧面信息驱动的设置中的实际相关性。在正样本无标签学习和开放部分领域适应中,将编码统计结构的点wise拒绝规则纳入固定基线流程中可以提高性能。最后,我们用一个地球物理实际案例来说明IC-POT的使用:多模态卫星海洋测量,其中物理和传感器先验自然地指导拒绝机制并定义检索的可比信号信息。

英文摘要

While optimal transport (OT) enforces a rigid constraint by requiring two measures to be matched exactly, partial optimal transport relaxes this requirement by allowing mass to remain unmatched through a global budget, scalar rebate, or uniform rejection rule. However, many applications call for more structured, pointwise rejection mechanisms, where the decision to leave mass unmatched depends on side-specific reliability, support geometry, or external information about which components should participate in the comparison. We introduce \emph{intent-controlled partial optimal transport} (IC-POT), a targeted generalization of partial transport that replaces the global rejection paradigm with pointwise rejection costs over both measures. We show that the resulting optimization problem admits a dual interpretation in terms of local acceptance thresholds and can be solved by recasting it as a balanced Kantorovich OT problem on an augmented support. Beyond theoretical analysis, we demonstrate the practical relevance of IC-POT in settings where rejection is driven by side information. In positive-unlabeled learning and open-partial domain adaptation, incorporating pointwise rejection rules that encode statistical structure improves fixed baseline pipelines. Finally, we motivate the use of IC-POT with a geophysical practical case: multi-modal satellite ocean measurements, for which physical and sensors priors naturally inform the rejection mechanism and define the retrieved comparable signal information.

2605.19776 2026-05-21 cs.CV

Preferences Order, Ratings Anchor: From Fused Expert Aesthetic Ground Truth to Self-Distillation

偏好顺序、评分锚定:从融合专家审美真实数据到自我蒸馏

Yuanpei Zhao, Jie Lin, Chao Zhang, Yilin Wang, Mao Li, Chenhui Li, Jie Hou, Tangjie Lv

发表机构 * Sichuan University(四川大学) NetEase Fuxi AI Lab(网易福溪人工智能实验室) East China Normal University(华东师范大学)

AI总结 本文提出PPaint基准,通过融合专家偏好和评分数据,改进图像审美评估模型,通过自我蒸馏方法在单次推理中实现更准确的审美评分,优于现有开源和闭源基线模型。

Comments 27 pages, 7 pages

详情
AI中文摘要

成对偏好和点状评分是图像审美评估(IAA)的两种主要标注协议,但现有基准仅采用其中一种,未能在受控条件下测量其互补性。我们引入PPaint,一种匹配双协议基准,在五个审美维度上,15名领域专家(每类5名)对150幅中国画进行双协议标注,通过本地密集偏好设计收集45,900个成对专家判断,同时匹配评分。匹配设计揭示了互补优势:偏好产生更一致的顺序排名,而评分锚定了绝对分数尺度。通过两种独立的偏好到评分方法融合两种信号,得到融合的专家真实数据,使两种构造收敛到几乎相同的分数。同样的偏好到评分原则也适用于无标签VLM训练。PSDistill通过Elo参考池将VLM的成对判断转换为校准的伪分数,并通过置信度加权排名优化训练相同的VLM,生成单次推理的审美评分器。在单个绘画类别上训练,蒸馏后的Qwen3-VL-8B在所有三个类别上将均值SRCC从0.504提升到0.709,优于所有开源基线,包括专用审美模型ArtiMuse,并在单次推理成本下与闭源Gemini-3.1-Pro相差0.04 SRCC,跨领域转移在APDDv2上进一步验证。我们将发布完整的PPaint数据集和训练代码。

英文摘要

Pairwise preferences and pointwise ratings are the two dominant annotation protocols in image aesthetic assessment (IAA), yet existing benchmarks adopt only one, leaving their complementarity unmeasured under controlled conditions. We introduce PPaint, a matched dual-protocol benchmark in which 15 domain experts, 5 per category, annotate 150 Chinese paintings under both protocols across five aesthetic dimensions, collecting 45,900 pairwise expert judgments through a locally dense preference design alongside the matched ratings. The matched design reveals complementary strengths: preferences yield more consistent ordinal rankings, while ratings anchor the absolute score scale. Fusing both signals via two independent preference-to-score methods yields a fused expert ground truth on which the two constructions converge to nearly identical scores. The same preference-to-score principle extends to label-free VLM training. PSDistill converts VLM pairwise judgments into calibrated pseudo-scores via an Elo reference pool, and trains the same VLM with confidence-weighted ranking optimization to produce a single-pass aesthetic scorer. Trained on a single painting category, the distilled Qwen3-VL-8B improves mean SRCC from 0.504 to 0.709 across all three categories, outperforming all open-source baselines including the dedicated aesthetic model ArtiMuse and matching closed-source Gemini-3.1-Pro within 0.04 SRCC at single-pass inference cost, with cross-domain transfer further validated on APDDv2. We will release the full PPaint dataset and training code.

2605.19649 2026-05-21 cs.CV

CAD-Free Learning of Spacecraft Pose Estimators via NeRF-Based Augmentations

无需CAD的基于NeRF的航天器姿态估计器学习方法

Antoine Legrand, Renaud Detry, Christophe De Vleeschouwer

发表机构 * Department of Electrical Engineering (ELEN), ICTEAM, UCLouvain(电子工程系(ELEN),ICTEAM,鲁汶大学) Department of Electrical Engineering (ESAT), KU Leuven(电子工程系(ESAT),鲁汶大学) Department of Mechanical Engineering (MECH), KU Leuven(机械工程系(MECH),鲁汶大学) Aerospacelab(航天实验室)

AI总结 本文提出了一种基于NeRF的图像增强方法,使航天器姿态估计器的学习不再依赖大量CAD渲染图像,仅需几十到几百张真实图像即可训练出准确的姿态估计器,同时提升了对实际轨道条件的鲁棒性。

Comments This work has been submitted to the IEEE for possible publication

详情
AI中文摘要

航天器姿态估计网络需要数万张CAD渲染图像进行训练。这种对合成CAD数据的依赖(i)限制了其在具有可靠几何先验的目标上的应用,排除了不合作或文档不全的航天器,(ii)由于不现实的光照和材料外观导致对真实轨道条件的泛化能力差。本文介绍了一种基于NeRF的图像增强方法,使学习航天器姿态估计器仅需几十到几百张图像。该方法通过几何一致的视角和外观增强生成大量多样化的数据集。这个增强的数据集使无需CAD模型或大规模合成数据集即可训练出准确的目标特定姿态估计器。实验表明,我们的方法支持仅用25到400张真实图像训练出准确的姿态估计器,即使在严重的光照变化下也是如此。当应用于大型CAD基于的合成数据集时,基于NeRF的增强也增强了域外泛化能力,提高了对真实轨道条件的鲁棒性。

英文摘要

Spacecraft pose estimation networks require tens of thousands of CAD-rendered images to be trained. This reliance on synthetic CAD data (i) limits applicability to targets with reliable geometry prior, excluding uncooperative or poorly documented spacecraft, and (ii) causes poor generalization to real on-orbit conditions due to unrealistic illumination and material appearance. This paper introduces a NeRF-based image augmentation method that enables the learning of spacecraft pose estimators from only a few tens to a few hundreds of images. The method learns a Neural Radiance Field of the target and generates a large, diverse dataset through geometrically-consistent viewpoint and appearance augmentation. This augmented dataset enables the training of accurate target-specific pose estimators without requiring a CAD model or large synthetic datasets. Experiments show that our approach supports the training of accurate pose estimators from only 25 to 400 realistic images, even under severe illumination variations. When applied on large CAD-based synthetic datasets, the NeRF-based augmentation also enhances out-of-domain generalization, yielding improved robustness to real on-orbit conditions.

2605.19624 2026-05-21 cs.CV cs.AI

Component-Aware Structure-Preserving Style Transfer for Satellite Visual Sim2Real Data Construction

面向组件的结构保持风格迁移用于卫星视觉Sim2Real数据构建

Zongwu Xie, Yonglong Zhang, Yifan Yang, Yang Liu, Baoshi Cao

发表机构 * State Key Laboratory of Robotics and Systems, Harbin Institute of Technology(机器人系统国家重点实验室,哈尔滨工业大学)

AI总结 本文提出了一种面向组件的结构保持风格迁移框架,用于卫星视觉的合成到真实数据构建,通过提取真实图像的部件级风格代码并注入到合成图像中,从而提高标注保持的卫星视觉Sim2Real数据生成效果。

详情
AI中文摘要

对于基于相机的卫星视觉感知,Sim2Real数据构建需要图像接近真实域传感器外观同时保留来自模拟的注释。具有可靠姿态标签和组件级遮罩的卫星目标的真实传感器图像难以大规模获取,而合成渲染提供精确的几何注释但存在明显的外观差距。本文提出了一种面向组件的结构保持风格迁移框架用于卫星视觉的合成到真实数据构建。该方法通过校准的真实获取、基于ArUbo的相机姿态测量、CAD渲染和组件遮罩构建弱配对的真实-合成样本。然后从未标记的真实图像中提取部件级真实域风格代码,并通过遮罩对齐调节将其注入到对应的合成卫星区域中。为了保持生成图像对下游传感器数据监督的可用性,对抗训练与局部对比一致性、自正则化和边缘保持约束相结合。实验在5000张渲染的卫星图像和100张在校准设置下拍摄的真实图像上进行。真实图像提供目标域外观参考和最终评估图像,而下游的GDRNet姿态估计器仅在合成或翻译的合成图像上进行训练。与代表性图像翻译基线相比,所提方法实现了最小的图像分布差异,FID为54.32,KID为0.048。当翻译数据用于在目标域适应设置下训练GDRNet时,ADD通过率提高到0.260,AUC提高到0.611。这些结果表明,组件级外观迁移可以提高标注保持的卫星视觉Sim2Real数据生成效果。

英文摘要

For camera-based satellite visual sensing, Sim2Real data construction requires images that approach real-domain sensor appearance while retaining the annotations inherited from simulation. Real sensor images of satellite targets with reliable pose labels and component-level masks are difficult to acquire at scale, whereas synthetic rendering provides exact geometric annotations but suffers from a visible appearance gap. This paper presents a component-aware structure-preserving style transfer framework for satellite visual synthetic-to-real data construction. The method builds weakly paired real--synthetic samples from calibrated real acquisition, ArUco-based camera-pose measurement, CAD rendering, and component masks. It then extracts part-wise real-domain style codes from unlabeled real images and injects them into corresponding synthetic satellite regions through mask-aligned modulation. To keep the generated images usable for downstream sensor-data supervision, adversarial training is combined with local contrastive consistency, self-regularization, and edge-preserving constraints. Experiments are conducted on 5,000 rendered satellite images and 100 real images captured in a calibrated setup. The real images provide target-domain appearance references and final evaluation images, while the downstream GDRNet pose estimator is trained only on synthetic or translated synthetic images. Compared with representative image-translation baselines, the proposed method achieves the lowest image distribution discrepancy, with an FID of 54.32 and a KID of 0.048. When the translated data are used to train GDRNet in this target-domain adaptation setting, the ADD pass rate improves to 0.260 and the AUC improves to 0.611. These results indicate that component-level appearance transfer can improve annotation-preserving satellite visual Sim2Real data generation in the considered calibrated setup.

2605.19537 2026-05-21 cs.LG

The Silent Hyperparameter: Quantifying the Impact of Inference Backends on LLM Reproducibility

沉默的超参数:量化推理后端对LLM可重复性的影响

David Pape, Jonathan Evertz, Lea Schönherr

发表机构 * CISPA Helmholtz Center for Information Security(CISPA海德堡信息安全研究中心)

AI总结 本文研究了推理后端对LLM基准测试结果的影响,发现不同后端可能导致基准分数变化达16.6个百分点,并引发高比例的输出分歧,强调了推理后端作为关键超参数的重要性。

详情
AI中文摘要

在LLM的进步中,标准化基准测试已成为衡量进展的主要方式,其中最先进的改进通常仅以小数点后几位百分比点来区分。同时,现代LLM评估的计算成本推动了专用推理后端的广泛应用,这些软件系统在推理时高效执行训练好的模型。尽管对可扩展性至关重要,系统级优化,如定制CUDA内核和降低精度的算术,可能会改变令牌概率并引入非确定性,这可能引发生成结果的分歧。在本工作中,我们首先调查了推理景观,识别出200个不同的引擎,并分析了35,000篇机器学习论文,发现尽管存在广泛多样性,特定的推理堆栈很少被报告。然后,我们系统地研究了推理后端如何影响LLM基准测试结果。在保持模型权重、解码参数和硬件不变的情况下,我们评估了五个广泛使用的推理引擎,包括vLLM、SGLang和llama.cpp,跨多个开放权重模型和已建立的基准测试。我们证明,仅选择后端即可使基准分数变化高达16.6个百分点,并引发高比例的输出分歧。通过隔离后端优化并追踪执行管道,我们发现这种分歧是由系统级优化如前缀缓存和CUDA图、定制内核以及日志处理中的引擎特定默认设置所驱动。我们的发现将推理后端识别为在LLM评估中之前未报告但重要的超参数,并倡导标准化报告推理堆栈以提高基准比较的可重复性和可解释性。

英文摘要

Progress in LLMs is increasingly measured through standardized benchmarks, where state-of-the-art improvements are often separated by fractions of a percentage point. At the same time, the computational cost of evaluating modern LLMs has driven widespread adoption of specialized inference backends, software systems that execute trained models efficiently at inference time. While critical for scalability, system-level optimizations, such as custom CUDA kernels and reduced-precision arithmetic, can alter token probabilities and introduce non-determinism, possibly cascading into divergent generation. In this work, we first survey the inference landscape, identifying 200 distinct engines, and analyze 35,000 ML publications, finding that the specific inference stack is rarely reported despite this widespread diversity. We then present a systematic empirical study of how inference backends affect LLM benchmark results. Holding model weights, decoding parameters, and hardware constant, we evaluate five widely used inference engines, including vLLM, SGLang, and llama$.$cpp, across multiple open-weight models and established benchmarks. We show that the choice of backend alone can shift benchmark scores by up to 16.6 percentage points and induce high rates of output disagreement. By isolating backend optimizations and tracing the execution pipeline, we find this divergence is driven by system-level optimizations like prefix caching and CUDA graphs, custom kernels, and engine-specific defaults in logit processing. Our findings identify the inference backend as a previously unreported but consequential hyperparameter in the evaluation of LLM and advocate standardized reporting of inference stacks to improve the reproducibility and interpretability of benchmark comparisons.

2605.19503 2026-05-21 cs.RO cs.AI cs.LG

ARC-RL: A Reinforcement Learning Playground Inspired by ARC Raiders

ARC-RL: 一种受ARC Raiders启发的强化学习游乐场

Carlo Romeo, Andrew D. Bagdanov

发表机构 * Media Integration and Communication Center – University of Florence(媒体整合与通信中心——佛罗伦萨大学)

AI总结 本文提出ARC-RL,一个包含四种MuJoCo连续控制环境的强化学习游乐场,这些环境的机器人形态灵感来自ARC Raiders的生物目录,通过统一的观察模板、动作约定和奖励函数,研究不同形态和动画风格约束下的强化学习算法性能。

详情
AI中文摘要

腿部运动的强化学习已经发展成一个多组件奖励函数和物理引擎基准的堆叠,其形态统一来源于现实商业硬件。然而,游戏NPC受风格约束,缺乏sim-to-real机器人,通常以没有现实机器人对应物的生物形式出现。我们介绍了ARC-RL,一个包含四种MuJoCo连续控制环境的套件,其机器人形态受ARC Raiders的生物目录启发:18自由度的高六足Queen、12自由度的装甲六足Bastion、18自由度的紧凑六足Tick以及12自由度的四足Leaper。这四个机器人共享统一的观察模板、动作约定、仿真节奏和一个单一的闭式多组件奖励函数,其唯一形态差异体现在一小部分权重和参数中。奖励融合了速度跟踪帐篷、健康生存奖励、相位锁定步态适应奖励/成本对、动作正则化器、三个安全惩罚和姿态锚;在任何点都不会引入运动捕捉数据。我们还为每种形态提供手工制作的中心模式生成器演示,这些演示既作为固定专家参考,也作为离线到在线训练的先验数据来源。在此游乐场中,我们进行了一项受控的实证研究,比较标准在线算法(SAC、SPEQ、SOPE-EO)和带有先验数据的算法(SACfD、SPEQ-O2O、SOPE),并研究每种范式如何应对游乐场的形态多样性和动画风格约束。源代码可在https://github.com/CarloRomeo427/ARC_RL.git获取。

英文摘要

Reinforcement learning for legged locomotion has matured into a stack of multi-component reward functions and physics-engine benchmarks whose morphologies are uniformly derived from real commercial hardware. Game NPCs, however, are bound by stylistic constraints absent from sim-to-real robotics and routinely take the form of creatures with no real-robot counterpart. We introduce ARC-RL, a suite of four MuJoCo continuous-control environments featuring robotic morphologies inspired by the bestiary of ARC Raiders: the 18-DoF tall hexapod Queen, the 12-DoF armoured hexapod Bastion, the 18-DoF compact hexapod Tick, and the 12-DoF quadruped Leaper. All four robots share a unified observation template, action convention, simulation cadence, and a single closed-form multi-component reward function whose only per-morphology variation lives in a small set of weights and parameters. The reward fuses a velocity-tracking tent, a healthy survive bonus, a phase-locked gait-compliance bonus/cost pair, action regularisers, three safety penalties, and a posture anchor; no motion-capture data enters the reward at any point. We additionally provide hand-crafted Central Pattern Generator demonstrators per morphology, which serve both as fixed expert references and as sources of prior data for offline-to-online training. On this playground, we conduct a controlled empirical study comparing standard online algorithms (SAC, SPEQ, SOPE-EO) and methods augmented with prior data (SACfD, SPEQ-O2O, SOPE), and characterise how each paradigm copes with the playground's morphological diversity and animation-style stylistic constraints. Source code is available at https://github.com/CarloRomeo427/ARC_RL.git.

2605.19376 2026-05-21 cs.AI

Generative Recursive Reasoning

生成性递归推理

Junyeob Baek, Mingyu Jo, Minsu Kim, Mengye Ren, Yoshua Bengio, Sungjin Ahn

发表机构 * KAIST(韩国科学技术院) Mila – Québec AI Institute(魁北克人工智能研究所) New York University(纽约大学) Université de Montréal(蒙特利尔大学)

AI总结 本文提出Gram框架,通过将递归潜在推理转化为概率多轨迹计算,解决了传统递归推理模型的确定性问题,实现了条件推理和无条件生成。

详情
AI中文摘要

未来的神经推理系统应如何实现扩展计算?递归推理模型(RRMs)通过使用共享转移函数的迭代潜在状态细化,为自回归序列扩展提供了一种有前途的替代方法。然而,现有RRMs大多是确定性的,遵循单一的潜在轨迹并收敛到单一预测。我们引入生成性递归推理模型(GRAM),一种将递归潜在推理转化为概率多轨迹计算的框架。GRAM将推理视为随机的潜在轨迹,通过递归深度和并行轨迹采样实现多个假设、替代解决方案策略和推理时间扩展。这产生了一个支持通过p_θ(y|x)进行条件推理的潜在变量生成模型,并通过p_θ(x)实现无条件生成,无论输入是否固定或缺失。通过缩放变分推断训练,GRAM在结构推理和多解约束满足任务上优于确定性递归和循环基线,同时展示了无条件生成能力。

英文摘要

How should future neural reasoning systems implement extended computation? Recursive Reasoning Models (RRMs) offer a promising alternative to autoregressive sequence extension by performing iterative latent-state refinement with shared transition functions. Yet existing RRMs are largely deterministic, following a single latent trajectory and converging to a single prediction. We introduce Generative Recursive reAsoning Models (GRAM), a framework that turns recursive latent reasoning into probabilistic multi-trajectory computation. GRAM models reasoning as a stochastic latent trajectory, enabling multiple hypotheses, alternative solution strategies, and inference-time scaling through both recursive depth and parallel trajectory sampling. This yields a latent-variable generative model supporting conditional reasoning via $p_θ(y \mid x)$ and, with fixed or absent inputs, unconditional generation via $p_θ(x)$. Trained with amortized variational inference, GRAM improves over deterministic recurrent and recursive baselines on structured reasoning and multi-solution constraint satisfaction tasks, while demonstrating an unconditional generation capability. https://ahn-ml.github.io/gram-website