arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2115
2606.17354 2026-06-17 cs.CL cs.AI 新提交

Translating the Untranslatable: An Operationalizable Ontology for Untranslatability

翻译不可译:一个可操作化的不可译性本体论

Jacob Bremerman, Brihi Joshi, Hirona Arai, Xiang Ren, Jonathan May

发表机构 * University of Southern California Information Sciences Institute(南加州大学信息科学研究所)

AI总结 提出一个结构化的不可译性本体论和补偿策略分类法,构建多语言数据集,通过人类偏好研究发现注释补偿策略最受青睐,为策略感知机器翻译奠定基础。

详情
AI中文摘要

不可译性,即意义无法在语言间直接保留的情况,在语言学中已有深入研究,但在自然语言处理中尚未充分探索。随着机器翻译系统在标准基准测试上的改进,其局限性越来越集中在这些情况下,即翻译无法简化为一一对应。我们引入了一个结构化的不可译性本体论以及补偿策略的分类法,这些策略是在这些不可译情况下传达意义的具体技术。我们将该框架操作化为一个多语言数据集,包含不可译句子及其基于策略的翻译,从而能够对翻译行为进行受控分析。初步的人类偏好研究表明,翻译质量取决于所使用的策略,并且对包含解释性上下文(称为注释补偿策略)的输出存在一致的偏好。我们的框架和数据集为研究和建模策略感知的机器翻译提供了基础。

英文摘要

Untranslatability, cases where meaning cannot be directly preserved across languages, is well-studied in linguistics but underexplored in NLP. As machine translation (MT) systems improve on standard benchmarks, their limitations increasingly concentrate in such cases, where translation cannot be reduced to one-to-one equivalence. We introduce a structured ontology of untranslatability along with a taxonomy of compensation strategies, which are specific techniques to convey meaning under these untranslatable circumstances. We operationalize this framework into a multilingual dataset of untranslatable sentences paired with strategy-based translations, enabling controlled analysis of translation behavior. Initial human preference studies suggest that translation quality depends on the strategy used, with consistent preferences for outputs that include explanatory context, known as the Annotation compensation strategy. Our framework and dataset provide a foundation for studying and modeling strategy-informed machine translation.

2606.17352 2026-06-17 cs.LG cs.CV 新提交

MM++: Unsupervised Scale-Invariant Multilayer OOD Detection via Top-K Gated Feature Fusion

MM++: 无监督尺度不变多层OOD检测通过Top-K门控特征融合

Rahim Hossain, Md Tawheedul Islam Bhuian, Md Farhan Shadiq, Kyoung-Don Kang

发表机构 * School of Computing, State University of New York at Binghamton(纽约州立大学宾汉姆顿分校计算机学院)

AI总结 提出MM++框架,通过熵密度下降识别判别性中间层,结合Ledoit-Wolf正则化协方差矩阵实现无监督、后处理、尺度不变的多层OOD检测,在近/远OOD场景中表现鲁棒。

详情
AI中文摘要

我们提出了MM++(多层马氏距离++),一个完全无监督、严格事后处理且尺度不变的分布外(OOD)检测框架。为了解决尺度不变性与层次表达性之间的权衡,MM++构建了一个原则性的联合特征空间。它首先通过测量熵密度下降来识别判别性中间层,这些下降标志着尖锐语义压缩的边界。通过将这些选定层与终端表示融合,该框架捕获潜在的跨层相关性,同时减轻早期层噪声。关键地,一个Ledoit-Wolf正则化的绑定协方差矩阵稳定了这个统一空间,使得距离估计可靠。无需辅助OOD数据、分类器微调或架构修改,MM++在近和远OOD检测的不同架构上均提供了鲁棒性能。

英文摘要

We introduce MM++ (Multilayer Mahalanobis++), a fully unsupervised, strictly post-hoc, and scale-invariant framework for Out-of-Distribution (OOD) detection. To address the trade-off between scale invariance and hierarchical expressivity, MM++ constructs a principled joint feature space. It first identifies discriminative intermediate layers by measuring entropy density drops, which mark the boundaries of sharp semantic compression. By fusing these selected layers with the terminal representation, the framework captures latent cross-layer correlations while mitigating early-layer noise. Crucially, a Ledoit-Wolf regularized tied covariance matrix stabilizes this unified space, enabling reliable distance estimation. Requiring no auxiliary OOD data, classifier fine-tuning, or architectural modifications, MM++ delivers robust performance across distinct architectures for both near- and far-OOD detection.

2606.17350 2026-06-17 cs.CL cs.AI 新提交

Do Large Language Models Always Tell The Same Stories?

大型语言模型总是讲述相同的故事吗?

Thennal DK, Hans Ole Hatzel

发表机构 * University of Hamburg(汉堡大学)

AI总结 通过对比框架和人类故事数据集,研究10种LLM生成故事的叙事相似性,发现LLM故事比人类故事更相似,前沿模型趋向于“平均”通用叙事,且常见缓解策略无效。

详情
AI中文摘要

大型语言模型(LLMs)的最新进展使得生成高质量散文成为可能,但这些模型是否能够生成多样化的输出仍然存在争议。在这项工作中,我们通过叙事相似性框架研究了LLM生成故事的多样性。使用对比框架和来自r/WritingPrompts的人类编写故事和提示数据集,我们收集了10个代表性LLM的叙事相似性判断,同时利用人类评估和三种不同的自动注释方法。我们的发现揭示了一个一致的趋势:LLM生成的叙事彼此之间始终比人类编写的故事更相似。我们证明,特别是前沿模型收敛于一种“平均”通用叙事,这种叙事近似于个体人类故事,但缺乏人类作者的整体多样性。最后,我们表明常见的缓解策略,包括负提示和温度缩放,未能有效解决这种同质性。

英文摘要

Recent advances in large language models (LLMs) have enabled the generation of high-quality prose, yet the question of whether these models are capable of generating diverse outputs remains contested. In this work, we investigate the diversity of LLM-generated stories through the framework of narrative similarity. Using a contrastive framework and a dataset of human-written stories and prompts from r/WritingPrompts, we collect narrative similarity judgments across 10 representative LLMs, utilizing both human evaluations and three different automatic annotation methods. Our findings reveal a consistent trend: LLM-generated narratives are consistently more similar to each other than human-written stories are. We demonstrate that frontier models in particular converge on a ``mean'' generic narrative that approximates individual human stories but lacks the collective diversity of human authors. Finally, we show that common mitigation strategies, including negative prompting and temperature scaling, fail to meaningfully address this homogeneity.

2606.17345 2026-06-17 cs.LG cs.AI 新提交

Counterfactual Optimization of Baseball Pitch Sequences and Estimation of Its Impact on Season-Level Statistics

棒球投球序列的反事实优化及其对赛季级统计指标影响的估计

Ryota Takamido, Hiroki Nakamoto

发表机构 * Sports Innovation Organization, National Institute of Fitness and Sports in Kanoya(体育创新组织,国立健身与体育研究所)

AI总结 利用Transformer模型和反事实分析,优化MLB投球序列中的最终投球和设置投球,发现可显著提升赛季级表现(如K/9提高1.0以上),并提供了速度带有效位置等实用见解。

详情
AI中文摘要

尽管投球序列是棒球分析的核心话题,但以往研究主要关注单次打席中最终投球的优化,对前期设置投球的作用及其对长期赛季级表现的影响研究不足。为解决这些问题,本研究利用MLB Statcast数据进行了反事实分析。训练了一个基于Transformer的机器学习模型,用于预测目标投球是否会导致击球结果或挥空。然后,通过将最终投球或前期设置投球替换为替代的投球类型和位置,同时保持周围背景信息不变,生成了反事实投球序列。最优反事实选择定义为那些最小化预测击球概率的选择,并使用将模型输出与赛季统计指标关联的回归模型估计其对投手赛季统计指标的预期影响。结果表明,最终投球和设置投球的优化都可能显著影响赛季级表现,包括K/9提高超过1.0。分析还提供了若干实用见解,包括特定速度带的有效位置、投球指令的重要性以及通过中速投球扩展投球选择范围。这些发现定量支持了投球序列在棒球中的战略重要性。

英文摘要

Although pitch sequencing is a central topic in baseball analytics, previous studies have primarily focused on optimizing the final pitch within a single plate appearance, leaving the role of preceding setup pitches and their impact on long-term season-level performance insufficiently examined. To address these issues, this study conducted counterfactual analyses using MLB Statcast data. A Transformer-based machine-learning model was trained to predict whether a target pitch would result in an in-play outcome or swing-out. Counterfactual pitch sequences were then generated by replacing either the final pitch or the preceding setup pitch with alternative pitch types and locations while keeping the surrounding contextual information fixed. Optimal counterfactual selections were defined as those that minimized the predicted in-play probability, and their expected effects on pitchers' seasonal statistics were estimated using regression models linking model outputs to season statistics. The results suggest that the optimization of both final and setup pitches may substantially influence season-level performance, including improvements of more than 1.0 in K/9. The analyses also provided several practical insights, including velocity-band-specific effective locations, the importance of pitch commands, and the expansion of pitch-selection options through middle-velocity pitches. These findings quantitatively support the strategic importance of pitch sequencing in baseball.

2606.17343 2026-06-17 cs.CV stat.AP 新提交

Bayesian Magnetic Resonance Joint Image Reconstruction and Uncertainty Quantification using Sparsity Prior Models and Markov Chain Monte Carlo Sampling

贝叶斯磁共振联合图像重建与不确定性量化:基于稀疏先验模型和马尔可夫链蒙特卡洛采样

Ahmed Karam Eldaly, Matteo Figini, Daniel C. Alexander

发表机构 * Department of Computer Science, University of Exeter(埃克塞特大学计算机科学系) UCL Hawkes Institute, Department of Computer Science, University College London(伦敦大学学院计算机科学系霍克斯研究所)

AI总结 提出一种基于压缩感知磁共振图像重建的不确定性量化框架,采用贝叶斯线性逆问题建模,利用稀疏先验(总变分或小波变换)和分裂增广吉布斯采样器进行MCMC采样,在单线圈和多线圈数据集上验证了优于优化方法和深度学习方法的图像重建与不确定性量化性能。

详情
AI中文摘要

我们提出了一种新的框架,用于使用压缩感知磁共振图像重建进行不确定性量化。该问题在贝叶斯框架内被表述为线性逆问题,并为未知模型参数分配先验分布。具体而言,待重建的图像在给定基下被假设为稀疏的。我们开发了一个适用于任何基的通用框架,并作为示例,测试了图像在(1)空间梯度(使用总变分先验模型)和(2)小波变换中的稀疏性。然后,采用基于分裂增广吉布斯采样的马尔可夫链蒙特卡洛(MCMC)方法从未知参数的后验分布中采样。使用近端MCMC方法有效采样不可微的条件分布。所提出的算法在单线圈和多线圈数据集上使用各种k空间子采样模式和比率进行了验证。结果表明,与对应的基于优化的方法相比,每种提出的方法在图像重建方面具有优越性能。此外,与现有的基于深度学习的方法相比,我们的框架有效地量化了不确定性,显示估计的不确定性图与使用真实值和重建图像计算的误差图之间存在显著相关性。

英文摘要

We propose a novel framework for uncertainty quantification using compressed sensing magnetic resonance image reconstruction. The problem is formulated within a Bayesian framework as a linear inverse problem, with prior distributions assigned to the unknown model parameters. Specifically, the image to be reconstructed is assumed to be sparse in a given basis. We develop a general framework applicable to any basis and as examples, we test the sparsity of the image in its (1) spatial gradients using a total variation prior model, and in its (2) wavelet transform. A Markov chain Monte Carlo (MCMC) method, based on a split-and-augmented Gibbs sampler, is then employed to sample from the posterior distribution of the unknown parameters. The non-differentiable conditional distributions are efficiently sampled using a proximal MCMC method. The proposed algorithms are validated on both single-coil and multi-coil datasets using various k-space sub-sampling patterns and ratios. The results demonstrate the superior performance of each proposed approach in reconstructing images compared to its counterpart optimisation-based method. Moreover, our framework effectively quantifies uncertainty, showing a notable correlation between estimated uncertainty maps and error maps computed using ground truth and reconstructed images, compared with existing deep learning-based methods.

2606.17342 2026-06-17 cs.CV 新提交

Learning a Maximum Entropy Model for Visual Textures using Diffusion

使用扩散学习视觉纹理的最大熵模型

Xinyuan Zhao, Eero P. Simoncelli

发表机构 * New York University(纽约大学) Flatiron Institute(熨斗研究所)

AI总结 提出首个基于扩散模型无监督学习最大熵模型统计量的纹理建模方法,仅用512个统计量即可生成质量优于或媲美当前最优模型(约177k统计量)的纹理图像,并实现平滑插值。

详情
AI中文摘要

视觉纹理——包含重复元素的空间均匀图像区域(例如草地、树皮)——在视觉场景中普遍存在,并为识别和分析材料及物体提供重要线索。许多现有纹理模型从单张纹理图像中提取关键统计量,然后通过匹配这些统计量生成视觉上相似的高质量样本。然而,它们的统计量要么是手工设计的,要么基于为其他目的(如物体识别)预训练的网络。在这里,我们开发了第一个用于无监督学习一组统计量的原理性方法,这些统计量用于约束最大熵概率模型。我们利用为生成扩散模型开发的方法来推导训练和采样程序,并将这些与通过匹配统计量进行采样的传统方法进行比较。尽管我们训练的模型很紧凑(512个统计量),但它生成的纹理图像质量与当前最先进的模型(约177k统计量)相当或更好。通过合成对一个模型不可区分但对另一个模型差异最大的图像,对两个模型进行更直接的比较,揭示了它们的相对优势和劣势。最后,我们表明,与以前的统计纹理模型不同,在我们的模型表示空间中的直线轨迹生成均匀的纹理样本,这些样本在两个端点的特征之间平滑插值。

英文摘要

Visual textures -- spatially homogeneous image regions containing repeated elements (e.g. a field of grass, the bark of a tree) -- are ubiquitous in visual scenes and provide important cues for recognizing and analyzing materials and objects. A number of existing texture models extract essential statistics from a single texture image, and can then generate high-quality samples that are visually similar to the original by matching these statistics. However, their statistics are either hand-designed or based on a network pretrained for another purpose (e.g., object recognition). Here, we develop the first principled method for unsupervised learning of a set of statistics that are used to constrain a maximum entropy probability model. We leverage methods developed for generative diffusion models to derive training and sampling procedures, and compare these to the traditional method of sampling via matching the statistics. Despite the compactness of our trained model (512 statistics), it generates texture images whose quality is as good as or better than the current state-of-the-art model (~177k statistics). A more direct comparison of the two models, obtained by synthesizing images that are indistinguishable for one model but maximally different for the other, reveals their relative strengths and weaknesses. Finally, we show that unlike previous statistical texture models, a straight trajectory in the representation space of our model generates homogeneous texture samples that interpolate smoothly between the features of the two end points.

2606.17340 2026-06-17 cs.CV cs.AI 新提交

Geometry-Consistent Endoscopic Representations for Image-Guided Navigation via Structured Foundation Model Adaptation

几何一致的内窥镜表示用于图像引导导航:基于结构化基础模型适配

Hongchao Shu, Roger D. Soberanis-Mukul, Hao Ding, Morgan Ringel, Mali Shen, Saif Iftekar Sayed, Hedyeh Rafii-Tari, Mathias Unberath

发表机构 * Department of Computer Science, Johns Hopkins University(约翰霍普金斯大学计算机科学系) Semaphor Surgical Johnson & Johnson MedTech(强生医疗科技)

AI总结 提出统一框架,结合合成数据管道与层级感知几何语义适配,学习几何一致且领域鲁棒的图像表示,提升单目内窥镜中的位姿估计与深度预测性能。

详情
AI中文摘要

由于深度线索有限、组织纹理弱、非刚性变形以及跨域外观变化大,单目内窥镜中基于视觉的精确导航十分困难,这些问题使得位姿估计、深度预测和图像-解剖对齐复杂化。尽管最近的视觉基础模型显示出潜力,但它们学到的表示往往几何一致性不足,阻碍了稳定的特征对应,限制了其在后续导航任务中的可靠性。我们提出了一个统一框架,用于学习单目内窥镜中几何一致且领域鲁棒的图像表示。该框架结合了提供精确几何监督的合成数据管道与层级感知几何语义适配,后者是标准LoRA的结构化替代方案,在Transformer层级间选择性插入低秩适配器,并配合逐层训练目标,以鼓励中间特征的几何对应和深层特征的语义一致性。在公开和专有数据集上的实验表明,几何和语义表示质量得到提升,从而在包括位姿估计和单目深度估计在内的下游导航任务上取得更好性能。学到的表示在临床支气管镜中显示出良好的合成到真实迁移能力,并为在有限监督下适配鼻窦镜和结肠镜提供了有用的初始化。该框架还显示出随模型大小和训练数据的良好扩展性。这些结果支持层级感知、几何引导的适配作为内窥镜表示学习的实用方法。

英文摘要

Accurate vision-based navigation in monocular endoscopy is difficult due to limited depth cues, weak tissue texture, non-rigid deformation, and substantial appearance variation across domains, all of which complicate pose estimation, depth prediction, and image-to-anatomy alignment. Although recent vision foundation models have shown promise, their learned representations often remain insufficiently geometry-consistent, hindering stable feature correspondence and limiting their reliability for downstream navigation tasks. We propose a unified framework for learning geometry-consistent and domain-robust image representations for monocular endoscopy. The framework combines a synthetic data pipeline that provides accurate geometric supervision with Hierarchy-Aware Geometry-Semantic Adaptation, a structured alternative to standard LoRA that inserts low-rank adapters selectively across the transformer hierarchy and couples them with layer-wise training objectives to encourage geometric correspondence in intermediate features and semantic consistency in deeper features. Experiments on public and proprietary datasets show improved geometric and semantic representation quality, leading to better performance on downstream navigation tasks including pose estimation and monocular depth estimation. The learned representations show favorable synthetic-to-real transfer on clinical bronchoscopy and provide a useful initialization for adaptation to sinus endoscopy and colonoscopy under limited supervision. The framework also shows favorable scaling with model size and training data. These results support hierarchy-aware, geometry-guided adaptation as a practical approach for endoscopic representation learning.

2606.17339 2026-06-17 cs.AI cs.CL cs.SD 新提交

SpeechDx: A Multi-Task Benchmark for Clinical Speech AI

SpeechDx: 面向临床语音AI的多任务基准

Sejal Bhalla, Larry Kieu, Aina Merchant, Eyal de Lara, Alex Mariakakis

发表机构 * University of Toronto(多伦多大学)

AI总结 提出SpeechDx基准,涵盖12个数据集和27个任务,通过语音产生阶段(概念化、公式化、发音)组织任务,评估12种音频编码器,发现大规模语音模型表现最佳,但尚无表示能可靠泛化。

详情
AI中文摘要

语音通过同时涉及神经、运动、呼吸和发声系统,为健康提供了一个独特的窗口。当前的临床语音AI方法主要通过孤立的特定疾病研究取得进展,导致结果难以比较,泛化能力难以评估。我们引入了SpeechDx,这是一个大规模的临床语音AI基准,涵盖12个数据集和27个任务,涉及多种健康状况。为了能够基于共享的临床机制进行评估,SpeechDx根据任务所破坏的语音产生阶段(概念化、公式化和发音)来组织任务。该基准通过包含有限标注数据的任务以及跨多个数据集评估同一健康状况来测试泛化能力,从而区分有临床意义的模式与数据集伪影。我们系统评估了12个最先进的音频编码器在所有任务以及零样本跨条件迁移下的表现。结果表明,大规模语音模型代表了最强的整体基线,领域特定模型仅在紧密匹配的任务上提升性能,而当前没有任何表示能在临床语音领域可靠泛化。SpeechDx建立了一个共享评估框架,用于追踪通用临床语音表示的进展。

英文摘要

Speech offers a uniquely informative window into health by simultaneously engaging neurological, motor, respiratory, and vocal systems. Current clinical speech AI methods have largely progressed through isolated condition-specific studies, making results difficult to compare and generalization difficult to assess. We introduce SpeechDx, a large-scale benchmark for clinical speech AI spanning 12 datasets and 27 tasks across diverse health conditions. To enable evaluation across shared clinical mechanisms, SpeechDx structures tasks by the stage of speech production they disrupt: conceptualization, formulation, and articulation. The benchmark tests generalization by including tasks with limited labeled data and evaluating the same health condition across multiple datasets, distinguishing clinically meaningful patterns from dataset artefacts. We systematically evaluate 12 state-of-the-art audio encoders across all tasks and under zero-shot cross-condition transfer. Results show that large-scale speech models represent the strongest overall baselines, domain-specific models improve performance only on closely matched tasks, and no current representation generalizes reliably across the clinical speech landscape. SpeechDx establishes a shared evaluation framework for tracking progress toward general-purpose clinical speech representations

2606.17334 2026-06-17 cs.CV 新提交

FATE: Pillar Encoding and Frequency-Aware Training for Event-Based Object Detection

FATE: 基于柱状编码和频率感知训练的事件目标检测

Md Tawheedul Islam Bhuian, Kyoung-Don Kang

发表机构 * School of Computing, State University of New York at Binghamton(纽约州立大学宾汉姆顿分校计算学院)

AI总结 提出FATE框架,通过柱状编码保留事件流时间结构,并利用频率感知训练生成密集伪标签,实现高达200Hz的高时间分辨率目标检测,性能优于现有方法。

详情
AI中文摘要

事件相机是生物启发式传感器,异步捕获对数强度变化,在高速和高动态范围场景中具有固有优势。然而,事件流的稀疏和异步特性对现代深度学习架构构成了根本性挑战。为了与标准模型兼容,大多数现有方法将累积窗口划分为固定的时间子区间。虽然这种方法对空间处理有效,但这种内部离散化丢弃了细粒度的时间结构,并将推理限制在训练监督所施加的低时间频率下。为了解决这一限制,我们提出了FATE,一个基于新型柱状编码(PE)的统一框架。在目标频率决定的离散宏观累积窗口上操作时,PE避免了内部时间子区间划分。它将事件组织成空间柱,并通过投影到连续时间正交多项式基上来近似其窗口内演化。这种公式产生了一个L2最优表示,在密集伪图像中保留了丰富的时间动态,减轻了稀疏事件条件下的信息损失。为了充分利用这种表示,我们引入了频率感知训练(FAT),一种软均值教师课程,生成时间密集的伪标签,有效弥合了低频监督和高频推理之间的不匹配。大量实验表明,FATE能够跨架构范式泛化,并持续优于强基线。它能够在高达200Hz的高时间分辨率下实现鲁棒的目标检测,同时参数数量和推理延迟的开销最小。

英文摘要

Event cameras are bio-inspired sensors that asynchronously capture logarithmic intensity changes, offering inherent advantages in high-speed and high-dynamic-range scenarios. However, the sparse and asynchronous nature of event streams poses a fundamental challenge for modern deep learning architectures. To enable compatibility with standard models, most existing approaches partition the accumulation window into fixed temporal sub-bins. While effective for spatial processing, this internal discretization discards fine-grained temporal structure and constrains inference to the low temporal frequencies imposed by training supervision. To address this limitation, we propose FATE, a unified framework built upon a novel Pillar Encoding (PE). While operating over discrete macro-accumulation windows dictated by the target frequency, PE avoids internal temporal sub-binning. It organizes events into spatial pillars and approximates their intra-window evolution via projection onto a continuous-time orthogonal polynomial basis. This formulation yields an L2-optimal representation that retains rich temporal dynamics in a dense pseudo-image, mitigating information loss under sparse event conditions. To fully leverage this representation, we introduce Frequency-Aware Training (FAT), a soft mean-teacher curriculum that generates temporally dense pseudo-labels, effectively bridging the mismatch between low-frequency supervision and high-frequency inference. Extensive experiments demonstrate that FATE generalizes across architectural paradigms and consistently outperforms strong baselines. It enables robust object detection at high temporal resolutions up to 200 Hz, while incurring minimal overhead in parameter count and inference latency

2606.17331 2026-06-17 cs.LG 新提交

Decision-Driven Geosteering Under Uncertainty: A Unified Framework for Sequential Decision Optimization

不确定性下的决策驱动地质导向:序列决策优化的统一框架

Hibat Errahmen Djecta, Sergey Alyaev, Kristian Fossum, Reidar B. Bratvold, Ressi Bonti Muhammad, Apoorv Srivastava

发表机构 * NORCE Research Centre(NORCE研究机构) University of Stavanger(斯塔夫anger大学) Stanford University(斯坦福大学)

AI总结 提出一个将粒子滤波与强化学习结合的地质导向框架,通过显式建模地质不确定性并评估三种决策策略,实现稳定且高效的井轨迹实时优化。

详情
AI中文摘要

地质导向需要在未知地质构造中导航井轨迹,同时根据钻井过程中获取的间接测量值顺序更新决策。本文提出一个不确定性感知的地质导向框架,该框架将用于概率性地下解释的粒子滤波与用于序列决策的基于价值的强化学习紧密结合。钻头前方的地质不确定性通过粒子滤波显式表示,从而实现基于信念的控制而非确定性轨迹校正。该框架将粒子滤波信念更新与信念感知决策策略耦合,并评估在相同不确定性表示下运行的三种决策选项:一种可解释的近似动态规划方案、一种深度Q学习基线,以及一种采用目标Q网络方案训练以保持稳定性的双深度强化学习架构,该架构使用对偶(价值/优势)分解进行Q值参数化。除了最终的放置性能外,我们还使用衡量随时间变化的转向平滑度的稳定性指标评估策略行为,从而提供关于决策策略如何随不确定性演变而响应的额外操作洞察。该框架集成了一个API,用于在工业地质导向模拟器中在真实测量噪声和钻井约束下进行验证。通过在所有方法中使用相同的地质实现、操作限制和奖励定义,实验提供了对替代决策策略在整个钻井过程中行为的受控和高保真评估,而不仅仅是根据最终井轨迹评估性能。

英文摘要

Geosteering requires navigating a well trajectory through an unknown geological configuration, while sequentially updating decisions based on indirect measurements acquired during drilling. This work presents an uncertainty-aware geosteering framework that tightly integrates particle filtering for probabilistic subsurface interpretation with value-based reinforcement learning for sequential decision-making. Geological uncertainty ahead of the drill bit is represented explicitly through a particle filter (PF), enabling belief-informed control rather than deterministic trajectory correction. The framework couples PF belief updates with belief-informed decision policies and evaluates three decision-making options that operate under identical uncertainty representations: an interpretable Approximate Dynamic Programming (ADP) scheme, a Deep Q-learning baseline, and a Dual Deep Reinforcement Learning (Dual DRL) architecture trained with a target Q-network scheme for stability, using a dueling (value/advantage) decomposition for Q-value parameterization. Beyond final placement performance, we assess policy behavior using stability-oriented metrics that quantify steering smoothness over time, providing additional operational insight into how decision policies respond as uncertainty evolves. The framework is integrated with an API for validation within an industrial geosteering simulator under realistic measurement noise and drilling constraints. Using identical geological realizations, operational limits, and reward definitions across methods, the experiments provide a controlled and high-fidelity evaluation of how alternative decision policies behave throughout the drilling process, rather than evaluating performance solely from the final well trajectory.

2606.17328 2026-06-17 cs.AI 新提交

MemTrace: Probing What Final Accuracy Misses in Long-Term Memory

MemTrace: 探知长期记忆中最终准确率所遗漏的信息

Xianxuan Long, Zhikai Chen, Shenglai Zeng, Shouren Wang, Kai Guo, Jiliang Tang

发表机构 * Michigan State University(密歇根州立大学) Case Western Reserve University(凯斯西储大学)

AI总结 提出MemTrace基准,以知识点为单位,沿记忆年龄、问题类型和证据条件三个维度评估LLM代理的长期记忆,发现证据使用是主要瓶颈。

详情
AI中文摘要

LLM代理越来越多地在会话之间维护用户事实的长期记忆。然而,这种记忆通常通过聚合问题行或情节的准确率来评估。由于这种方法独立评分问题行,即使多个问题探查同一事实,也无法显示该事实在条件变化时的行为。我们引入MemTrace,一个以知识点为测量单位的基准:知识点是关于用户的单个类型化事实,而非单个问题。MemTrace沿三个受控维度探查每个事实:记忆年龄,由事实出现在历史中的会话次数定义;问题类型,涵盖当前状态、先前状态和变化轨迹;以及证据条件,涵盖存在、缺失和被错误前提反驳的设置。评估跨四个范式的13种记忆系统配置,我们发现相似的汇总准确率隐藏了不同的失败:恢复事实的当前和先前状态并不意味着跟踪其变化,安全弃权并不意味着纠正错误前提。主要瓶颈是证据使用,而非检索:当系统失败时,证据可检索的次数比缺失的次数多10倍。这些结果表明,改进长期记忆需要更好地使用可获取的证据,而不仅仅是增加存储或检索。

英文摘要

LLM agents increasingly maintain long-term memory of user facts across sessions. Yet such memory is usually evaluated by aggregating accuracy over question rows or episodes. Because this approach scores question rows independently, even when several questions probe the same fact, it cannot show how that fact behaves as conditions change. We introduce MemTrace, a benchmark whose unit of measurement is the knowledge point: a single typed fact about the user, rather than an individual question. MemTrace probes each fact along three controlled dimensions: memory age, defined by how many sessions ago the fact appeared in the history; question type, covering current state, earlier state, and trajectory of change; and evidence condition, covering present, missing, and contradicted-by-false-premise settings. Evaluating 13 memory-system configurations across four paradigms, we find that similar pooled accuracy hides different failures: recovering a fact's current and earlier states does not imply tracking how it changed, and safe abstention does not imply correcting a false premise. The dominant bottleneck is evidence use, not retrieval: when systems fail, the evidence was retrievable 10 times more often than it was missing. These results suggest that improving long-term memory requires better use of reachable evidence, not simply more storage or retrieval.

2606.17321 2026-06-17 cs.LG cs.CV 新提交

ProCUA-SFT Technical Report

ProCUA-SFT 技术报告

Jaehun Jung, Ximing Lu, Brandon Cui, Muhammad Khalifa, Shaokun Zhang, Hao Zhang, Jin Xu, Amala Sanjay Deshmukh, Karan Sapra, Andrew Tao, Yejin Choi, Jan Kautz, Mingjie Liu, Yi Dong

发表机构 * NVIDIA(英伟达) University of Washington(华盛顿大学) Allen Institute for AI(艾伦人工智能研究所)

AI总结 提出 ProCUA-SFT 数据集,通过自动化管道从 2484 个应用组合的合成轨迹中蒸馏出 310 万步级 SFT 样本,微调 UI-TARS 7B 在 OSWorld 上达到 45.0% 的成功率,比基线提升 18.7 个百分点。

Comments 15 pages, 5 figures

详情
AI中文摘要

训练计算机使用智能体(CUA)——通过截图和键盘/鼠标操作与图形桌面交互的模型——需要在全桌面环境中收集的大规模、多样化的轨迹数据。最大的公共资源 AgentNet(22.5K 条人类轨迹)在用于监督微调(SFT)时会导致负迁移:在 AgentNet 上继续训练 UI-TARS 7B 导致 OSWorld 成功率从 26.3% 下降到 8-10%。我们提出了 ProCUA-SFT,一个包含 310 万步级 SFT 样本的数据集,这些样本从 2484 个应用组合中的 93K 条合成轨迹中蒸馏得到。该数据集由一个全自动管道生成,该管道(i)在带有真实世界内容的实况桌面上合成有基础的任务——912 个来自 SpreadsheetBench 的电子表格、约 10K 个来自 Zenodo10K 的宽松许可演示文稿以及多应用 OSWorld 配置——以及(ii)在展开前通过二元前置条件检查验证每个任务的可行性。单个 VLM(Kimi-K2.5)作为目标生成器、前置条件判断器和轨迹执行器,消除了规划器-执行器的能力差距。每条轨迹被扩展为步前缀样本,精确复现推理时看到的上下文布局。在 ProCUA-SFT 上微调 UI-TARS 7B 一个 epoch 后,在 OSWorld 上达到 45.0%——比基础模型提升 18.7 个百分点,比 AgentNet 训练的模型高出 35% 以上。ProCUA 的一个子集被纳入 Nemotron 3 Nano Omni 模型的训练数据中,为其计算机使用能力做出了贡献。

英文摘要

Training computer-use agents (CUAs) -- models that interact with graphical desktops through screenshots and keyboard/mouse actions -- requires large-scale, diverse trajectory data collected in full desktop environments. The largest public resource, AgentNet (22.5K human trajectories), leads to negative transfer when used for supervised fine-tuning (SFT): continuing training UI-TARS 7B on AgentNet causes OSWorld success rate to fall from 26.3% to 8-10%. We present ProCUA-SFT, a dataset of 3.1M step-level SFT samples distilled from 93K synthetic trajectories across 2,484 application combinations. The dataset is produced by a fully automated pipeline that (i) synthesizes grounded tasks on live desktops seeded with real-world content -- 912 spreadsheets from SpreadsheetBench, approximately 10K permissively-licensed presentations from Zenodo10K, and multi-application OSWorld configs -- and (ii) verifies each task's feasibility through binary precondition checking before rollout. A single VLM (Kimi-K2.5) serves as goal generator, precondition judge, and trajectory executor, eliminating planner-actor capability gaps. Each trajectory is expanded into step-prefix samples that exactly reproduce the context layout seen at inference time. Fine-tuning UI-TARS 7B on ProCUA-SFT for one epoch yields 45.0% on OSWorld -- an 18.7 percentage-point improvement over the base model and over 35% above AgentNet-trained counterparts. A subset of ProCUA was incorporated into the training data for the Nemotron 3 Nano Omni model, contributing to its computer-use capabilities.

2606.17312 2026-06-17 cs.AI 新提交

Quantifying Consistency in LLM Logical Reasoning via Structural Uncertainty

通过结构不确定性量化LLM逻辑推理中的一致性

Baishali Chaudhury, Mengdie Flora Wang, Hyunji Hayley Park, Rahul Ghosh, Sungmin Hong, Jae Oh Woo

发表机构 * AWS Generative AI Innovation Center(AWS生成式AI创新中心)

AI总结 提出结构不确定性框架,通过自偏好排序的稳定性评估LLM推理一致性,在逻辑和数学任务中与答案分散度互补,提升不可靠实例识别。

Comments Published at ICLR 2026 Workshop on Logical Reasoning of Large Language Models. Accepted as best paper

详情
AI中文摘要

大型语言模型可以通过不稳定、矛盾或难以一致排序的推理路径得出相同答案——这种失败模式在多步演绎推理中尤为普遍。现有方法主要通过输出分散度(衡量采样答案的差异)来评估可靠性,但这丢弃了一个互补信号:模型是否能一致地对竞争性推理候选进行排序。我们提出结构不确定性,一个从自偏好诱导的推理解决方案排序稳定性导出的、具有一致性意识的框架。给定一个查询,我们生成多个候选解决方案,并让模型对其自身输出进行成对偏好判断。我们通过Bradley-Terry模型与PageRank将自偏好聚合成排序分布,并将信号分解为两个基于熵的分量:跨试验排序不稳定性和试验内候选歧义性。在五个LLM和八个基准上,结构信号提供了与答案分散度互补的信息:在逻辑和数学推理任务中,组合提高了不可靠实例的识别,而在事实检索中,结构信号坍缩为均匀分布,诊断出一个推理层面一致性评估无信息性的状态边界。两个分量与准确性的关系不同:试验内歧义性与正确性正相关——与多个合理解决方案路径保持竞争的情况一致——而跨试验不稳定性与正确性负相关,表明推理不可靠。结构不确定性最好不被理解为通用置信度估计器,而是作为逻辑推理一致性的状态敏感评估器。

英文摘要

Large language models can arrive at the same answer through reasoning paths that are unstable, contradictory, or difficult to rank consistently -- a failure mode especially prevalent in multi-step deductive reasoning. Existing methods assess reliability primarily through output dispersion -- measuring how much sampled answers differ -- but this discards a complementary signal: whether the model can consistently rank competing reasoning candidates. We propose structural uncertainty, a consistency-aware framework derived from the stability of self-preference-induced rankings over sampled reasoning solutions. Given a query, we generate multiple candidate solutions and ask the model to judge pairwise preferences among its own outputs. We aggregate self-preferences into ranking distributions via Bradley-Terry modeling with PageRank, and decompose the signal into two entropy-based components: across-trial ranking instability and within-trial candidate ambiguity. Across five LLMs and eight benchmarks, structural signals provide information complementary to answer dispersion: on logical and mathematical reasoning tasks, the combination improves identification of unreliable instances, while on factual retrieval the structural signal collapses toward uniformity, diagnosing a regime boundary where reasoning-level consistency evaluation is uninformative. The two components relate differently to accuracy: within-trial ambiguity correlates positively with correctness -- consistent with settings where multiple plausible solution paths remain competitive -- while across-trial instability correlates negatively, signaling unreliable reasoning. Structural uncertainty is best understood not as a universal confidence estimator, but as a regime-sensitive evaluator of logical reasoning consistency.

2606.17310 2026-06-17 cs.CV 新提交

SierpinskiCam: Camera-Controlled Video Retaking with Sierpinski Triangle Pattern Cues

SierpinskiCam: 基于谢尔宾斯基三角形图案线索的相机控制视频重拍

Suttisak Wizadwongsa, Hyelin Nam, Supasorn Suwajanakorn, Jeong Joon Park

发表机构 * University of Michigan, Ann Arbor(密歇根大学安娜堡分校) VISTEC, Thailand(泰国威斯泰克科学技术研究院)

AI总结 提出SierpinskiCam方法,通过谢尔宾斯基圆顶纹理线索增强几何引导,并引入参考视频条件机制,解决单目视频重拍中相机大角度偏离时的稀疏区域问题,提升相机可控性、几何一致性和视频质量。

Comments 20 pages, 13 figures

详情
AI中文摘要

从单个单目视频沿用户定义的相机轨迹生成场景的新颖渲染,称为视频重拍,是内容创作和视觉效果中一个引人注目但困难的问题。现有的几何引导方法从源视频重建4D表示,并沿目标轨迹渲染以条件视频扩散模型。然而,当目标相机偏离源轨迹时,这种引导会退化,导致新暴露区域稀疏或完全缺失。我们提出SierpinskiCam,通过使用包含丰富可跟踪特征的谢尔宾斯基圆顶纹理线索来增强基于几何的引导,从而解决了这一限制,即使在大的视角变化下也能保持跟踪。我们进一步引入了一种参考视频条件机制,将源视频令牌附加到目标令牌序列,并使用负RoPE索引分离两个流,从而无需架构修改或逐视频适应即可实现外观基础。大量实验表明,SierpinskiCam在多样且具有挑战性的重拍场景中,在相机可控性、几何一致性和视频质量方面取得了显著提升。项目页面:此https URL。

英文摘要

Generating novel renderings of a scene along user-defined camera trajectories from a single monocular video, dubbed video retaking, is a compelling but difficult problem in content creation and visual effects. Existing geometry-guided approaches reconstruct a 4D representation from the source video and render it along the target trajectory to condition video diffusion models. However, this guidance degrades as the target camera departs from the source trajectory, leaving newly revealed regions sparse or entirely missing. We propose SierpinskiCam, which addresses this limitation by augmenting geometry-based guidance with Sierpinski dome texture cues that contains rich trackable features even under large viewpoint changes. We further introduce a reference video conditioning mechanism that appends source-video tokens to the target-token sequence and separates the two streams with negative RoPE indices, enabling appearance grounding without architectural modification or per-video adaptation. Extensive experiments show that SierpinskiCam achieves significant gains in camera controllability, geometric consistency, and video quality across diverse and challenging retaking scenarios. Project page: https://hyelinnam.github.io/SierpinskiCam/.

2606.17309 2026-06-17 cs.RO 新提交

Abstention-Aware Personalized Object Rearrangement via Uncertainty-Guided LLM Assistance

基于不确定性引导的LLM辅助的弃权感知个性化物体重排

Sam Collin, Ali Ayub

发表机构 * Concordia University(康考迪亚大学)

AI总结 提出APOLLO框架,结合轻量级个性化嵌入模型与选择性大语言模型辅助,通过不确定性估计在模糊决策时调用LLM,实现高效、隐私保护的弃权感知物体重排。

Comments Accepted at the 2026 IEEE 35th International Conference on Robot and Human Interactive Communication (RO-MAN 2026)

详情
AI中文摘要

家庭环境中的机器人辅助不仅需要预测物体应放置的位置,还需要推理何时不应放置物体。现有的个性化物体重排方法主要假设观测清晰且完全可操作,限制了其在现实、杂乱且部分错误环境中的适用性。本文提出APOLLO,一个用于弃权感知个性化物体重排的混合框架,结合了轻量级个性化嵌入模型(PEM)与选择性大语言模型(LLM)辅助。PEM针对每个用户-环境对使用少量演示进行训练,完全在CPU上运行,并产生不确定性估计,用于仅对模糊决策选择性调用基于LLM的推理,平衡效率、隐私和推理能力。为了在现有基准之外评估该公式,我们引入了APOR,一个合成的、由LLM生成的数据集,捕捉房间级、多家具环境、多样化的组织配置文件、明确的弃权行为和嘈杂的部分场景上下文。在PARSEC和APOR上的大量实验初步表明,APOLLO在受控基准设置中优于先前基于LLM的基线,同时大幅减少LLM的使用。代码可在该网址获取。

英文摘要

Robotic assistance in household environments requires not only predicting where objects should be placed, but also reasoning about when objects should not be placed at all. Existing approaches to personalized object rearrangement primarily focus on placement decisions under the assumption of clean observations and complete actionability, limiting their applicability in realistic, cluttered, and partially erroneous settings. In this paper, we introduce APOLLO, a hybrid framework for abstention-aware personalized object rearrangement that combines a lightweight, personalized embedding model (PEM) with selective large language model (LLM) assistance. PEM is trained for each user-environment pair using a small number of demonstrations, operates entirely on CPU, and produces uncertainty estimates, which are used to selectively invoke LLM-based reasoning only for ambiguous decisions, balancing efficiency, privacy, and reasoning capability. To evaluate this formulation beyond existing benchmarks, we introduce APOR, a synthetic, LLM-generated dataset that captures room-level, multi-furniture environments, diverse organizational profiles, explicit abstention behavior, and noisy partial scene context. Extensive experiments on both PARSEC and APOR provide initial evidence that APOLLO improves over prior LLM-based baselines in controlled benchmark settings while substantially reducing LLM usage. Code is available at https://github.com/PaInt-Lab/APOLLO.

2606.17301 2026-06-17 cs.SD cs.LG 新提交

Turning music identification into a neural forward pass

将音乐识别转化为神经前向传播

Muhammad Taimoor Haseeb, Ahmad Hammoudeh, Gus Xia

发表机构 * Music X Lab(音乐X实验室) Mohamed Bin Zayed University of Artificial Intelligence(Mohamed Bin Zayed人工智能大学)

AI总结 提出用生成式Transformer通过单次神经前向传播实现音乐识别,在短音频片段上超越传统声学指纹方法,存储和延迟显著降低。

详情
AI中文摘要

搜索是计算机科学中的基础操作,它将查询映射到集合中的匹配项。通常,它被实现为类似系统2的基于规则的流水线:计算键、探测索引、验证候选。相比之下,人类识别类似于系统1的联想式身份恢复模型,其中即使部分线索也能触发回忆,而无需显式枚举、排序甚至访问离散候选。在这里,我们展示了音乐声音识别——一个困难的搜索问题——可以通过生成式Transformer在单次神经前向传播中完成。该模型在音频数据集上训练,从短音频片段预测对应的曲目标识符。这种方法超越了最先进的声学指纹识别,对于短音频片段(1秒)的提升最大,证明了该方法不仅可行而且具有优势。此外,它将外部存储减少到基线的0.33%,并将推理延迟提高了2.3倍(p95)。而且,该模型可以拒绝未见曲目的查询,支持开放集操作,同时降低误归因风险。以音乐曲目识别为例,这项工作重新定义了搜索,使其更接近人类联想识别,远离算法数据库查找。

英文摘要

Search, a foundational operation in computer science, maps a query to a matching item in a collection. It is typically implemented as a System-2 like, rule-based pipeline in which a key is computed, an index is probed, and candidates are verified. By contrast, human recognition resembles a System-1 like, associative model of identity recovery, in which even partial cues can trigger a recall without explicitly enumerating, ranking, or even accessing discrete candidates. Here, we show that music sound identification, a difficult search problem, can be performed in a single neural feed-forward pass by a generative transformer. Trained on an audio dataset, the model predicts the corresponding track identifier from a short audio excerpt. This approach surpasses state-of-the-art acoustic fingerprinting, with the largest gains for short audio segments (1 second), demonstrating the method is not only viable but advantageous. Moreover, it reduces external storage to 0.33% of the baseline footprint and improves inference latency by 2.3x (p95). Furthermore, the model can reject queries for unseen tracks, supporting open-set operation while reducing misattribution risk. Using music track identification as an example, this work reframes search, bringing it closer in spirit to human associative recognition and away from algorithmic database lookup.

2606.17299 2026-06-17 cs.CL 新提交

Examining the Limits of Word2Vec with Toki Pona

用 Toki Pona 检验 Word2Vec 的极限

Daniel Zhenhan Huang, Hongchen Wu

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文使用仅有约130个单词的人造语言 Toki Pona 训练 Word2Vec,探究词汇量极小时嵌入质量,并分析非核心噪声词的影响。

Comments 10 pages, 4 figures, 3 tables. Accepted to the Society for Computation in Linguistics (SCiL) 2026

详情
AI中文摘要

Word2Vec 在生成语义嵌入方面的有效性已得到广泛验证,但其测试几乎完全集中在词汇量大的语言上。本研究使用 Toki Pona(一种约130个单词的人造语言)的数据,检验 Word2Vec 能否在极小的词汇量下成功捕获语义关系。我们从 Toki Pona 社区获取了140万句子(795万词元)用于训练。语料库中约23%的句子包含非 Toki Pona 词元,如命名实体、借词和新词。为了探究这种语言噪声是增强还是阻碍性能——这是词嵌入文献中很少涉及的话题——我们训练了两个不同的模型:一个保留这些偶然词元,另一个将其完全过滤。评估采用定量方法(测量单词到语义类别质心的距离)、自动轮廓分数(通过凝聚聚类)以及定性分析(使用与英语对比的表征相似性矩阵)。结果表明,虽然稀疏的非核心词元不影响所学嵌入的相对结构,但它们实际上使相似词在向量空间中更接近。重要的是,即使在这个极端下限,Word2Vec 的有效性更多地取决于分布模式而非词汇表大小。

英文摘要

Word2Vec's effectiveness at generating semantic embeddings has been widely validated, yet it has been tested almost exclusively on languages with large vocabulary inventories. This study examines whether Word2Vec can successfully capture semantic relationships within an extremely reduced vocabulary using data from Toki Pona, a constructed language with approximately 130 words. We sourced 1.4 million sentences (7.95 million tokens) from the Toki Pona community for training. Approximately 23% of sentences in the corpus contain non-Toki Pona tokens such as named entities, loanwords, and neologisms. To investigate whether this linguistic noise enhances or hinders performance -- a topic rarely addressed in word embedding literature -- we trained two distinct models: one retaining these incidental tokens and another filtering them out completely. Evaluation was conducted using quantitative methods measuring word proximity to semantic category centroids, automated silhouette scores via agglomerative clustering, and qualitative analysis utilizing representational similarity matrices compared against English. The results indicate that while sparse, non-core tokens do not affect the relative structure of the learned embeddings, they actually draw similar words closer together in the vector space. Importantly, Word2Vec's effectiveness depends more on distributional patterns than lexicon size even at this extreme lower bound.

2606.17298 2026-06-17 cs.CV 新提交

Reasoning Text-to-Video Retrieval for Operating Room Clips via Action-Driven Digital Twins

面向手术室视频的推理式文本-视频检索:基于动作驱动数字孪生

Yiqing Shen, Hao Ding, Mathias Unberath

发表机构 * Johns Hopkins University(约翰霍普金斯大学)

AI总结 提出OR3方法,通过动作驱动数字孪生(ActDT)将视频片段转化为结构化表示,并利用大语言模型生成假设ActDT进行检索,结合证据修正实现隐式查询推理,在手术室视频检索中显著优于基线。

详情
AI中文摘要

手术室中的文本-视频检索是实现手术室安全的关键技术,它允许利益相关者检索和检查特定事件的记录。然而,由于最安全关键的事件可能不遵循常见结构,为了充分发挥其潜力,文本-视频检索必须能够处理需要推理才能识别正确视频的隐式查询(例如,剪断前的一步)。然而,现有方法依赖于无法对此类查询进行推理的全局嵌入。我们提出OR3,一种文本-视频检索方法,它将视频片段转换为动作驱动数字孪生(ActDTs),将并发的“主体-动作-对象”三元组分组到非重叠的时间间隔内。此外,与通过配对编码器进行跨模态匹配不同,OR3执行基于想象的检索,其中大语言模型从查询生成假设的ActDTs。这通过使用针对ActDT定制的难负样本训练的单一编码器实现模态内匹配。最后,基于证据的修正根据与顶级候选者的差异修正想象的ActDTs,以捕获特定于程序的模式。我们从MM-OR构建了一个基准,包含来自机器人膝关节手术的386个视频片段的276个隐式查询,涵盖四个推理类别。OR3实现了57.6的R@1和77.3的R@5,优于最强基线。这些结果表明,OR3通过时间动作推理实现了视觉上相似的手术室视频片段之间的细粒度区分。

英文摘要

Text-to-video retrieval in operating rooms (OR) is an enabling technology for OR safety, as it allows stakeholders to retrieve and inspect recordings of specific events. However, because the most safety-critical events may not follow the common structure, to unlock its full potential text-to-video retrieval must be able to handle implicit queries that require reasoning to identify the right video (e.g., the step right before clipping). However, existing methods rely on global embeddings that cannot reason over such queries. We propose OR3, a text-to-video retrieval method that converts clips into action-driven digital twins (ActDTs), grouping concurrent subject-action-object triplets under non-overlapping temporal intervals. Moreover, rather than cross-modal matching through paired encoders, OR3 performs imagination-based retrieval where an LLM generates hypothetical ActDTs from queries. This enables intra-modal matching via a single encoder trained with ActDT-tailored hard negatives. Finally, evidence-grounded refinement revises imagined ActDTs based on discrepancies with top candidates to capture procedure-specific patterns. We construct a benchmark from MM-OR with 276 implicit queries across four reasoning categories over 386 clips from robotic knee procedures. OR3 achieves 57.6 R@1 and 77.3 R@5, outperforming the strongest baseline. These results demonstrate that OR3 enables fine-grained discrimination between visually similar OR video clips through temporal action reasoning.

2606.17296 2026-06-17 cs.CV 新提交

Pareto LoRA: Mitigating Modality Imbalance in Unified Multimodal Models via Pareto-Optimal Gradient Integration

Pareto LoRA:通过帕累托最优梯度集成缓解统一多模态模型中的模态不平衡

Xiwen Wei, Mark Nutter, Madhusudhanan Srinivasan, Radu Marculescu

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) Advanced Micro Devices, Inc.(超威半导体公司)

AI总结 针对统一多模态模型在LoRA微调中语言梯度主导优化导致图像生成质量下降的问题,提出帕累托最优梯度集成策略Pareto LoRA,通过调节梯度方向和强度平衡文本与图像目标,在CoMM基准上显著提升图像感知质量达44.9%。

详情
AI中文摘要

统一多模态模型(UMMs)最近作为一种有前景的范式出现,将多模态理解和生成集成在单个自回归Transformer中。然而,在多模态指令微调期间,这些模型通常表现出明显的模态不平衡:语言梯度主导优化,从而导致图像生成质量较低,尤其是在参数高效微调(如LoRA)下。在这项工作中,我们系统分析了基于LoRA的UMMs在交错文本-图像生成微调中的模态不平衡。我们表明,与单模态对应物相比,视觉模态性能下降幅度远大于文本模态性能,并且模态特定梯度在不同任务和层之间可能相差数个数量级。受此观察启发,我们将多模态指令微调重新表述为双目标优化问题,并提出Pareto LoRA,一种帕累托最优梯度集成策略,通过调节梯度方向和强度来平衡文本和图像目标。在CoMM基准上使用Emu2的实验表明,Pareto LoRA持续改善多模态生成平衡,在保持可比文本性能的同时,图像感知质量相比普通LoRA提升高达44.9%。

英文摘要

Unified multimodal models (UMMs) have recently emerged as a promising paradigm for integrating multimodal understanding and generation within a single autoregressive transformer. However, during multimodal instruction tuning, these models often exhibit pronounced modality imbalance: language gradients dominate optimization, thus leading to lower image generation quality, especially under parameter-efficient fine-tuning such as LoRA. In this work, we systematically analyze modality imbalance in LoRA-based fine-tuning of UMMs for interleaved text-image generation. We show that vision modality performance degrades substantially more than text modality performance when compared to unimodal counterparts, and that modality-specific gradients can differ by orders of magnitude across various tasks and layers. Motivated by this observation, we reformulate the multimodal instruction tuning as a bi-objective optimization problem and propose Pareto LoRA, a Pareto-optimal gradient integration strategy that balances the text and image objectives by modulating the gradient direction and strength. Experiments on the CoMM benchmark with Emu2 demonstrate that Pareto LoRA consistently improves multimodal generation balance, achieving up to 44.9% gains in perceptual image quality over vanilla LoRA while maintaining comparable text performance.

2606.17294 2026-06-17 cs.RO cs.LG 新提交

VISTA: Scale-Aware Visual Navigation via Action History Conditioning

VISTA:通过动作历史条件实现尺度感知的视觉导航

Maeva Guerrier, Koki Kobayashi, Simon Roy, Jana Pavlasek, Giovanni Beltrame

发表机构 * Polytechnique Montreal(蒙特利尔理工学院) MILA(MILA研究所) Institute of Science Tokyo(东京科学大学) CoRA Lab(CoRA实验室) Mist Lab(Mist实验室)

AI总结 针对视觉导航基础模型因动作归一化导致的尺度脆弱性,提出通过动作历史条件化提供物理位移上下文,并集成DINOv3编码器增强重复环境中的特征表示,实现零样本跨环境部署。

详情
AI中文摘要

视觉导航基础模型(VNMs)承诺能够实现端到端的学习导航策略,并能在不同实体和环境之间进行零样本部署。为了保持通用性,许多基于视觉的导航模型预测归一化动作。然而,这种归一化引入了一个关键的部署漏洞:对相同的归一化轨迹应用不同的缩放因子会改变其物理几何形状,从而降低导航性能并增加碰撞风险。我们通过将模型条件化于归一化动作历史以及图像观测来解决这一漏洞,为模型预测与机器人实际物理位移之间的关系提供显式上下文。此外,当前的VNMs在缺乏显著特征的视觉重复环境中常常表现不佳。为解决此问题,我们集成了DINOv3编码器,其更丰富的表示使我们的模型能够捕获观测之间的空间和几何维度。VISTA能够鲁棒地泛化到分布外环境,在户外、森林和办公室环境的零样本真实世界部署中实现了100%的目标预测准确率,平均95%的检查点被穿越,展示了在未见环境中的一致路径跟随能力。

英文摘要

Vision Navigation Foundation Models (VNMs) promise end-to-end learned navigation policies capable of zero-shot deployment across diverse embodiments and environments. To maintain generality, many vision-based navigation models predict normalized actions. However, this normalization introduces a critical deployment vulnerability: applying different scaling factors to the same normalized trajectory alters its physical geometry, which degrades navigation performance and increases collision risks. We address this vulnerability by conditioning the model on normalized action histories alongside image observations, providing explicit context on the relationship between the model's predictions and the robot's actual physical displacement. Furthermore, current VNMs often struggle in visually repetitive environments that lack distinct features. To resolve this issue, we integrate a DINOv3 encoder, whose richer representations enable our model to capture both spatial and geometric dimensions between observations. VISTA generalizes robustly to out-of-distribution environments, achieving 100% goal prediction accuracy in zero-shot, real-world deployment in Outdoor, Forest and Office settings, and an average of 95% checkpoints crossed, demonstrating consistent path following in unseen environments.

2606.17289 2026-06-17 cs.AI cs.CL 新提交

Nothing from Something: Can a Language Model Discover 0?

无中生有:语言模型能否发现0?

Phoebe Zeng, Thomas L. Griffiths, Brenden M. Lake

发表机构 * Department of Computer Science, Princeton University(普林斯顿大学计算机科学系)

AI总结 研究语言模型能否独立发现“零”的概念,通过算术任务测试,发现GPT-2规模模型无法在测试时泛化,但少量示例训练后显著提升,且语言预训练减少所需示例约50%。

详情
AI中文摘要

基于人工神经网络的AI系统正被开发,旨在推动人类数学知识的边界。这些系统的关键问题在于它们能在多大程度上超越训练数据。数学发现需要一种强形式的分布外泛化能力——假设真正新的、且可能逻辑上更强大的数学结构的能力。已有假设认为,语言能力在人类认知中支持这种泛化。在这项工作中,我们使用简单算术作为案例研究,考察现代AI模型如何扩展其数学视野,评估这些模型能否独立发现“零”的概念。我们表明:(1) GPT-2规模的语言模型在测试时无法进行这种泛化,无论是否经过语言预训练;(2) 但在经过数十或数百个零的示例训练后,模型能显著改进。此外,我们发现语言预训练将所需示例数量减少了约50%,表明语言能力可以支撑神经模型中的数学发现。

英文摘要

AI systems based on artificial neural networks are being developed with aspirations of pushing the boundary of human mathematical knowledge. A key question for these systems is how much they can reach beyond their training data. Mathematical discovery requires a strong form of out of distribution generalization; the ability to hypothesize genuinely new - and potentially logically more powerful - mathematical structures. It has been hypothesized that language abilities support such generalizations in human cognition. In this work, we use simple arithmetic as a case study for examining how modern AI models could expand their mathematical horizons, evaluating whether these models can independently discover the concept of "zero". We show that We show that (1) language models of a GPT-2 size are unable to perform this generalization at test time regardless of language pretraining, but (2) models can improve substantially after training on tens or hundreds of examples of zero. Additionally, we find that language pretraining reduces the number of required examples by approximately $50\%$, showing that language abilities can scaffold mathematical discovery in neural models.

2606.17281 2026-06-17 cs.CL cs.SD eess.AS 新提交

Are you speaking my languages? On spoken language adherence in multimodal LLMs

你在说我的语言吗?多模态大语言模型中的口语遵循问题

Hyungwon Kim, Kandarp Joshi, Lillian Zhou, Pavel Golik, Petar Aleksic

发表机构 * Google DeepMind(谷歌深Mind)

AI总结 针对多模态大语言模型在自动语音识别中输出语言识别错误的问题,提出软提示方法、监督微调和思维链推理三种缓解策略,并引入新指标量化语言违背,比较各方法在减少违规和保持ASR性能上的效果。

Comments 7 pages, 3 tables in the main body

详情
AI中文摘要

虽然基于大语言模型(LLM)的自动语音识别(ASR)能够实现无缝的多语言使用,但模型经常错误识别输出语言,损害转录保真度和下游应用质量。为了保持灵活性和代码切换能力,我们提出了一种软提示方法,该方法暗示潜在的口语语言而不严格约束输出。我们正式将这一挑战定义为缺乏语言遵循,引入了一个新的指标来量化违规行为,并评估了三种缓解策略:(1)零样本提示,在不确定性下提供稳健指导;(2)监督微调(SFT),以提高提示遵循度;(3)思维链(CoT)推理,在解码过程中强制遵循。我们跨多种语言对这些方法进行了比较分析,评估了它们在减少语言违规同时保持整体ASR性能方面的有效性。最后,我们讨论了权衡,以指导在不同计算约束下的策略选择。

英文摘要

While Large Language Model (LLM) based Automatic Speech Recognition (ASR) enables seamless multilingual use, models often misidentify the output language, compromising transcription fidelity and downstream application quality. To preserve flexibility and code-switching capabilities, we propose a soft prompting approach that hints at potential spoken languages without strictly constraining the output. We formally define this challenge as a lack of language adherence, introduce a novel metric to quantify violations, and evaluate three mitigation strategies: (1) zero-shot prompting for robust guidance under uncertainty, (2) supervised fine-tuning (SFT) to improve prompt adherence, and (3) Chain-of-Thought (CoT) reasoning to enforce adherence during decoding. We present a comparative analysis of these methods across multiple languages, evaluating effectiveness in reducing the language violation while maintaining overall ASR performance. Finally, we discuss trade-offs to guide strategy selection under various compute constraints.

2606.17279 2026-06-17 cs.CV 新提交

Training LLMs with Reinforcement Learning over Digital Twin Representations for Reasoning-Intensive Surgical VideoQA

基于数字孪生表示的强化学习训练LLMs用于推理密集型手术视频问答

Yiqing Shen, Han Zhang, Mathias Unberath

发表机构 * Johns Hopkins University(约翰霍普金斯大学)

AI总结 提出强化学习框架,通过手术基础模型构建数字孪生表示,解耦视觉感知与推理,并引入分层表示与新型奖励,在三个基准上取得最优性能。

详情
AI中文摘要

手术视频问答需要跨语义、空间和时间维度的多步推理。现有方法在架构上将视频压缩为离散令牌表示,并将视觉感知与推理耦合。这种方法割裂了连续的空间-时间关系,已被证明限制了多步推理能力。我们引入了一个强化学习框架,通过操作由手术基础模型构建的数字孪生表示,训练大型语言模型解耦感知与推理。此外,我们引入了跨帧、时间窗口和程序级别的分层表示,并带有概率不确定性估计。最后,我们提出了一种新颖的奖励,结合了格式验证与通过临床合理性评估和不确定性感知校准进行的准确性评估。为了展示该方法的能力,我们引入了REAL-Colon-Reason,一个包含2000个问题-答案对、涵盖三个复杂度级别的结肠镜基准。我们在REAL-Colon-Reason以及两个现有手术视频问答基准REAL-Colon-VQA和EndoVis18-VQA上取得了最先进的性能。

英文摘要

Surgical video question answering requires multi-step reasoning across semantic, spatial, and temporal dimensions. Existing methods architecturally compress videos into discrete token representations and couple visual perception with reasoning. This approach fragments continuous spatial-temporal relationships and has been shown to restrict multi-step reasoning capabilities. We introduce a reinforcement learning (RL) framework that trains large language models (LLMs) to decouple perception from reasoning by operating over digital twin representations constructed from surgical foundation models. Additionally, we introduce hierarchical representations across frame, temporal window, and procedure levels with probabilistic uncertainty estimates. Finally, we propose a novel reward that combines format validation with accuracy assessment through clinical plausibility evaluation and uncertainty-aware calibration for training. To demonstrate the capabilities of this approach, we introduce REAL-Colon-Reason, a colonoscopic benchmark with 2000 question-answer pairs across three complexity levels. We achieve state-of-the-art performance on REAL-Colon-Reason and two existing surgical VideoQA benchmarks REAL-Colon-VQA and EndoVis18-VQA.

2606.17269 2026-06-17 cs.AI cs.SY eess.SY 新提交

Skill-Constrained Model Predictive Control for Resilient Manufacturing Supply Chains

技能约束下的弹性制造供应链模型预测控制

Carlos Eduardo Sanoja

发表机构 * Quanta Labs, LLC Universidad Monteávila(蒙特阿维拉大学)

AI总结 针对技能约束的生产库存系统,提出一种闭环模型预测控制器,通过混合整数规划优化生产、库存、缺货和培训决策,并评估其在多种扰动下的表现,发现预测控制仅在技能瓶颈可提前预测时有效。

详情
AI中文摘要

在技能约束的生产库存系统中,明天可用的合格人力容量取决于今天的培训决策:生产需要认证工人,认证除非维护否则会失效,而培训消耗与当前生产需求相同的稀缺工时。我们研究了一种闭环技能约束模型预测控制器,该控制器在每个班次求解一个有限时域混合整数规划,涉及生产、库存、缺货和培训,包含二元预测认证、硬生产资格以及一个可解释的终端值,该终端值在时域边界对认证容量缺口进行定价;仅执行第一周期动作后重新规划。在合成、种子控制的SkillChain-Gym场景中——包括公告和新技能冲击、需求冲击、缺勤、预测与可用性质量模式、容量边界与培训率扫描以及阴性对照——我们将该控制器与仅生产和仅维护的消融、静态交叉培训保险计划以及一个强反应式启发式方法进行比较,采用事前固定配置和配对统计。结果是存在制度依赖性,而非优越性:没有策略类别占主导。当技能或劳动力瓶颈可提前足够预测以完成培训时,预测控制有帮助;在意外冲击、接近需求-容量边界以及冲击前松弛使保险廉价的情况下,精益静态保险仍难以被击败。归因消融区分了认证维护、失效认证的重新获取以及全新技能获取。可预测性(而非适应性本身)决定了预测控制何时有价值。

英文摘要

In skill-constrained production-inventory systems, the qualified human capacity available tomorrow depends on training decisions made today: production requires certified workers, certifications decay unless maintained, and training consumes the same scarce worker hours that production needs now. We study a closed-loop skill-constrained model predictive controller that, at every shift, solves a finite-horizon mixed-integer program over production, inventory, backlog, and training, with binary predicted certification, hard production eligibility, and an interpretable terminal value that prices certified-capacity gaps at the horizon boundary; only the first-period action is applied before replanning. On synthetic, seed-controlled SkillChain-Gym scenarios - announced and surprise new-skill shocks, demand shocks, absenteeism, forecast- and availability-quality modes, capacity-boundary and training-rate sweeps, and negative controls - we evaluate the controller against production-only and maintenance-only ablations, static cross-training insurance plans, and a strong reactive heuristic, under an ex-ante locked configuration and paired statistics. The result is regime dependence, not superiority: no policy class dominates. Predictive control helps when skill or labor bottlenecks are forecastable early enough for training to complete; lean static insurance remains hard to beat under surprise shocks, near the demand-capacity boundary, and wherever pre-shock slack makes insurance cheap. Attribution ablations separate certification maintenance, re-acquisition of lapsed certifications, and greenfield skill acquisition. Forecastability, not adaptivity per se, decides when predictive control pays.

2606.17266 2026-06-17 cs.AI cs.SY eess.SY 新提交

SkillChain-Gym: A Benchmark for Reskilling-Aware Production-Inventory Control under Disruptions

SkillChain-Gym:面向中断下再技能感知的生产-库存控制的基准测试

Carlos Eduardo Sanoja

发表机构 * Quanta Labs, LLC(Quanta Labs有限责任公司) FCEA, Universidad Monteávila(蒙特阿维拉大学经济与行政科学学院)

AI总结 提出SkillChain-Gym基准,用于评估考虑技能动态(如遗忘、再培训)的生产-库存控制策略,实验发现无策略在所有场景中占优,需根据预测灵活选择。

详情
AI中文摘要

生产规划日益需要将劳动力能力视为决策变量:当技能未得到维护时认证会失效,新产品需要当前劳动力不具备的技能,再技能培训与生产争夺相同的工时。现有的运营基准通常将劳动力视为外生变量,而包含技能和学习的劳动力规划模型很少作为可复用的测试平台发布。我们引入了SkillChain-Gym,这是一个针对再技能感知的生产-库存控制的基准规范:一个单站点环境,具有风格化的工人技能状态动态、硬阈值认证、遗忘以及消耗产能的培训动作,这些动作受与生产相同的每个工人时间预算约束。该基准包括种子控制的中断场景、三种可行性模式(带投影诊断)、确定性回放以及涵盖运营、韧性、能力增长和培训访问分布的指标。我们评估了仅生产策略、反应式自适应策略、注水自适应策略和静态保险策略(带预算变体),在60个班次的时间范围内进行配对统计检验。结果是依赖于情景的,而非排序。具备培训能力的策略优于仅生产基线,并且在遗忘存在的情况下,即使没有中断,维护性培训也是必要的。在具备培训能力的策略中,当瓶颈在预测中可见时,自适应培训有帮助,而一个精简的静态交叉培训计划(一个故意有利的比较对象,其结构编码了相关的技能应急情况)在突发冲击和缺勤下充当了强有力的保险。产能松弛和遗忘率决定了这些情景之间的边界。没有策略类在所有情景中占优,这促使了能够决定何时购买技能保险和何时反应的预测驱动型控制器。

英文摘要

Production planning increasingly has to treat workforce capability as a decision variable: certifications lapse when skills are not maintained, new products require skills the current workforce does not hold, and reskilling competes for the same worker hours needed for production. Existing operations benchmarks usually treat labor as exogenous, while workforce-planning models with skills and learning are rarely released as reusable testbeds. We introduce SkillChain-Gym, a benchmark specification for reskilling-aware production-inventory control: a single-site environment with stylized worker skill-state dynamics, hard threshold certification, forgetting, and capacity-consuming training actions constrained by the same per-worker time budget as production. The benchmark includes seed-controlled disruption scenarios, three feasibility modes with projection diagnostics, deterministic replay, and metrics covering operations, resilience, capability growth, and training-access distribution. We evaluate production-only, reactive adaptive, water-filling adaptive, and static-insurance policies with budget variants over 60-shift horizons with paired statistical tests. The results are regime-dependent rather than a ranking. Training-capable policies dominate the production-only baseline, and maintenance training is necessary under forgetting even without disruptions. Among training-capable classes, adaptive training helps when bottlenecks are visible in the forecast, while a lean static cross-training plan, a deliberately favorable comparator whose structure encodes relevant skill contingencies, acts as strong insurance under surprise shocks and absenteeism. Capacity slack and the forgetting rate govern the boundary between these regimes. No policy class dominates across regimes, motivating forecast-driven controllers that decide when to buy skill insurance and when to react.

2606.17257 2026-06-17 cs.CV cs.AI 新提交

Pulling The REINS: Training-Free Safety Alignment of Video Diffusion Models via Representation Steering

Pulling The REINS: 通过表示引导实现视频扩散模型的无训练安全对齐

Rohit Kundu, Arindam Dutta, Sarosij Bose, Athula Balachandran, Amit K. Roy-Chowdhury

发表机构 * University of California, Riverside(加州大学河滨分校) YouTube (Google)(YouTube(谷歌))

AI总结 提出REINS方法,在推理时通过线性方向引导视频扩散模型的内部表示,实现无训练的安全对齐,避免有害内容生成,且不降低通用能力。

详情
AI中文摘要

开源视频扩散模型能够生成从暴力到虚假信息等逼真的不安全内容,然而现有防御要么需要昂贵的安全微调(这会降低通用能力),要么应用容易被对抗性提示绕过的外部过滤器。我们提出REINS(表示空间推理时安全引导),一种无训练方法,通过在推理时引导其内部表示向安全生成方向对齐视频扩散模型。我们的关键发现是,安全相关结构线性编码在视频扩散Transformer的隐藏状态激活中,并且通过基于二元安全标签的监督PCA发现的一个单一方向足以分离安全与不安全的生成轨迹。在推理时,将该方向添加到中间Transformer层的隐藏状态中,将生成从有害内容重定向到语义相关的安全替代方案,无需权重更新、无需概念枚举,且计算开销可忽略。通过机制分析,我们揭示了虽然安全信息随Transformer深度单调累积,但引导效果在中间层(约50%深度)达到峰值,暴露了信息可用性与下游传播能力之间的基本权衡。我们在9个视频扩散模型、多个参数规模(1.3B-5B)以及文本到视频和图像到视频生成上评估REINS,据我们所知,这是视频生成文献中最广泛的安全评估套件。

英文摘要

Open-weight video diffusion models can generate photorealistic unsafe content, from violence to misinformation, yet existing defenses either require expensive safety fine-tuning that degrades general capability, or apply external filters that are trivially bypassed by adversarial prompts. We present REINS (REpresentation-space INference-time Safety steering), a training-free method that aligns video diffusion models at inference time by steering their internal representations toward safe generation. Our key finding is that safety-relevant structure is linearly encoded in the hidden-state activations of video diffusion transformers, and a single direction, discovered via Supervised PCA on binary safety labels, suffices to separate safe from unsafe generation trajectories. At inference, adding this direction to hidden states at an intermediate transformer layer redirects generation from harmful content to semantically related safe alternatives, with no weight updates, no concept enumeration, and negligible computational overhead. Through mechanistic analysis, we reveal that while safety information accumulates monotonically with transformer depth, steering effectiveness peaks at intermediate layers (~50% depth), exposing a fundamental tradeoff between information availability and downstream propagation capacity. We evaluate REINS across 9 video diffusion models, multiple parameter scales (1.3B-5B), and both text-to-video and image-to-video generation, to our knowledge, the broadest safety evaluation suite in the video generation literature.

2606.17256 2026-06-17 cs.RO cs.CV 新提交

Contrastive Action-Image Pre-training for Visuomotor Control

对比动作-图像预训练用于视觉运动控制

Yuvan Sharma, Dantong Niu, Anirudh Pai, Zekai Wang, Zhuoyang Liu, Baifeng Shi, Stefano Saravalle, Boning Shao, Ruijie Zheng, Jing Wang, Konstantinos Kallidromitis, Yusuke Kato, Fabio Galasso, Yuke Zhu, Danfei Xu, Linxi "Jim" Fan, Jitendra Malik, Trevor Darrell, Roei Herzig

发表机构 * UC Berkeley(加州大学伯克利分校) NVIDIA(英伟达) Sapienza University of Rome(罗马大学) Panasonic(松下) ItalAI

AI总结 提出CAIP方法,利用大规模第一人称视频中3D手部关键点作为代理动作信号,通过对比学习统一动作-图像表示,在少量机器人数据下显著提升灵巧操作性能。

详情
AI中文摘要

现有的机器人视觉编码器面临一个根本瓶颈:机器人数据集缺乏大规模预训练所需的规模。先前的工作通过转向互联网规模的图像和语言数据或自我中心的人类视频来规避数据稀缺问题。虽然这些模型显示出潜力,但两种范式都没有从配对的视觉和动作数据中学习,而下游视觉运动控制策略需要这些数据。然而,机器人轨迹作为这种配对信号最直接的来源,在预训练规模上不可用,这促使我们从丰富的人类视频中提取动作信号。为此,我们引入了CAIP(对比动作-图像预训练),一种视觉编码器,将大规模自我中心视频中的人类手部姿态视为末端执行器动作的代理。通过提取3D手部关键点(一种与下游机器人动作空间自然对齐的表示),CAIP通过对比目标学习统一的动作-图像表示。利用32,041小时的自我中心人类视频和仅88小时的机器人操作数据,CAIP优于最先进的视觉编码器,包括DINOv2、SigLIP、MVP和R3M。在使用Dexmate Vega和Sharpa Wave手的具有挑战性的真实世界灵巧操作设置上评估,CAIP在涉及折叠、倾倒和精细操作的任务上取得了超过30%的性能提升。我们的结果表明,我们的对比动作中心预训练方法为获得更适合物理交互的鲁棒视觉表示提供了一条可扩展的路径。

英文摘要

Existing vision encoders for robotics face a fundamental bottleneck: robotic datasets lack the scale necessary for large-scale pre-training. Prior work circumvents this data scarcity by turning to internet-scale image and language data or egocentric human video. While these models show promise, neither paradigm learns from paired vision and action data, which downstream visuomotor control policies require. However, robot trajectories, the most direct source of this paired signal, are not available at pre-training scale, motivating us to extract action signals from abundant human video instead. To this end, we introduce CAIP (Contrastive Action-Image Pre-training), a vision encoder that treats human hand poses from large-scale egocentric video as a proxy for end-effector actions. By extracting 3D hand keypoints, a representation that aligns naturally with downstream robot action spaces, CAIP learns a unified action-image representation through a contrastive objective. Leveraging 32,041 hours of egocentric human video and only 88 hours of robotic manipulation data, CAIP outperforms state-of-the-art vision encoders including DINOv2, SigLIP, MVP, and R3M. Evaluated on a challenging real-world dexterous manipulation setup using Dexmate Vega and Sharpa Wave hands, CAIP yields performance gains of more than 30% on tasks involving folding, pouring, and fine-grained manipulation. Our results show that our method of contrastive action-centric pre-training yields a scalable path to achieving robust visual representations better suited for physical interaction.

2606.17255 2026-06-17 cs.CL cs.AI 新提交

MLLP-VRAIN UPV system for the IWSLT 2026 Simultaneous Speech Translation task

MLLP-VRAIN UPV 系统在 IWSLT 2026 同声传译任务中的应用

Jorge Iranzo-Sánchez, Gerard Mas-Mollà, Adrià Giménez, Jorge Civera, Albert Sanchis, Alfons Juan

发表机构 * MLLP-VRAIN research group(MLLP-VRAIN研究组) VRAIN Universitat Politècnica de València(瓦伦西亚理工大学)

AI总结 提出基于Parakeet和Qwen 3.5模型的级联同声传译系统,通过自适应黑盒策略优化质量-延迟权衡,并引入ASR词增强和RAG机制处理上下文跟踪,在MCIF En→De测试集上实现XCOMET-XL提升+5.82。

Comments IWSLT 2026 System Description

详情
AI中文摘要

本文描述了MLLP-VRAIN研究组参与IWSLT 2026同声传译赛道共享任务的情况。我们的提交利用最近发布的Parakeet和Qwen 3.5模型,通过自适应“黑盒”策略构建了一个鲁棒的级联解决方案,用于长形式SimulST。我们探索了这些策略的松弛版本以实现更好的质量-延迟权衡。与去年相比,我们参与了所有语言方向。此外,对于En→{De, It, Zh}方向,我们还参与了今年新增的上下文跟踪赛道,采用ASR词增强和离线预翻译示例的RAG机制相结合,以引导生成并丰富系统的领域特定上下文。最后,我们提供了系统的详细延迟分析。与去年相比,在MCIF En→De测试集上的结果显示质量显著提升,XCOMET-XL提高了+5.82。我们的上下文跟踪处理进一步提升了+1.03的性能。

英文摘要

This work describes the participation of the MLLP-VRAIN research group in the shared task of the IWSLT 2026 Simultaneous Speech Translation track. Our submission utilizes the recently released Parakeet and Qwen 3.5 models to create a robust, cascaded solution for long-form SimulST through the use of adaptive "black-box" policies. We explore relaxations of these policies to achieve better quality-latency trade-offs. Compared to last year, we participate on all language directions. In addition to this, for the En$\rightarrow${De, It, Zh} directions we also participate in this year's new context track employing a combination of ASR word-boosting and a RAG mechanism of offline pre-translated exemplars to guide generation and enrich our system with domain-specific context. Finally, we provide a detailed latency analysis of our system. Compared to last year, results on the MCIF En$\rightarrow$De test set shows a substantial quality improvement of +5.82 XCOMET-XL. Our context track processing further improves performance by +1.03.

2606.17250 2026-06-17 cs.LG cs.CL 新提交

Rethinking Groups in Critic-Free RLVR

重新思考无评论强化学习中的分组

Yihong Wu, Liheng Ma, Lingfeng Xiao, Muzhi Li, Xinyu Wang, Yingxue Zhang, Jian-Yun Nie

发表机构 * Université de Montréal(蒙特利尔大学) McGill University(麦吉尔大学) Mila - Quebec AI Institute(Mila - 魁北克人工智能研究所) University of Waterloo(滑铁卢大学) The Chinese University of Hong Kong(香港中文大学) Huawei Noah’s Ark Lab(华为诺亚方舟实验室)

AI总结 针对无评论强化学习分组策略的数据低效和同步问题,提出负令牌过滤方法,实现单次 rollout 稳定训练,在推理和代理任务上表现相当或更优。

详情
AI中文摘要

强化学习已成为大型语言模型后训练的核心范式。现有的无评论强化学习方法通常为同一问题生成一组 rollout 以估计价值基线用于优势计算。然而,这种设计存在数据低效、组同步障碍以及与结构化 rollout 不灵活的问题。在这项工作中,我们重新审视了“分组”的作用,并表明其底层功能不仅仅是估计基线,而是防止对负样本的错误惩罚。基于这一见解,我们提出了负令牌过滤,一种简单有效的策略,能够实现稳定的单 rollout 训练。我们将其应用于两种批量级优势方法,在推理任务上取得了与基于分组的强化学习技术相当的性能,在代理任务上取得了更强的性能。

英文摘要

Reinforcement learning (RL) has become a central paradigm for post-training large language models. Existing critic-free RL methods typically generate a group of rollouts for the same question to estimate value baselines for advantage computation. However, this design suffers from data inefficiency, group synchronization barriers, and inflexibility with structured rollouts. In this work, we revisit the role of the ``group'' and show that its underlying function is not merely to estimate baselines but to prevent false penalties on negative samples. Building on this insight, we propose negative token filtering, a simple and effective strategy that enables stable single-rollout training. We apply it to two batch-level advantage methods, achieving comparable performance on reasoning tasks and stronger performance on agentic tasks relative to group-based RL techniques.

2606.17246 2026-06-17 cs.CV cs.MA 新提交

GeoDisaster: Benchmarking Orchestrated Agents for Operational Disaster Geo-Intelligence

GeoDisaster: 用于操作化灾害地理智能的编排智能体基准测试

Maram Hasan, Aman Verma, Savitra Roy, Hariseetharam Gunduboina, Daksh Jain, Muhammad Haris Khan, Subhasis Chaudhuri, Biplab Banerjee

发表机构 * Indian Institute of Technology Bombay(印度理工学院孟买分校) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)

AI总结 提出GeoDisaster基准,包含2921个实例和43种问题类型,用于评估遥感视觉语言模型在工具化空间推理和结构化决策方面的能力,并设计多智能体框架RCEA提升工具使用和证据基础。

Comments 28 pages, 11 Figures

详情
AI中文摘要

遥感视觉语言模型(RS-VLMs)推动了地球观测分析向视觉解释和指令遵循发展,但在操作化地理智能方面仍显不足,后者需要基于工具的空间推理和结构化、有证据支持的决策。我们提出了GeoDisaster,一个操作化地理空间灾害推理基准,包含2921个经过验证的实例,涵盖43种问题类型和五个任务族:森林砍伐监测、多灾害分析、建筑损坏评估、洪水安全路线规划以及Sentinel-1 SAR洪水监测。实例集成了异构的EO/GIS证据——光学和SAR影像、栅格掩膜、矢量几何、道路网络和暴露图层——涵盖灾害检测、损坏评估、暴露估计和诊断报告生成。真实答案基于可执行的地理空间工作流和确定性一致性检查,无需语言模型标注。我们进一步提出了一个编排的多智能体框架,包含18个面向灾害的工具,其中角色专业化的智能体通过明确的执行契约进行协调,并通过角色契约期望对齐(RCEA)进行对齐:结合故障感知的监督微调和基于契约的强化学习,利用密集的步骤级信号。实验表明,GeoDisaster对现有的RS-VLMs和智能体系统构成了挑战,而RCEA改进了工具使用、证据基础、状态一致性和决策生成。

英文摘要

Remote-sensing vision-language models (RS-VLMs) have advanced Earth-observation analysis toward visual interpretation and instruction-following, yet fall short of operational geo-intelligence, which demands tool-grounded spatial reasoning and structured, evidence-backed decisions. We introduce GeoDisaster, an operational geospatial disaster reasoning benchmark with 2,921 verified instances across 43 question types and five task families: deforestation monitoring, multi-hazard analysis, building-damage assessment, flood-safe routing, and Sentinel-1 SAR flood monitoring. Instances integrate heterogeneous EO/GIS evidence-optical and SAR imagery, raster masks, vector geometries, road networks, and exposure layers-spanning hazard detection, damage assessment, exposure estimation, and diagnostic report generation. Ground-truth answers are grounded in executable geospatial workflows and deterministic consistency checks, removing the need for language-model annotation. We further propose an orchestrated multi-agent framework with 18 disaster-oriented tools, where role-specialized agents coordinate through explicit execution contracts, aligned via Role-Contract Expectation Alignment (RCEA): failure-aware supervised fine-tuning combined with contract-grounded reinforcement learning over dense step-level signals. Experiments show that GeoDisaster challenges existing RS-VLMs and agentic systems, while RCEA improves tool use, evidence grounding, state consistency, and decision generation.