arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3868
2606.09450 2026-06-09 cs.AI 新提交

TheoremBench: Evaluating LLMs on Theorem Proving in Formal Mathematics

TheoremBench: 评估LLMs在形式数学中的定理证明能力

QuocViet Pham, Elvir Karimov, Andrey Galichin, Ivan Oseledets

发表机构 * Skolkovo Institute of Science and Technology(斯科尔科沃科学技术研究所) HSE University(高等经济大学) Artificial Intelligence Research Institute(人工智能研究所) Sberbank(俄罗斯联邦储蓄银行)

AI总结 提出TheoremBench基准,通过结构化定理族和细粒度评估指标,揭示当前证明器在复杂定理上的行为偏差。

Comments Preprint version (20 pages, 10 figures)

详情
AI中文摘要

LLMs最近在形式证明基准上取得了强劲结果。然而,现有评估仍高度集中在竞赛式问题上,且往往未能捕捉模型在更长、依赖关系更丰富的数学发展中的行为。我们引入TheoremBench,这是一个Lean4基准,旨在评估超越竞赛设置的定理证明器。该基准由近一百个经典定理构建,并以两种互补形式发布:一个简洁主版本,每个实例包含一个目标定理;以及一个前提版本,将每个定理扩展为一个结构化的相关证明任务族,包括主定理以及自动提取的支持性子定理。这种设计不仅能够评估最终定理是否从零开始被证明,还能评估通过定理内部证明结构的部分进展。我们的实验表明,显式前提显著提高了Lean4能力证明器模型的性能。为了提供全面评估,我们引入了定理级覆盖率和令牌效率指标,这些指标揭示了证明行为中的定性差异。结果表明,当前的证明器仍然强烈偏向于简单的子定理,并且通常通过冗长且低效的策略轨迹而非紧凑的证明计划来求解定理。因此,TheoremBench提供了对形式推理能力的更细粒度视角,并强调了结构基准设计对于评估Lean4定理证明器的重要性。

英文摘要

LLMs have recently achieved strong results on formal proving benchmarks. However, existing evaluations remain heavily concentrated on competition-style problems and often fail to capture how models behave on longer, more dependency-rich mathematical developments. We introduce TheoremBench, a Lean4 benchmark designed to evaluate theorem provers beyond contest settings. The benchmark is built from nearly one hundred classical theorems and is released in two complementary forms: a plain main version containing one target theorem per instance, and a premised version that expands each theorem into a structured family of related proving tasks consisting of the main theorem together with automatically extracted supporting subtheorems. This design enables evaluation of not only whether the final theorem was proved from scratch, but also of partial progress through the internal proof structure of a theorem. Our experiments show that explicit premises substantially improve performance for Lean4-capable prover models. To provide a comprehensive evaluation, we introduce theorem-level coverage and token-efficiency metrics that expose qualitative differences in proof behavior. The results show that current provers remain strongly biased toward easy subtheorems and often solve theorems through long and inefficient tactic traces rather than compact proof plans. TheoremBench therefore provides a more fine-grained view of formal reasoning ability and highlights the importance of structural benchmark design for evaluating Lean4 theorem provers.

2606.09449 2026-06-09 cs.CL 新提交

Reasoning without Gold Standards: A Proxy-Judge Theory of Autoformalization

无金标准推理:自动形式化的代理-裁判理论

Lei Xu, Xin Quan, André Freitas

发表机构 * Idiap Research Institute(Idiap研究所) École Polytechnique Fédérale de Lausanne (EPFL)(洛桑联邦理工学院) University of Manchester(曼彻斯特大学) CRUK National Biomarker Centre, University of Manchester(英国癌症研究中心国家生物标志物中心,曼彻斯特大学)

AI总结 提出无参考的代理-裁判框架,通过多轴属性检查替代金标准匹配,实现自动形式化的迭代优化,理论保证收敛,实验提升通过率。

详情
AI中文摘要

复杂的推理任务日益要求系统生成其正确性无法通过与单一参考精确匹配来判断的输出。自动形式化(AF)是一个代表性例子;它要求模型将非正式的数学或逻辑推理翻译成可形式化检查的对象,然而专家验证的形式化在玩具案例之外无法扩展,且一个非正式论证可以有许多有效的形式化呈现。因此,进展取决于部分、结构化的代理能否替代精确参考。我们为AF引入了一个无参考的代理-裁判框架,用多轴属性检查向量替代金标准匹配。该框架沿三个结构范围组织代理:涵盖所引发对象的全局属性、子组件内部的模块属性、以及将其重新对齐到非正式来源的跨域属性,并将每个轴聚合成一个裁决向量。该向量驱动一个反思性精炼循环,其中违反的坐标将控制器引导到匹配的修复目标,因此每次迭代仅更改被判断为错误的部分。在有界裁判噪声下,期望的内在差距几何级数收缩到噪声相关的平台。在miniF2F、ProofNet、e-SNLI和ProntoQA上的七个形式化骨干中,精炼持续提升通过率超过单次ICL基线,并且在基线有改进空间的基准上,多轴代理优于匹配的标量代理。因此,结构化代理判断既提供了实用的精炼信号,也提供了在精确参考不可用时收敛的理论依据。

英文摘要

Complex reasoning tasks increasingly require systems to produce outputs whose correctness cannot be judged by exact match against a single reference. Autoformalization (AF) is a representative example; it asks a model to translate informal mathematical or logical reasoning into a formally checkable object, yet expert-validated formalizations do not scale beyond toy cases and a single informal argument can admit many valid formal renderings. Progress therefore depends on whether partial, structured proxies can substitute for exact references. We introduce a reference-free proxy-judge framework for AF that replaces gold-standard matching with a vector of per-axis property checks. The framework organizes the proxy along three structural scopes that cover global properties of the elicited object, per-module properties internal to its sub-components, and cross-domain properties that re-align it to the informal source, and aggregates each axis into a verdict vector. The vector drives a reflective refinement loop in which a violated coordinate routes the controller to a matching repair target, so each iteration changes only what is judged wrong. Under bounded judge noise, the expected intrinsic gap contracts geometrically to a noise-dependent plateau. Across seven formalization backbones on miniF2F, ProofNet, e-SNLI, and ProntoQA, refinement consistently lifts Pass Rate over the single-shot ICL baseline, and the per-axis proxy outperforms a matched scalar proxy on benchmarks where the baseline has room to improve. Structured proxy judgments therefore provide both a practical refinement signal and a theoretical handle on convergence when exact references are unavailable.

2606.09447 2026-06-09 cs.AI 新提交

AliyunConsoleAgent: Training Web Agents in Real-World Cloud Environments via Distillation and Reinforcement Learning

AliyunConsoleAgent:通过蒸馏和强化学习在真实云环境中训练Web智能体

Bojie Rong, Zheyu Shen, Qiaoping Wang, Pengfei Kang, Yang Xu, Yawen Wei, Hanyu Wu, Zhi Zhao, Leihao Pei, Linquan Jiang

发表机构 * Alibaba Cloud China(阿里云中国)

AI总结 提出AliyunConsoleAgent框架,通过蒸馏前沿模型轨迹进行监督微调,再结合GRPO和双通道结果奖励模型在真实云环境中强化学习,实现文档验证自动化,以低成本达到接近前沿专有模型的成功率。

详情
AI中文摘要

我们提出AliyunConsoleAgent,一个用于真实云控制台自动化文档验证的Web智能体框架。主流云平台包含数百个产品,功能迭代迅速,导致控制台UI频繁与对应文档不一致。验证文档流程准确反映当前控制台并能够端到端执行,每年需要约400万次重复检查,但人工覆盖率仍低于1%。虽然基于前沿专有模型的智能体系统取得了高成功率,但其高昂成本和数据隐私限制阻碍了大规模部署。我们提出一个两阶段训练范式:首先对蒸馏的前沿模型轨迹进行监督微调,然后在真实云环境中使用组相对策略优化(GRPO)和双通道结果奖励模型进行强化学习。为了支持大规模RL训练,我们构建了一个高确定性的回滚系统,采用基于Terraform的资源预置和LLM驱动的按需置备,有效隔离环境噪声与训练信号。我们进一步引入基于后端审计日志的规则奖励评估协议,提供客观、抗奖励破解的结果判断。我们的模型从机械的指令遵循演变为具有云控制台和产品特定理解的自主决策。在一个具有挑战性的278任务基准上(最佳前沿模型仅达到65.34%成功率),AliyunConsoleAgent-32B实现了63.52%的平均成功率——相比基础模型提升20.24个百分点,与最佳前沿专有模型的差距缩小至1.82个百分点(bootstrap 95% CI [-1.27, 7.39])——而推理成本降低92%。

英文摘要

We present AliyunConsoleAgent, a web agent framework for automated documentation verification in real-world cloud consoles. Major cloud platforms encompass hundreds of products with rapid feature iteration, causing console UIs to frequently diverge from their corresponding documentation. Verifying that documented procedures accurately reflect the current console and can be executed end-to-end demands an estimated 4 million recurring inspections annually, yet manual coverage remains below 1%. While agent systems built on frontier proprietary models achieve high success rates, their prohibitive cost and data privacy constraints preclude large-scale deployment. We propose a two-stage training paradigm: supervised fine-tuning (SFT) on distilled frontier-model trajectories, followed by reinforcement learning using Group Relative Policy Optimization (GRPO) and a dual-channel outcome reward model in real cloud environments. To support large-scale RL training, we construct a high-determinism rollout system featuring Terraform-based resource pre-provisioning and LLM-driven on-demand provisioning, which effectively isolates environment noise from the training signal. We further introduce a rule-based reward evaluation protocol grounded in backend audit logs, providing objective, reward-hacking-resistant outcome judgment. Our model evolves from mechanical instruction following to autonomous decision-making with cloud console and product-specific understanding. Experiments on a challenging 278-task benchmark where the best frontier model achieves only 65.34% demonstrate that AliyunConsoleAgent-32B achieves a 63.52% mean success rate -- a 20.24 percentage-point improvement over the base model, narrowing the gap to the best frontier proprietary model to 1.82 pp (bootstrap 95% CI [-1.27, 7.39]) -- at 92% lower inference cost.

2606.09446 2026-06-09 cs.CV 新提交

Leveraging Morphology for Historical Script Metrological Analysis

利用形态学进行历史手稿计量分析

Malamatenia Vlachou Efstathiou, Raphaël Baena, Dominique Stutzmann, Mathieu Aubry

发表机构 * LIGM, École des Ponts et Chaussées, IP Paris, CNRS, France(LIGM,国立桥路学校,巴黎理工学院,法国国家科学研究中心,法国) Institut de Recherche et d’Histoire des Textes, Paris, Île-de-France, France(文本研究与历史研究所,巴黎,法兰西岛,法国)

AI总结 提出基于Transformer的检测架构和原型线重建模块,从行级转录中学习字符原型,实现可扩展、有意义的古文字测量,并验证其在区分图形轮廓和发现细微变化方面的有效性。

详情
AI中文摘要

手写文本识别的进展使得历史文献的大规模转录成为可能,但仍为古文字学(历史手稿研究)提供有限的可解释视觉测量。本文的主要见解是,形态学手稿分析,特别是从行级转录中学习字符原型的能力,能够定义可扩展、有意义且稳定的古文字测量。更精确地说,我们利用基于Transformer的检测架构和基于原型的线重建模块来学习原型字符及其出现、变形和定位。我们的贡献有两方面。首先,我们引入了一种深度架构和学习方法,仅通过行级转录监督即可实现高效的字符建模,显著改进了可学习打字机基线,并实现了准确的字符边界框预测,释放了其在古文字测量中的潜力。其次,我们介绍并展示了由我们的架构实现的字符、双字母组和图形单元之间间距的自动测量的古文字相关性。为了演示,我们将巴黎手稿BnF fr. 2813(14世纪末由查理五世委托,由四名抄写员抄写)的注释扩展到160页。我们可视化这些页面上的测量结果,显示它们不仅使我们能够区分图形轮廓,还能发现和分析细微变化。这个案例研究概述了我们方法的可扩展性及其在所需训练数据方面的节俭性,因为单列文本就足以对160页中的每一页进行计算。数据和代码公开于:https://malamatenia.github.io/morphology4metrology-analysis。

英文摘要

Advances in handwritten text recognition have enabled large-scale transcription of historical documents, but still provide limited access to interpretable visual measurements for paleography, the study of historical scripts. In this paper, our main insight is that morphological script analysis, in particular the capacity to learn character prototypes from line-level transcriptions, enables the definition of scalable, meaningful, and stable paleographic measurements. More precisely, we leverage a transformer-based detection architecture together with a prototype-based line reconstruction module to learn prototypical characters and their occurrence, deformation, and positioning. Our contributions are twofold. First, we introduce a deep architecture and learning methodology that enables efficient character modeling with only line-level transcription supervision, significantly improving over the Learnable Typewriter baseline and enabling accurate character bounding box prediction, unlocking its potential for paleographic measurements. Second, we introduce and demonstrate the paleographical relevance of automatic measurements enabled by our architecture for characters, bi-grams, and spaces between graphical units. For this demonstration, we extend the annotations of the codex Paris, BnF, fr. 2813, commissioned in the late fourteenth century by Charles V and copied by four hands, to 160 pages. We visualize our measurements over these pages, showing how they enable us not only to differentiate graphical profiles, but also to discover and analyze subtle variations. This case study outlines the scalability of our approach and its frugality in terms of required training data, since a single column of text is sufficient to compute our measurements on each of the 160 pages. Data and code are publicly available at: https://malamatenia.github.io/morphology4metrology-analysis.

2606.09441 2026-06-09 cs.AI cs.AR 新提交

SIFT: Selective-Index For Fast Compute of RAG Prefill by Exploiting Attention Invariance

SIFT: 利用注意力不变性实现RAG预填充快速计算的索引选择

Rya Sanovar, Srikant Bharadwaj, Hritvik Taneja, Moinuddin Qureshi

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Microsoft(微软)

AI总结 针对RAG查询中文档重复导致预填充计算冗余和TTFT增加的问题,提出SIFT方法,通过离线提取文档高注意力分数位置并利用注意力不变性,在预填充时仅计算标记位置,将TTFT提升1.71倍且精度损失在1%以内。

详情
AI中文摘要

检索增强生成(RAG)向LLM查询注入相关文档以提高响应质量。这种注入增加了提示长度并减慢了首个令牌生成时间(TTFT)。与标准查询不同,RAG查询具有上下文复用的独特属性,即相同文档在用户查询中重复出现。因此,为每个RAG查询完全重新计算文档会导致冗余计算并增加TTFT。先前的工作离线预计算RAG文档的KV张量,并在在线预填充期间粗略地重新计算一些令牌。然而,由于高延迟的磁盘传输,这种KV复用在现代GPU上通常比完全重新计算更慢。此外,这种粗粒度的重新计算会降低准确性。为了解决这些限制,本文提出了SIFT:利用注意力不变性实现RAG预填充快速计算的索引选择。SIFT离线处理文档,并提取每个文档中高注意力分数的细粒度位置。接下来,我们识别出以下注意力不变性见解,使我们能够在运行时利用提取的位置:(1)局部注意力不变性:文档内高注意力分数的位置不受周围文档的影响。这有助于我们预测文档自注意力中高分数出现的位置。(2)交叉注意力一致性:具有高文档内注意力的键也会吸引后续文档的交叉注意力。这有助于我们预测文档对未来文档注意力中高分数出现的位置。关键的是,SIFT不存储任何KV数据,仅以两个紧凑的位向量的形式存储高分数位置。SIFT的存储比KV张量小24000倍,避免了昂贵的磁盘传输。在预填充期间,SIFT仅计算标记位置的注意力,将TTFT提升1.71倍,同时将精度保持在完全重新计算的1%以内。

英文摘要

Retrieval-Augmented Generation (RAG) injects LLM queries with relevant documents to improve response quality. This injection increases prompt length and slows time to first token (TTFT). Unlike standard queries, RAG queries have a unique property of context reuse where the same documents recur across user queries. Thus, fully recomputing documents for every RAG query does redundant compute and increases TTFT. Prior works precompute KV tensors of RAG documents offline and coarsely recompute some tokens during online prefill. However, such KV reuse is often slower than full recomputation on modern GPUs due to high-latency disk transfers. Further, such a coarse-grained recomputation degrades accuracy. To address these limitations, this paper proposes SIFT: Selective-Index For Fast Compute of RAG Prefill by Exploiting Attention Invariance. SIFT processes documents offline and extracts fine-grained locations of high attention scores for each document. Next, we identify the following attention invariance insights that enable us to exploit the extracted locations during runtime: (1) Local-Attention Invariance: The location of high attention scores within a document remain invariant to surrounding documents. This helps us predict the location of high scores where the document attends to itself. (2) Cross-Attention Consistency: Keys with high intra-document attention also attract cross-attention from subsequent documents. This helps us predict the location of high scores where the document attends to future documents. Critically, SIFT stores no KV data and only stores locations of high scores in the form of two compact bit vectors. SIFT's storage is up to 24,000x smaller than KV tensors, obviating costly disk transfers. During prefill, SIFT computes the attention only for the marked locations and improves TTFT by 1.71x while holding accuracy within 1% of full recompute.

2606.09435 2026-06-09 cs.CL 新提交

MUDIDI: A Two-Stage Framework for Multilingual Dictionary Digitization with Language Models

MUDIDI:一种基于语言模型的多语言词典数字化两阶段框架

David Setiawan, Temuulen Khishigsuren, Milind Agarwal, Pagnarith Pit, Aso Mahmudi, Ekaterina Vylomova

发表机构 * School of Computing and Information Systems, The University of Melbourne(墨尔本大学计算与信息系统学院) Melbourne School of Psychological Sciences, The University of Melbourne(墨尔本大学墨尔本心理科学学院) LILT

AI总结 提出MUDIDI两阶段框架,结合语言模型实现多语言词典数字化,在字符识别、标记保留和词条分割上优于现有OCR和视觉语言模型,并发布30本公共领域词典的标注数据集。

Comments 9 pages, preprint, submitted to EMNLP 2026

详情
AI中文摘要

多语言词典是低资源和濒危语言最有价值的文献资源之一,但许多仍仅以扫描件形式存在。几十年来,由于语言特有的文字、包含缩写和交叉引用条目的复杂多栏布局,其数字化并转换为机器可读格式几乎不可能。最近的视觉语言模型提供了有希望的解决方案,但尚不清楚它们在保留字符、标记和处理词典结构方面的表现。我们提出MUDIDI,一个用于多语言词典数字化的两阶段框架。第一阶段评估字符识别和标记保留的质量;第二阶段专注于词典条目分割,随后映射到机器可读的词典模式——SIL的多词典格式化器。我们还发布了一个数据集,包含从30本公共领域词典中收集的人工标注的词典条目,这些词典涵盖多种文字系统、语系和格式。我们在该数据集上对OCR系统、通用大语言模型(LLM)和视觉语言模型(VLM)进行了基准测试,展示了LLM在大多数文字系统和语言的两个阶段中的优越性能,并为更具挑战性的场景提供了改进结果的实用指南。最后,我们表明向LLM补充额外信息(如词典引言)可以提高数字化词典的质量。Github: https://github.com/DavidSamuell/MUDIDI-Pipeline-for-Digitization-of-Multilingual-Dictionary/

英文摘要

Multilingual dictionaries are among the most valuable documentary resources for low-resource and endangered languages, yet many remain available only as scans. For many decades, their digitization and conversion into a machine-readable format was nearly impossible due to language-specific scripts, complex multi-column layouts full of entries with abbreviations and cross-references. Recent vision-language models offer a promising solution, but it is unclear how well they preserve characters, markup, and process lexicographic structure. We introduce MUDIDI, a two-stage framework for multi-lingual dictionary digitization. Stage One evaluates the quality of character recognition and markup preservation; Stage Two focuses on dictionary entry segmentation with subsequent mapping into a machine-readable lexicographic schema, SIL's Multi-Dictionary Formatter. We also release a dataset that consists of human-annotated lexicographic entries collected from 30 public-domain dictionaries featuring diverse writing systems, language families, and formats. We benchmark OCR systems, general-purpose Large Language Models (LLMs), and Vision Language Models (VLMs) on the dataset, demonstrating superior performance of LLMs across most writing systems and languages in both stages, and provide practical guidelines on improving the results for more challenging scenarios. Finally, we show that supplementing additional information, such as dictionary introduction, to the LLMs can improve the quality of the digitized dictionary. Github: https://github.com/DavidSamuell/MUDIDI-Pipeline-for-Digitization-of-Multilingual-Dictionary/

2606.09434 2026-06-09 cs.LG 新提交

Operator learning for solving Fokker-Planck equations with various initial conditions

算子学习求解不同初始条件下的福克-普朗克方程

Li Zeng, Xiaoliang Wan, Yaobin Wang, Fabio Nobile, Tao Zhou

发表机构 * Fuzhou University(福州大学) Louisiana State University(路易斯安那州立大学) Beijing Normal-Hong Kong Baptist University(北京师范大学-香港浸会大学联合国际学院) École Polytechnique Fédérale de Lausanne(洛桑联邦理工学院) Chinese Academy of Sciences(中国科学院)

AI总结 提出基于条件归一化流的物理信息神经网络框架,利用Chapman-Kolmogorov方程和线性化SDE基分布,高效求解多种初始条件下FPE的算子,引入时间加权损失函数解决小时间不稳定性。

详情
AI中文摘要

福克-普朗克方程(FPE)在描述由随机动力学支配的系统概率密度函数(PDF)的时间演化中起着关键作用。本文提出了一种基于条件归一化流的物理信息神经网络(PINN)框架,用于高效逼近整个初始条件范围内FPE的解算子。利用马尔可夫随机过程的Chapman-Kolmogorov方程,将问题重新表述为逼近从任意点狄拉克质量开始的初始时刻的转移PDF。采用关联线性化随机微分方程(SDE)的PDF作为归一化流的基分布,该分布提供了目标PDF的良好近似,特别是在小时间尺度下,从而避免了与狄拉克δ初始分布相关的映射奇异性。此外,引入时间加权损失函数以减轻小时间尺度下出现的数值不稳定性,在时间推进过程中实现因果性与训练难度之间的平衡。通过多种数值实验展示了所提方法的有效性和鲁棒性。

英文摘要

The Fokker-Planck equation (FPE) plays a pivotal role in describing the time evolution of probability density functions (PDFs) for systems governed by stochastic dynamics. In this work, we propose a conditional normalizing flow-based physics-informed neural network (PINN) framework for efficiently approximating the solution operator of the FPE for a whole range of initial conditions. Leveraging the Chapman-Kolmogorov equation for Markovian stochastic processes, the problem is reformulated into approximating a transition PDF starting at initial time from a Dirac mass centered at an arbitrary point. The PDF of an associated linearized stochastic differential equation (SDE) is employed as the base distribution for the normalizing flow, providing a good approximation of the target PDF, especially for small times, and thereby avoiding the singularity of the map associated with the Dirac delta initial distribution. Furthermore, a time-weighted loss function is introduced to mitigate numerical instabilities arising at small times, achieving a balance between causality and training difficulty as time progresses. A variety of numerical experiments are presented to illustrate the effectiveness and robustness of the proposed method.

2606.09433 2026-06-09 cs.AI 新提交

Bayesian Selective Latent Inference for Wastewater-First Influenza Monitoring

贝叶斯选择性潜在推断用于污水优先的流感监测

Yixuan Zhang, Yang Song, Hao Wang, Samir Bhatt, Hengguan Huang

发表机构 * University of Copenhagen(哥本哈根大学) Rutgers University(罗格斯大学) Imperial College London(帝国理工学院)

AI总结 提出贝叶斯选择性潜在推断(BSLI),通过后验分布、可回答性认证和成本校准的Bellman策略,在污水优先流感监测中优化查询与弃权决策。

Comments Corresponding authors: Hengguan Huang and Samir Bhatt. Hengguan Huang is the lead corresponding author

详情
AI中文摘要

污水流感监测可以在临床报告之前揭示社区传播,但仅凭污水并不能完全识别人类负担。现有的污水模型假设固定的证据集,而通用的证据获取方法将官方监测流视为可互换的昂贵特征。我们将污水优先的流感监测视为一个选择性决策问题:从强制性的污水证据开始,系统必须决定污水是否足够,接下来查询哪个延迟的官方流,以及在源模糊下何时弃权是唯一科学上可辩护的行动。我们提出了贝叶斯选择性潜在推断(BSLI),这是一种原则性的贝叶斯方法,它维护潜在负担和可识别性的后验分布,通过明确的科学门认证可回答性,并使用精确的成本校准Bellman策略优化查询-停止决策。我们证明了关键的变分、可回答性、Bellman最优性和一维成本校准性质。在一个包含5,933个预测事件和3,102个源模糊事件的固定公共数据基准上,BSLI改善了匹配预算的成本-性能前沿,同时在源模糊下保持保守的弃权。

英文摘要

Wastewater influenza surveillance can reveal community circulation before clinical reporting, but wastewater alone is not a fully identifiable proxy for human burden. Existing wastewater models assume a fixed evidence set, while generic evidence-acquisition methods treat official surveillance streams as interchangeable costly features. We cast wastewater-first influenza monitoring as a selective decision problem: starting from mandatory wastewater evidence, the system must decide whether wastewater is sufficient, which delayed official stream to query next, and when abstention is the only scientifically defensible action under source ambiguity. We propose Bayesian Selective Latent Inference (BSLI), a principled Bayesian method that maintains a posterior over latent burden and identifiability, certifies answerability through explicit scientific gates, and optimizes query-stop decisions with an exact cost-calibrated Bellman policy. We prove the key variational, answerability, Bellman-optimality, and one-dimensional cost-calibration properties. On a fixed public-data benchmark with 5,933 forecasting episodes and 3,102 source-ambiguity episodes, BSLI improves the matched-budget cost-performance frontier while preserving conservative abstention under source ambiguity.

2606.09432 2026-06-09 cs.LG 新提交

Graph Mamba Operator: A Latent Simulator for Interacting Particle Systems

Graph Mamba Operator: 一种用于相互作用粒子系统的潜在模拟器

Karn Tiwari, Niladri Dutta, N M Anoop Krishnan, Prathosh A P

发表机构 * Indian Institute of Science, Bangalore(印度科学研究所,班加罗尔) Indian Institute of Technology, Delhi(印度理工学院,德里)

AI总结 提出Graph Mamba Operator (GraMO),通过将状态空间模型与图交互学习集成到单一循环中,实现长期时空依赖的联合建模,在N体系统、运动捕捉和机器人数据集上取得最低误差。

Comments Under Submission

详情
AI中文摘要

建模相互作用的动力系统需要捕捉空间相互作用以及长期时间依赖。图神经网络(GNNs)提供了一种自然的表示,但通常依赖于自回归滚动,并分别处理空间和时间动态,导致长期预测中误差累积。现有方法还侧重于局部交互和短时间上下文,限制了它们捕捉多跳依赖和全局结构的能力。我们引入了图Mamba算子(GraMO),一种潜在空间模拟器,将状态空间模型与基于图的交互学习集成在一起。与先前将节点排序或分阶段应用空间和时间更新的工作不同,GraMO在单个循环中耦合了基于图的交互和时间状态更新。该更新在潜在状态上是线性的,具有跨状态自适应变化的输入相关系数。我们在N体系统、运动捕捉和机器人数据集上评估了GraMO,在基准测试中实现了最低误差,并在长期预测中取得了最大增益。

英文摘要

Modeling interacting dynamical systems requires capturing spatial interactions alongside long-range temporal dependencies. Graph neural networks (GNNs) provide a natural representation but typically rely on autoregressive rollouts and treat spatial and temporal dynamics separately, leading to error accumulation over long horizons. Existing approaches also focus on local interactions and short temporal contexts, limiting their ability to capture multi-hop dependencies and global structure. We introduce the Graph Mamba Operator (GraMO), a latent-space simulator that integrates state-space models with graph-based interaction learning. In contrast to prior work that sequences nodes or applies spatial and temporal updates in separate stages, GraMO couples graph-based interactions and temporal state updates within a single recurrence. The update is linear in the latent state, with input-dependent coefficients that adapt across regimes. We evaluate GraMO on N-body systems, motion capture, and robotics datasets, achieving the lowest error across benchmarks and the largest gains in long-horizon prediction.

2606.09430 2026-06-09 cs.LG cs.AI 新提交

LargeMonitor: Monitoring Online Task-Free Continual Learning via Large Pretrained Models

LargeMonitor: 通过大型预训练模型监控在线无任务持续学习

Mingqi Yuan, Xiaoquan Sun, Shihao Luo, Jiayu Chen

发表机构 * HKU(香港大学) Qicore Tech(启科科技)

AI总结 提出LargeMonitor框架,利用大型预训练模型(LVM和LMM)解耦检测与诊断,实现无任务持续学习中的零样本漂移检测和语义病因诊断,提升现有算法性能。

详情
AI中文摘要

在线无任务持续学习(TFCL)要求智能体在严格单次遍历约束下,从无界、非平稳的数据流中顺序积累知识,且无显式任务标识。现有在线TFCL范式主要依赖于参数高效的提示调整或由训练耦合优化动态(如经验损失波动或潜在距离演变)驱动的动态结构扩展。因此,这些训练耦合求解器对分布漂移的结构起源不可知,机械地在根本不同的流变化上强制执行固定策略。为解决这一问题,我们提出LargeMonitor,一个利用大型预训练基础模型自主编排无任务连续适应的框架。具体而言,LargeMonitor引入一个解耦的检测模块,利用大型视觉模型(LVM)的冻结、稳定表示空间,实现鲁棒的零样本漂移检测,无需训练依赖的干扰或脆弱的阈值调整。在确认漂移后,该框架激活一个由大型多模态模型(LMM)驱动的上下文感知诊断模块,以解释流变化的精确语义病因(例如,新类出现 vs. 环境域偏移)。这种双阶段能力使连续学习者能够动态部署自适应且特定于漂移的优化策略。在多个TFCL设置和基准上的大量实验表明,LargeMonitor实现了对复杂数据流的精确、鲁棒检测和诊断,同时持续提升现有在线TFCL算法的性能。

英文摘要

Online task-free continual learning (TFCL) requires intelligent agents to sequentially accumulate knowledge from an unbounded, non-stationary data stream under strict single-pass constraints and without any explicit task identifiers. Existing online TFCL paradigms primarily rely on parameter-efficient prompt tuning or dynamic structure expansion driven by training-coupled optimization dynamics, such as empirical loss fluctuations or evolving latent distances. As a result, these training-coupled solvers remain agnostic to the structural origins of distribution drift, mechanically enforcing a fixed strategy across fundamentally distinct streaming variations. To address this gap, we propose LargeMonitor, a framework that leverages large pretrained foundation models to autonomously orchestrate task-free continuous adaptation. Specifically, LargeMonitor introduces a decoupled detection module utilizing the frozen, stable representation space of large vision models (LVMs) to achieve robust, zero-shot drift detection without training-dependent interference or brittle threshold tuning. Upon a confirmed drift, the framework activates a context-aware diagnostic module driven by large multimodal models (LMMs) to interpret the precise semantic etiologies of the stream variation (e.g., novel class emergence vs. environmental domain shift). This dual-stage capability empowers the continuous learner to dynamically deploy adaptive and shift-specific optimization strategies. Extensive experiments across multiple TFCL settings and benchmarks demonstrate that LargeMonitor achieves precise, robust detection and diagnosis of complex data streams while consistently improving the performance of existing online TFCL algorithms.

2606.09428 2026-06-09 cs.CL 新提交

Guide Me Out: A Framework to Benchmark VLM Operators Communication in Crisis Scenarios

引导我出去:危机场景下评估VLM操作员通信的框架

Giacomo Gonella, Stefano Menini, Marco Guerini

发表机构 * Fondazione Bruno Kessler(布鲁诺·凯斯勒基金会) University of Trento(特伦托大学)

AI总结 提出一个基准框架,评估视觉语言模型在模拟疏散中引导平民的策略(窄播 vs. 广播)、环境表示(视觉 vs. 图)和威胁行为(静态 vs. 移动),发现窄播降低失败率,视觉表示主导性能,移动威胁增加失败率。

详情
AI中文摘要

有效的危机响应需要空间定位的通信,将平民的语言指导与物理环境联系起来,考虑结构瓶颈、不断变化的威胁和代理特定背景。然而,当前危机通信中的NLP研究主要局限于静态、纯文本分类设置,忽视了AI操作员在动态、具身场景中的关键通信作用。我们通过一个新的基准框架来解决这一差距,该框架用于评估视觉语言模型(VLM)在模拟疏散中引导平民代理的任务。我们测试了两种通信策略(窄播与广播)、两种环境表示(视觉与基于图)和两种威胁行为(静态与移动),跨越九张不同结构复杂度的地图。我们的结果表明,与广播相比,窄播在所有难度级别上持续降低平民失败率。指导质量很大程度上取决于VLM操作员如何表示世界:视觉模态驱动性能,而添加邻接图则依赖于模型且通常有害。移动威胁在所有条件下提高失败率,因为通信必须随时间持续适应。这些发现共同表明,将VLM作为AI操作员部署在疏散场景中仍然是一个非平凡挑战,其中通信策略和输入表示的选择可以直接决定干预的成功或失败。

英文摘要

Effective crisis response requires spatially grounded communication that bridges linguistic guidance of civilians with the physical environment, accounting for structural bottlenecks, evolving threats, and agent-specific contexts. Yet, current NLP research in crisis communication remains mainly limited to static, text-only classification settings, overlooking the critical communicative role of AI operators in dynamic, embodied scenarios. We address this gap with a novel benchmarking framework for evaluating Vision-Language Models (VLMs) tasked with guiding civilian agents through simulated evacuations. We test two communication strategies (narrowcast vs. broadcast), two environment representations (visual vs. graph-based), and two threat behaviors (static vs. moving) across nine maps of varying structural complexity. Our results show that Narrowcast consistently reduces civilian Fail rates compared to Broadcast across all difficulty levels. Guidance quality depends heavily on how the VLM operator represents the world: the visual modality drives performance, while adding an adjacency graph is model-dependent and often harmful. Moving threats raise Fail rates across all conditions as communication must continuously adapt over time. Together, these findings show that deploying VLMs as AI operators in evacuation scenarios remains a non-trivial challenge, where the choice of communication strategy and input representation can directly determine the success or failure of the intervention.

2606.09424 2026-06-09 cs.CL 新提交

Toward Signing Activity Projection in Sign Language Interaction

面向手语交互中的手语活动预测

Takao Obi, Wang Yusong, Koji Inoue, Kotaro Funakoshi

发表机构 * Institute of Science Tokyo(东京科学大学) Kyoto University(京都大学)

AI总结 本研究探索将语音活动预测(VAP)框架迁移至双人手语交互,利用公共DGS语料库提取手语活动流,基于姿态特征进行轮换预测,结果表明HOLD/SHIFT预测有潜力但SHIFT预测困难。

详情
AI中文摘要

社交机器人不仅需要与以语音为中心的系统所假设的用户进行稳健交互,还需要与依赖不同模态(例如手语)进行交流的多样化用户进行交互。一个重要的能力差距是与手语用户进行预测性轮换。尽管语音活动预测(VAP)已成功用于模拟口语交互中的未来语音活动,但该框架是否适用于手语交互仍不清楚。本文提出了将VAP架构适应双人手语交互的初步迁移研究。使用公共DGS语料库的交互录音,我们从词汇手语标注中推导出二进制手语活动流,并制定轮换预测的代理任务。模型使用每个手语者提取的基于姿态的手部、眼部区域和嘴部区域特征。结果表明,SHIFT/HOLD预测是有前景的,尤其是利用手部线索,而SHIFT预测仍然困难。这些发现为将预测性轮换模型从口语交互迁移到手语交互的潜力和当前局限性提供了初步证据。手语交互的预测建模仍然需要超越语音衍生类别的手语特定事件定义。

英文摘要

Social robots must interact robustly not only with users assumed by speech-centered systems but also with diverse users whose communication relies on different modalities, e.g., sign language. One important capability gap is predictive turn-taking with signing users. Although Voice Activity Projection (VAP) has been successfully used to model future voice activity in spoken interaction, it remains unclear whether the framework transfers to sign language interaction. This paper presents an initial transfer study of adapting a VAP architecture to dyadic sign language interaction. Using interaction recordings from the Public DGS Corpus, we derive binary signing activity streams from lexical sign annotations and formulate proxy tasks for turn-taking prediction. The model uses pose-derived hand, eye-region, and mouth-region features extracted for each signer. The results show that SHIFT/HOLD prediction is promising, especially with hand cues, while SHIFT-prediction remains difficult. These findings provide initial evidence for both the promise and the current limitations of transferring predictive turn-taking models from spoken interaction to sign language interaction. Predictive modeling of sign language interaction still requires sign-language-specific event definitions that go beyond speech-derived categories.

2606.09416 2026-06-09 cs.RO cs.AI cs.SE 新提交

Harness Engineering for Physical AI: Robot Middleware Is the Harness Layer

面向物理AI的驾驭工程:机器人中间件即驾驭层

Sanghoon Lee, Jiyeong Chae, Kyung-Joon Park

发表机构 * Daegu Gyeongbuk Institute of Science and Technology (DGIST)(大邱庆北科学技术院)

AI总结 本文提出机器人中间件作为物理AI的驾驭层,需同时干预控制、计算和通信,并补充投影、隔离和转移三种缺失的强制功能,以ROS 2驾驭配置文件为例。

Comments 6 pages, 2 figures, 2 tables. Big Ideas track submission to the 27th ACM/IFIP International Middleware Conference (Middleware 2026)

详情
AI中文摘要

在物理AI时代,机器人中间件面临新的角色。学习策略、规划器和视觉-语言-动作(VLA)模型现在作为控制路径上的因果参与者进入已部署的机器人,但将它们与定时、调度和网络集成的层尚未被命名。最近的语言智能体工作将此层命名为驾驭层,即中介工具、管理状态、约束资源和记录执行的外部系统。机器人社区尚未采用这一框架,我们提出机器人中间件就是那个驾驭层。物理AI驾驭层与软件驾驭层的区别在于其干预位置。软件驾驭层在工具调用边界进行中介。物理AI驾驭层必须同时干预控制、计算和通信,因为学习策略的输出跨越所有三者:其命令改变轨迹,其推理时间改变调度,其有效载荷改变带宽。机器人中间件是机器人栈中最低的层,具有对所有三者的中介抽象,因此最适合组合它们的强制实施。它已经提供了驾驭层所需的大部分功能,但缺乏针对AI模型的强制实施。我们将这种缺失的强制实施命名为三个功能:投影在输出时门控每个输出,隔离约束模型的执行和传输时隙,转移在检查失败时回退到经过验证的基线。每个功能目前以手工构建的应用程序代码形式出现在已部署的机器人系统中,构建在机器人中间件已提供的表面上。机器人中间件应该将它们作为组合所有三者的层,而不是作为最佳的单轴强制器。我们将其勾勒为ROS 2驾驭配置文件,这是一个部署工件,携带AI模型声明的输出区域、推理预算和运行机制,而中间件在ROS 2、DDS和Zenoh上强制实施它们。

英文摘要

Robot middleware faces a new role in the era of Physical AI. Learned policies, planners, and vision-language-action (VLA) models now enter deployed robots as causal participants on the control path, but the layer that integrates them with timing, scheduling, and network has not been named. Recent language-agent work names this layer the harness, the external system that mediates tools, manages state, bounds resources, and records execution. The robotics community has not yet adopted this framing, and we propose that robot middleware is that harness. A Physical AI harness differs from a software harness in where it intervenes. A software harness mediates at tool-call boundaries. A Physical AI harness must mediate at control, computing, and communication simultaneously, because a learned policy's output crosses all three: its commands shift the trajectory, its inference time shifts the schedule, and its payload shifts the bandwidth. Robot middleware is the lowest robot-stack layer with mediating abstractions over all three, so it is best positioned to compose their enforcement. It already provides most of what a harness needs but lacks the enforcement for an AI model. We name this missing enforcement as three functions: Projection gates each output at emission, Isolation bounds the model's execution and transmission slot, and Transfer falls back to a verified baseline when checks fail. Each appears today as hand-built application code in deployed robot systems, built on surfaces robot middleware already provides. Robot middleware should host them not as the best single-axis enforcer but as the layer that composes all three. We sketch this as a ROS 2 Harness Profile, a deployment artifact that carries an AI model's declared output region, inference budget, and operating regime while the middleware enforces them across ROS 2, DDS, and Zenoh.

2606.09409 2026-06-09 cs.AI cs.CL cs.LG 新提交

Correct Looks Better: Pairwise Comparisons Reveal Accuracy Rankings

正确看起来更好:成对比较揭示准确性排名

Mina Remeli, Moritz Hardt

发表机构 * Max Planck Institute for Intelligent Systems, Tübingen, Germany(马克斯·普朗克智能系统研究所,蒂宾根,德国) Tübingen AI Center(蒂宾根人工智能中心)

AI总结 本文通过将基准测试转化为生成式评估,发现成对比较结合Elo方法得到的模型排名与基于真实准确率的排名高度一致(Spearman相关系数>0.9),且风格和裁判偏见影响较小,但答案重复(echo)是裁判偏好的因果驱动因素。

Comments Accepted at ICML'26

详情
AI中文摘要

成对比较结合诸如Elo等聚合方法已成为评估生成模型的核心,但人们仍担心它们会奖励肤浅的风格线索或显示裁判偏见。从更积极的角度看,我们表明,当存在真实准确率用于比较时,成对比较得出的模型排名与基于真实准确率的排名高度一致。通过将五个知名基准测试转化为自由形式的生成评估,我们发现Elo排名与准确率排名的Spearman相关系数超过0.9,并且在裁判较弱时显著优于直接评估。此外,风格和裁判偏见对模型排名的影响较小,尽管大多数判断发生在两个候选答案都正确(或都错误)的成对上。在这样的成对比较中,我们发现最终答案后的重复(echo)是裁判偏好的因果驱动因素。

英文摘要

Pairwise comparisons combined with aggregation methods like Elo have become central to evaluating generative models, yet concerns remain that they reward superficial stylistic cues or display judge biases. In a more positive turn, we show that model rankings from pairwise comparisons strongly agree with ground-truth-based accuracy rankings when such ground truth is available for comparison. By converting five well-known benchmarks into free-form generative evaluations, we find that Elo rankings achieve a Spearman correlation above 0.9 with accuracy rankings and substantially outperform direct evaluation when the judge is weak. Furthermore, style and judge bias have only minor effects on model rankings, despite most judgments occurring on pairs where both candidate answers are correct (or incorrect). On such pairs, we find that repetition after the final answer (echo) is a causal driver of judge preference.

2606.09403 2026-06-09 cs.CL 新提交

Introducing multiplex semantic networks as multifaceted representations of creative associative knowledge across multilingual samples

引入多重语义网络作为跨语言样本中创造性联想知识的多面表示

Edith Haim, Kurt Haim, Roger E. Beaty, Cynthia S. Q. Siew, Massimo Stella

发表机构 * CogNosco Lab, Department of Psychology and Cognitive Science, University of Trento(CogNosco实验室,心理学与认知科学系,特伦托大学) Department of Science Education, University of Education Upper Austria(科学教育系,上奥地利教育大学) Department of Psychology, The Pennsylvania State University(心理学系,宾夕法尼亚州立大学) Department of Psychology, National University of Singapore(心理学系,新加坡国立大学)

AI总结 本研究通过从六种认知任务构建的多重语义网络,更全面地建模创造力背后的联想知识,并利用机器学习预测个体创造力得分,证明多重网络比单一任务更有效。

详情
AI中文摘要

创造力是一种复杂的认知能力,依赖于语义记忆中的知识组织和检索。然而,大多数研究使用单一任务来测量它,仅捕捉了这种复杂性的一小部分。本研究调查了多重网络——从六种认知任务中获得的层次化语义网络——作为建模创造力背后联想知识的更全面方法。我们收集了来自四个国家(奥地利、美国、新加坡、意大利)的N=518名个体的数据。根据他们对言语流畅性、句子链、自由联想和叙事写作任务的回答,我们构建了语义网络并将其组装成多重结构。基于AI角色的响应提供了比较基线。结构可约性分析表明,不同的任务层捕捉了关于语义组织的不同、非冗余信息,支持使用多个任务而非任何单一任务。高创造力和低创造力组的网络在结构上保持不同,而AI生成的网络无论创造力组如何都显示出几乎相同的结构。最后,我们在使用岭回归的机器学习模型中使用了12个特征(网络度量、情感分数和扩散激活模拟)来预测个体创造力得分。在前一阶段识别出的结构相似层的组合,将概念验证预测准确性提高了50%。结构度量显示出最高的特征重要性,扩散激活动力学提供了额外的预测能力。总之,这些发现表明多重语义网络捕捉了创造力背后联想知识的更丰富的跨文化图景。我们还发布了我们的多样化数据集和代码,以促进创造力社区内的多样化计算方法。

英文摘要

Creativity is a complex cognitive ability that relies on knowledge organisation and retrieval from semantic memory. Yet most research uses a single task to measure it, capturing only a fraction of this complexity. This study investigates multiplex networks - layered semantic networks obtained from six cognitive tasks - as a more comprehensive approach to modelling the associative knowledge underlying creativity. We collected data from N=518 individuals from four countries (Austria, USA, Singapore, Italy). From their responses to verbal fluency, sentence-chain, free association, and narrative writing tasks, we constructed semantic networks and assembled them in a multiplex structure. AI persona-based responses provided a comparison baseline. Structural reducibility analyses showed that different task layers captured distinct, non-redundant information about semantic organisation, supporting the use of multiple tasks over any single one. The networks from high- and low-creative groups remained structurally distinct, while AI-generated networks showed near-identical structures regardless of creativity group. Finally, we used 12 features (network measures, emotional scores, and spreading activation simulations) in a machine learning model using ridge regression to predict individual creativity scores. The combination of structurally similar layers, as identified in the previous stage, improved a proof-of-concept prediction accuracy by 50%. Structural measures showed the highest feature importance, with spreading activation dynamics providing additional predictive power. Together, these findings indicate that multiplex semantic networks capture a richer, cross-cultural picture of associative knowledge underlying creativity. We also release our diverse dataset and code to foster diverse computational approaches within the creativity community.

2606.09401 2026-06-09 cs.LG cs.CR 新提交

Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models

大语言模型适配的实证隐私保护基准测试

Bartłomiej Marek, Lorenzo Rossi, Vincent Hanke, Xun Wang, Michael Backes, Franziska Boenisch, Adam Dziedzic

发表机构 * CISPA Helmholtz Center for Information Security(CISPA 欧洲信息安全中心)

AI总结 通过系统变化适配数据分布,使用鲁棒成员推断和金丝雀数据提取攻击,评估差分隐私下大语言模型的实际隐私风险,发现分布偏移显著影响隐私脆弱性,LoRA等参数高效微调方法对分布外数据提供最佳实证保护。

Comments Accepted at ICLR 2026 (Oral)

详情
AI中文摘要

最近的工作应用差分隐私(DP)来适配大语言模型(LLMs)以用于敏感应用,提供了理论保证。然而,其实用有效性仍不明确,部分原因是LLM预训练中,与适配数据的重叠和相互依赖关系可能破坏隐私,尽管采用了DP。为了在实践中分析这一问题,我们使用最先进的攻击(如鲁棒成员推断和金丝雀数据提取)调查了DP适配下LLMs中的隐私风险。我们通过系统变化适配数据分布(从与预训练数据完全重叠,经过分布内(IID)情况,到完全分布外(OOD)示例)来对这些风险进行基准测试。此外,我们评估了不同适配方法和不同隐私机制对脆弱性的影响。我们的结果表明,分布偏移强烈影响隐私脆弱性:在相同的理论保证下,适配数据越接近预训练分布,实际隐私风险越高,即使没有直接的数据重叠。我们发现,参数高效微调方法(如LoRA)对OOD数据实现了最高的实证隐私保护。我们的基准测试确定了在DP LLM适配中实现实际隐私的关键因素,为在敏感环境中部署定制模型提供了可操作的见解。展望未来,我们提出了一个结构化框架,用于超越适配隐私的整体隐私评估,以识别和评估LLM的完整预训练-适配流程中的风险。

英文摘要

Recent work has applied differential privacy (DP) to adapt large language models (LLMs) for sensitive applications, offering theoretical guarantees. However, its practical effectiveness remains unclear, partly due to LLM pretraining, where overlaps and interdependencies with adaptation data can undermine privacy despite DP efforts. To analyze this issue in practice, we investigate privacy risks under DP adaptations in LLMs using state-of-the-art attacks such as robust membership inference and canary data extraction. We benchmark these risks by systematically varying the adaptation data distribution, from exact overlaps with pretraining data, through in-distribution (IID) cases, to entirely out-of-distribution (OOD) examples. Additionally, we evaluate how different adaptation methods and different privacy regimes impact the vulnerability. Our results show that distribution shifts strongly influence privacy vulnerability: the closer the adaptation data is to the pretraining distribution, the higher the practical privacy risk at the same theoretical guarantee, even without direct data overlap. We find that parameter-efficient fine-tuning methods, such as LoRA, achieve the highest empirical privacy protection for OOD data. Our benchmark identifies key factors for achieving practical privacy in DP LLM adaptation, providing actionable insights for deploying customized models in sensitive settings. Looking forward, we propose a structured framework for holistic privacy assessment beyond adaptation privacy, to identify and evaluate risks across the full pretrain-adapt pipeline of LLMs.

2606.09400 2026-06-09 cs.CV 新提交

vesselFM-CT: Segmenting All Blood Vessels in CT Images for System-Level Cardiovascular Analysis

vesselFM-CT:在CT图像中分割所有血管以实现系统级心血管分析

Bastian Wittmann, Chinmay Prabhakar, Suprosanna Shit, Bjoern Menze

发表机构 * Department of Quantitative Biomedicine, University of Zurich(苏黎世大学定量生物医学系)

AI总结 提出vesselFM-CT模型,通过迭代多步训练和TubeLoss损失函数,实现CT图像中从大血管到微小肠系膜血管的全分割,优于基线方法,支持系统级心血管分析。

详情
AI中文摘要

人体血管网络中的血管在半径、长度、拓扑特性和分支模式上表现出剧烈的结构变化。这种异质性,加上位置特定的解剖背景变化,对稳健、大规模地分析整个心血管系统构成了重大挑战。因此,大多数研究集中在血管网络的狭窄孤立部分。虽然这些针对性研究提供了有价值的见解,但它们本质上限制了评估血管网络整体系统健康和功能完整性的能力。在这项工作中,我们旨在弥合这一差距,以推进临床诊断和我们对血管生理学的基本理解。我们提出了在CT图像中分割所有血管的任务,范围从心血管系统最大的组成部分到微小的肠系膜血管。为此,我们引入了vesselFM-CT,这是第一个能够稳健分割3D CT图像中所有血管的模型。vesselFM-CT通过迭代多步过程进行训练,并优化我们提出的TubeLoss损失函数,有效解决了心血管系统固有的异质性。我们证明vesselFM-CT优于所有基线,并能够从CT图像中自动精确提取心血管系统,从而解锁广泛的临床和技术视角,包括自动疾病分类和合成CT图像生成。

英文摘要

The vascular network in the human body is characterized by blood vessels exhibiting drastic structural variations in radius, length, topological properties, and branching patterns. This heterogeneity, together with location-specific anatomical background variations, poses a significant challenge for robust, large-scale analysis of the entire cardiovascular system. As a result, most research has focused on narrow, isolated segments of the vascular network. While such targeted studies provide valuable insights, they inherently limit the ability to assess the systemic health and functional integrity of the vascular network as a whole. In this work, we aim to bridge this gap to advance both clinical diagnostics and our fundamental understanding of vascular physiology. We propose the task of segmenting all vessels in CT images, ranging from the largest components of the cardiovascular system to even minuscule mesenteric vessels. To this end, we introduce vesselFM-CT, the first model capable of robustly segmenting all blood vessels in 3D CT images. VesselFM-CT is trained via an iterative, multi-step process and optimizes our proposed TubeLoss loss function, effectively addressing the inherent heterogeneity of the cardiovascular system. We demonstrate that vesselFM-CT outperforms all baselines and enables automated, precise extraction of the cardiovascular system from CT images, thereby unlocking a wide range of clinical and technical perspectives, including automated disease classification and synthetic CT image generation.

2606.09399 2026-06-09 cs.AI 新提交

RunAgent SuperBrowser: A Theory of Autonomous Web Navigation Grounded in Human Browsing Behaviour

RunAgent SuperBrowser: 基于人类浏览行为的自主网页导航理论

Radeen Mostafa, Sawradip Saha

发表机构 * RunAgent AI

AI总结 提出SuperBrowser自主网页导航代理,通过模仿人类浏览的感知-认知-行动三元机制,在Mind2Web Hard基准上以89.47%成功率超越现有开源研究代理。

Comments 31 pages, 8 figures, preprint/work in progress

详情
AI中文摘要

我们提出SUPERBROWSER,一个自主网页导航代理,其设计基于一个指导性假设:网页代理应该像人一样浏览。人类阅读页面时不会记住看到的每个像素;他们会看几个候选目标,决定一个,并只记住维持目标所需的信息。我们将这个感知-认知-行动三元组实现为三个耦合机制。首先,一个视觉优先的边界框管道在每个截图上标记候选交互区域,并异步预取给语言模型,使“眼睛”先于“手”。其次,一个三角色大脑——一个分类和路由的编排器、一个每几步评估进度的规划器、一个发出每步动作的工作器——将战略推理与操作推理分离。第三,一个结构化的账本只存储人类会记住的内容:目标、最近三个动作、少量事实和死胡同、以及少量检查点;一个六阶段驱逐循环系统性地从实时上下文中丢弃过时的截图、状态块和推理痕迹。动作执行是一个三层点击级联(Chrome DevTools协议到Puppeteer到脚本化),带有拟人化的贝塞尔运动,以及一个感知V形箭头的边界框捕捉器,解决“大标签旁的小箭头”歧义。在Mind2Web Hard基准(66个任务)上,SUPERBROWSER达到89.47%的成功率,总体排名第三,并以大幅优势领先所有已发表的开源/研究浏览器代理基线。我们认为,这一提升并非来自任何单一技巧,而是来自整个系统中认知契约的一致应用。

英文摘要

We present SUPERBROWSER, an autonomous web-navigation agent designed against a single guiding hypothesis: a web agent should browse the way a person browses. A human reading a page does not retain every pixel they have seen; they look at a few candidate targets, decide on one, and remember only what is needed to keep the goal alive. We operationalize this perception-cognition-action triad as three coupled mechanisms. First, a vision-first bounding-box pipeline labels candidate interactive regions on every screenshot and feeds them, asynchronously prefetched, to the language model so that the "eye" precedes the "hand". Second, a three-role brain -- an Orchestrator that classifies and routes, a Planner that evaluates progress every few steps, and a Worker that emits per-step actions -- separates strategic from operational reasoning. Third, a structured Ledger stores only what a person would: the goal, the last three actions, a small set of facts and dead-ends, and a handful of checkpoints; a six-phase eviction loop systematically discards stale screenshots, state blobs, and reasoning traces from the live context. Action execution is a three-tier click cascade (Chrome DevTools Protocol to Puppeteer to scripted) with humanized Bezier motion, plus a chevron-aware bounding-box snapper that resolves the "small arrow beside a large label" ambiguity. On the Mind2Web Hard benchmark (66 tasks), SUPERBROWSER attains 89.47% success, placing third overall and ahead of every published open/research browser-agent baseline by a large margin. We argue that the gain comes not from any single trick but from the consistent application of a cognitive contract throughout the system.

2606.09396 2026-06-09 cs.CL cs.LG 新提交

PriFT: Prior-Support Guided Supervised Fine-Tuning

PriFT: 先验支持引导的监督微调

Ke Wang, Shuangqi Li, Mathieu Salzmann, Pascal Frossard

发表机构 * EPFL(瑞士联邦理工学院洛桑分校)

AI总结 提出PriFT方法,利用冻结的预训练模型计算token权重,避免在线模型导致的自我强化动态,在数学推理、代码生成和医疗问答任务中取得SFT最优结果,并为后续RL提供更好初始化。

Comments The first two authors contributed equally to this work

详情
AI中文摘要

监督微调(SFT)是下游任务适配的高效方法,通常作为强化学习(RL)的初始化阶段,但其泛化能力可能弱于RL。一个关键限制是其离策略目标:SFT逐token拟合固定演示,包括与模型预训练分布对齐不良的目标,这可能导致过拟合。最近一系列工作通过给与当前模型预测分布更对齐的token分配更大的训练权重来解决此问题,直觉是拟合这些token对模型的预训练知识和表示的扭曲较小。然而,从当前微调模型计算token权重会将token权重与优化轨迹纠缠在一起,随着分布迅速偏离预训练模型,引发自我强化动态。为了解决这个问题,我们提出PriFT(先验支持引导的微调),该方法从冻结的预训练参考模型导出token权重,以获得不受微调影响的稳定重加权信号。该信号估计先验支持:每个目标token受预训练分布支持的程度。在多种现有token重加权规则中,将重加权信号从在线模型替换为预训练模型一致地提升了性能。我们引入了两种实例化:PriFT-prob使用预训练token概率,而PriFT-mass根据预训练分布下的累积概率质量选择token。在数学推理、代码生成和医疗问答上的大量实验表明,PriFT在SFT基线中取得了最先进的结果,并为后续RL训练提供了更好的初始化。

英文摘要

Supervised fine-tuning (SFT) is an efficient approach for downstream task adaptation and often serves as the initialization stage for reinforcement learning (RL), but it can show weaker generalization than RL. A key limitation is its off-policy objective: SFT fits fixed demonstrations token by token, including targets poorly aligned with the model's pretrained distribution, which can lead to overfitting. A recent line of work addresses this issue by assigning larger training weights to tokens better aligned with the current model's predictive distribution, with the intuition that fitting these tokens are less distortive to the model's pretrained knowledge and representations. However, computing the token weights from the model that is currently fine-tuned entangles token weights with the optimization trajectory, inducing a self-reinforcing dynamics as the distribution rapidly departs from the pretrained model. To address this, we propose PriFT (Prior-support guided Fine-Tuning), which derives token weights from a frozen pretrained reference to obtain a stable reweighting signal unaffected by fine-tuning. This signal estimates prior support: the extent to which each target token is supported by the pretrained distribution. Across multiple existing token-reweighting rules, replacing the reweighting signal from the online model to pretrained model consistently improves performance. We introduce two instantiations: PriFT-prob uses pretrained token probability, while PriFT-mass selects tokens by cumulative probability mass under the pretrained distribution. Extensive experiments on mathematical reasoning, code generation, and medical question answering show that PriFT achieves state-of-the-art results among SFT baselines and provides a better initialization for subsequent RL training.

2606.09393 2026-06-09 cs.CV 新提交

CapRL++: Unified Reinforcement Learning with Verifiable Rewards for Dense Image and Video Captioning

CapRL++:基于可验证奖励的统一强化学习用于密集图像和视频描述生成

Penghui Yang, Long Xing, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Yibin Wang, Yujie Zhou, Jiazi Bu, Jianze Liang, Qidong Huang, Jiaqi Wang, Feng Wu, Dahua Lin

发表机构 * Tsinghua University(清华大学) University of Science and Technology of China(中国科学技术大学) Microsoft(微软) Shanghai AI Laboratory(上海人工智能实验室) Shanghai Innovation Institute(上海创新研究院) Alibaba Cloud(阿里云) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出CapRL++框架,利用可验证奖励的强化学习(RLVR)优化多模态描述生成,通过非视觉语言模型回答问题的准确性作为奖励,提升密集描述质量,在20多个基准上超越传统监督微调。

Comments 26 pages, 10 figures. Project page: https://github.com/InternLM/CapRL. arXiv admin note: text overlap with arXiv:2509.22647

详情
AI中文摘要

图像和视频描述是连接视觉与语言领域的基础任务,在预训练大型视觉语言模型(LVLMs)中发挥关键作用。当前最先进的描述模型通常采用监督微调(SFT)训练,这种范式依赖于昂贵且不可扩展的标注,并常导致模型记忆特定真实答案,限制了其通用性和生成多样化、创造性描述的能力。为克服这些局限,我们提出将可验证奖励的强化学习(RLVR)应用于多模态描述的开放任务。我们引入描述强化学习++(CapRL++),一种新颖的无参考训练框架,通过效用重新定义描述质量:高质量描述应使非视觉语言模型能够准确回答关于相应视觉内容的问题。CapRL++采用解耦的两阶段流程,其中LVLM生成描述,目标奖励来自一个独立的、无视觉的LLM仅基于该描述回答多项选择题的准确率。在超过20个图像和视频基准上的评估表明,CapRL++提升了密集描述质量,并增强了基于描述的预训练在空间和时间理解等任务上的表现。在CapRL++标注的可扩展图像和视频描述数据集上预训练带来了显著的下游收益。此外,在描述质量评估的Prism框架内,使用CapRL++训练的紧凑模型在密集描述性能上可与Qwen2.5-VL-72B和Qwen3-VL-235B-A22B等大得多的模型相媲美。这些结果验证了CapRL++能有效训练模型生成可泛化、高保真的描述,为超越传统SFT的局限奠定了坚实基础。

英文摘要

Image and video captioning are fundamental tasks that bridge the visual and linguistic domains, playing a critical role in pre-training Large Vision-Language Models (LVLMs). Current state-of-the-art captioning models are typically trained with Supervised Fine-Tuning (SFT), a paradigm that relies on expensive, non-scalable annotations and often causes models to memorize specific ground-truth answers, limiting their generality and ability to generate diverse, creative descriptions. To overcome these limitations, we propose applying Reinforcement Learning with Verifiable Rewards (RLVR) to the open-ended task of multimodal captioning. We introduce Captioning Reinforcement Learning++ (CapRL++), a novel reference-free training framework that redefines caption quality through its utility: a high-quality caption should enable a non-visual language model to accurately answer questions about the corresponding visual content. CapRL++ employs a decoupled two-stage pipeline where an LVLM generates a caption, and the objective reward is derived from the accuracy of a separate, vision-free LLM answering Multiple-Choice Questions based solely on that caption. Evaluations on more than 20 image and video benchmarks show that CapRL++ improves dense caption quality and strengthens caption-based pretraining across tasks such as spatial and temporal understanding. Pretraining on scalable image and video caption datasets annotated by CapRL++ yields substantial downstream gains. Furthermore, within the Prism Framework for caption quality evaluation, compact models trained with CapRL++ achieve dense captioning performance comparable to substantially larger models such as Qwen2.5-VL-72B and Qwen3-VL-235B-A22B. These results validate that CapRL++ effectively trains models to produce generalizable, high-fidelity descriptions, establishing a robust foundation beyond the limitations of traditional SFT.

2606.09392 2026-06-09 cs.AI 新提交

From Coarse to Fine: Managing Temporal Granularity in Spatio-Temporal Data for Fine-Grained Traffic Prediction

从粗到细:管理时空数据中的时间粒度以实现细粒度交通预测

Shuhao Li, Weidong Yang, Yue Cui, Zizhuo Xu, Lipeng Ma, Fan Zhang, Xiaofang Zhou

发表机构 * College of Computer Science and Artificial Intelligence, Fudan University(复旦大学计算机科学与技术学院) Tongyi Lab, Alibaba Group(阿里巴巴集团通义实验室) The Hong Kong University of Science and Technology(香港科技大学) Guangzhou University(广州大学)

AI总结 针对粗粒度采样数据难以支持细粒度预测的问题,提出时空细化预测器(STRP),通过树卷积和逆膨胀卷积实现高效时空建模,在六个数据集上显著优于现有方法。

详情
AI中文摘要

高效的交通数据获取、存储和利用是时空数据管理中的关键挑战。大多数交通数据系统以固定的粗粒度时间间隔收集和存储观测数据,以降低存储和计算成本。然而,这种粗粒度数据严重限制了需要更细时间粒度预测的下游应用。在所有地点和时间段收集和维护细粒度交通数据将给数据库存储和预处理流程带来巨大负担。为了解决这种时间粒度不匹配问题,我们定义了一个新问题:利用粗粒度采样数据预测细粒度未来交通。我们提出了时空细化预测器(STRP),一种面向时空数据系统的粒度感知框架。STRP集成了两个组件:用于高效且可解释的空间依赖建模的树卷积,以及用于渐进式时间外推的逆膨胀卷积。STRP支持两种实用的预测设置:基于窗口和基于持续时间的,以处理不同形式的粒度不匹配。在六个基准数据集上的实验表明,STRP在准确性和效率上均显著优于最先进的基线方法。我们的工作为管理时空交通数据系统中的粒度不匹配提供了一种实用且可解释的方法。

英文摘要

Efficient acquisition, storage, and utilization of traffic data are critical challenges in spatio-temporal data management. Most traffic data systems collect and store observations at fixed, coarse-grained temporal intervals to reduce storage and computation costs. However, such coarse-grained data severely limits downstream applications that require predictions at a finer temporal granularity. Collecting and maintaining fine-grained traffic data across all locations and time periods would impose a substantial burden on database storage and preprocessing pipelines. To address this temporal granularity mismatch, we formulate a novel problem: predicting fine-grained future traffic using coarse-grained sampled data. We propose the Spatial-Temporal Refinement Predictor (STRP), a granularity-aware framework for spatio-temporal data systems. STRP integrates two components: Tree Convolution for efficient and interpretable spatial dependency modeling, and Inverse Dilated Convolution for progressive temporal extrapolation. STRP supports two practical prediction settings: window-based and duration-based, to handle different forms of granularity mismatch. Experiments on six benchmark datasets show that STRP significantly outperforms state-of-the-art baselines in both accuracy and efficiency. Our work offers a practical and interpretable approach to managing granularity mismatches in spatio-temporal traffic data systems.

2606.09390 2026-06-09 cs.CV cs.AI cs.RO 新提交

Real-time body pose non-verbal communication with a consistency-based reliability measure

基于一致性可靠性度量的实时身体姿态非语言通信

Alina Marcu, Dragos Costea, Cristina Lazar, Marius Leordeanu

发表机构 * National University of Science and Technology "Politehnica" Bucharest(布加勒斯特理工大学) Simion Stoilow Institute of Mathematics of the Romanian Academy(罗马尼亚科学院西蒙·斯托伊洛数学研究所) NORCE Norwegian Research Centre AS(挪威研究中心)

AI总结 研究仅从2D身体姿态识别通信意图,提出自回归自一致性作为无监督可靠性信号,并在嵌入式GPU上实现实时性能。

详情
AI中文摘要

身体运动在远距离或无法捕捉面部及语音的条件下传达意图。我们研究仅从2D身体姿态识别通信意图。我们认为身体运动是可靠的信号,特别是在需要实时低成本设备上的人-机器人通信场景中,如救援任务。然而,现有资源并未孤立这一信号。情感语料库结合了身体、面部、语音和文本,而骨架动作识别基准标记的是执行的动作而非传达的信息。我们发布了一个包含十种通信意图的全身体姿态真实帧数据集,并将其与其他真实(IPC)和合成(MotionLCM, VEO3.1, Kimodo)数据集进行比较,这些数据集覆盖了不同难度。我们针对能在机器人有限板载硬件上运行的系统。我们基准测试了多种模型,从骨架图分类器到联合运动预测网络,并在嵌入式GPU(NVIDIA Orin Nano)上报告了性能指标和帧率,因为在我们的场景中速度和准确性同样重要。最后,我们展示了模型自身的自回归自一致性可作为无监督可靠性信号。我们给出了一个简短证明,界定了自一致性预测正确的概率,表明该概率随一致步数增加而增长,并识别了自信预测仍可能错误的条件,与行业标准指标进行了基准测试。

英文摘要

Body movement communicates intent at distances and in conditions where neither the face, nor speech can be captured. We study the recognition of communicative intent from 2D body pose alone. We argue that body motion is a reliable signal especially in scenarios that require real time low-cost on-device person-to-robot communication in long distance environments, such as rescue missions. However, existing resources do not isolate this signal. Affective corpora combine body, face, voice and text, while skeleton action-recognition benchmarks label the action performed rather than the message conveyed. We release a dataset of real frames of full-body pose covering ten communicative intents and we compare it against other real (IPC) and synthetic (MotionLCM, VEO3.1, Kimodo) ones that span a range of difficulty. We target systems that can run on a robot's limited onboard hardware. We benchmark multiple models, from skeleton graph classifiers to joint motion-forecasting networks, and report performance metrics together with frame rate on an embedded GPU (NVIDIA Orin~Nano), since speed matters as much as accuracy in our scenario. Finally, we show that a model's own autoregressive self-consistency works as an unsupervised reliability signal. We give a short proof that bounds the probability that a self-consistent prediction is correct, show that this probability grows with the number of consistent steps, and identify the conditions under which a confident prediction can still be false, benchmarked against industry-standard metrics.

2606.09389 2026-06-09 cs.CL 新提交

LexRubric: A Rubric-Guided Diagnostic Benchmark for Open-Ended Legal Tasks

LexRubric:面向开放式法律任务的基于评分指南的诊断基准

Yifan Chen, Haitao Li, Yiran Hu, Kaisong Song, Jun Lin, Yueyue Wu, Qingyao Ai, Min Zhang, Yiqun Liu

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Tsinghua University(清华大学) University of Waterloo(滑铁卢大学) Alibaba Group(阿里巴巴集团)

AI总结 提出LexRubric基准,包含649个中国法律咨询与司法考试实例及12,337条原子评分标准,通过六维框架评估LLM在开放式法律任务中的可靠性,发现当前模型仍面临挑战。

详情
AI中文摘要

随着大型语言模型(LLM)越来越多地应用于现实法律任务,评估其开放式法律响应的可靠性变得至关重要。这些任务需要上下文敏感的答案,且容错空间极小,因此需要能够识别响应质量失败具体原因的细粒度诊断评估。我们引入了LexRubric,一个基于评分指南的基准,用于评估开放式中文法律任务。LexRubric包含来自法律咨询和司法考试的649个实例,这些实例既反映了日常法律需求,也体现了专业法律推理,覆盖14个法律场景。此外,它还包含12,337条由专家编写的原子评分标准,这些标准组织在一个统一的六维框架下,能够跨任务和评估维度进行准确的评估和诊断分析。为了验证评估的可靠性,我们测试了多个评判模型,并将基于模型的评判与人类评判进行了比较。我们进一步在LexRubric上评估了18个近期通用和法律领域的LLM。结果表明,不同模型展现出不同的能力特征,且开放式法律问题对当前LLM仍然具有挑战性。数据可在以下网址获取:https://github.com/foggpoy/LexRubric。

英文摘要

As large language models (LLMs) are increasingly applied to real-world legal tasks, evaluating the reliability of their open-ended legal responses has become essential. These tasks require context-sensitive answers and allow little room for error, motivating fine-grained and diagnostic evaluation that can identify specific sources of response quality failures. We introduce LexRubric, a rubric-based benchmark for evaluating open-ended Chinese legal tasks. LexRubric contains 649 instances from legal consultation and judicial examination, which reflect both everyday legal needs and professional legal reasoning and cover 14 legal scenarios. It further includes 12,337 expert-written atomic scoring criteria organized under a unified six-dimensional framework, enabling accurate evaluation and diagnostic analysis across tasks and evaluation dimensions. To validate the reliability of the evaluation, we test multiple judge models and compare model-based judgments with human judgments. We further evaluate 18 recent general and legal-domain LLMs on LexRubric. Results show that different models exhibit distinct capability profiles, and that open-ended legal question remains challenging for current LLMs. Data is available at: https://github.com/foggpoy/LexRubric.

2606.09388 2026-06-09 cs.LG 新提交

Distilling Safe LLM Systems via Soft Prompts for On Device Settings

通过软提示蒸馏安全的设备端LLM系统

Motasem Alfarra, Cristina Pinneri, Dana Kianfar, Mohammed Almousa, Christos Louizos

发表机构 * Qualcomm AI Research(高通人工智能研究院)

AI总结 针对资源受限设备上部署安全大语言模型(LLM)的挑战,提出基于软提示与蒸馏训练的安全对齐方法,在最小化额外计算开销的同时实现优越的安全-有用性权衡。

Comments Accepted to UAI 2026

详情
Journal ref
42nd Conference on Uncertainty in Artificial Intelligence 2026
AI中文摘要

在资源受限的边缘设备上部署安全的大语言模型(LLM)面临关键挑战:虽然将LLM与防护模型结合的双模型系统能提供有效的安全保障,但其巨大的内存和计算需求使其在设备端部署中代价高昂。本文对资源受限环境下的参数高效安全对齐方法进行了全面研究。通过对多种LLM架构、训练目标和参数高效微调方法的系统评估,我们发现软提示与基于蒸馏的训练相结合始终优于其他方法。我们引入了基于总变差和KL散度的蒸馏框架,能够有效将防护模型的安全行为迁移到学习到的软提示中。我们在多个基准上的评估表明,与LoRA适配器、引导向量和直接优化方法相比,这种组合在安全-有用性权衡上表现更优,同时在推理时仅需极少的额外内存和计算。这些发现确立了软提示蒸馏作为设备端LLM部署中安全对齐的首选方法。

英文摘要

Deploying safe large language models (LLMs) on resource-constrained edge devices presents a critical challenge: while dual-model systems combining LLMs with guard models provide effective safety guarantees, their substantial memory and computational demands make them prohibitively expensive for on-device deployment. This paper presents a comprehensive study of parameter-efficient safety alignment methods for resource-constrained settings. Through systematic evaluation across multiple LLM architectures, training objectives, and parameter-efficient fine-tuning approaches, we identify that soft prompts combined with distillation-based training consistently outperform alternative methods. We introduce distillation frameworks based on total variation and KL divergence that effectively transfer safety behaviors from guard models into learned soft prompts. Our evaluations on various benchmarks demonstrate that this combination achieves superior safety-usefulness trade-offs compared to LoRA adapters, steering vectors, and direct optimization methods, while requiring minimal additional memory and compute at inference time. These findings establish soft prompt distillation as the preferred approach for safety alignment in on-device LLM deployment.

2606.09383 2026-06-09 cs.CV 新提交

An Opticalmechanics Framework for Dynamic Estimation of Multibody Systems

多体系统动态估计的光力学框架

Banglei Guan, Xuanyu Bai, Qingquan Chen, Zibin Liu, Dongcai Tan, Zhenbao Yu, Yang Shang, Qifeng Yu

发表机构 * National University of Defense Technology(国防科技大学)

AI总结 提出光力学运动-动力学集成框架,通过图像测量运动学量结合遗传算法优化,实现无接触关节力矩估计,实验验证腕关节力矩误差0.46 Nm。

Comments 10 pages, 12 figures

详情
AI中文摘要

传统的人体动力学分析通常受限于接触力/力矩传感器和受控实验室环境。为解决此问题,本研究提出了一种用于多体系统的光力学运动-动力学集成估计框架。具体而言,建立约束多体模型描述系统动力学,同时将图像测量的运动学量作为动态估计的非接触输入。然后通过基于遗传算法的优化,最小化模型预测与图像测量运动学量之间的差异,识别未知关节力矩。在气浮平台上的实验验证表明,与传感器测量相比,从图像数据估计的腕关节力矩平均绝对误差为0.46 Nm。在前向预测测试中,模型预测的角速度相对于图像测量结果的平均绝对误差为0.006 rad/s。本研究展示了在直接力/力矩测量困难的情况下,结合图像测量和力学建模进行非接触动态估计的潜力。

英文摘要

Conventional dynamics analysis of the human body is often constrained by the need for contact force and torque sensors and controlled laboratory environments. To address this issue, this study proposes an opticalmechanics kinematic-dynamic integrated estimation framework for multibody systems. Specifically, a constrained multibody model is established to describe the system dynamics, while image-measured kinematic quantities are used as non contact inputs for dynamic estimation. The unknown joint torque is then identified through a genetic-algorithm based optimization by minimizing the discrepancy between model-predicted and image-measured kinematic quan tities. Experimental validation on an air-bearing platform showed that the wrist joint torque estimated from image data achieved a mean absolute error of 0.46 Nm compared with sensor measurements. In the forward prediction test, the model-predicted angular velocity achieved a mean absolute error of 0.006 rad/s relative to the image-measured results. This study demonstrates the potential of combining image measurement and mechanical modeling for non-contact dynamic estimation in scenarios where direct force and torque measurement is difficult.

2606.09381 2026-06-09 cs.RO 新提交

ReGIL: Retrieval-Guided Imitation Learning from a Single Demonstration

ReGIL: 基于检索引导的单一示范模仿学习

Yuying Zhang, Francesco Verdoja, Wenyan Yang, Ville Kyrki

发表机构 * Aalto University(阿尔托大学)

AI总结 提出ReGIL框架,将单一示范作为外部记忆,通过检索引导探索、生成正则化缓冲和构建奖励,在LIBERO和Meta-World基准及真实机器人任务中显著提升成功率和训练效率。

详情
AI中文摘要

使用深度神经网络从单一示范中学习机器人操作策略仍然极具挑战性,因为即使与示范轨迹有微小偏差也可能迅速累积导致失败,而收集大量在线交互数据成本高昂。我们提出ReGIL,一种检索引导的模仿学习框架,将单一示范视为外部记忆。ReGIL在整个训练过程中反复查询该静态记忆,以同时引导探索、生成正则化缓冲和构建奖励。具体而言,它通过当前轨迹与检索片段之间的局部时间对齐来计算奖励,为策略改进提供逐步且信息丰富的反馈。我们在LIBERO和Meta-World基准的机器人操作任务上,在单一示范设置下评估了ReGIL。ReGIL在成功率和训练效率上均优于先前基线。在真实机器人实验中,仅使用一个示范和不到一小时的在线训练,ReGIL在三个操作任务上(初始机器人姿态和目标位置均随机)实现了超过75%的成功率。这些结果表明,将单一示范作为可重用记忆可以提供比静态监督更高效的机器人学习。更多详情请访问我们的网站:https://regil2026.github.io/

英文摘要

Learning robot manipulation policies with deep neural networks from a single demonstration remains highly challenging, as even small deviations from the demonstrated trajectory can quickly compound into failure, while collecting substantial online interaction data is costly. We propose ReGIL, a retrieval-guided imitation learning framework that treats a single demonstration as an external memory. ReGIL repeatedly queries this static memory throughout training to simultaneously guide exploration, generate the regularization buffer, and construct rewards. Specifically, it computes rewards through local temporal alignment between the current trajectory and the retrieved segment, providing step-wise and informative feedback for policy improvement. We evaluate ReGIL on robotic manipulation tasks from the LIBERO and Meta-World benchmarks under the single demonstration setting. ReGIL outperforms prior baselines in both success rate and training efficiency. In real-robot experiments, using only one demonstration and less than one hour of online training, ReGIL achieves over 75% success rate across three manipulation tasks with randomness in both initial robot pose and target position. These results demonstrate that leveraging the single demonstration as reusable memory can provide more than static supervision for efficient robot learning. More details can be found on our website: https://regil2026.github.io/

2606.09380 2026-06-09 cs.LG cs.AI cs.CL 新提交

Reasoning Arena: Trace Tournaments When Verifiable Rewards Fall Short

推理竞技场:当可验证奖励不足时的轨迹锦标赛

Han Zhou, Adam X. Yang, Laurence Aitchison, Anna Korhonen, Albert Q. Jiang

发表机构 * University of Cambridge(剑桥大学) Mistral AI

AI总结 提出推理竞技场框架,通过轨迹锦标赛将无梯度信号的非多样奖励组转化为相对奖励信号,结合Bradley-Terry模型高效整合强化学习,在数学和编码基准上平均提升7.6%,加速训练27%-41%。

Comments 9 pages, 6 figures, 2 tables (17 pages including references and appendices)

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)已成为通过结果监督提升大语言模型推理能力的主流范式。然而,可验证奖励在组级别常常变得无信息:当给定提示的所有采样轨迹获得相同奖励时,组相对优势估计无法提供梯度信号,尽管这些轨迹在推理质量上可能差异显著。我们提出推理竞技场,一种自适应训练框架,将此类非多样奖励组路由至裁判系统而非丢弃。除了检查最终答案,推理竞技场构建轨迹锦标赛,其中推理轨迹进行两两比较以暴露组内更细粒度的偏好,将推理质量转化为丰富的相对奖励信号。为使奖励估计高效,而非穷举比较每一对,每个新轨迹与一个动态更新的先前生成轨迹小池作为锚点进行评估,以高效建立相对排名。然后我们在不完整比较图上拟合Bradley-Terry模型,实现无需二次成对比较的可扩展强化学习集成。实验结果表明,推理竞技场在竞赛数学和编码基准上平均比RLVR基线高出7.6%。通过将原本浪费的零优势样本转化为有用的梯度更新,我们的方法加速训练27%至41%,节省近50%的生成计算量,并显著提升整体推理性能。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has become a leading paradigm for improving the reasoning ability of large language models through outcome-based supervision. However, verifiable rewards frequently become uninformative at the group level: when all sampled traces of a given prompt receive identical rewards, group-relative advantage estimation provides no gradient signal, even though the traces may differ substantially in reasoning quality. We propose Reasoning Arena, an adaptive training framework that routes such non-diverse reward groups to a judge system instead of discarding them. Beyond examining the final answer, Reasoning Arena constructs trace tournaments, where reasoning traces are compared head-to-head to expose finer-grained preferences within the group, converting reasoning quality into rich relative reward signals. To make reward estimation efficient, rather than exhaustively comparing every pair, each new trace is evaluated against a small, dynamically updated pool of previously generated traces as anchors to efficiently establish a relative ranking. We then fit a Bradley-Terry model on the incomplete comparison graph, enabling scalable RL integration without quadratic pairwise comparisons. Empirical results demonstrate that Reasoning Arena consistently outperforms the RLVR baseline by 7.6% on average in competition mathematics and coding benchmarks. By converting otherwise wasted zero-advantage samples into useful gradient updates, our method accelerates training by 27% to 41%, saving nearly 50% of generation compute, and substantially improves overall reasoning performance.

2606.09378 2026-06-09 cs.CV 新提交

Echo-DM: Ultrasound Marker Removal via Conditional Latent Diffusion and Region-Aware Fusion

Echo-DM: 通过条件潜在扩散和区域感知融合去除超声标记

Zhiwei Wang, Tao Huang, Wentao Jiang, Muyi Li, Jianxin Liu, Jian Chen, Jie Zou, Yong Luo, Bo Du, Jing Zhang

发表机构 * School of Computer Science, Wuhan University, China(武汉大学计算机学院) The Central Hospital of Wuhan, China(武汉市中心医院) School of Computer Science, Hubei University of Technology, China(湖北工业大学计算机学院)

AI总结 提出Echo-DM框架,结合条件潜在扩散和区域感知融合,在无掩码条件下有效去除超声图像中的人工标记,同时保持解剖结构保真度。

Comments 18 pages, 4 figures

详情
AI中文摘要

临床超声图像通常包含人工标记,如测量卡尺和文字,以辅助诊断解释和比较。然而,这些标记可能在下游自动分析中引入捷径偏差,促使深度学习模型依赖标记相关线索而非临床有意义的解剖结构。现有的标记去除方法要么依赖于掩码且易受错误传播影响,要么是无掩码的确定性修复器,可能过度平滑超声纹理并扰动未受影响的背景区域。为应对这些挑战,我们提出了Echo-DM,一个通过条件潜在扩散和区域感知融合进行超声标记去除的框架。Echo-DM遵循通用的编码器-扩散-解码器流水线,其中基于DiT的条件潜在扩散网络执行全局修复,区域感知融合模块在端到端无掩码推理下强制执行保留感知的图像空间细化。基于这一固定核心设计,我们进一步分别用基于VAE和基于RAE的潜在模块实例化了Echo-DM-V和Echo-DM-R,这表明Echo-DM架构与多种潜在模块实例化兼容。在Echo-PAIR(一个大规模配对临床超声数据集)上的大量实验表明,与代表性的两阶段基线相比,Echo-DM具有优越的标记去除能力和强大的解剖保真度,同时在部署设置中提供了有利的质量-效率权衡。数据、代码和模型将在https://github.com/MiliLab/Echo-DM发布。

英文摘要

Clinical ultrasound images often contain artificial markers, such as measurement calipers and text, to assist diagnostic interpretation and comparison. However, these markers can introduce shortcut bias in downstream automated analysis, encouraging deep learning models to rely on marker-related cues rather than clinically meaningful anatomy. Existing marker removal methods are either mask-dependent and vulnerable to error propagation, or mask-free deterministic restorers that may over-smooth ultrasound texture and perturb unaffected background regions. To address these challenges, we present Echo-DM, a framework for ultrasound marker removal via conditional latent diffusion and region-aware fusion. Echo-DM follows a common encoder-diffusion-decoder pipeline, where a DiT-based conditional latent diffusion network performs global restoration and a region-aware fusion module enforces preservation-aware image-space refinement under end-to-end mask-free inference. Building on this fixed core design, we further instantiate Echo-DM-V and Echo-DM-R with VAE-based and RAE-based latent modules, respectively, which demonstrates that the Echo-DM architecture is compatible with diverse latent-module instantiations. Extensive experiments on Echo-PAIR, a large-scale paired clinical ultrasound dataset, demonstrate superior marker removal and strong anatomical fidelity compared with representative two-stage baselines, while providing favorable quality--efficiency trade-offs across deployment settings. Data, code and models will be released at https://github.com/MiliLab/Echo-DM.

2606.09371 2026-06-09 cs.AI 新提交

Capability-Aligned Hierarchical Learning for Tool-Augmented LLMs

面向工具增强型大语言模型的能力对齐分层学习

Haotong Yang, Ting Long, Yi Chang

发表机构 * Jilin University(吉林大学)

AI总结 提出CAHL方法,利用RLVR联合优化高层规划器与低层执行器,解决分层工具学习中的规划-执行对齐问题,在多个基准上验证有效性。

Comments 14 pages, 5 figures, 6 tables. Preprint

详情
AI中文摘要

工具学习使大语言模型能够调用外部工具完成任务。先前研究证明了分层结构的有效性:高层策略负责全局规划并将任务分解为可管理的子任务,低层策略专注于调用工具解决这些子任务。然而,这些工作通常分别优化高层和低层策略,导致规划器与执行器不对齐,限制了LLM在工具使用任务上的性能。本文提出一种名为能力对齐分层学习(CAHL)的方法,利用RLVR联合优化两个策略,使高层规划器与低层执行器更好地对齐。在受限工具使用基准(API-Bank和BFCL)和开放环境(Bamboogle)上的实验证明了CAHL的有效性。

英文摘要

Tool learning enables LLMs to invoke external tools to accomplish tasks. Prior studies have demonstrated the effectiveness of a hierarchical structure: a high-level policy handles global planning and decomposes tasks into manageable sub-tasks, and a low-level policy focuses on invoking tools to solve these sub-tasks. However, these works typically optimize the high-level and low-level policies separately, leading to planner-executor misalignment and limiting LLM performance on tool-use tasks. In this paper, we propose a method called Capability-Aligned Hierarchical Learning (CAHL), which leverages RLVR to jointly optimize both policies, enabling better alignment between the high-level planner and the low-level executor. Experiments on constrained tool-use benchmarks (API-Bank and BFCL) and an open-ended environment (Bamboogle) demonstrate the effectiveness of CAHL.

2606.09368 2026-06-09 cs.CV cs.AI 新提交

PhysScene: A Scene Graph Dataset for Scientific Visual Reasoning in Physics Experiments

PhysScene:用于物理实验科学视觉推理的场景图数据集

Minghao Zou, Qingtian Zeng, Shangkun Liu, Yanda Meng, Guanghui Yue, Baoquan Zhao, Abdulmotaleb El Saddik, Wei Zhou

发表机构 * Cardiff University(卡迪夫大学) Shandong University of Science and Technology(山东科技大学) University of Exeter(埃克塞特大学) Shenzhen University(深圳大学) Sun Yat-sen University(中山大学) University of Ottawa(渥太华大学)

AI总结 提出首个面向物理实验的场景图数据集PhysScene,通过高密度关系约束和结构化实验设置,推动科学视觉推理中超越空间共现的逻辑依赖关系建模。

详情
AI中文摘要

场景图通过建模对象及其成对关系,提供视觉场景的结构化表示。尽管最近取得了进展,现有数据集主要关注通用自然场景,领域特定和功能导向的场景仍未被充分探索。这一限制阻碍了科学实验场景中关系推理的评估,进而阻碍了此类场景中智能监控、分析及相关应用的发展。为填补这一空白,我们引入了PhysScene,这是首个针对物理实验的场景图数据集。PhysScene涵盖了实验环境中特有的仪器、结构化实验装置和功能关系,使得推理能够超越空间共现,扩展到逻辑依赖。PhysScene不追求大规模数据,而是聚焦于实验场景中的强语义约束和高关系密度,为现有场景解析算法带来新挑战,同时提供进一步改进的机会。广泛的分析和实验表明,PhysScene补充了现有基准,并为推进科学视觉推理建立了有价值的测试平台。该数据集公开于https://github.com/ZMH-SDUST/PhysScene。

英文摘要

Scene Graphs (SGs) provide structured representations of visual scenes by modeling objects and their pairwise relationships. Despite recent progress, existing datasets primarily focus on generic natural contexts, leaving domain-specific and function-oriented scenes largely underexplored. This limitation restricts the evaluation of relational reasoning in scientific experimental scenes, thereby hindering the development of intelligent monitoring, analysis, and related applications in such scenes. To address this gap, we introduce PhysScene, the first SG dataset tailored to physics experiments. PhysScene encompasses specialized instruments, structured experimental setups, and functional relations intrinsic to experimental environments, enabling reasoning that extends beyond spatial co-occurrence to logical dependencies. Rather than pursuing large data scale, PhysScene focuses on strong semantic constraints and high relation density in experimental scenes, posing new challenges for existing scene parsing algorithms while offering opportunities for further improvements. Extensive analyses and experiments show that PhysScene complements existing benchmarks and establishes a valuable testbed for advancing scientific visual reasoning. The dataset is publicly available at https://github.com/ZMH-SDUST/PhysScene.