arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3844
专题追踪
2606.16022 2026-06-16 cs.RO 新提交

$λ$-Reachability: Geometric-Horizon Safety Bellman Equations for Humanoid Safety

$λ$-可达性:面向人形机器人安全性的几何视界安全贝尔曼方程

Rui Chen, Shangtao Li, Yifan Sun, Changliu Liu

发表机构 * Robotics Institute, Carnegie Mellon University(卡内基梅隆大学机器人研究所) Mechanical Engineering, Carnegie Mellon University(卡内基梅隆大学机械工程系)

AI总结 提出$λ$-可达性方法,通过几何分布视界和随机吸收终端的随机多步估计,实现高维机器人系统的Hamilton-Jacobi安全性分析,显著提升安全边界分类和裕度估计。

详情
AI中文摘要

我们引入了$λ$-可达性,一种用于高维机器人系统的Hamilton-Jacobi安全性分析的可扩展方法。与依赖固定一步贝尔曼更新的先前折扣公式不同,$λ$-可达性采用安全值的随机多步估计器,使用几何分布的展开视界和随机吸收终端。概念上类似于TD($λ$),$λ$-可达性通过可解释的视界控制参数,在局部自一致性更新和长视界轨迹最大安全目标之间插值。与TD($λ$)不同(其中终端值始终包含在学习目标中),$λ$-可达性中的终端安全值仅以参数$δ$控制的概率使用。我们正式证明,对于$δ<1$,更新诱导一个收缩映射,允许时序差分学习;当$λ\ o 1$时,估计器恢复无折扣的可达性目标。我们将$λ$-可达性应用于高维安全学习问题,包括在平衡和碰撞避免约束下的模拟和真实人形机器人。实验结果表明,与单步时序差分基线相比,$λ$-可达性显著改进了安全集边界分类和安全裕度估计。

英文摘要

We introduce $λ$-Reachability, a scalable approach to Hamilton--Jacobi safety analysis for high-dimensional robotic systems. Unlike prior discounted formulations that rely on fixed one-step Bellman updates, $λ$-Reachability employs a stochastic multi-step estimator of the safety value, using a geometrically distributed rollout horizon together with a randomly absorbed terminal. Conceptually analogous to TD($λ$), $λ$-Reachability interpolates between local self-consistency updates and long-horizon max-over-trajectory safety targets via an interpretable horizon-control parameter. Unlike TD($λ$), where the terminal value is always incorporated in learning targets, the terminal safety value in $λ$-Reachability is only used at a probability controlled by parameter $δ$. We formally show that for $δ<1$, the update induces a contraction mapping that allows temporal-difference learning; as $λ\to 1$, the estimator recovers the undiscounted reachability objective. We apply $λ$-Reachability to high-dimensional safety learning problems with both simulated and real humanoid robots under balance and collision avoidance constraints. Experimental results demonstrate that $λ$-Reachability significantly improves both safe-set boundary classification and safety margin estimation compared to single-step temporal-difference baselines.

2606.16019 2026-06-16 cs.CL cs.LG cs.SD 新提交

Scaling Human and G2P Supervision for Robust Phonetic Transcription

扩展人类与G2P监督以实现鲁棒语音转录

Alexander Metzger, Aruna Srivastava, Ruslan Mukhamedvaleev

发表机构 * Koel Labs LLC

AI总结 研究自动语音转录中人类标注与G2P监督的扩展规律,发现当人类标注少于20-30小时时G2P有效,超过后无益甚至降低鲁棒性,而ASR预训练可显著提升性能。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

专家语音标注成本高昂,尤其对于非标准方言和非典型语音。一种常见替代方法是使用字素到音素(G2P)模型从文本转录中自动生成语音标签。我们研究了自动语音转录性能如何随英语中人类和G2P监督的扩展而变化。使用一个涵盖母语、非母语和卒中后语音的精心策划的80小时基准测试,我们确定了一个监督质量阈值:只有当人类标注少于20-30小时时,G2P监督才有帮助。超过此阈值,它不提供显著益处,并可能降低跨方言鲁棒性。在此阈值之后有效的是ASR预训练,我们使用它实现了比先前系统加权音素特征错误率降低2.3倍,在非母语和失语症语音上取得了强劲提升。这些结果表明,数量驱动的G2P扩展可能对鲁棒泛化产生递减收益。

英文摘要

Expert phonetic annotation is costly, especially for non-standard dialects and atypical speech. A common alternative is using Grapheme-to-Phoneme (G2P) models to auto-generate phonetic labels from text transcripts at scale. We study how automatic phonetic transcription performance scales with human and G2P supervision in English. Using a curated 80-hour benchmark spanning native, non-native and post-stroke speech, we identify a supervision quality threshold: G2P supervision helps only when fewer than 20-30 hours of human annotation are available. Beyond this threshold, it provides no significant benefit and can reduce cross-dialect robustness. What is effective after this threshold is ASR pretraining which we use to achieve a 2.3x reduction in weighted phone feature error rate over prior systems, with strong gains on non-native and aphasic speech. These results suggest that quantity-driven G2P scaling may yield diminishing returns for robust generalization.

2606.16015 2026-06-16 cs.CV 新提交

Stringalign: Moving beyond summary statistics with a transparent Unicode-aware tool for evaluating automatic transcription models

Stringalign: 超越摘要统计的透明Unicode感知工具,用于评估自动转录模型

Yngve Mardal Moe, Marie Roald

发表机构 * Independent researcher(独立研究员) The National Library of Norway(挪威国家图书馆)

AI总结 提出Stringalign库,通过透明预处理和错误分析,解决字符/词错误率定义模糊问题,支持HTR、OCR和ASR模型的可重复评估。

详情
AI中文摘要

在评估和理解文档识别、音频转录等文本处理任务的性能时,比较文本字符串至关重要。随着基于AI的手写文本识别(HTR)、光学字符识别(OCR)和自动语音识别(ASR)模型日益复杂,需要能够以灵活且可重复的方式促进评估的工具。本文介绍了Stringalign,一个旨在简化自动转录项目评估过程并促进透明评估的Python库。Stringalign的工具可以检查和可视化模型产生的错误率和错误类型,从而洞察可能的改进,并帮助为特定任务选择模型。广泛使用的字符串比较指标,如字符错误率(CER)和词错误率(WER),虽然有用,但由于字符和词的定义不同而可能产生歧义。Stringalign通过确保所有预处理(即归一化和分词)透明且易于复制,并提供工具以超越摘要统计并分析常见模型错误,解决了这一挑战。此外,Stringalign遵循研究软件的FAIR(可发现、可访问、可互操作、可重用)原则,同时保持轻量级且易于融入研究人员现有工作流程。在本文中,我们讨论了字符级和词级字符串比较的挑战,并通过示例表明,现有工具可能产生不透明且有时令人困惑的结果,而Stringalign提供了一种易于使用且无歧义的替代方案。

英文摘要

Comparing text strings is crucial when evaluating and understanding the performance of various text processing tasks such as document recognition and audio transcription. With an increasingly complex landscape of AI-based handwritten text recognition (HTR), optical character recognition (OCR) and automatic speech recognition (ASR) models, there is a need for tools that facilitate evaluation in a flexible and reproducible way. This paper presents Stringalign, a Python library designed to simplify the evaluation process for automatic transcription projects and facilitate transparent evaluation. Stringalign's tools to examine and visualise both the rate of errors and the types of errors a model makes, give insights into possible improvements and help inform model selection for a particular task. Widely used string comparison metrics, such as the character and word error rates (CER and WER), although useful, can be ambiguous due to varying definitions of what constitutes a character and a word. Stringalign addresses this challenge by ensuring all preprocessing (i.e. normalisation and tokenisation) is transparent and easily replicable, and by providing tools to move beyond summary statistics and analyse common model errors. Moreover, Stringalign adheres to FAIR (Findable, Accessible, Interoperable, and Reusable) principles for research software while staying lightweight and easy to adapt into researchers existing workflows. In this paper, we discuss challenges with character and word level string comparisons and show through examples that where existing tools can yield opaque and sometimes confusing results, Stringalign provides an easy-to-use and unambiguous alternative.

2606.16011 2026-06-16 cs.CL 新提交

Who Flips? Self- and Cross-Model Counterarguments Reveal Answer Instability in LLMs

谁在翻盘?自我与跨模型反论证揭示LLM中的答案不稳定性

Nafiseh Nikeghbal, Amir Hossein Kargaran, Shaghayegh Kolli, Jana Diesner

发表机构 * Technical University of Munich(慕尼黑工业大学) Munich Center for Machine Learning(慕尼黑机器学习中心) LMU Munich(慕尼黑大学)

AI总结 提出一种评估大语言模型答案稳定性的协议,通过生成反论证挑战正确回答,发现翻转率在17.5%到97.3%之间,且自我归因和跨模型论证显著影响稳定性。

Comments Accepted to the non-archival workshops AI4Good and AIWILD at ICML 2026

详情
AI中文摘要

标准准确率基准测试旨在测试大语言模型(LLM)接近正确答案的程度,但不适合测试当答案受到合理反论证挑战时,LLM是否坚持正确回答。我们引入了一种评估答案稳定性的受控协议:在模型正确回答多项选择题后,我们用针对错误选项的连贯论证挑战模型的答案,并测量模型是否翻转。该设置a)将论证内容与公开的社会压力隔离,b)改变论证长度、自我归因和跨模型来源。在七个前沿模型和57个MMLU主题上,翻转率从17.5%到97.3%不等,揭示了仅靠准确率指标无法捕捉到的稳定性巨大差异。我们发现自我归因始终增加翻转率(平均+7.1个百分点,最高+18.7个百分点)。此外,跨模型汇集错误答案论证,并为每个问题选择最有效的论证,比依赖任何单一源模型产生更强的对抗性挑战。我们进一步构建了MaxFlip,一个精心策划的挑战集,相比标准自我生成挑战,翻转率提升高达+23.6个百分点。我们发布了协议、挑战记录和MaxFlip,以支持在标准准确率基准测试之外进行稳定性评估。材料可在https://github.com/nafisenik/WhoFlips和https://hf.co/datasets/nafisehNik/WhoFlips获取。

英文摘要

Standard accuracy benchmarks are designed to test how closely large language models (LLMs) approach correct answers, but are not suitable for testing whether LLMs stick with a correct answer when that answer is challenged by a plausible counter-argument. We introduce a controlled protocol for evaluating answer stability: after a model answers a multiple-choice question correctly, we challenge the model's answer with a coherent argument for an incorrect option and measure whether the model flips. The setup a) isolates argumentative content from overt social pressure and b) varies argument length, self-attribution, and cross-model source. Across seven frontier models and 57 MMLU subjects, flip rates range from 17.5% to 97.3%, revealing large differences in stability that are not captured by accuracy metrics alone. We find that self-attribution consistently increases flip rates (mean +7.1pp, up to +18.7pp). Also, pooling wrong-answer arguments across models and selecting the most effective one per question yields stronger adversarial challenges than relying on any single source model. We further construct MaxFlip, a curated challenge set that amplifies flips by up to +23.6pp over standard self-generated challenges. We release the protocol, challenge records, and MaxFlip to support stability evaluation alongside standard accuracy benchmarks. Materials are available at https://github.com/nafisenik/WhoFlips and https://hf.co/datasets/nafisehNik/WhoFlips.

2606.16003 2026-06-16 cs.AI 新提交

SciText2Eq: Assessing LLMs for Explainable Equation Generation for Scientific Creativity

SciText2Eq: 评估大语言模型在科学创造力中的可解释方程生成

Yifan Mo, Xiao Fu, Yue Su, Qingyu Meng, Koen Hindriks, Qingzhi Liu, Jiahuan Pei

发表机构 * Vrije Universiteit Amsterdam(阿姆斯特丹自由大学) Wageningen University & Research(瓦赫宁根大学及研究中心)

AI总结 研究大语言模型从科学文本生成数学方程的能力,构建AI论文数据集,提出可解释方程生成流程,并设计结合自动指标、LLM评估和人工判断的评估协议,发现LLM在语义准确性上表现不足。

Comments Accepted by findings of ACL 2026

详情
AI中文摘要

本文研究了大语言模型(LLMs)从科学文本生成数学方程的能力。先前的工作面临非结构化基础、多方程依赖和人类对齐评估的挑战。为此,我们构建了一个AI研究论文数据集,将上下文段落与真实方程和变量描述配对。我们开发了一个可解释的方程生成流程,并在多种开源和闭源LLM骨干上进行了评估。我们引入了一个评估协议,结合自动指标、基于LLM的评分标准和人工判断,以评估准确性、可解释性和人机对齐。结果表明,LLM在基于词汇和句法的相似性上表现中等,但在语义准确性上存在困难。基于LLM的评估与人工判断之间的比较显示对齐有限,凸显了使用LLM评估方程质量的挑战。这些发现为改进方程生成模型和开发更可靠的科学文本评估方法提供了见解。我们提供了代码和数据以供复现。

英文摘要

This work investigates the ability of large language models (LLMs) to generate mathematical equations from scientific texts. Prior work faces challenges in unstructured grounding, multi-equation dependency, and humanaligned evaluation. To this end, we construct a dataset of AI research papers, pairing contextual passages with ground-truth equations and variable descriptions. We develop an explainable equation generation workflow and evaluate it across diverse open- and closed-source LLM backbones. We introduce an evaluation protocol combining automatic metrics, LLM-based rubrics, and human judgments to assess accuracy, explainability, and human-LLM alignment. Results indicate that LLMs perform moderately on lexical- and syntactic-based similarity, while struggling with semantic accuracy. Comparisons between LLM-based evaluations and human judgments reveal limited alignment, highlighting challenges in using LLMs to assess equation quality. These findings offer insights for improving equation generation models and developing more reliable evaluation methods for scientific text. We provide code and data for reproducibility.

2606.16002 2026-06-16 cs.LG 新提交

Decomposing one-class support vector machine into an ensemble of one-data support vector machines

将一类支持向量机分解为单数据支持向量机的集成

Toshitaka Hayashi, Dalibor Cimr, Hamido Fujita, Richard Cimler

发表机构 * University of Hradec Králové(赫拉德茨-克拉洛韦大学) Universiti Teknologi Malaysia(马来西亚理工大学)

AI总结 针对一类支持向量机在大规模数据集上的可扩展性问题,提出将数据集分解为单个样本并训练单数据支持向量机,再通过集成学习组合模型,同时采用数据缩减策略加速,实验表明该方法在保持分类性能的同时显著提升速度。

详情
AI中文摘要

一类分类(OCC)是一个训练数据只包含一个类别的分类问题。一类支持向量机(OCSVM)是最有竞争力的OCC算法之一。然而,OCSVM在处理大规模数据集时存在可扩展性问题。本文提出了OCSVM的加速策略。其思想是将数据集分解为样本,并为单个数据点训练OCSVM模型。随后,应用集成学习将所有模型组合起来,以计算数据集的OCSVM模型。此外,通过数据缩减策略(使用训练样本均值训练的OCSVM模型)实现了进一步加速。实验使用Python包将所提方法与传统OCSVM进行了比较。所提策略比传统OCSVM更快,同时获得了相似的分类结果。此外,所提策略可以在样本和模型之间建立一一对应关系。源代码已上传至https://github.com/ToshiHayashi/ODSVM。

英文摘要

One-class classification (OCC) is a classification problem in which the training data contains only one class. The one-class support vector machine (OCSVM) is one of the most competitive OCC algorithms. However, OCSVM has scalability issues with large-scale datasets. This paper proposes the acceleration strategy of OCSVM. The idea is to decompose the dataset into samples and train OCSVM models for single data points. Subsequently, ensemble learning is applied to combine all models to compute the OCSVM model for the dataset. In addition, further acceleration is achieved through a data-reduction strategy with an OCSVM model trained on the average of the training samples. The experiment compared the proposal and traditional OCSVM using the Python package. The proposed strategy is faster than traditional OCSVM, while achieving similar classification results. Moreover, the proposed strategy can create one-to-one correspondence between samples and models. Source code is uploaded at https://github.com/ToshiHayashi/ODSVM

2606.15997 2026-06-16 cs.RO cs.SY eess.SY 新提交

Friction Characterization of a Cable-Driven Differential Actuation System for Lower-Limb Exoskeletons

下肢外骨骼用缆绳驱动差动驱动系统的摩擦特性

Alberto Maria Nobili, Fabio Salsedo, Alessandro Filippeschi

发表机构 * Institute of Mechanical Intelligence, Department of Excellence in Robotics and AI, Sant'Anna School of Advanced Studies(机械智能研究所、机器人及人工智能卓越系,圣安娜高等研究学院) Wearable Robotics s.r.l.(可穿戴机器人有限公司)

AI总结 提出一种用于下肢外骨骼髋-膝关节屈伸的差动驱动架构,通过电机与关节间的线性差动映射实现协同扭矩共享,并开发基于模型的摩擦估计策略实现无传感器扭矩估计。

Comments Accepted for presentation IEEE RAS/EMBS 11th International Conference on Biomedical Robotics and Biomechatronics

详情
AI中文摘要

下肢外骨骼需要能够提供精确关节扭矩控制同时保持低质量和低负担的驱动系统。传统架构通常依赖独立驱动关节和关节级扭矩传感器,增加了系统复杂性和重量。本文提出一种用于髋-膝关节屈伸的新型差动驱动架构,通过电机与关节间的线性差动映射实现两个电机之间的协同扭矩共享。为了补偿传动损失,开发并实验实现了一种基于模型的摩擦估计策略,允许在无需扭矩传感器的情况下进行精确的关节扭矩估计。所提出的解决方案在物理原型上进行了验证,证明了在下肢外骨骼的差动驱动髋-膝模块中无传感器扭矩估计的可行性。

英文摘要

Lower-limb exoskeletons require actuation systems that can provide accurate joint torque control while preserving low mass and encumbrance. Conventional architectures often rely on independently actuated joints and joint-level torque sensors, increasing system complexity and weight. This paper presents a novel differential actuation architecture for hip-knee flexion/extension, enabling cooperative torque sharing between two motors via a linear differential mapping between motor and joint. To compensate for transmission losses, a model-based friction estimation strategy is developed and experimentally implemented, allowing accurate joint torque estimation without the need for torque sensors. The proposed solution is validated on a physical prototype, demonstrating the feasibility of sensorless torque estimation in a differentially actuated hip-knee module of a lower-limb exoskeleton.

2606.15994 2026-06-16 cs.AI cs.LG 新提交

Agentic Framework for Deep Learning workload migration via In-Context Learning

基于上下文学习的深度学习工作负载迁移智能体框架

Qiyue Liang, Steven Ingram, George Vanica, Andi Gavrilescu, Newfel Harrat, Hassan Sipra, Sethuraman Sankaran

发表机构 * Google(谷歌)

AI总结 提出结合上下文学习与Oracle驱动的自调试的自主系统,实现从PyTorch到JAX的深度学习模型自动迁移,在神经模块上达到91%数值等价性。

详情
AI中文摘要

将深度学习模型从PyTorch灵活的面向对象设计迁移到JAX的函数式无状态设置通常是一项手动且易出错的任务。自动迁移具有挑战性,因为大型语言模型(LLM)难以处理严格且动态的API对齐,并且容易在精确操作上出错。我们提出了一个完全自主的系统,结合了上下文学习(ICL)与Oracle驱动的自调试。首先,我们整理了一个ICL上下文,作为惯用JAX样式和测试用例生成的严格参考。其次,不依赖LLM推导数学输出,而是运行源PyTorch模块以获取其实际的动态张量状态,从而创建一个不可变的执行Oracle。然后,我们使用自主智能体循环基于Oracle数据合成测试。测试用例被重复执行,并将回溯发送回LLM进行自我修正。消融实验表明,将ICL参考与Oracle基础及自调试相结合,大大优于纯指令和基本智能体基线。这种改进没有增加过多的计算开销。我们的轻量级流水线在神经模块上实现了91%的数值等价性(相比之下,基线为9%,指令+自调试为27%),为跨框架迁移提供了高度可靠、可扩展的蓝图。该方案已在多个最先进模型上得到验证,包括SAM(Segment Anything)、T5、Code Whisper等,显示出高数值等价性。代码:https://github.com/AI-Hypercomputer/accelerator-agents/tree/main/MaxCode

英文摘要

Translating deep learning models from PyTorch's flexible, object-oriented design to JAX's functional, stateless setup is usually a manual and error-prone task. Automated migration is challenging because Large Language Models (LLMs) struggle with strict and dynamic API alignment and are prone to mistakes for exacting operations. We propose a fully autonomous system that combines In-Context Learning (ICL) with oracle-driven self-debugging. First, we curated an ICL context that serves as a strict reference for idiomatic JAX styling and test case generation. Second, instead of depending on the LLM to deduce mathematical outputs, we run the source PyTorch modules to get their actual dynamic tensor states. This creates an unchangeable execution oracle. We then use an autonomous agentic loop to synthesize tests based on the oracle data. The test cases are executed repeatedly, and the traceback is sent back to the LLM for self-correction. Ablations show that combining ICL references with oracle grounding and self-debugging greatly outperforms pure instructional and basic agentic baselines. This improvement does not add an excessive computational overhead. Our lightweight pipeline achieves 91% numerical equivalence (compared to baseline: 9%, instruction + self-debugging: 27%) on neural modules, providing a highly reliable, scalable blueprint for cross-framework migration. This has been validated across several state-of-the-art models including SAM (segment anything), T5, Code Whisper amongst others showing high numerical equivalency. Code: https://github.com/AI-Hypercomputer/accelerator-agents/tree/main/MaxCode

2606.15992 2026-06-16 cs.CV 新提交

Multi-Task Tennis Stroke Biomechanics Analysis Using MediaPipe Pose

基于MediaPipe Pose的多任务网球击球生物力学分析

Jigyashman Hazarika

发表机构 * Kaggle

AI总结 提出多任务流水线,从RGB视频自动识别击球类型、预测击球方向并评估姿势质量,结合规则反馈提供教练建议,在跨球员测试中击球类型准确率仅下降0.8%。

Comments 14 pages, 9 figures

详情
AI中文摘要

我们构建了一个从普通RGB视频进行网球击球生物力学分析的多任务流水线。在基于姿态的击球识别基础上,它增加了两个新任务:预测击球方向和评估姿势质量,外加一个基于规则的反馈层,用于提供教练建议。使用加权关节速度得分s(t) = 0.5 v_wrist + 0.3 m_elbow + 0.2 m_shoulder自动检测击球,无需手动标注。姿态来自MediaPipe Pose Landmarker(33个关键点,公制世界坐标),每个击球被转换为30帧×39特征的序列,输入TennisTransformerGPU——一个紧凑的564,103参数Transformer(4层,4头,d=128),带有三个并行输出头。在来自7名职业球员和1名业余球员的11个视频中的1,281个标注击球上训练,在随机80/20划分下,击球类型准确率为83.7%,方向准确率为61.9%,姿势准确率为62.6%。有趣的测试是跨球员:在职业球员上训练,在业余球员上评估。击球类型几乎不变,为82.9%,下降0.8%。方向预测无法迁移,直接退化为多数类。消融实验表明世界坐标的重要性:切换到图像空间关键点导致跨球员击球类型准确率从83%降至47%,方向准确率从68%降至21%。所有内容在Kaggle的免费T4 GPU上运行,完全可复现。

英文摘要

We built a multi-task pipeline for tennis stroke biomechanics from plain RGB video. On top of pose-based stroke recognition, it adds two new tasks, predicting shot direction and grading posture quality, plus a rule-based feedback layer that suggests coaching tips. Strokes are found automatically using a weighted joint velocity score, s(t) = 0.5 v_wrist + 0.3 m_elbow + 0.2 m_shoulder, removing the need for manual annotation. Pose comes from MediaPipe Pose Landmarker (33 landmarks, metric world coordinates), with each stroke turned into a 30-frame by 39-feature sequence for TennisTransformerGPU, a compact 564,103-parameter transformer (4 layers, 4 heads, d=128) with three parallel output heads. Trained on 1,281 labeled strokes from 7 pros and 1 amateur across 11 videos, it hits 83.7% stroke-type accuracy, 61.9% on direction, and 62.6% on posture under a random 80/20 split. The interesting test is cross-player: train on pros, evaluate on the amateur. Stroke type barely budges, 82.9%, a 0.8% drop. Direction prediction does not transfer; it just falls back to the majority class. An ablation shows why world coordinates matter so much here: switching to image-space landmarks tanks cross-player stroke-type accuracy from 83% to 47% and direction from 68% to 21%. Everything runs on Kaggle's free T4 GPU tier and is fully reproducible.

2606.15987 2026-06-16 cs.CV cs.DL 新提交

A Text Recognition Dataset from Sahidic Coptic Ancient Manuscripts

来自萨希迪克科普特古代手稿的文本识别数据集

Fabio Quattrini, Carmine Zaccagnino, Costanza Bianchi, Silvia Cascianelli, Rita Cucchiara

发表机构 * University of Modena and Reggio Emilia(摩德纳大学与雷焦艾米利亚大学)

AI总结 针对低资源手写文本识别,构建了萨希迪克科普特古代手稿数据集SCAM,并评估了多种先进方法的性能,揭示了当前方法在低资源历史文本上的局限性。

Comments Accepted at ICDAR 2026

详情
AI中文摘要

在这项工作中,我们针对低资源场景下的手写文本识别(HTR),这些场景源于代表性不足的语言、稀有文字以及历史文献典型的退化视觉条件。我们引入了SCAM(萨希迪克科普特古代手稿),这是一个从已灭绝的萨希迪克科普特方言书写的数字化古代手稿构建的新行级数据集。该数据集反映了现实且具有挑战性的环境,因为它结合了跨图书馆的异构采集条件以及典型的文献退化,如墨水褪色、透印和材料劣化。除了视觉复杂性外,由于萨希迪克科普特的资源稀缺、其不常见的字母表以及方言特有的变音符号,SCAM还带来了显著的语言挑战。为了支持低资源HTR的研究,我们基于不同范式对几种最先进的方法进行了基准测试,突出了它们在此环境中的局限性和优势。我们的结果强调了当前在资源丰富的现代文字上的HTR性能与基于历史的低资源场景之间的差距,从而为未来的发展提供了参考点。

英文摘要

In this work, we target Handwritten Text Recognition (HTR) in low-resource scenarios, which arise from underrepresented languages, rare scripts, and degraded visual conditions typical of historical documents. We introduce SCAM (Sahidic Coptic Ancient Manuscripts), a new line-level dataset built from digitized ancient manuscripts written in the extinct Sahidic Coptic dialect. The dataset reflects a realistic and challenging setting, as it combines heterogeneous acquisition conditions across libraries with typical manuscript degradations such as ink fading, bleed-through, and material deterioration. In addition to visual complexity, SCAM poses significant linguistic challenges due to the scarcity of resources for Sahidic Coptic, its uncommon alphabet, and dialect-specific diacritics. To support research in low-resource HTR, we benchmark several state-of-the-art approaches based on different paradigms, highlighting their limitations and strengths in this setting. Our results underline the gap between current HTR performance on well-resourced modern scripts and historically grounded, low-resource scenarios, thus providing a reference point for future developments.

2606.15984 2026-06-16 cs.CL 新提交

ROMPAR: Morphological Completion and Demographic Unlearning for Romanian-Accented Speech Recognition

ROMPAR:罗马尼亚口音语音识别的形态补全与人口统计去偏

Andrei-Marius Avram, Aureliu-Valentin Antonie, Ştefan-Bogdan Badea, Andrei Florea, Robert-Nicolae Zaharoiu, Dumitru-Clementin Cercel

发表机构 * National University of Science and Technology POLITEHNICA Bucharest(布加勒斯特理工大学)

AI总结 提出ROMPAR语料库和多任务对抗训练框架,通过指数衰减机制和LLM引导解码,实现人口统计不变性和形态补全,WER显著降低,形态重建F1达96.6%。

详情
AI中文摘要

由于人口统计偏差、方言变异以及分割过程中话语截断等技术伪影,议会程序的自动转录面临重大挑战。本文介绍了罗马尼亚议会语音语料库(ROMPAR)数据集,这是一个17.80小时的罗马尼亚和摩尔多瓦议会语音语料库,具有双重标注的真实数据和重构词片段的显式标签。为了构建鲁棒的ASR系统,我们提出了一个多任务对抗训练框架,强制实现跨年龄、性别和方言的人口统计不变性。我们通过引入对抗系数的指数衰减机制来解决生成架构中对抗目标固有的不稳定性。此外,我们实现了一种具有位置依赖权重的LLM引导解码策略,以促进截断终端词的形态补全。我们的结果表明,所提出的框架显著降低了WER,并在形态重建中达到了96.6%的F1分数。

英文摘要

Automated transcription of parliamentary proceedings faces significant hurdles due to demographic bias, dialectal variation, and technical artifacts such as utterance truncation during segmentation. This paper introduces the ROManian PARliamentary Speech Corpus (ROMPAR) dataset, a 17.80-hour corpus of Romanian and Moldavian parliamentary speech, featuring double-annotated ground truth and explicit labels for reconstructed word fragments. To build a robust ASR system, we propose a multi-task adversarial training framework that enforces demographic invariance across age, gender, and dialect. We address the inherent instability of adversarial objectives in generative architectures by introducing an exponential decay mechanism for the adversarial coefficients. Furthermore, we implement an LLM-guided decoding strategy with position-dependent weighting to facilitate morphological completion of truncated terminal words. Our results demonstrate that the proposed framework significantly reduces WER and achieves an F1-score of 96.6% in morphological reconstruction.

2606.15980 2026-06-16 cs.LG cs.AI cs.CL 新提交

Do Safety Monitors Stay Reliable After an Update? Benchmarking and Predicting Activation-Monitor Staleness

安全监控器在更新后是否仍可靠?激活监控器陈旧性的基准测试与预测

Evan Duan

发表机构 * University of Michigan(密歇根大学)

AI总结 研究语言模型更新后激活监控器是否仍可靠,发现量化更新影响小,微调更新常导致监控器失效,且可通过预部署特征预测退化。

详情
AI中文摘要

激活监控器——在语言模型内部表示上训练的轻量级探针——在部署安全栈中越来越常见。然而,部署的模型很少是静态的:它们被量化、微调、用LoRA适配,或与合并适配器一起服务,而监控器保持冻结。我们首次系统测试了这一隐含契约是否成立:在基础模型上训练的激活监控器在这些常规模型更新后是否仍可靠。跨多个安全相关监控器、模型深度、更新系列和开放权重模型,我们发现一个明显的分裂:量化风格的更新大多保持冻结探针性能,而微调风格的更新经常使探针变得陈旧。脆弱性高度依赖于监控器,隐私/PII探针受影响最大,而拒绝合规探针相对稳定,表明重新训练行为不一定使其对应的监控器变得陈旧。QLoRA尤其有害,尽管单独的NF4量化相对良性,这表明量化在与适配结合时风险更大。我们进一步表明,退化可以从部署前的特征预测,从而能够将重新验证预算优先分配给最可能失败的监控器。这些结果表明,微调应默认触发激活监控器重新验证,而预测可以帮助优先检查哪些监控器。

英文摘要

Activation monitors-lightweight probes trained on a language model's internal representations-are an increasingly common layer in deployment safety stacks. Deployed models however are rarely static: they are quantized, fine-tuned, adapted with LoRA, or served with merged adapters while the monitor remains frozen. We present the first systematic test of whether this implicit contract holds: whether activation monitors trained on a base model remain reliable after these routine model updates. Across multiple safety-relevant monitors, model depths, update families, and open-weight models, we find a sharp split: quantization-style updates largely preserve frozen probe performance, while fine-tuning-style updates frequently make probes stale. Fragility is highly monitor-dependent, with privacy/PII probes most affected and refusal-compliance probes comparatively stable, showing that retraining a behavior need not stale its corresponding monitor. QLoRA is especially damaging despite NF4 quantization alone being relatively benign, suggesting that quantization becomes riskier when combined with adaptation. We further show that degradation is predictable from pre-deployment features, enabling revalidation budgets to be triaged toward the monitors most likely to fail. These results suggest that fine-tuning should trigger activation-monitor revalidation by default, while prediction can help prioritize which monitors to check first.

2606.15978 2026-06-16 cs.LG 新提交

Scalar-Stepsize Nonuniform Monte Carlo Optimistic Policy Iteration: A Certified Counterexample

标量步长非均匀蒙特卡洛乐观策略迭代:一个经过认证的反例

Yuanlong Chen

发表机构 * Yuanlong Chen(陈元龙)

AI总结 针对非均匀更新频率下的蒙特卡洛乐观策略迭代,本文通过一个三状态MDP反例证明标量步长非均匀异步值迭代可能不收敛,并揭示了各向异性畸变导致的切换吸引环。

详情
AI中文摘要

Tsitsiklis证明了在均匀更新结构下蒙特卡洛乐观策略迭代的收敛性,并指出非均匀更新频率是一个微妙的障碍。对于自然的标量步长、非归一化异步状态值递归,在固定非均匀状态选择概率下,我们给出了一个经过认证的否定答案。在一个三状态、两动作的折扣MDP中,非均匀更新频率诱导出一个对角缩放贪婪策略平均场,该平均场具有一个经过认证的非恒定混合周期轨道。使用有界无偏几何视界估计器和Robbins-Monro步长,原始随机递归以正概率被困在周期附近,因此无法收敛。该例子揭示了一个几何障碍:均匀采样产生径向残差收缩,而标量非均匀采样各向异性地扭曲残差动态,并可能产生切换吸引环。

英文摘要

Tsitsiklis proved convergence of Monte Carlo optimistic policy iteration under a uniform update structure and identified nonuniform update frequencies as a delicate obstruction. We give a certified negative answer for the natural scalar-stepsize, unnormalized asynchronous state-value recursion with fixed nonuniform state-selection probabilities. In a three-state, two-action discounted MDP, the nonuniform update frequencies induce a diagonally scaled greedy-policy mean field with a certified nonconstant attracting hybrid periodic orbit. With a bounded unbiased geometric-horizon estimator and Robbins--Monro stepsizes, the original stochastic recursion remains trapped near the cycle with positive probability and therefore fails to converge. The example pinpoints a geometric obstruction: uniform sampling gives radial residual contraction, whereas scalar nonuniform sampling anisotropically distorts the residual dynamics and can generate switched attracting cycles.

2606.15976 2026-06-16 cs.CV 新提交

HadBalance: A Plug-and-Play Unified Global Geometric Prior Framework for Generalizable Biomedical Segmentation

HadBalance: 一种即插即用的统一全局几何先验框架,用于可泛化的生物医学分割

Zhuangzhi Gao, Feixiang Zhou, He Zhao, Wenhan Chen, Ruiyu Luo, Xin Wang, Hongyi Qin, Zhongli Wu, Yanda Meng, Yitian Zhao, Alena Shantsila, Gregory Y. H. Lip, Eduard Shantsila, Yalin Zheng

发表机构 * Department of Eye and Vision Science, University of Liverpool(利物浦大学眼与视觉科学系) Department of Primary Care and Mental Health, University of Liverpool(利物浦大学初级保健与精神健康系) School of Psychological and Cognitive Sciences, Peking University(北京大学心理与认知科学学院) Computer Vision Research Group, University of Amsterdam(阿姆斯特丹大学计算机视觉研究组) Institute of Life Course & Medical Sciences, University of Liverpool(利物浦大学生命历程与医学科学研究所) Bioengineering Program, Biomedical Sciences Division (BioMed), King Abdullah University of Science and Technology (KAUST)(阿卜杜拉国王科技大学生物医学科学部生物工程项目) Ningbo Cixi Institute of Biomedical Engineering, Chinese Academy of Sciences(中国科学院宁波慈溪生物医学工程研究所) Liverpool Centre for Cardiovascular Science, University of Liverpool(利物浦大学利物浦心血管科学中心)

AI总结 针对生物医学图像分割中几何先验缺乏统一性和泛化性的问题,提出基于Hadwiger定理的全局近凸形状先验,并结合冲突感知目标平衡方法,实现跨器官和模态的即插即用分割。

Comments Provisionally accepted by the 29th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2026). 11 pages, 3 figures, 2 tables

详情
AI中文摘要

精确的生物医学图像分割对于临床诊断至关重要。几何线索(例如边界、形状和拓扑)可以改善结构一致性,但大多数是任务特定的,缺乏跨器官和模态泛化的统一几何基础。我们观察到,多个医学分割目标可以近似为全局近凸形状。凸区域是指其中任意两个内点都可以由完全包含在该区域内的线段连接的区域。在实践中,医学目标可能表现出小的局部凹陷或边界不规则性;我们将这种全局凸状形状称为近凸。受此启发,我们从Hadwiger定理推导出Hadwiger形状先验,作为可解释的全局正则化器,使用三个二维度量:面积A、周长P和欧拉示性数χ,从而实现跨器官和模态的迁移。然而,由于医学数据集是形状异质的,统一施加近凸先验可能会过度正则化具有显著凹陷的非凸解剖结构,从而冲淡凹陷和细节,降低分割精度。为解决这一挑战,我们提出冲突感知目标平衡(CAOB),它以梯度感知的方式将形状先验与分割相结合。对于每个先验,CAOB仅移除与分割冲突的梯度分量,同时保留剩余的对齐分量,并自适应地调节目标影响以防止先验主导。这使得能够在形状异质数据上稳定使用形状先验,而不会抹去真实的凹陷或精细结构细节。我们将这个即插即用框架称为HadBalance。

英文摘要

Precise biomedical image segmentation is crucial for clinical diagnosis. Geometric cues (e.g., boundary, shape, and topology) can improve structural consistency, yet most are task-specific and lack a unified geometric foundation that generalizes across organs and modalities. We are motivated by the observation that several medical segmentation targets can be approximated as globally near-convex shapes. A convex region is one in which any two interior points can be connected by a line segment entirely contained within the region. In practice, medical targets may exhibit small local concavities or boundary irregularities; we refer to such globally convex-like shapes as near-convex. Motivated by this, we derive Hadwiger Shape Priors from Hadwiger's theorem as an interpretable global regularizer using three 2D measures: area A, perimeter P, and Euler characteristic chi, enabling transfer across organs and modalities. However, because medical datasets are shape-heterogeneous, enforcing near-convex priors uniformly can over-regularize non-convex anatomy with significant concavities, washing out concavities and fine details and degrading segmentation accuracy. To address this challenge, we propose Conflict-Aware Objective Balancing (CAOB), which integrates shape priors with segmentation in a gradient-aware manner. For each prior, CAOB removes only the gradient component that conflicts with segmentation while preserving the remaining aligned component, and adaptively regulates objective influences to prevent prior dominance. This enables stable use of shape priors on shape-heterogeneous data without erasing genuine concavities or fine structural details. We call this plug-and-play framework HadBalance.

2606.15974 2026-06-16 cs.CL 新提交

A Large-Scale Multi-Dimensional Empirical Study of LLMs for Conversation Summarization

面向对话摘要的大规模多维度LLM实证研究

Weixiao Zhou, Gengyao Li, Xianfu Cheng, Junnan Zhu, Feifei Zhai, Zhoujun Li

发表机构 * CCSE, Beihang University(北京航空航天大学CCSE) MAIS, CASIA(中国科学院自动化研究所MAIS) Fanyu AI Laboratory(帆禹AI实验室)

AI总结 提出OmniCSEval基准,包含1800个跨六场景的对话,评估28个LLM在对话摘要中的完整性、简洁性和忠实性,揭示推理能力与模型规模的影响。

Comments 21 pages, 18 figures

详情
AI中文摘要

尽管LLMs在对话摘要方面取得了显著进展,但其评估仍受限于场景不足、输入长度和样本量有限。此外,现有基准往往忽略前沿推理系统和高效小模型,或缺乏细粒度、多维度的评估。为弥补这些不足,我们提出OmniCSEval,一个统一基准,包含1800个跨六个真实场景的多样化对话,上下文长度从128到32k tokens。为进行细粒度评估,我们采用双向事实核查框架,结合关键事实匹配评估完整性和简洁性,以及摘要事实验证评估忠实性。为确保可靠评估,我们建立了人机协作的关键事实提取流程和多LLM共识验证器用于摘要事实分解。利用该框架,我们评估了28个LLM,按推理能力和模型规模分为四个不同类别。我们的大规模实证研究揭示了当前LLMs在跨场景挑战、推理与规模的影响以及推理模型的效率与适应性方面的关键见解。我们还为实际部署中的系统选择提供了指导。

英文摘要

Despite the significant advancement of LLMs in conversation summarization, their evaluation remains limited by insufficient scenarios, input lengths, and sample sizes. Furthermore, existing benchmarks often omit frontier reasoning systems and efficient small models, or lack fine-grained, multi-dimensional assessments. To bridge these gaps, we propose OmniCSEval, a unified benchmark comprising 1,800 diverse conversations across six real-world scenarios, featuring context lengths ranging from 128 to 32k tokens. For fine-grained evaluation, we employ a bidirectional fact-checking framework that integrates key fact matching to assess completeness and conciseness, alongside summary fact verification to evaluate faithfulness. To ensure reliable assessment, we establish a human-LLM collaborative pipeline for key fact extraction and a multi-LLM consensus verifier for summary fact decomposition. Leveraging this framework, we evaluate 28 LLMs across four distinct categories grouped by reasoning capability and model scale. Our extensive empirical study reveals critical insights regarding the cross-scenario challenges current LLMs continue to face, the impacts of reasoning and scale, and the efficiency and adaptability of reasoning models. We also provide guidance for system selection in real-world deployments.

2606.15972 2026-06-16 cs.CL cs.AI cs.LG 新提交

Formalize Once, Edit the Rest: Efficient Lean-Based Answer Selection for Math Reasoning

一次形式化,其余编辑:基于Lean的高效数学推理答案选择

Ji Feng, Zhouxing Shi

发表机构 * University of California, Riverside(加州大学河滨分校)

AI总结 提出BASE流水线,通过形式化一个候选答案并编辑其余答案,减少自动形式化调用约5倍,同时提升选择准确性。

Comments 15 pages, 1 figure. Code available at https://github.com/ucr-rai/base-and-edit

详情
AI中文摘要

随着大型语言模型(LLMs)越来越多地应用于数学推理,形式化证明助手(如Lean)可用于以机器可检查的严谨性验证推理输出,从而支持在测试时扩展中从K个采样候选答案中进行答案选择等用例。然而,使用Lean要求LLM的输出(最初为自然语言)首先被形式化。现有的基于Lean的答案选择工作使用自动形式化模型为每个候选答案独立生成一个Lean形式化语句,这带来了显著的计算成本。我们提出BASE,一个基础-编辑流水线,它为每个问题形式化一个基础候选答案,并通过就地编辑答案表达式来推导出其余K-1个语句。为此,我们训练了一个重写器模型LEANSCRIBE,用于定位基础形式化中的答案,并为其他K-1个候选答案生成可重用的编辑函数。BASE同时提高了选择准确性并降低了形式化成本——这是一个帕累托改进,在四个基准测试和三个求解器上的所有12个(数据集,求解器)配置中均成立,在K=8时自动形式化器调用减少约5倍,且随着K增长,减少幅度预计会更大。代码可在https://github.com/ucr-rai/base-and-edit获取。

英文摘要

With large language models (LLMs) increasingly applied to mathematical reasoning, formal proof assistants such as Lean can be leveraged to verify reasoning outputs with machine-checkable rigor, enabling use cases such as answer selection in test-time scaling with K sampled candidate answers. However, employing Lean requires that LLM outputs, originally in natural language, first be formalized. Existing Lean-based answer-selection work uses an autoformalization model to generate a formal statement in Lean for each candidate answer independently, incurring a significant computational cost. We propose BASE, a base-and-edit pipeline that formalizes a single base candidate per problem and derives the remaining K-1 statements by editing the answer expression in place. To facilitate this, we train a rewriter model LEANSCRIBE to localize the answer in the base formalization and generate a reusable edit function for the other K-1 candidates. BASE simultaneously improves selection accuracy and reduces formalization cost - a Pareto improvement that holds on all 12 (dataset, solver) configurations across four benchmarks and three solvers, cutting autoformalizer calls by about 5x at K=8, with the reduction expected to become larger as K grows. Code is available at https://github.com/ucr-rai/base-and-edit.

2606.15971 2026-06-16 cs.CL 新提交

SAG: SQL-Retrieval Augmented Generation with Query-Time Dynamic Hyperedges

SAG: 查询时动态超边的SQL检索增强生成

Yuchao Wu, Junqin Li, XingCheng Liang, Yongjie Chen, Yinghao Liang, Linyuan Mo, Guanxian Li

发表机构 * Zleap AI

AI总结 提出SAG架构,通过SQL查询动态构建局部超边索引,避免全局图维护,在多跳推理基准上达到最优召回率,支持亿级数据生产部署。

详情
AI中文摘要

检索增强生成(RAG)为大语言模型访问外部知识提供了一种有效方法。然而,现有方法依赖密集相似性检索,在处理结构化约束和多跳推理方面存在固有限制。引入知识图谱可部分缓解这些问题,但代价是语义碎片化、高维护成本和难以增量更新。本文介绍了SAG(SQL检索增强生成),一种用于检索和代理系统的结构化架构。SAG不预先构建全局静态图,而是将每个块转换为一个语义完整的事件和一组索引实体,然后使用SQL连接查询动态地将共享实体的事件链接到局部超边中,在查询时构建动态实例化的局部索引结构。这种设计避免了全局图重建和持续维护的需要;该系统通过依赖标准数据库基础设施,自然支持增量写入、并发处理和持续扩展。在HotpotQA、2WikiMultiHop和MuSiQue这三个标准多跳基准上,SAG在9个Recall@K指标中的8个上取得了最佳结果,在MuSiQue(多跳推理需求最高的基准)上达到80.0%的Recall@5。SAG还已在数亿数据项的生产规模上部署,在线检索延迟保持在秒级。项目网站和代码见https://github.com/Zleap-AI/SAG-Benchmark。

英文摘要

Retrieval-Augmented Generation (RAG) offers an effective approach for large language models to access external knowledge. However, existing methods rely on dense similarity retrieval and face inherent limitations in handling structured constraints and multi-hop reasoning. Incorporating knowledge graphs partially alleviates these issues, but at the cost of semantic fragmentation, high maintenance overhead, and difficult incremental updates. This paper introduces SAG (SQLRetrieval Augmented Generation), a structured architecture for retrieval and agent systems. Instead of pre-building a global static graph, SAG converts each chunk into one semantically complete event and a set of indexing entities, then uses SQL join queries to dynamically link events that share entities into local hyperedges,constructing, at query time, a dynamically instantiated local index structure. This design avoids the need for global graph rebuilding and ongoing maintenance; the system naturally supports incremental writes, concurrent processing, and continuous scaling through its reliance on standard database infrastructure. Across HotpotQA, 2WikiMultiHop, and MuSiQue, three standard multi-hop benchmarks,SAG achieves the best results on 8 out of 9 Recall@K metrics, reaching 80.0% Recall@5 on MuSiQue, the benchmark with the highest multi-hop reasoning demands.SAG has also been deployed at a production scale of hundreds of millions of data items, with online retrieval latency kept within seconds. Project site and code are available at https://github.com/Zleap-AI/SAG-Benchmark.

2606.15967 2026-06-16 cs.CV 新提交

CRIS: Cross-Plane Self-Supervised Isotropic Restoration for Anisotropic Volumetric Imaging Across Modalities

CRIS:跨模态各向异性体积成像的跨平面自监督各向同性恢复

Adi Ahituv, Anat Ilivitzki, Moti Freiman

发表机构 * Faculty of Data and Decision Sciences, Technion -- Israel Institute of Technology(数据与决策科学学院,技术离子技术学院) Faculty of Biomedical Engineering, Technion -- Israel Institute of Technology(生物医学工程学院,技术离子技术学院) The May-Blum-Dahl MRI Research Center, Technion -- Israel Institute of Technology(梅-布卢姆-达尔MRI研究中心,技术离子技术学院)

AI总结 提出CRIS,一种无需配对各向同性真值的跨平面自监督框架,通过正交重切2D条带补全实现3D各向同性恢复,在MRI和体积电镜上优于插值和多种方法。

Comments 22 pages, 8 figures, supplementary material included. Submitted to Medical Image Analysis

详情
AI中文摘要

各向异性体积采集在临床MRI和体积电子显微镜(vEM)中很常见,其中稀疏的跨平面采样产生厚切片或截面,降低了正交重切和下游分析的质量。我们提出CRIS,一种跨平面自监督框架,无需配对各向同性真值即可实现各向同性恢复。CRIS将3D恢复视为各向同性网格正交重切上的2D条带补全:训练时,高分辨率面内切片被合成退化并周期性掩蔽;推理时,空白切片定义各向同性网格,恢复两个正交重切,并通过多视图平均融合预测。我们在两个MRI队列和两个显微镜基准上评估CRIS,各向异性高达8倍。在脑MRI上,CRIS达到32.921±0.436 dB PSNR和0.9631±0.0027 SSIM,优于插值、SMORE4、SIMPLE、SA-INR和ATME,并给出最佳分割一致性(Dice 0.940±0.004,ASSD 0.245±0.014 mm,HD99 1.275±0.061 mm)。在无参考腹部MRI上,CRIS将FID/KID降至48.714/0.023。在vEM上,CRIS优于插值、NIIV和vEMINR,在4倍时达到29.133 dB/0.834 3D PSNR/SSIM,在EPFL 8倍时达到27.123 dB/0.734,在噪声hemibrain数据上达到21.915 dB/0.699。在鲁棒性实验中,一个可变间隙CRIS模型在间隙因子3-7以及冠状、轴向和矢状退化下评估,保持比插值更高的PSNR/SSIM(36.36-31.14 dB和0.977-0.932对比33.07-27.85 dB和0.951-0.853)。这些结果支持CRIS作为一种模态灵活的途径,无需配对各向同性目标或特定配置的重新训练即可实现各向同性恢复。代码可在https://github.com/adi-hatav/CRIS获取。

英文摘要

Anisotropic volumetric acquisitions are common in clinical MRI and volume electron microscopy (vEM), where sparse through-plane sampling creates thick slices or sections that degrade orthogonal reformats and downstream analysis. We present CRIS, a cross-plane self-supervised framework for isotropic restoration without paired isotropic ground truth. CRIS casts 3D restoration as 2D stripe completion on orthogonal reformats of an isotropic grid: high-resolution in-plane slices are synthetically degraded and periodically masked for training, while at inference blank slices define the isotropic grid, two orthogonal reformats are restored, and predictions are fused by multi-view averaging. We evaluate CRIS on two MRI cohorts and two microscopy benchmarks up to 8x anisotropy. On brain MRI, CRIS achieves 32.921 +/- 0.436 dB PSNR and 0.9631 +/- 0.0027 SSIM, outperforming interpolation, SMORE4, SIMPLE, SA-INR, and ATME, and gives the best segmentation consistency (Dice 0.940 +/- 0.004, ASSD 0.245 +/- 0.014 mm, HD99 1.275 +/- 0.061 mm). On reference-free abdominal MRI, CRIS reduces FID/KID to 48.714/0.023. On vEM, CRIS outperforms interpolation, NIIV, and vEMINR, reaching 29.133 dB/0.834 3D PSNR/SSIM at 4x, 27.123 dB/0.734 on EPFL at 8x, and 21.915 dB/0.699 on noisy hemibrain data. In a robustness experiment, one variable-gap CRIS model evaluated across gap factors 3--7 and coronal, axial, and sagittal degradations maintained higher PSNR/SSIM than interpolation (36.36--31.14 dB and 0.977--0.932 vs. 33.07--27.85 dB and 0.951--0.853). These results support CRIS as a modality-flexible route to isotropic restoration without paired isotropic targets or configuration-specific retraining. Code is available at https://github.com/adi-hatav/CRIS.

2606.15956 2026-06-16 cs.CV cs.AI cs.LG 新提交

You Don't Need Strong Assumptions: Visual Representation Learning via Temporal Differences

你不需要强假设:通过时间差异进行视觉表示学习

Ninad Daithankar, Alexi Gladstone, Yann LeCun, Heng Ji

发表机构 * UIUC(伊利诺伊大学厄巴纳-香槟分校) New York University(纽约大学)

AI总结 提出TDV方法,基于因果假设(过去导致未来)从视频中自监督学习,避免强归纳偏置,在密集空间任务上达到SOTA。

详情
AI中文摘要

AI的进步很大程度上是由假设更少的方法驱动的。随着计算和数据量的增加,弱归纳偏置的方法通常优于强假设的方法。这在视觉表示学习领域尤为典型,方法从监督学习主导,到弱监督学习,再到如今无需人工标签的自监督学习的广泛成功。然而,即使是现代自监督学习方法仍然依赖于强归纳偏置,如数据增强、掩码或裁剪。如果这一趋势持续,这些剩余的偏置在大规模下将成为瓶颈——我们的实验证实了这一点:随着数据增长,归纳偏置的最优强度降低。这促使我们寻找依赖更少假设的方法。为此,我们提出了视觉时间差异(TDV),一种从视频中进行自监督学习的新范式,它避免了现有的归纳偏置,而是依赖于一个因果假设:过去导致未来。TDV通过联合训练图像编码器和运动编码器,使得当前帧的表示加上编码的运动等于下一帧的表示。尽管没有利用任何强归纳偏置,TDV在密集空间任务上达到了最先进的水平,为无需强假设的表示学习奠定了基础。

英文摘要

Progress in AI has largely been driven by methods that assume less. As compute and data increase, approaches with weaker inductive biases generally outperform those with stronger assumptions. This is particularly characteristic of the field of Visual Representation Learning, where approaches have gone from being dominated by Supervised Learning, to Weakly Supervised Learning, to the now widespread success of Self-Supervised Learning without human labels. Yet, even modern Self-Supervised Learning approaches still depend on strong inductive biases such as augmentations, masking, or cropping. If this trend holds, even these remaining biases should become bottlenecks at scale -- and our experiments confirm this: the optimal strength of inductive biases decreases as data grows. This motivates the search for approaches that rely on fewer assumptions. To this end, we introduce Temporal Difference in Vision (TDV), a new paradigm for self-supervised learning from video that avoids existing inductive biases, relying instead on a causal assumption that the past causes the future. TDV functions by jointly training an image encoder and a motion encoder so that the current frame's representation plus the encoded motion equals the next frame's representation. Despite not leveraging any strong inductive biases, TDV matches state-of-the-art recipes on dense spatial tasks, laying the foundation for representation learning without strong assumptions.

2606.15949 2026-06-16 cs.CL 新提交

FinBalance: A Multi-Document Accounting Reconciliation Benchmark

FinBalance:多文档会计对账基准

Sasank Tumpati, Devansh Agarwal, Ayush Kedia, Arjun Neekhra, Murari Mandal, Krishna Garg, Yash Sinha, Suman Gupta, Dhruv Kumar

发表机构 * BITS Pilani(比拉理工学院皮拉尼校区) KIIT Bhubaneswar(KIIT布巴内斯瓦尔) University of Oxford(牛津大学)

AI总结 提出FinBalance基准,通过多行业源文档构建会计对账任务,评估LLM在生成资产负债表和检测不一致性上的表现,发现模型在文档绑定和一致性聚合上存在显著差距。

Comments 18 pages, 12 figures. Code and data: https://github.com/Devansh1105/finbalance

详情
AI中文摘要

现有的金融NLP基准主要评估已准备好的工件,如申报文件、表格或提取的值。真正的会计工作更早开始:源文档必须被对账到引用的日记账分录中,汇总到资产负债表,并检查矛盾。我们引入了FinBalance,一个多文档会计对账基准,由来自八个行业、三种期间类型和五个难度级别的源文档包构建。人工编写的业务场景、会计政策、税务/外汇处理、文档模式、干扰项和不一致性模板由确定性生成器组合,其分类账产生日记账分录、资产负债表和23个不一致性代码标签。在710条记录的评估分割上,六个当代LLM的精确最终资产负债表准确率最高仅为46%。四个模型在BS_exact(模型报告的资产负债表)和BS_recon(通过重放其分录到我们的分类账获得的资产负债表)之间显示出26-41个百分点的差距。模型通常恢复数值上合理的分录,但未能将其绑定到支持文档并一致地聚合。引用压力提示几乎不改变文档链接错误,而分类账反馈消融显著改善了报告的资产负债表并暴露了不一致性检测的权衡。专家财务评审人员验证了基准设计和标签。

英文摘要

Existing financial-NLP benchmarks mostly evaluate prepared artifacts such as filings, tables, or extracted values. Real accounting begins earlier: source documents must be reconciled into cited journal entries, aggregated into a balance sheet, and checked for contradictions. We introduce FinBalance, a multi-document accounting reconciliation benchmark built from source-document bundles across eight industries, three period types, and five difficulty levels. Human-authored business scenarios, accounting policies, tax/FX treatments, document schemas, distractors, and inconsistency templates are composed by a deterministic generator whose ledger produces journal entries,balance sheets, and 23 inconsistency-code labels. On a 710-record evaluation split, six contemporary LLMs reach at most 46% exact final-balance-sheet accuracy. Four models show a 26-41 pp gap between BS_exact, the model's reported balance sheet, and BS_recon, the balance sheet obtained by replaying its entries through our ledger. Models often recover numerically plausible entries but fail to bind them to supporting documents and aggregate them consistently. Citation-pressure prompting barely changes document-linking errors, while ledger-feedback ablations substantially improve reported balance sheets and expose inconsistency-detection trade-offs. Expert finance reviewers validate the benchmark design and labels.

2606.15940 2026-06-16 cs.LG 新提交

Causal-Privacy Audit Workflow for Synthetic and Distilled Data in Dropout Support

辍学支持中合成与蒸馏数据的因果隐私审计工作流

Hanghang Zheng, Xiwei Zhuang, Zhong Wang, Hong Liu, Xiao Chen, Jingwen He, Xia Li

发表机构 * Central University of Finance and Economics(中央财经大学) China Development Bank(中国发展银行) University of Cambridge(剑桥大学)

AI总结 提出CaP-Eval工作流,在固定估计目标下审计合成学生数据的预测效用、因果保真度和隐私风险,发现DPGNet和蒸馏数据在保留处理效应结构上优于基线方法。

详情
AI中文摘要

合成和蒸馏的学生数据越来越多地用于实现隐私意识的学习分析,但它们对面向决策的机构支持的适用性仍不确定。在辍学支持中,生成的数据不仅必须保留预测效用或分布相似性,还必须保留用于指导咨询、付款计划援助和奖学金相关决策的财务状况证据。方法:本研究引入了CaP-Eval,一种面向决策的因果隐私审计工作流,用于在固定估计目标、时间感知调整设计、估计器集和经验隐私治理筛选下评估生成的学生数据。该工作流比较了原始数据、蒸馏数据、对抗合成数据、统计合成数据和DPGNet隐私导向生成数据在预测效用、处理效应保真度、对替代估计器的鲁棒性以及局部训练记录邻近性方面的表现。结果:DPGNet和蒸馏数据比对抗和高斯Copula基线更可靠地保留了原始财务状况处理效应结构。DPGNet在epsilon水平上保留了完整的方向和秩一致性;epsilon=10产生了最小的非原始IPW和DML偏差,而epsilon=1和epsilon=5放大了若干财务状况对比。蒸馏数据保持高度忠实,但保留了最强的局部训练记录邻近信号。TabularGNet保留了定性方向但存在中度衰减,高斯Copula压缩了效应幅度。结论:预测效用、隐私导向、经验披露信号和因果保真度存在分歧;生成的学生数据在决策使用前需要对方向、幅度、重叠和发布治理风险进行联合审计。

英文摘要

Synthetic and distilled student data are increasingly used to enable privacy-conscious learning analytics, yet their suitability for decision-facing institutional support remains uncertain. In dropout support, generated data must preserve not only predictive utility or distributional resemblance, but also the financial-status evidence used to guide advising, payment-plan assistance, and scholarship-related decisions. Method: This study introduces CaP-Eval, a decision-facing causal-privacy audit workflow for evaluating generated student data under a fixed estimand, timing-aware adjustment design, estimator set, and empirical privacy-governance screen. The workflow compares original, distilled, adversarial synthetic, statistical synthetic, and DPGNet privacy-oriented generated data on predictive utility, treatment-effect fidelity, robustness to alternative estimators, and local training-record proximity. Results: DPGNet and distilled data preserved the original financial-status treatment-effect structure more reliably than the adversarial and Gaussian Copula baselines. DPGNet preserved full direction and rank agreement across epsilon levels; epsilon = 10 produced the smallest non-original IPW and DML deviations, while epsilon = 1 and epsilon = 5 amplified several financial-status contrasts. Distilled data remained highly faithful but retained the strongest local training-record proximity signal. TabularGNet preserved qualitative directions with moderate attenuation, and Gaussian Copula compressed effect magnitudes. Conclusions: Predictive utility, privacy orientation, empirical disclosure signals, and causal fidelity diverged; generated student data require joint audits of direction, magnitude, overlap, and release-governance risk before decision use.

2606.15938 2026-06-16 cs.CV cs.MM 新提交

Learning Directional Semantic Transitions for Longitudinal Chest X-ray Analysis

学习纵向胸部X光分析的方向性语义转变

Zhangfeng Hu, Zefan Yang, Ge Wang, Tanveer Syeda-Mahmood, Anushree Burade, Mannudeep Kalra, Pingkun Yan

发表机构 * Rensselaer Polytechnic Institute(伦斯勒理工学院) Stanford University(斯坦福大学) Massachusetts General Hospital, Harvard Medical School(麻省总医院,哈佛医学院)

AI总结 提出ProTrans框架,将疾病进展建模为配对CXR研究间的方向性语义转变,利用放射学报告和可学习进展特征图显式编码语义变化,通过反向时间建模和双向重建一致性实现方向感知,在纵向下游任务中优于现有方法。

Comments MICCAI 2026

详情
AI中文摘要

胸部X光(CXR)解读通常需要纵向比较以评估疾病进展。现有方法通常依赖于时间特征融合或研究间差异建模,但在捕捉细微进展语义方面仍有限,且忽视了疾病轨迹固有的方向性。本文提出ProTrans,一种新颖的视觉-语言预训练框架,将疾病进展建模为配对CXR研究间的方向性语义转变。ProTrans利用放射学报告将单个CXR表示锚定到可解释的疾病状态,并引入可学习的进展特征图以显式编码状态间的语义转变,与报告导出的进展描述对齐。为强制方向感知,ProTrans结合了反向时间建模过程,并在状态和转变间施加双向重建一致性,从而解耦方向语义并促进连贯的轨迹建模。在纵向下游任务(包括疾病进展分类和进展描述)上的大量实验表明,ProTrans始终优于现有方法,为纵向CXR理解建立了统一的预训练框架。https://github.com/RPIDIAL/ProTrans

英文摘要

Chest X-ray (CXR) interpretation often requires longitudinal comparison to assess disease progression. Existing approaches typically rely on temporal feature fusion or inter-study discrepancy modeling, yet remain limited in capturing subtle progression semantics and overlook the inherently directional nature of disease trajectories. In this paper, we propose ProTrans, a novel vision-language pretraining framework that formulates disease progression as a directional semantic transition between paired CXR studies. ProTrans leverages radiology reports to anchor individual CXR representations within interpretable disease states, and introduces a learnable progression feature map to explicitly encode semantic shifts between states, aligned with report-derived progression descriptions. To enforce direction-aware perception, ProTrans incorporates a reversed temporal modeling process and imposes bidirectional reconstruction consistency across states and transitions, thereby disentangling directional semantics and promoting coherent trajectory modeling. Extensive experiments on longitudinal downstream tasks, including disease progression classification and progression captioning, demonstrate that ProTrans consistently outperforms existing methods, establishing a unified pretraining framework for longitudinal CXR understanding. https://github.com/RPIDIAL/ProTrans

2606.15930 2026-06-16 cs.RO cs.AI 新提交

ControlMap: Controllable High-Definition Map Generation for Traffic Scenario Simulation

ControlMap: 用于交通场景仿真的可控高清地图生成

Marwan Farag, Steffen Wäldele, Yu Yao

发表机构 * University of Stuttgart(斯图加特大学) Robert Bosch GmbH(博世公司) Motional, Inc(Motional公司)

AI总结 提出基于潜在扩散和ControlNet的数据驱动管道,实现可控高清地图生成,支持空间引导、条件强度调整和城市风格迁移,并引入新指标评估控制信号遵循度和地图真实性。

详情
AI中文摘要

仿真是验证自动驾驶系统的核心,但当前流程因高精(HD)地图创建成本高昂而受限于场景多样性不足。扩展HD地图需要昂贵的数据收集和人工处理。此外,现有生成模型缺乏在生成过程中针对特定道路拓扑进行细粒度控制的能力。本文提出一种数据驱动的可控HD地图生成管道,使用潜在扩散和ControlNet进行空间条件控制。据我们所知,我们是首个将空间引导信号注入扩散模型用于HD地图合成的工作。此外,我们的模型支持通过无分类器引导调整条件强度,并通过城市标签条件实现城市级风格迁移。为补充现有指标,我们引入两个新指标来评估对控制信号的遵循程度以及与真实地图的相似性。实验表明,我们的模型生成的HD地图真实且忠实遵循输入道路拓扑,同时准确保留城市特定细节。

英文摘要

Simulation is central to validating autonomous driving systems, yet current pipelines are limited by insufficient scenario diversity due to costly High Definition (HD) map creation. Scaling HD maps requires expensive data collection and manual processing. Moreover, existing generative models lack the fine-grained control necessary to target specific road topologies during generation. This paper presents a data-driven pipeline for controllable HD map generation using latent diffusion and ControlNet for spatial conditioning. To our knowledge, we are the first to inject spatial guidance signals into a diffusion model for HD map synthesis. Furthermore, our model supports adjustable conditioning strength through classifier-free guidance and city-level style transfer via city label conditioning. To complement existing metrics, we introduce two novel metrics to evaluate adherence to the control signal and similarity to ground-truth maps. Experiments demonstrate that our model generates realistic HD maps that faithfully follow input road topologies while accurately preserving city-specific details.

2606.15927 2026-06-16 cs.LG 新提交

An Exploratory Study of Blood Glucose Estimation from Photoplethysmography Signals using Machine Learning

基于机器学习从光电容积脉搏波信号估计血糖的探索性研究

Ruhani Bhatia, Vijval Ekbote

发表机构 * Indraprastha Institute of Information Technology, Delhi(德里印度信息技术学院)

AI总结 本研究利用智能手表PPG信号和CGM血糖数据构建机器学习模型,探索无创血糖估计的可行性,初步结果显示存在预测信号但需更多数据验证。

Comments 7 pages, 3 figures

详情
AI中文摘要

糖尿病和极端血糖水平是当今人类面临的主要健康问题之一。虽然连续血糖监测(CGM)已成为管理糖尿病和监测血糖水平的有效技术,但该技术传统上是侵入性的(即需要刺穿皮肤),并存在刺激、硬结等风险。这凸显了对准确且可大规模部署的非侵入性CGM方法的需求。随着各种传感技术的出现及其在智能手表等可穿戴设备中的集成,我们现在能够以非侵入方式连续监测光电容积脉搏波(PPG)等身体信号。通过CGM连续监测血糖并通过智能手表连续监测PPG信号的能力,为我们提供了获取这两类密集数据的机会,从而开启了构建基于机器学习和深度学习的模型以从PPG信号估计血糖水平的可能性。在这项工作中,我们首先提供了一个配对数据集,包含来自智能手表的连续PPG信号以及使用CGM设备记录的血糖值。我们还展示了在数据集上进行的一些初步实验探索的结果。这些初步结果表明可能存在一些预测信号,但需要来自更多个体的更多数据进行进一步探索。数据集可在 https://zenodo.org/records/20577959 获取。

英文摘要

Diabetes and extreme blood sugar levels are some of the major health problems faced by humans today across the world. While Continuous Glucose Monitoring (CGM) has emerged as an effective technology for management of diabetes as well as for monitoring blood sugar levels, this technology has traditionally been invasive (that is, requiring the piercing of the skin) and carries the risk of irritation, induration, etc. This highlights the need for accurate and non-invasive CGM methods that can be deployed at scale. With the emergence of various sensing technologies and their integration in wearables like the smart-watch, we now have the capability to continuously monitor body signals like the Photoplethysmogram (PPG) in a non-invasive manner. Having the ability to continuously monitor blood glucose through CGMs and continuously monitor PPG signals through a smart-watch offers an opportunity to get dense data on these two, opening the possibility of building machine learning and deep learning based models to estimate blood glucose level from PPG signals. In this work, we first present a paired dataset comprising continuous PPG signals from a smartwatch along with glucose values recorded using a CGM device. We also present the results of some preliminary experimental explorations performed on our dataset. These preliminary results suggest that some predictive signals may exist, though more exploration is needed with more data from a larger number of individuals. The dataset can be accessed at https://zenodo.org/records/20577959

2606.15924 2026-06-16 cs.CV cs.GR 新提交

TurboGS: Accelerating 3D Gaussian Splatting via Error-Guided Sparse Pixel Sampling and Optimization

TurboGS: 通过误差引导的稀疏像素采样与优化加速3D高斯泼溅

Zheng Dong, Daifei Qiu, Pinxuan Dai, Ke Xu, Jiamin Xu, Lili He, Rynson W. H. Lau, Weiwei Xu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出TurboGS框架,通过误差引导的稀疏像素采样、结构感知损失、动态密度控制和混合优化器,在保持高保真渲染质量的同时实现高达10倍的训练加速。

Comments Accepted by ICML2026. Project page: https://zhengdong.site/projects/TurboGS/

详情
AI中文摘要

消费级应用需要快速优化3D高斯泼溅(3DGS)以实现高保真新视角渲染。然而,现有的3DGS加速方法在牺牲细节的同时,仍会在冗余像素上产生大量计算。本文提出TurboGS,一种误差引导的训练框架,通过将优化集中在感知信息丰富的像素上来加速3DGS。TurboGS基于四个核心组件:(1)瓦片级稀疏像素采样,由训练期间的多视图重建误差驱动,优先处理困难区域并跳过重建良好的区域以避免冗余梯度计算;(2)带有稀疏归一化互相关的瓦片级结构感知损失,提供稀疏但有效的监督以保留细节并稳定训练;(3)误差驱动的高斯密度控制策略,动态分配模型容量并移除冗余基元;(4)定制的混合优化器,将Hessian信息更新与Adam动量阻尼相结合,以稳定和改善稀疏监督下的收敛。标准基准实验表明,TurboGS在单个RTX 5090 GPU上可在100秒内提供与原始3DGS相当或更优的渲染质量(训练速度提升高达10倍)。

英文摘要

Consumer-level applications require fast optimization of 3D Gaussian Splatting (3DGS) with high-fidelity novel view rendering. However, existing 3DGS acceleration approaches still incur substantial computation on redundant pixels while sacrificing fine details. In this paper, we present TurboGS, an error-guided training framework that accelerates 3DGS by concentrating optimization on perceptually informative pixels. TurboGS is built upon four core components: (1) a tile-wise sparse pixel sampling, which, driven by multi-view reconstruction errors during training, prioritizes challenging regions and skips well-reconstructed ones to avoid redundant gradient computation; (2) a tile-wise structure-aware loss with sparse Normalized Cross-Correlation, which provides sparse yet effective supervision to preserve fine details and stabilize training; (3) an error-driven Gaussian density control strategy, which dynamically allocates model capacity and removes redundant primitives; and (4) a tailored hybrid optimizer that couples Hessian-informed updates with Adam moment damping to stabilize and improve convergence under sparse supervision. Experiments on standard benchmarks demonstrate that TurboGS can deliver on par or superior rendering quality within 100 seconds on a single RTX 5090 GPU card (up to 10x training speedup over vanilla 3DGS).

2606.15920 2026-06-16 cs.CV 新提交

OmniOPSD: Rationale-Privileged On-Policy Self-Distillation for Affective Computing

OmniOPSD:面向情感计算的理性特权在线自蒸馏

Zebang Cheng, Shuimu Chen, Boxue Yang, Yuanshen Guan, Jingyi Chen, Zheng Lian, Xiaojiang Peng, Fei Ma, LaiZhong Cui, Qi Tian

发表机构 * Shenzhen University(深圳大学) Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)(广东省人工智能与数字经济实验室(深圳)) Tsinghua University(清华大学) Shanghai Jiao Tong University(上海交通大学) University of Science and Technology of China(中国科学技术大学) Shenzhen Technology University(深圳技术大学) Tongji University(同济大学) Huawei(华为)

AI总结 针对多模态大模型在复杂推理任务中奖励稀疏的问题,提出OmniOPSD框架,利用前沿模型生成的理性作为教师特权证据而非学生模仿目标,通过在线自蒸馏提供密集令牌级监督,在MER-UniBench上取得84.19平均分的最优性能。

详情
AI中文摘要

多模态大语言模型的强化学习在复杂推理任务中常因严重的奖励稀疏性而受阻。这一挑战在涉及状态、情感、意图和行为的以人为中心的场景中尤为突出,其中异质多模态信号和主观人为因素使得高质量思维链标注昂贵且难以获取。尽管许多多模态数据集提供了专家标注的真实标签,但直接使用这些标签进行监督微调可能会鼓励多模态感知中的捷径学习,并为安全关键的人机交互提供有限的透明度。为解决这些限制,我们提出OmniOPSD,一种理性特权的在线自蒸馏框架,该框架将前沿模型生成的理性作为教师侧的特权证据而非学生模仿目标。OmniOPSD仅将前沿模型生成的证据感知理性作为训练时的特权证据上下文提供给本地教师。学生从原始多模态输入中采样自己的轨迹,而理性特权教师对相同令牌进行评分并提供密集的令牌级监督。因此,学生在自己的轨迹分布上学习,无需直接模仿前沿模型完成,且推理不需要标签、理性、思维链标注或闭源模型访问。在MER-UniBench上的实验表明,OmniOPSD以84.19的平均分实现了最先进的性能,消融实验进一步支持了理性特权教师指导的价值。

英文摘要

Reinforcement learning for multimodal large language models (MLLMs) is often hindered by severe reward sparsity in complex reasoning tasks. This challenge is particularly pronounced in human-centered scenarios involving states, emotions, intentions, and behaviors, where heterogeneous multimodal signals and subjective human factors make high-quality chain-of-thought (CoT) annotations expensive and difficult to obtain. Although many multimodal datasets provide expert-annotated ground-truth labels, directly using these labels for supervised fine-tuning may encourage shortcut learning in multimodal perception and provides limited transparency for safety-critical human--AI interaction. To address these limitations, we propose OmniOPSD, a Rationale-Privileged On-Policy Self-Distillation framework that uses frontier-generated rationales as teacher-side privileged evidence rather than student imitation targets. OmniOPSD uses frontier-generated evidence-aware rationales only as training-time privileged evidence context for a local teacher. The student samples its own rollout from the original multimodal input, while the rationale-privileged teacher scores the same tokens and provides dense token-level supervision. Thus, the student learns on its own trajectory distribution without directly imitating frontier-model completions, and inference requires no labels, rationales, CoT annotations, or closed-source model access. Experiments on MER-UniBench show that OmniOPSD achieves state-of-the-art performance with an average score of $84.19$, and ablations further support the value of rationale-privileged teacher guidance.

2606.15918 2026-06-16 cs.RO 新提交

Energy-Efficient Arm Reaching for a Humanoid Robot via Deep Reinforcement Learning with Identified Power Models

基于识别功率模型的深度强化学习实现人形机器人节能手臂伸展

Nestor N. Deniz, Simon Parsons, Fernando Auat Cheein

发表机构 * Harper Adams University(哈珀亚当斯大学) Lincoln Institute for Agri-Food Technology(林肯农业食品技术研究所) Lincoln Centre for Autonomous Systems(林肯自主系统中心)

AI总结 提出一种端到端能量感知强化学习框架,结合物理实验识别的电功率模型与SAC策略,在Unitree G1人形机器人上实现节能手臂伸展,仿真成功率69.9%,实物验证平均能耗71.5 J。

详情
AI中文摘要

在田间执行操作任务(如机器人苹果采摘)的人形机器人面临严重的能量约束,这直接限制了每块电池充电可执行的伸展运动次数。本文针对Unitree~G1人形机器人的7自由度左臂,提出了一种端到端的能量感知强化学习框架,该框架结合了基于物理实验识别的电功率模型和在基于Pinocchio的刚体动力学模拟器中训练的Soft Actor-Critic (SAC)策略。RL策略在增量关节位置动作空间上运行,并使用混合星座奖励进行训练,该奖励将四点末端执行器星座距离与扭矩范数能量代理相结合;经过$5\times10^6$次训练后,在运动学模拟中对$1\,000$个随机目标达到了$69.9\%$的成功率,成功情节的平均能量为\SI{98.16}{\joule}。最后,在物理Unitree~G1上,该策略在三个独立的10目标批次上进行了验证,实现了平均能量$71.5 \pm 48.3$\,J,末端执行器位置误差$2.64 \pm 1.04$\,cm,方向误差$6.92 \pm 1.33^\circ$——均在\SI{4}{\centi\metre}/$8.6^\circ$的训练容差内。这些结果构成了基于能量感知强化学习的人形机器人手臂伸展的第一步。

英文摘要

Humanoid robots performing in-field manipulation tasks, such as robotic apple harvesting, face severe energy constraints that directly limit the number of reaching motions that can be executed per battery charge. This paper presents an end-to-end, energy-aware reinforcement learning framework for the 7-degree-of-freedom left arm of the Unitree~G1 humanoid robot, combining a physics-based, experimentally identified electrical power model with a Soft Actor-Critic (SAC) policy trained in a Pinocchio-based rigid-body dynamics simulator. The RL policy operates on an incremental joint-position action space and is trained with a Hybrid Constellation Reward that combines a four-point end-effector constellation distance with a torque-norm energy proxy; after % $5\times10^6$ training it reaches a $69.9\%$ success rate over $1\,000$ random targets in kinematic simulation, at a mean energy of \SI{98.16}{\joule} on successful episodes. Finally, on the physical Unitree~G1, the policy is validated over three independent 10-target batches, achieving a mean energy of $71.5 \pm 48.3$\,J, an end-effector position error of $2.64 \pm 1.04$\,cm, and an orientation error of $6.92 \pm 1.33^\circ$ -- within the \SI{4}{\centi\metre}/$8.6^\circ$ training tolerance. These results constitute a first step toward energy-aware reinforcement-learning-based arm reaching for humanoid robots.

2606.15917 2026-06-16 cs.LG 新提交

Reinforcement Learning for LLM-based Event Forecasting

基于强化学习的LLM事件预测

Amit Arnold Levy

发表机构 * Advanced Computer Science(高级计算机科学) DeepSeek R1

AI总结 使用GRPO微调LLM,结合Wikipedia修订工具获取实时信息,预测未来事件,使1.5B参数模型性能超越Claude Sonnet 3.5。

Comments Submitted internally at the University of Oxford in Oct 2025, migrated to arXiv on Jun 2026

详情
AI中文摘要

我们使用Group Relative Policy Optimization (GRPO),一种最近提出的样本和内存高效的强化学习方法,来微调预训练的LLM(参数范围1.5B到14B),使其能够通过Wikipedia修订工具或新闻摘要获取当前信息,从而预测超出LLM知识截止日期的真实事件,以及模拟训练动态不同方面的问题。我们利用这些实验结果来评论LLM在预测方面的扩展能力,并分类判断性预测如何适应可验证/不可验证的领域分类法,考虑预测未来事件时固有的偶然不确定性(例如掷骰子)的影响。通过GRPO训练,我们成功使一个1.5B参数的Transformer(Qwen 2.5 1.5B)在预测性能上超越了Claude Sonnet 3.5,以市场同意概率的交叉熵衡量。我们还讨论了达到这一结果过程中的各种死胡同。

英文摘要

We use Group Relative Policy Optimization (GRPO), a recently devised sample and memory efficient reinforcement learning method, to finetune pretrained LLMs in the range of 1.5B to 14B parameters equipped with the ability to get current information through the use of a Wikipedia revisions tool, or news summaries, to forecast real events beyond the knowledge cutoff of the LLM, as well as problems made to simulate different aspects of the dynamics of that training. We use the results of these experiments to comment on the scaling capability of LLMs for forecasting, as well as classify how judgmental forecasting fits into the verifiable/unverifiable domain taxonomy, considering the impact of the inherent aleatoric uncertainty when forecasting future events (e.g. the roll of a die). As a result of the GRPO training, we manage to bring a 1.5B parameter transformer (Qwen 2.5 1.5B) to forecasting performance superior to Claude Sonnet 3.5 over the same dataset as measured by cross entropy from the market agreed probabilities. We also discuss various dead ends on the path to this result.

2606.15915 2026-06-16 cs.RO 新提交

Identification of a Physics-Based Electrical Power Consumption Model for the Unitree G1 Humanoid Arm

基于物理的Unitree G1人形机器人手臂电力消耗模型识别

Nestor N. Deniz, Sebastian Vega, Simon Parsons, Fernando Auat Cheein

发表机构 * Harper Adams University(哈珀亚当斯大学) Lincoln Institute for Agri-Food Technology(林肯农业食品技术研究所) Lincoln Centre for Autonomous Systems(林肯自主系统中心)

AI总结 提出一种基于物理的线性参数模型,用于预测Unitree G1人形机器人左臂的电力消耗,通过实验数据识别参数,在897条轨迹上达到R²=0.933,并在未见速度轨迹上验证泛化能力。

详情
AI中文摘要

精确预测电力消耗对于电池供电人形机器人的能量感知运动规划、电池管理和热监测至关重要。本文提出了一个基于物理的线性参数模型,用于Unitree~G1人形机器人七自由度左臂的电力消耗。所提出的公式将执行器损耗项与基线扭矩校正相结合,该校正捕捉重力补偿负载的变化,并能够准确预测负净功率轨迹。引入成对交互项来模拟多关节同时运动期间的功率耦合。模型参数从物理Unitree~G1上收集的实验数据中识别,使用板载功率测量作为回归目标。在覆盖单关节和协调手臂运动、多个速度水平的897条轨迹上,识别模型实现了$R^2 = 0.933$,RMSE为1.07 (W)。在46条以先前未见速度执行的轨迹上验证,得到$R^2 = 0.965$,显示出在识别数据集之外的强泛化能力。对识别参数的分析揭示了手臂上不同的功耗特性,粘性摩擦主导大多数关节(肩部俯仰和所有三个腕关节),铜损主导肩部偏航和肘部,而肩部滚动则独特地由库仑摩擦主导。

英文摘要

Accurate prediction of electrical power consumption is essential for energy-aware motion planning, battery management, and thermal monitoring in battery-powered humanoid robots. This letter presents a physics-based, linear-in-parameters model for the electrical power consumption of the seven-degree-of-freedom left arm of the Unitree~G1 humanoid robot. The proposed formulation combines actuator loss terms with a baseline-torque correction that captures changes in gravity-compensation load and enables accurate prediction of negative net power trajectories. Pairwise interaction terms are introduced to model power coupling during simultaneous multi-joint motion. Model parameters are identified from experimental data collected on a physical Unitree~G1 using onboard power measurements as the regression target. Across 897 trajectories covering single-joint and coordinated arm motions at multiple speed levels, the identified model achieves $R^2 = 0.933$ with an RMSE of 1.07 (W). Validation on 46 trajectories executed at previously unseen speeds yields $R^2 = 0.965$, demonstrating strong generalisation beyond the identification dataset. Analysis of the identified parameters reveals distinct power-consumption characteristics across the arm, with viscous friction dominating most joints (shoulder pitch and all three wrist joints), copper losses dominating shoulder yaw and the elbow, and shoulder roll uniquely dominated by Coulomb friction.

2606.15914 2026-06-16 cs.CL cs.HC 新提交

Contaminated Collaboration: Measuring Gender Bias Transfer in LLM-Assisted Student Writing

污染的合作:测量LLM辅助学生写作中的性别偏见迁移

Ariyan Hossain, Kazi Kamruzzaman Rabbi, Farig Sadeque, S M Taiabul Haque

发表机构 * Brac University(布拉卡大学)

AI总结 通过实验发现,使用性别偏见的LLM写作助手会显著增加学生职业规划作文中的性别刻板印象,且偏见迁移不对称:女性目标作文的能动性被抑制,男性目标作文受影响较小。

Comments 18 pages, 7 pages

详情
AI中文摘要

LLM中的性别偏见已在模型输出中得到广泛研究,有偏见的提示被证明会放大刻板印象生成。然而,这种偏见是否会传播到使用这些系统的人类所写的文本中,仍未被充分探索。我们研究了LLM写作助手中的性别偏见是否会迁移到学生撰写的职业规划作文中。我们首先验证了性别偏见的提示会诱导LLM生成性别差异化的语言,而中性提示则不会。然后,我们在受控环境中招募参与者(N = 123),在三种条件下为仅性别不同的配对传记档案撰写职业规划作文:无AI辅助、中性LLM辅助或性别偏见LLM辅助。与对照组和中性条件相比,偏见条件下的学生作文产生了显著更大的能动性差距和更多的性别刻板职业建议。我们的结果还揭示了这种偏见迁移是不对称的:女性目标作文中的能动性受到抑制,而男性目标写作基本不受影响。我们的发现凸显了AI辅助写作中偏见传播的风险,呼吁在教育AI工具中进行公平性意识设计。

英文摘要

Gender bias in LLMs has been studied extensively in model outputs, with biased prompts shown to amplify stereotyped generations. Whether such bias propagates into text produced by humans who use these systems, however, remains underexplored. We investigate whether gender bias in an LLM writing assistant transfers into career plan essays written by students. We first verify that a gender-biased prompt induces gender-differentiated language in LLM-generated essays, while a neutral prompt does not. We then recruited participants (N = 123) in a controlled environment to write career plan essays for paired biographical profiles differing only in gender under three conditions: no AI assistance, neutral LLM assistance, or gender-biased LLM assistance. Students in the biased condition produced essays with a significantly larger agentic gap and more gender-stereotypic occupation suggestions than those in the control and neutral conditions. Our results also reveal that this bias transfer is asymmetric: agency is suppressed in female-target essays while male-target writing remains largely unaffected. Our findings highlight the risk of bias propagation in AI-assisted writing, calling for fairness-aware design in educational AI tools.