arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2070
2606.12006 2026-06-11 cs.LG cs.AI 新提交

Tabular Foundation Models for Clinical Survival Analysis via Survival-Aware Adaptation

通过生存感知适配的临床生存分析表格基础模型

Minh-Khoi Pham, Luca Cotugno, Alina Sirbu, Tai Tan Mai, Martin Crane, Marija Bezbradica

发表机构 * ADAPT Centre, Dublin City University(ADAPT中心,都柏林城市大学) School of Computing, Dublin City University(都柏林城市大学计算机学院) Department of Computer Science and Engineering, University of Bologna(博洛尼亚大学计算机科学与工程系)

AI总结 提出轻量级适配方法,将表格基础模型(TabPFN、TabDPT、TabICL)与多任务逻辑回归头结合,用于临床生存分析,在多个基准和ICU队列上达到竞争性或更优性能。

Comments Accepted for publication at International Conference on AI in Healthcare 2026

详情
AI中文摘要

预测死亡率等时间至事件结果是临床决策中的基本任务,通常通过生存分析来解决。虽然经典的统计和深度学习方法已被广泛研究,但它们通常需要特定任务的训练和足够的标记数据。最近表格基础模型的进展通过学习结构化数据的通用表示提供了一种新范式。然而,它们在临床环境中对删失时间至事件预测的适用性仍未得到充分探索,因为典型应用仅限于离散分类而非生存分析任务。在这项工作中,我们提出了一种轻量级适配方法,通过直接在预训练表示之上训练一个生存感知头,将表格基础模型应用于临床生存分析。我们研究了代表性架构,包括TabPFN、TabDPT和TabICL,并使用多任务逻辑回归(MTLR)头对它们进行适配,以建模右删失时间至事件结果。我们在多个公开生存基准和两个大规模ICU队列MIMIC-IV和eICU上评估了该方法。我们的结果表明,这种迁移学习方法与强基线相比达到了竞争性或更优的性能。在MIMIC-IV上,TabDPT-FT-MTLR达到了0.856的C指数,相对于最佳非FM基线(DeepSurv,0.844)相对提升了+1.4%,相对于最佳零样本模型(0.802)提升了+6.7%。在eICU上,TabICL-FT-MTLR达到了0.797,分别获得了+1.7%(DeepSurv,0.784)和+6.4%(0.749)的提升。这些发现强调了将预训练表格表示与生存感知目标相结合的重要性,并表明表格基础模型为临床生存预测提供了一种实用且有效的替代方案。

英文摘要

Predicting time-to-event outcomes such as mortality is a fundamental task in clinical decision-making, commonly addressed through survival analysis. While classical statistical and deep learning approaches have been widely studied, they typically require task-specific training and sufficient labeled data. Recent advances in tabular foundation models offer a new paradigm by learning general-purpose representations for structured data. However, their applicability to censored time-to-event prediction in clinical settings remains underexplored, as typical applications are restricted to discrete classification rather than survival analysis tasks. In this work, we propose a lightweight adaptation approach for applying tabular foundation models to clinical survival analysis by directly training a survival-aware head on top of the pretrained representations. We study representative architectures, including TabPFN, TabDPT, and TabICL, and adapt them using a multi-task logistic regression (MTLR) head to model right-censored time-to-event outcomes. We evaluate this approach on a diverse set of public survival benchmarks and two large-scale ICU cohorts, MIMIC-IV and eICU. Our results show that this transfer learning approach achieves competitive or superior performance compared to strong baselines. On MIMIC-IV, TabDPT-FT-MTLR reaches a C-index of 0.856, corresponding to a relative improvement of +1.4% over the best non-FM baseline (DeepSurv, 0.844) and +6.7% over the best zero-shot model (0.802). On eICU, TabICL-FT-MTLR achieves 0.797, yielding gains of +1.7% (DeepSurv, 0.784) and +6.4% (0.749), respectively. These findings highlight the importance of combining pretrained tabular representations with survival-aware objectives and suggest that tabular foundation models provide a practical and effective alternative for clinical survival prediction.

2606.12003 2026-06-11 cs.CL 新提交

Agreement in Representation Space for Open-Ended Self-Consistency

表示空间中的一致性:面向开放式自洽性

Paula Ontalvilla, Gorka Azkune, Aitor Ormazabal

发表机构 * HiTZ Center - Ixa, University of the Basque Country (UPV/EHU)(HiTZ中心 - Ixa,巴斯克大学(UPV/EHU))

AI总结 针对开放式生成任务,提出基于嵌入的协议(EBA),通过聚类采样生成的嵌入表示来估计自洽性,无需训练即可鲁棒地选择更可靠的输出。

详情
AI中文摘要

自洽性通过采样多个输出并选择最一致的答案来改进大语言模型的推理,但现有方法主要依赖于精确匹配,因此仅限于具有分类输出的任务。在这项工作中,我们研究开放式生成任务(如代码合成和文本摘要)中的自洽性。我们假设一致性可以理解为生成空间的几何属性,其中语义兼容的生成在表示空间的相似区域中集中。为了研究这一假设,我们引入了基于嵌入的协议(EBA),这是一种简单的无需训练的操作方法,通过在嵌入空间中对采样生成进行聚类来估计一致性。通过在数学推理、代码生成和摘要上的实验,我们表明表示空间中的一致性为开放式任务提供了鲁棒且可扩展的自洽性信号。特别是,EBA 始终优于随机选择,并且比最近基于大语言模型评估或不确定性估计的选择方法表现出更稳定的扩展行为。我们进一步表明,这些一致性信号在不同模型家族和嵌入空间中保持稳定,即使使用原生隐藏表示也是如此。最后,我们的分析表明,采样生成所占据的几何位置与生成质量强相关:集中在表示空间中心区域附近的生成往往对应于更可靠的输出,而外围生成则显著不准确。总体而言,我们的研究结果支持将自洽性视为采样生成的几何组织属性,而非精确符号重叠。

英文摘要

Self-consistency improves LLM reasoning by sampling multiple outputs and selecting the most consistent answer, but existing formulations largely rely on exact matching and therefore remain limited to tasks with categorical outputs. In this work, we study self-consistency in open-ended generation tasks such as code synthesis and text summarization. We hypothesize that consistency can be understood as a geometric property of the generation space, where semantically compatible generations concentrate in similar regions of representation space. To study this hypothesis, we introduce Embedding-Based Agreement (EBA), a simple training-free operationalization that estimates agreement by clustering sampled generations in embedding space. Through experiments on mathematical reasoning, code generation, and summarization, we show that agreement in representation space provides a robust and scalable signal of self-consistency for open-ended tasks. In particular, EBA consistently outperforms random selection and exhibits more stable scaling behavior than recent selection approaches based on LLM evaluation or uncertainty estimation. We further show that these agreement signals remain stable across model families and embedding spaces, even with native hidden representations. Finally, our analysis shows that the geometric location occupied by sampled generations is strongly correlated with generation quality: generations concentrated near central regions of representation space tend to correspond to more reliable outputs, whereas peripheral generations are substantially less accurate. Overall, our findings support viewing self-consistency as a property of the geometric organization of sampled generations rather than exact symbolic overlap.

2606.11998 2026-06-11 cs.LG 新提交

Bootstrapped Monitoring: Leveraging Transparent Reasoning to Oversee Stronger AI Agents

自助监控:利用透明推理监督更强的AI智能体

Frank Xiao, Mary Phuong

发表机构 * California Institute of Technology(加州理工学院)

AI总结 提出自助监控协议,通过插入具有透明思维链的不可信中间模型来监督更强智能体,在软件工程任务中显著提升捕获率,即使不可信监控者与智能体合谋。

详情
AI中文摘要

可信监控是AI控制的基石。然而,随着前沿模型能力增强,可信与不可信模型之间的能力差距可能使可信模型成为不可靠的监控者。我们引入了\emph{自助监控}协议,通过在监督链中插入一个具有透明思维链推理的更强的不可信中间模型来解决这一问题。不可信监控者($U_m$)评估智能体的行为,而较弱的可信模型($T$)监督$U_m$的推理以检测合谋。我们在多轮软件工程任务(BashArena)上对多个智能体和监控者评估了自助监控。即使不可信监控者主动与智能体合谋,只要我们能够访问其原始思维链,自助监控相比仅使用可信监控显著提高了捕获率。我们的结果表明,随着AI能力的进步,自助监控可以延长可信模型在控制中的有效寿命。

英文摘要

Trusted monitoring is a cornerstone of AI control. However, as frontier models grow more capable, the increasing capabilities gap between trusted and untrusted models may render trusted models unreliable monitors. We introduce \emph{bootstrapped monitoring}, a protocol that addresses this by inserting a stronger, intermediate untrusted model with transparent chain-of-thought reasoning into the oversight chain. The untrusted monitor ($U_m$) evaluates the agent's actions, while a weaker trusted model ($T$) oversees $U_m$'s reasoning to detect collusion. We evaluate bootstrapped monitoring on multi-turn software engineering tasks (BashArena) across multiple agents and monitors. Bootstrapped monitoring substantially improves catch rates over trusted-only monitoring, even when the untrusted monitor actively colludes with the agent, provided we have access to its raw chain-of-thought. Our results suggest that bootstrapped monitoring can extend the useful lifetime of trusted models in control as AI capabilities advance.

2606.11989 2026-06-11 cs.CV 新提交

From Nominal Intensity to Equivalent Rainfall: A Path-Based Credibility Evaluation Framework for Simulated Rainfall in Autonomous-Driving Perception Tests

从名义强度到等效降雨:自动驾驶感知测试中模拟降雨的基于路径的可信度评估框架

Tian Xia, Xin Zhao, Shaolingfeng Ye, Junyi Chen

发表机构 * College of Automotive and Energy Engineering, Tongji University(同济大学汽车与能源工程学院) Tsinghua University(清华大学)

AI总结 提出基于路径的可信度评估方法,通过路径等效降雨强度、不确定性带和雨滴分布真实度评分,结合激光雷达点云计数和平均反射率进行感知一致性校正,实现模拟降雨与真实降雨的对齐及测试结果映射。

Comments 17 pages, preprint

详情
AI中文摘要

可信的模拟降雨条件对于识别自动驾驶感知系统边界和支持面向SOTIF的风险评估至关重要。然而,封闭场地测试通常仅用名义降雨强度或单点测量来描述,这使得模拟降雨场难以与真实降雨对齐,并将测试结果映射到真实场景。本文提出了一种基于路径的自动驾驶感知测试中模拟降雨的可信度评估方法。以真实降雨的雨滴尺寸和速度联合分布为参考,每条候选路径由路径等效降雨强度、不确定性带和路径平均雨滴分布真实度(RRD)评分表示。进一步利用激光雷达目标点云计数和平均反射率进行感知一致性校正,量化每条模拟降雨路径对真实降雨感知效果的代理能力。实验使用了约10,000个真实降雨雨滴谱样本、728个RainSense感知样本以及2.4 m x 7.2 m模拟降雨区域内的45个空间采样点。结果表明,在相同名义条件下空间非均匀性仍然存在,证实了基于路径评估的必要性。该方法识别出路径IV和路径VI为优选候选路径,结果分别为11.54 +/- 0.31 mm/h、RRD = 0.43和8.28 +/- 0.34 mm/h、RRD = 0.46。这些路径在降雨强度稳定性、雨滴谱真实性和感知一致性方面表现出更均衡的性能。所提方法支持降雨条件下自动驾驶感知测试的路径选择、条件描述和可信解释。

英文摘要

Credible simulated-rainfall conditions are essential for identifying perception-system boundaries and supporting SOTIF-oriented risk assessment in automated driving. However, closed-field tests are often described only by nominal rainfall intensity or single-point measurements, making it difficult to align simulated rain fields with real rainfall and map test results to real-world scenarios. This paper proposes a path-based credibility evaluation method for simulated rainfall in autonomous-driving perception tests. Using the drop size and velocity joint distribution of real rainfall as the reference, each candidate path is represented by path-equivalent rainfall intensity, an uncertainty band, and a path-averaged Realism of Raindrop Distribution (RRD) score. Lidar target point-cloud count and mean reflectivity are further used for perception-consistency correction, quantifying the proxy capability of each simulated-rainfall path for real-rainfall perception effects. Experiments are conducted using about 10,000 real-rainfall raindrop-spectrum samples, 728 RainSense perception samples, and 45 spatial sampling points in a 2.4 m x 7.2 m simulated-rainfall area. Results show that spatial non-uniformity remains under the same nominal condition, confirming the need for path-based evaluation. The method identifies Path IV and Path VI as preferable candidates, with results of 11.54 +/- 0.31 mm/h, RRD = 0.43, and 8.28 +/- 0.34 mm/h, RRD = 0.46, respectively. These paths show more balanced performance in rainfall-intensity stability, raindrop-spectrum realism, and perception consistency. The proposed method supports path selection, condition description, and credible interpretation of autonomous-driving perception tests under rainfall.

2606.11988 2026-06-11 cs.LG stat.ML 新提交

What Uncertainties Do We Need for Dynamical Systems?

动力系统需要哪些不确定性?

Yusuf Sale, Christopher Bülte, Felix Czaja, Joshua Stiller, Eyke Hüllermeier

发表机构 * Institute of Computer Science, LMU Munich(慕尼黑大学计算机科学研究所) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心) Department of Mathematics, LMU Munich(慕尼黑大学数学系) German Research Center for Artificial Intelligence (DFKI, DSA)(德国人工智能研究中心(DFKI, DSA))

AI总结 本文从机器学习视角探讨动力系统中的不确定性,区分偶然与认知不确定性,并讨论不同任务中表示和量化不确定性的目标。

Comments EIML@ICML

详情
AI中文摘要

偶然不确定性和认知不确定性之间的区别在机器学习研究中受到了相当大的关注,主要是在监督学习的背景下,但也涉及其他设置,如生成建模。在本文中,我们提供了一个关于动力系统不确定性建模的机器学习视角,这方面的研究迄今较少。特别是,我们提出:动力系统需要哪些不确定性?我们讨论了不确定性的来源,阐明了它们的性质(偶然或认知),并考虑了表示和量化不确定性的目标如何在不同任务中变化。

英文摘要

The distinction between aleatoric and epistemic uncertainty has received considerable attention in machine learning research, mainly in the context of supervised learning but also in other settings such as generative modeling. In this paper, we offer a machine learning perspective on uncertainty modeling for dynamical systems, which has been studied much less so far. In particular, we ask: what uncertainties do we need for dynamical systems? We discuss sources of uncertainty, clarify their nature (aleatoric or epistemic), and consider how the objectives of representing and quantifying uncertainty vary across different tasks.

2606.11982 2026-06-11 cs.LG 新提交

PAWS: Preference Learning with Advantage-Weighted Segments

PAWS: 基于优势加权片段的首选学习

Aleksandar Taranovic, Onur Celik, Niklas Freymuth, Ge Li, Serge Thilges, Huy Le, Tai Hoang, Rania Rayyes, Gerhard Neumann

发表机构 * University of Freiburg(弗莱堡大学)

AI总结 针对偏好强化学习中训练与推理分布不匹配导致时间信用分配退化的问题,提出PAWS方法,利用片段级优势函数直接进行策略更新,在机器人操作和运动任务上优于现有方法。

Comments Published as a conference paper at ICML 2026

详情
AI中文摘要

基于偏好的强化学习(PbRL)从人类轨迹级比较中学习策略,避免了显式奖励设计和专家演示。现有方法通常在轨迹或片段级偏好上训练效用函数,同时在策略优化过程中依赖每步效用估计。这种训练和推理的不匹配导致了分布偏移,严重降低了时间信用分配并限制了策略学习。我们分析了这一问题,并提出了PAWS,一种基于片段的偏好学习方法,直接使用片段级优势函数进行策略更新。通过使效用训练与策略优化对齐,PAWS保留了轨迹级偏好信息,避免了不可靠的每步学习信号。在模拟机器人操作和运动任务上的实验表明,PAWS持续优于现有的PbRL方法,突显了分布一致偏好学习的重要性。

英文摘要

Preference-based reinforcement learning (PbRL) learns policies from human trajectory-level comparisons, avoiding explicit reward design and expert demonstrations. Existing methods typically train utility functions on trajectory or segment-level preferences while relying on per-step utility estimates during policy optimization. This training and inference mismatch induces a distribution shift that severely degrades temporal credit assignment and limits policy learning. We analyze this issue and propose PAWS, a segment-based preference learning method that performs policy updates directly using segment-level advantage functions. By aligning utility training with policy optimization, PAWS preserves trajectory-level preference information and avoids unreliable per-step learning signals. Experiments on simulated robotic manipulation and locomotion tasks demonstrate that PAWS consistently outperforms existing PbRL approaches, highlighting the importance of distribution-consistent preference learning.

2606.11977 2026-06-11 cs.CV 新提交

ParseFixer: An Agentic Framework for Document Parsing via Selective Multimodal Correction

ParseFixer: 一种通过选择性多模态校正的文档解析智能体框架

LeKai Yu, Hao Liu, Kun Wang, Zhiran Li, Ruping Cao, Fan Liu, Yupeng Hu

发表机构 * Shandong University(山东大学) Southeast University(东南大学)

AI总结 提出ParseFixer框架,结合全页骨干解析和智能体选择性校正,通过验证-回滚机制修复高价值解析错误,在DataMFM挑战赛文档解析任务中获得第三名。

详情
AI中文摘要

在本报告中,我们介绍了DataMFM挑战赛赛道1:文档解析的第三名解决方案。该赛道要求模型从文档页面图像中恢复结构化的Markdown文档,同时保留文本内容和文档结构。为了解决准确内容恢复和忠实结构重建的互补需求,我们提出了ParseFixer,一个用于骨干解析和选择性校正的智能体框架。ParseFixer包含两个关键模块:全页骨干解析(FBP)和智能体选择性校正(ASC)。FBP使用MinerU2.5 Pro生成稳定的初始Markdown输出,而ASC通过验证-回滚校正过程检测并修复高价值的解析失败。通过在开源骨干解析之后放置选择性多模态校正,ParseFixer在不重写可靠骨干预测的情况下,改善关键文档元素的恢复。在测试集上,我们的最终系统取得了61.78的总分,在赛道1中排名第三,证明了其在准确文档解析方面的有效性。我们的代码将发布在:this https URL。

英文摘要

In this report, we present our third-place solution for the DataMFM Challenge Track 1: Document Parsing. This track requires models to recover structured Markdown documents from document page images while preserving textual content and document structure. To address the complementary requirements of accurate content recovery and faithful structure reconstruction, we propose ParseFixer, an agentic framework for backbone parsing and selective correction. ParseFixer consists of two key modules: Full-Page Backbone Parsing (FBP) and Agentic Selective Correction (ASC). FBP produces stable initial Markdown outputs with MinerU2.5 Pro, while ASC detects high-value parsing failures and repairs them through a verify-and-rollback correction process. By placing selective multimodal correction after open-source backbone parsing, ParseFixer improves the recovery of key document elements without rewriting reliable backbone predictions. On the test set, our final system achieves an overall score of 61.78 and ranks third in Track 1, demonstrating its effectiveness for accurate document parsing. Our code will be released at: https://github.com/iLearn-Lab/CVPRW26-ParseFixer.

2606.11969 2026-06-11 cs.CV 新提交

SpecLoR: Spectral Lookahead Rectification for Motion-Coherent Text-to-Video Generation

SpecLoR: 面向运动连贯文本到视频生成的频谱前瞻矫正

Xu Zhang, Yu Lu, Ruijie Quan, Zhaozheng Chen, Bohan Wang, Yi Yang

发表机构 * ReLER, College of Artificial Intelligence, Zhejiang University(浙江大学人工智能学院ReLER实验室) Huawei Central Research Institute(华为中央研究院)

AI总结 提出SpecLoR,一种即插即用的推理方法,通过前瞻预测和频域矫正减少文本到视频生成中的时空不一致性,在Wan2.2上显著提升运动连贯性且仅增加4次NFE。

详情
AI中文摘要

流匹配通过潜在ODE采样实现了鲁棒的文本到视频生成。然而,速度逼近和数值离散误差不可避免地累积,导致采样轨迹漂移。因此,生成的视频常常遭受严重的时空不一致性。尽管如此,直接矫正这些漂移的噪声潜在变量具有挑战性:(i) 时间步相关的噪声掩盖了可靠的结构线索;(ii) 空间干预可能破坏复杂的局部几何结构,同时带来高昂的计算成本。为了解决这个问题,我们提出了频谱前瞻矫正(SpecLoR),一种即插即用的推理方法,通过前瞻预测绕过噪声,并通过将矫正转移到频域来规避时空纠缠,在频域中自然视频的通用统计先验易于获取。首先,在早期采样阶段,SpecLoR前瞻估计干净潜在变量 $z_{t,0}$ 并计算其3D时空频谱。接着,SpecLoR矫正幅度谱以匹配先验,保持相位不变。最后,将矫正后的状态重新加噪以恢复ODE积分。在Wan2.2上的实验表明,SpecLoR在多个基准上显著减少了物理伪影并增强了运动连贯性,且计算开销极小(仅增加4次NFE)。

英文摘要

Flow Matching has enabled robust text-to-video generation via latent ODE sampling. However, velocity approximation and numerical discretization errors inevitably accumulate, causing sampling trajectories to drift. Consequently, generated videos often suffer from severe spatiotemporal inconsistencies. Nevertheless, directly correcting these drifted, noisy latents is challenging: (i) timestep-dependent noise obscures reliable structural cues; (ii) spatial interventions risk disrupting intricate local geometry while incurring heavy computational costs. To address this, we propose Spectral Lookahead Rectification (SpecLoR), a plug-and-play inference method that bypasses noise via lookahead prediction, and circumvents spatiotemporal entanglement by shifting corrections to the frequency domain, where universal statistical priors of natural videos are readily available. First, during early sampling stages, SpecLoR looks ahead to estimate the clean latent $z_{t,0}$ and computes its 3D spatiotemporal spectrum. Next, SpecLoR rectifies the amplitude spectrum to match the prior, leaving the phase intact. Finally, the corrected state is re-noised to resume ODE integration. Experiments on Wan2.2 demonstrate that SpecLoR significantly reduces physical artifacts and enhances motion coherence across multiple benchmarks with minimal computational overhead (4 additional NFEs).

2606.11968 2026-06-11 cs.LG stat.ML 新提交

Efficient Multinomial Logistic Bandit via Frequent Directions

基于频繁方向的高效多项式逻辑斯蒂老虎机

Linzhe He, Yu-Jie Zhang, Sifan Yang, Lijun Zhang

发表机构 * State Key Laboratory of Novel Software Technology, Nanjing University(南京大学计算机软件新技术国家重点实验室) School of Artificial Intelligence, Nanjing University(南京大学人工智能学院) Paul G. Allen School of Computer Science & Engineering, University of Washington(华盛顿大学保罗·G·艾伦计算机科学与工程学院)

AI总结 针对多项式逻辑斯蒂老虎机的高维计算瓶颈,提出集成频繁方向矩阵素描的EOFD-MLogB算法,将每轮复杂度降至O(Kd(m+K)^2)时间和O(Kd(m+K))空间,并证明其遗憾界接近原算法。

详情
AI中文摘要

本文研究多项式逻辑斯蒂老虎机(MLogB)的高效在线算法,其中$K+1$个结果的反馈分布遵循$d$维动作向量的多项式逻辑斯蒂模型。代表性的UCB型算法OFUL-MLogB实现了$\tilde{\mathcal{O}}(Kd\sqrt{T})$的遗憾界,但由于参数估计和乐观奖励构造,每轮仍需$\mathcal{O}(K^3d^3)$时间和$\mathcal{O}(K^2d^2)$空间,在高维场景下不可行。为解决此限制,我们提出EOFD-MLogB,将频繁方向矩阵素描集成到OFUL-MLogB中。通过维护累积Hessian的低秩SVD素描,参数估计中的约束在线牛顿更新和奖励奖励中的$Kd \times K$谱范数计算分别简化为单维求根任务和$K \times K$特征值计算。这导致每轮主要时间复杂度为$\mathcal{O}(Kd(m+K)^2)$,空间复杂度为$\mathcal{O}(Kd(m+K))$,其中$m \ll d$为素描大小。我们进一步证明了$\tilde{\mathcal{O}}(\Delta_T(Kd\ln\Delta_T+m)\sqrt{T})$的遗憾界,其中素描误差因子$\Delta_T$由Hessian的$m$截断谱尾控制。因此,当Hessian近似低秩时,遗憾接近OFUL-MLogB。实验验证了计算效率和竞争性能。

英文摘要

This paper studies efficient online algorithms for multinomial logistic bandits (MLogB), where the feedback distribution over $K+1$ outcomes follows a multinomial logistic model of $d$-dimensional action vectors. A representative UCB-type algorithm, OFUL-MLogB, achieves a regret bound of $\tilde{\mathcal{O}}(Kd\sqrt{T})$, but still requires $\mathcal{O}(K^3d^3)$ time and $\mathcal{O}(K^2d^2)$ space per round due to parameter estimation and optimistic reward construction, which is prohibitive in high-dimensional settings. To address this limitation, we propose EOFD-MLogB, which integrates frequent directions matrix sketching into OFUL-MLogB. By maintaining a low-rank SVD sketch of the accumulated Hessian, constrained online Newton updates in parameter estimation and $Kd \times K$ spectral-norm computations in the reward bonus are reduced to one-dimensional root-finding tasks and $K \times K$ eigenvalue computations, respectively. This yields dominant per-round time complexity $\mathcal{O}(Kd(m+K)^2)$ and space complexity $\mathcal{O}(Kd(m+K))$, where $m \ll d$ is the sketch size. We further prove a regret bound of $\tilde{\mathcal{O}}(Δ_T(Kd\lnΔ_T+m)\sqrt{T})$, where the sketching error factor $Δ_T$ is controlled by the $m$-truncated spectral tail of the Hessian. Thus, when the Hessian is approximately low-rank, the regret is close to that of OFUL-MLogB. Experiments validate the computational efficiency and competitive performance.

2606.11966 2026-06-11 cs.CV 新提交

Feature extraction for plant growth estimation

用于植物生长估计的特征提取

Simbarashe Aldrin Ngorima, Albert Helberg, Marelie H. Davel

发表机构 * Faculty of Engineering, North-West University(西北大学工程学院) Centre for Artificial Intelligence Research(人工智能研究中心) National Institute for Theoretical and Computational Sciences(国家理论与计算科学研究所)

AI总结 针对精准农业中实时估计植物生长阶段的需求,提出两种特征提取方法(Gabor滤波器与形态学操作、预训练CNN与迁移学习),在公开数据集上测试,CNN方法在速度和精度上均优于手工特征,最佳系统(VGG-19特征+RBF SVM)达到98.4%准确率,每图处理0.08秒。

Comments 13 pages

详情
Journal ref
Artificial Intelligence Research. SACAIR 2025. Communications in Computer and Information Science, vol 2784. Springer, Cham (2026)
AI中文摘要

精准农业需要实时估计植物生长阶段。当植物生长阶段已知时,可以减少栽培中资源(如养分和水)的浪费,因为只需供应所需的资源。然而,不同生长阶段的植物具有相似的形态特征,这可能使自主生长阶段估计变得困难。本文提出了两种用于生长阶段估计的特征提取方法:一种使用Gabor滤波器组和形态学操作,另一种使用预训练卷积神经网络(CNN)和迁移学习。我们在公开的植物生长阶段数据集(“bccr-segset”)上测试了这些方法,该数据集包含两种在室内条件下生长和捕获的物种:油菜和小萝卜。使用支持向量机和提升树作为分类器,比较了两种提出的特征提取方法。我们发现两种方法都适用于实时应用,并且CNN特征在速度和准确性方面均优于手工特征。最佳系统(VGG-19特征,使用径向基函数支持向量机分类)对两个物种均获得了98.4%的准确率,处理一张图像仅需0.08秒。

英文摘要

Precision agriculture requires the estimation of plant growth stages in real-time. When the plant growth stage is known, the wastage of resources in cultivation, such as nutrients and water, is reduced as only the required resources need to be supplied. Plants at different growth stages, however, have similar morphological features, which can make autonomous growth stage estimation difficult. This paper presents two feature extraction methods for growth stage estimation: one that uses a bank of Gabor filters and morphological operations, and the other that uses pre-trained convolutional neural networks (CNNs) and transfer learning. We test these methods on a publicly available plant growth stage dataset (``bccr-segset``) for two species, canola and radish, grown and captured under indoor conditions. The two proposed feature extraction methods are compared, using support vector machines and boosted trees as classifiers. We find that both methods are suitable for real-time applications, and that CNN features outperform the hand-crafted features, both with regard to speed and accuracy. The best system (VGG-19 features, classified with a radial basis function support vector machine) obtained an accuracy of 98.4% for both species, processing an image in 0.08 seconds.

2606.11963 2026-06-11 cs.LG physics.comp-ph 新提交

HAMNO: A Hierarchical Adaptive Multi-scale Neural Operator with Physics-Informed Learning for Dynamical Systems

HAMNO: 一种用于动力系统的分层自适应多尺度神经算子与物理信息学习

Mostafa Bamdad, Mohammad Sadegh Eshaghi, Timon Rabczuk

发表机构 * Bauhaus-Universität Weimar(魏玛包豪斯大学) Leibniz University Hannover(莱布尼茨汉诺威大学)

AI总结 提出HAMNO神经算子架构,通过自适应门控机制平衡局部与全局信息,结合物理信息扩展PI-HAMNO,在非周期Allen-Cahn等方程上提升长期预测精度与物理一致性。

详情
AI中文摘要

神经算子为直接在函数空间学习偏微分方程解映射提供了强大框架。然而,许多现有架构仍难以表示涉及多尺度结构、长程相互作用和稳定长时间演化的非线性时变系统。本文引入分层自适应多尺度神经算子(HAMNO),一种结合局部卷积表示、全局谱算子和分层编码器-解码器处理的神经算子架构。HAMNO的核心是一个数据相关的门控机制,可在每个空间位置自适应平衡局部和全局信息,使模型能够解析细尺度特征同时保持长程依赖。我们进一步基于多目标损失策略开发了物理信息扩展PI-HAMNO,该策略将数据拟合与强形式和弱形式物理约束相结合。强形式项惩罚物理坐标中域积分平方PDE残差,而弱形式项通过将控制残差乘以有限元测试函数并使用基于质心的四面体求积法评估所得单元积分来构建。该框架在定义于立方域上的非周期Allen-Cahn(AC)、Cahn-Hilliard(CH)和Swift-Hohenberg(SH)方程上进行了评估。在长时程展开、数据有限训练、分布外初始条件偏移和随机种子变化下,HAMNO提高了相对于标准神经算子基线的预测精度,而PI-HAMNO进一步增强了稳定性、物理一致性和数据效率。实现代码公开于https://github.com/HAMNO/HAMNO。

英文摘要

Neural operators provide a powerful framework for learning solution mappings of partial differential equations directly in function space. However, many existing architectures still struggle to represent nonlinear time-dependent systems that involve multi-scale structures, long-range interactions, and stable long-time evolution. In this work, we introduce the Hierarchical Adaptive Multi-scale Neural Operator (HAMNO), a neural-operator architecture that combines local convolutional representations, global spectral operators, and hierarchical encoder-decoder processing. The central component of HAMNO is a data-dependent gating mechanism that adaptively balances local and global information at each spatial location, allowing the model to resolve fine-scale features while preserving long-range dependencies. We further develop a physics-informed extension, PI-HAMNO, based on a multi-objective loss strategy that combines data fitting with strong- and weak-form physics constraints. The strong-form term penalizes the domain-integrated squared PDE residual in physical coordinates, while the weak-form term is constructed by multiplying the governing residual by finite-element test functions and evaluating the resulting element integrals using centroid-based tetrahedral quadrature. The framework is evaluated on non-periodic Allen-Cahn (AC), Cahn-Hilliard (CH), and Swift-Hohenberg (SH) equations defined on cubic domains. Across long-horizon rollout, data-limited training, out-of-distribution initial-condition shifts, and random-seed variations, HAMNO improves predictive accuracy over standard neural-operator baselines, while PI-HAMNO further enhances stability, physical consistency, and data efficiency. The implementation is publicly available at https://github.com/MBamdad/HAMNO .

2606.11961 2026-06-11 cs.LG cs.AI 新提交

Categorical Prior Lock-in: Why In-Context Learning Fails for Structured Data

类别先验锁定:为何上下文学习在结构化数据上失败

Antonio Pelusi, Stefano Braghin, Alberto Trombetta

发表机构 * University of Insubria(因苏布里亚大学) IBM Research Ireland(IBM 爱尔兰研究院)

AI总结 研究大语言模型在结构化数据生成中上下文学习的局限性,发现其无法更新预训练中的类别先验分布,导致罕见类完全无法生成;参数高效微调可解决但带来记忆化风险。

Comments 9 pages, 5 figures. Empirical study of in-context learning and LoRA fine-tuning for synthetic tabular data generation, introducing the phenomenon of categorical prior lock-in. Under review

详情
AI中文摘要

大型语言模型(LLM)越来越多地被用作结构化数据的条件生成器,依赖上下文学习(ICL)来适应新分布而无需更新参数。我们以高基数表格数据作为受控测试案例,研究分布不匹配下ICL在结构化生成中的局限性,并识别出一种结构性失败模式,我们称之为“类别先验锁定”:ICL无法更新模型从预训练中继承的令牌分布先验。在两个70亿参数开源模型中,ICL随着示例增加提高了数值保真度,但在类别分布上表现出明显的天花板效应,完全无法复现罕见类。参数高效微调(LoRA)克服了这些限制,但引入了可测量的记忆化风险,并在某些情况下破坏了结构化输出生成的稳定性,凸显了适应性与隐私之间的基本权衡。

英文摘要

Large language models (LLMs) are increasingly used as conditional generators for structured data, relying on in-context learning (ICL) to adapt to new distributions without parameter updates. We investigate the limits of ICL for structured generation under distribution mismatch, using high-cardinality tabular data as a controlled test case, and identify a structural failure mode we term \textit{categorical prior lock-in}: the inability of ICL to update the model's prior over token distributions inherited from pre-training. Across two 7B-parameter open-weight models, ICL improves numerical fidelity with additional examples but exhibits a sharp ceiling on categorical distributions, failing to reproduce rare classes entirely. Parameter-efficient fine-tuning (LoRA) overcomes these limitations but introduces measurable memorization risk and, in some cases, destabilizes structured output generation, highlighting a fundamental trade-off between adaptability and privacy.

2606.11953 2026-06-11 cs.CL 新提交

Decoding Multimodal Cues: Unveiling the Implicit Meaning Behind Hateful Videos

解码多模态线索:揭示仇恨视频背后的隐含意义

Junyu Lu, Deyi Ji, Liqun Liu, Xiaokun Zhang, Youlin Wu, Roy Ka-Wei Lee, Peng Shu, Huan Yu, Jie Jiang, Bo Xu, Liang Yang, Hongfei Lin

发表机构 * Dalian University of Technology(大连理工大学) Tencent(腾讯) City University of Hong Kong(香港城市大学) Singapore University of Technology and Design(新加坡科技设计大学)

AI总结 提出IARE框架,通过信息增强和推理优化实现可解释的仇恨视频检测,在Ex-HateMM和Ex-ImpliHateVid数据集上达到最优性能。

详情
AI中文摘要

仇恨视频在在线平台上日益普遍,凸显了有效检测的迫切需求。然而,现有研究主要关注二元分类,未能提供揭示这些判断背后隐含意义的上下文理由,严重削弱了模型的可解释性。为填补这一空白,我们旨在实现可解释的仇恨视频检测,使模型能够提供整合相关证据和逻辑推理的上下文理由,同时做出决策。这种方法可以全面增强对视频内容的理解以及决策过程的可解释性。我们首先引入了两个用于可解释仇恨视频检测的数据集Ex-HateMM和Ex-ImpliHateVid。每个数据集提供了多模态有害元素的细粒度标注以及上下文理由。然后,我们提出了一个用于可解释检测的信息增强与推理优化(IARE)框架。该框架采用信息增强阶段,利用多模态思维链整合有害元素,从而丰富理由证据。此外,IARE包含一个推理优化阶段,其中直接偏好优化引导模型走向正确的推理路径并远离错误的路径,从而提高其理由的逻辑连贯性。我们在两个数据集上进行了大量实验,将多个基线与我们提出的IARE框架进行比较。结果表明,IARE在生成准确理由的同时实现了最先进的性能。

英文摘要

Hateful videos have become prevalent on online platforms, highlighting an urgent need for effective detection. However, existing studies primarily focus on binary classification and fail to provide contextual rationales that reveal the implicit meanings behind these judgments, significantly undermining model explainability. To fill this gap, we aim to achieve explainable hateful video detection, enabling models to provide contextual rationales that integrate relevant evidence and logical reasoning alongside decisions. This approach can comprehensively enhance the understanding of video content and the explainability of the decision-making process. We first introduce two datasets, Ex-HateMM and Ex-ImpliHateVid, for explainable hateful video detection. Each dataset provides fine-grained annotations of multimodal harmful elements, along with contextual rationales. We then propose an Information Augmentation and Reasoning Enhancement (IARE) framework designed for explainable detection. The framework employs an information augmentation phase that leverages the multimodal chain-of-thought to integrate harmful elements, thereby enriching rationale evidence. Additionally, IARE incorporates a reasoning enhancement phase, in which Direct Preference Optimization guides the model toward correct reasoning paths and away from incorrect ones, thereby improving the logical coherence of its justifications. We conduct extensive experiments on the two datasets, comparing multiple baselines with our proposed IARE framework. The results demonstrate that IARE achieves state-of-the-art performance while also generating accurate rationales.

2606.11952 2026-06-11 cs.RO 新提交

Deformable In-Hand Slip-Aware Tactile Sensor with Integrated Velocity, Force/Torque, and Pressure Map Sensing

可变形手内滑移感知触觉传感器,集成速度、力/力矩和压力图传感

Gabriel Arslan Waltersson, Yiannis Karayiannidis

发表机构 * Chalmers University of Technology(查尔姆斯理工大学) Lund University(隆德大学)

AI总结 提出一种新型触觉传感器,通过可变形接触垫集成速度、力/力矩和压力图传感,实现手内操作的滑移感知控制,并支持快速低成本制造。

详情
AI中文摘要

本文介绍了一种用于手内操作的新型触觉传感器,具有滑移感知控制功能,将速度、力/力矩和压力图传感集成到一个带有可变形接触垫的单一设备中。据我们所知,这是首个将这些传感模态结合在单一柔性结构中的传感器。该传感器具有可变形接触表面,能够稳健地跟踪各种材料上的平面和曲面。通过一系列全面的实验评估了其性能,突出了其能力和局限性。该传感器设计用于快速低成本制造,结合了标准PCB制造和快速原型制作技术。

英文摘要

This paper introduces a novel tactile sensor for in-hand manipulation with slip-aware control that integrates velocity, force/torque, and pressure map sensing into a single device with a deformable contact pad. To the best of our knowledge, this is the first sensor to combine these sensing modalities within a single compliant structure. The sensor features a deformable contact surface and can robustly track both flat and curved surfaces across a wide range of materials. Its performance is evaluated through a comprehensive set of experiments that highlight both its capabilities and limitations. The sensor is designed for rapid and low-cost fabrication using a combination of standard PCB manufacturing and rapid prototyping techniques.

2606.11949 2026-06-11 cs.LG cs.CR stat.ML 新提交

Online Shift Detection and Conformal Adaptation for Deployed Safety Classifiers

已部署安全分类器的在线漂移检测与共形自适应

Jun Wen Leong

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出在线监测系统,使用校准序列统计检测分布漂移,并通过共形弃权层自适应阈值恢复目标错误率,在800个实验单元中实现86.6%有效检测。

Comments 16 pages, 4 figures, 7 tables. Code and data at https://github.com/junwenleong/safety-classifier-shift-monitor

详情
AI中文摘要

我们提出了一种在线监测系统,用于检测已部署安全分类器中的分布漂移,使用校准的序列统计量来检测分类器何时移出分布。一旦检测到,共形弃权层会自适应调整决策阈值,以恢复目标错误率ε=0.1。在一项预注册的析因评估(4个分类器×5种漂移条件×20个种子×2个窗口大小,共800个单元)中,该系统实现了86.6%的有效检测(693/800,95% CI [84.1%, 88.8%]),平均延迟为39.5步。检测在三种真实标签机制下均有效:合成发作(86.6%)、真实时间越狱(85%,17/20)和GCG对抗攻击。加权共形预测为DeBERTa恢复了高达39个百分点的丢失覆盖率(ESS=46/300),但所有其他分类器均崩溃(ESS≈300):逻辑密度比估计在高维嵌入空间中实现了完美的源/目标可分离性,将所有重要性权重裁剪至下限。DeBERTa显示出从有效校正(释义,ESS=46)到几乎完全崩溃(对抗后缀,ESS=206)的梯度。PCA降至32维打破了崩溃,为Llama Guard恢复了33个百分点,为ShieldGemma恢复了21个百分点。方差分解显示分类器(η²=0.243)、漂移类型(η²=0.237)及其交互作用(η²=0.185)均对检测延迟方差有显著贡献(所有p<0.001),表明需要针对每个分类器的监测配置文件。

英文摘要

We present an online monitoring system for distributional shift in deployed safety classifiers, using calibrated sequential statistics to detect when a classifier has moved out of distribution. Upon detection, a conformal abstention layer adapts decision thresholds to recover a target error rate epsilon=0.1. In a pre-registered factorial evaluation (4 classifiers x 5 shift conditions x 20 seeds x 2 window sizes, 800 cells), the system achieves 86.6% valid detection (693/800, 95% CI [84.1%, 88.8%]) with mean latency of 39.5 steps. Detection holds across three ground-truth regimes: synthetic onset (86.6%), real temporal jailbreaks (85%, 17/20), and GCG adversarial attacks. Weighted conformal prediction recovers up to 39 pp of lost coverage for DeBERTa (ESS=46/300) but collapses for all other classifiers (ESS~300): logistic density ratio estimation achieves perfect source/target separability in high-dimensional embedding spaces, clipping all importance weights to the floor. DeBERTa shows a gradient from effective correction (paraphrase, ESS=46) to near-total collapse (adversarial suffix, ESS=206). PCA to 32 dimensions breaks the collapse, recovering 33 pp for Llama Guard and 21 pp for ShieldGemma. Variance decomposition reveals classifier (eta^2=0.243), shift type (eta^2=0.237), and their interaction (eta^2=0.185) all contribute substantially to detection latency variance (all p<0.001), indicating per-classifier monitoring profiles are necessary.

2606.11945 2026-06-11 cs.CL cs.IR 新提交

uva-irlab-conv at SemEval-2026 Task 8: Multi-Turn RAG with Learned Sparse Retrieval and Listwise Reranking

uva-irlab-conv 在 SemEval-2026 任务 8:基于学习型稀疏检索和列表式重排序的多轮 RAG

Simon Lupart, Kidist Amde Mekonnen, Zahra Abbasiantaeb, Mohammad Aliannejadi

发表机构 * University of Amsterdam(阿姆斯特丹大学)

AI总结 提出结合学习型稀疏检索与基于 LLM 的重排序和生成的多轮检索增强生成流水线,用于跨四个领域的对话系统,有效处理不可回答查询。

Comments SemEval-2026, The 20th International Workshop on Semantic Evaluation, collocated with ACL 2026, 9 pages, 5 figures, 6 tables

详情
AI中文摘要

本报告描述了我们在 SemEval-2026 任务 8(多轮检索与问答)中的参与情况。该任务评估跨四个领域(金融、云文档、政府、维基百科)的对话系统,并包括不可回答的查询,即可用集合中没有足够证据来生成完整回答。我们提出了一种多轮检索增强生成流水线,将学习型稀疏检索与基于 LLM 的重排序和生成相结合。使用稀疏检索作为主要检索方法,我们利用了其跨领域的强泛化能力。此外,我们利用 LLM 的长上下文能力进行对话查询重写、逐点和列表式重排序以及生成最终回答,每一步都基于完整的对话历史。这种多步骤设计使得在整个检索和生成过程中有效整合对话上下文,提高了跨领域的鲁棒性。

英文摘要

This report describes our participation in SemEval-2026 Task 8 on multi-turn retrieval and question answering. The task evaluates conversational systems across four domains (finance, cloud documentation, government, Wikipedia), and includes unanswerable queries where the available collection does not contain sufficient evidence to produce a complete response. We propose a multi-turn retrieval-augmented generation pipeline that combines learned sparse retrieval with LLM-based reranking and generation. Using sparse retrieval as the primary retrieval method, we leverage its strong generalization across domains. In addition, we make use of the long-context capabilities of LLMs for conversational query rewriting, pointwise and listwise reranking, and generating the final response, each conditioned on the full conversational history. This multi-step design enables effective integration of conversational context throughout retrieval and generation, improving robustness across domains.

2606.11931 2026-06-11 cs.CL 新提交

Semantic Grading of Written Answers in Low-Resource Language Bangla Using a Fine-Tuned Lightweight Language Model

低资源语言孟加拉语中书面答案的语义评分:使用微调轻量级语言模型

Meherun Farzana, Aniket Joarder, Mahmudul Hasan, Md. Mosaddek Khan

发表机构 * Computer Science and Engineering, University of Dhaka(达卡大学计算机科学与工程系)

AI总结 针对低资源语言孟加拉语,提出一种基于微调轻量级语言模型的双语评估系统,通过语义正确性而非词汇重叠进行自动评分,在合成和人工评估中均取得最优性能。

Comments 10 pages, 5 figures, 2 tables. Preprint

详情
AI中文摘要

孟加拉语是世界上使用最广泛的语言之一,但在教育NLP研究中仍服务不足。在许多偏远和农村地区,合格学科教师资源有限,书面答案因此主要依靠人工评分,限制了及时和一致的反馈。自动评估具有挑战性,因为语义正确的回答在表面形式上可能有很大差异。我们提出一个为低资源教育环境设计的双语(孟加拉语-英语)评估系统,优先考虑语义正确性而非词汇重叠。我们的方法微调一个轻量级语言模型,使用问题、参考答案和学生答案对每个回答进行评分,产生一个数值分数和简洁、基于上下文的反馈,适合课堂部署。我们还构建了一个合成双语数据集,以实现受控训练和评估。在统一协议下评估的专有和开源LLM中,我们的QLoRA微调Qwen3-8B在合成评估中产生最具抗泄漏性的反馈(RoRa = 0.819),并在专门的人工研究中与人类评分的一致性最强(rho = 0.936, MAE = 0.725),证实了持续改进。

英文摘要

Bangla is among the world's most widely spoken languages, yet it remains underserved in educational NLP research. In many remote and rural regions, access to qualified subject teachers is limited, and written answers are consequently graded largely by hand, restricting timely and consistent feedback. Automatic assessment is challenging because semantically correct responses can vary substantially in surface form. We present a bilingual (Bangla-English) evaluation system designed for low-resource educational settings that prioritizes semantic correctness over lexical overlap. Our approach fine-tunes a lightweight language model to grade each response using the question, reference answer, and student answer, producing a numeric score and concise, context-grounded feedback suitable for classroom deployment. We also construct a synthetic bilingual dataset to enable controlled training and evaluation. Across proprietary and open-source LLMs evaluated under a unified protocol, our QLoRA-tuned Qwen3-8B confirms consistent improvement by producing the most leakage-resistant feedback (RoRa = 0.819) in synthetic evaluation and the strongest agreement with human scores (rho = 0.936, MAE = 0.725) in a dedicated human study.

2606.11926 2026-06-11 cs.CL cs.AI 新提交

Toward Generalist Autonomous Research via Hypothesis-Tree Refinement

通过假设树精炼迈向通用自主研究

Jiajie Jin, Yuyang Hu, Kai Qiu, Qi Dai, Chong Luo, Guanting Dong, Xiaoxi Li, Tong Zhao, Xiaolong Ma, Gongrui Zhang, Zhirong Wu, Bei Liu, Zhengyuan Yang, Linjie Li, Lijuan Wang, Hongjin Qian, Yutao Zhu, Zhicheng Dou

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学高瓴人工智能学院) Microsoft Research(微软研究院)

AI总结 提出Arbor框架,通过假设树精炼(HTR)实现长期自主研究循环,在六项真实任务中平均相对保留增益超过Codex和Claude Code的2.5倍。

详情
AI中文摘要

科学进步依赖于探索、实验和抽象的重复循环。研究人员测试候选方向,解释证据,并将所得经验用于后续尝试。我们研究AI代理如何自主地长期运行这一循环。我们提出了Arbor,一个用于自主研究的通用框架,它结合了长期存在的协调器、短期执行器和假设树精炼(HTR),后者是一个持久树,跨时间连接假设、工件、证据和提炼的见解。协调器管理树上的全局研究策略,而执行器在隔离的工作树中实现和测试单个假设。当结果返回时,Arbor更新树,传播可重用的经验,优化搜索前沿,并接受验证过的改进。这种设计将自主研究从一系列局部尝试转变为累积过程,其中策略、执行和证据跨时间传递。我们在自主优化(AO)下评估Arbor,这是一种操作设置,代理通过迭代实验改进初始研究工件,无需逐步人工监督。在模型训练、工具工程和数据合成等六项真实研究任务中,Arbor在所有六项任务上取得了最佳保留结果,在相同任务接口和资源预算下,平均相对保留增益是Codex和Claude Code的2.5倍以上。在MLE-Bench Lite上,Arbor使用GPT-5.5达到86.36%的任何奖牌,这是我们比较中的最强结果。

英文摘要

Scientific progress depends on a repeated loop of exploration, experimentation, and abstraction. Researchers test candidate directions, interpret the evidence, and carry the resulting lessons into later attempts. We study how an AI agent can run this loop autonomously over long horizons. We introduce Arbor, a general framework for autonomous research that combines a long-lived coordinator, short-lived executors, and Hypothesis Tree Refinement (HTR), a persistent tree that links hypotheses, artifacts, evidence, and distilled insights across time. The coordinator manages global research strategy over the tree, while executors implement and test individual hypotheses in isolated worktrees. As results return, Arbor updates the tree, propagates reusable lessons, refines the search frontier, and admits verified improvements. This design turns autonomous research from a sequence of local attempts into a cumulative process in which strategy, execution, and evidence are carried across time. We evaluate Arbor under Autonomous Optimization (AO), an operational setting where an agent improves an initial research artifact through iterative experimentation without step-level human supervision. Across six real research tasks in model training, harness engineering, and data synthesis, Arbor achieves the best held-out result on all six tasks, attaining more than 2.5x the average relative held-out gain of Codex and Claude Code under the same task interface and resource budget. On MLE-Bench Lite, Arbor reaches 86.36% Any Medal with GPT-5.5, the strongest result in our comparison.

2606.11925 2026-06-11 cs.CV cs.LG 新提交

Corpus Augmentation for Sign Language Translation via LLM-Guided Video Stitching

通过LLM引导的视频拼接进行手语翻译的语料增强

Zsolt Robotka, Ádám Rák, Jalal Al-Afandi, András Horváth, György Cserey

发表机构 * Peter Pazmany Catholic University, Faculty of Information Technology and Bionics(彼得·帕兹马尼天主教大学信息科技与仿生学院) DeepSign Technologies Ltd.(DeepSign科技有限公司)

AI总结 提出一种无需额外标注或生成模型的手语翻译语料增强方法,利用CTC强制对齐提取手语片段,通过LLM生成句子并拼接视频,在GFSLT-VLP基线上提升BLEU-4达2.92,并发现合成数据对视觉-语言预训练有害但可提升下游任务。

详情
AI中文摘要

手语翻译(SLT)将手语视频转换为口语文本,对于改善无障碍交流以及促进手语与非手语社区之间的沟通具有重要前景。虽然大规模弱对齐数据集实现了规模化预训练,且无词汇表方法减少了对专家标注的依赖,但用于微调的高质量平行手语视频-文本对仍然稀缺,限制了长尾词汇和未见结构的泛化。我们提出一种语料增强方法,无需额外人工标注、外部手语视频语料库或生成式视频模型,仅依赖现有的词汇表标注训练语料和用于句子生成的LLM:通过CTC强制对齐从训练视频中提取每个手语词汇的片段,由语料锚定的LLM生成新的词汇-句子对,通过随机句子采样和片段分配组装合成序列。得到的合成RGB视频-文本对在下游训练阶段与架构无关,可直接被基于RGB的SLT模型使用,或通过从视频提取输入的流水线转换为姿态或特征表示。Sincan等人在严格相同条件下重新评估了五种近期无词汇表方法;在GFSLT-VLP基线上验证的最大增益仅为0.98 BLEU-4。我们的增强方法在同一框架内应用,无需改变架构或训练协议,实现了+2.92 BLEU-4。我们进一步发现,合成数据虽然改善了视觉-语言预训练的目标,但对其有害;并且基于L2准则优化片段过渡以实现视觉平滑适得其反;我们提出,突兀的边界可能作为一种隐式正则化形式。代码可在https://this https URL获取。

英文摘要

Sign language translation (SLT) converts sign language video into spoken language text and holds significant promise for improving accessibility and enabling communication between signing and non-signing communities. While large weakly-aligned datasets have enabled pre-training at scale and gloss-free methods have reduced reliance on expert annotation, high-quality parallel sign video-text pairs for fine-tuning remain scarce, limiting generalisation on long-tail vocabulary and unseen constructions. We propose a corpus augmentation approach that requires no additional human annotation, external sign-language video corpora, or generative video models, relying only on the existing gloss-annotated training corpus and an LLM for sentence generation: per-gloss clips are extracted from training videos via CTC forced-alignment, novel gloss-sentence pairs are generated by a corpus-anchored LLM, and synthetic sequences are assembled through random sentence sampling and clip assignment. The resulting synthetic RGB video-text pairs are architecture-agnostic at the downstream training stage and can be consumed directly by RGB-based SLT models, or converted into pose or feature representations by pipelines that derive such inputs from video. Sincan et al. re-evaluated five recent gloss-free methods under strictly identical conditions; the largest verified gain over the GFSLT-VLP baseline was only 0.98 BLEU-4. Our augmentation, applied within the same framework, achieves +2.92 BLEU-4 without any change to architecture or training protocol. We further identify that synthetic data harms vision-language pretraining despite improving its objectives, and that optimising clip transitions for visual smoothness is counter-productive under L2-based criteria; we propose that abrupt boundaries may act as a form of implicit regularisation. Code is available at https://github.com/robizso/slt-datagen.

2606.11922 2026-06-11 cs.SD cs.AI 新提交

Lung-SRAD: Spectral-Aware Regularized Audio DASS with Dual-Axis Patch-Mix Contrastive Learning for Respiratory Sound Classification

Lung-SRAD: 基于谱感知正则化音频DASS与双轴补丁混合对比学习的呼吸音分类

Hemansh Shridhar, Miika Toikkanen, June-Woo Kim

发表机构 * RSC LAB, MODULABS(RSC实验室,MODULABS) Department of Electronic Engineering, Wonkwang University(圆光大学电子工程系) AI Convergence Research Institute, Wonkwang University(圆光大学人工智能融合研究所)

AI总结 针对呼吸音分类中AST模型对局部异常模式不敏感的问题,提出基于状态空间模型的谱感知层正则化和双轴补丁混合对比学习,在ICBHI基准上达到64.48%分数,比AST基线提升5%。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

最近的呼吸音分类(RSC)研究主要依赖于CLS令牌驱动的自注意力架构,如音频频谱图变换器(AST)。虽然它在建模全局上下文方面有效,但最近的分析表明存在低通滤波行为,可能会降低对局部异常模式的敏感性。在这项工作中,我们研究了状态空间模型(SSM)作为RSC的替代骨干网络。使用蒸馏音频状态空间模型,我们通过频谱响应曲线分析中间表示,并观察到对中到高空间频率分量的更强保留。基于这些观察,我们引入了使用高斯卷积应用于选定层的谱感知层正则化。我们进一步提出了针对基于SSM的音频模型定制的双轴补丁混合对比学习,以实现稳健的表示学习。在ICBHI基准上的实验表明,我们的方法达到了64.48%的分数,比AST基线高出5%。代码可在以下网址获取:https://this https URL。

英文摘要

Recent respiratory sound classification (RSC) studies largely rely on CLS-token driven self-attention architectures such as the Audio Spectrogram Transformer (AST). While effective at modeling global context, recent analyses suggest a low-pass filtering behavior that may reduce sensitivity to localized abnormal patterns. In this work, we investigate State Space Models (SSMs) as an alternative backbone for RSC. Using the Distilled Audio State Space model, we analyze intermediate representations through spectral response curves and observe stronger preservation of mid-to-high spatial-frequency components. Based on these observations, we introduce spectral-aware layer regularization using Gaussian convolution applied to selected layers. We further propose Dual-Axis Patch-Mix contrastive learning tailored to SSM-based audio models for robust representation learning. Experiments on the ICBHI benchmark show that our approach achieves 64.48% score, outperforming the AST baseline by 5%. Code is available at https://github.com/RSC-Toolkit/Lung-SRAD.

2606.11918 2026-06-11 cs.AI 新提交

The Art of Interrogation: Consistency Amplifies Factuality in Spatial Reasoning

提问的艺术:一致性增强空间推理中的事实性

Theo Uscidda, Marta Tintore Gazulla, Maks Ovsjanikov, Federico Tombari, Leonidas Guibas

发表机构 * The University of California, Berkeley(加州大学伯克利分校) ETH Zurich(苏黎世联邦理工学院) University of Oxford(牛津大学) Stanford University(斯坦福大学)

AI总结 提出自监督强化学习框架,通过几何与语义一致性验证器(如图像翻转、文本对象顺序交换)对齐预训练模型的内在空间推理能力,无需标注数据即可达到接近监督方法的精度。

详情
AI中文摘要

当前的大型推理模型(LRMs)展现出显著的通用能力,但在空间推理任务中表现明显不足。现有方法将此差距视为知识缺陷,依赖监督微调(SFT)从外部视觉源或合成引擎中获取标注空间数据。相反,我们认为对于许多任务,空间推理能力已经存在于预训练的LRMs中,但需要通过几何2D和3D约束下的逻辑一致性进行对齐。在这项工作中,我们提出了一个自监督强化学习(RL)框架,针对内部推理过程,无需真实标注。通过形式化一致性验证器——即在变换下检查几何和语义一致性的奖励函数——我们证明模型可以提高其空间推理能力。我们同时使用图像变换(如翻转)和文本变换(如交换问题中对象的顺序),并提出了一种新的基于最优传输的RL策略OT-GRPO,这是针对成对验证器定制的组相对策略优化的最小匹配变体。我们展示了这种无标签一致性训练在精度上接近使用真实监督训练的模型,并在不同任务和数据领域实现了类似的泛化。

英文摘要

Current Large Reasoning Models (LRMs) exhibit remarkable general capabilities but significantly underperform in spatial reasoning tasks. Existing approaches treat this gap as a knowledge deficit, relying on supervised fine-tuning (SFT) to ingest labeled spatial data from external vision sources or synthetic engines. In contrast, we argue that for many tasks, spatial reasoning capabilities are already present in pre-trained LRMs but require alignment through logical coherence under geometric 2D and 3D constraints. In this work, we propose a self-supervised reinforcement learning (RL) framework that targets the internal reasoning process without requiring ground-truth annotations. By formalizing the notion of consistency verifiers -- reward functions that check for geometric and semantic consistency under transformations -- we demonstrate that models can improve their spatial reasoning abilities. We use both image transformations, like flipping, and textual transformations, like swapping the order of objects in the question, and propose a new optimal transport-based RL strategy, OT-GRPO, which is a minimal-matching variant of group relative policy optimization tailored to pairwise verifiers. We show that this label-free consistency training approaches the accuracy of models trained with ground-truth supervision and achieves similar generalization across diverse tasks and data domains.

2606.11915 2026-06-11 cs.SD cs.AI 新提交

Quality Adaptive Angular Margin Learning for Respiratory Sound Classification

呼吸音分类的质量自适应角度边界学习

Yoon Tae Kim, Heejoon Koo, Miika Toikkanen, June-Woo Kim

发表机构 * RSC LAB, MODULABS, Republic of Korea(RSC实验室,MODULABS,韩国) Department of Electronic Engineering, Wonkwang University, Republic of Korea(韩国圆光大学电子工程系) AI Convergence Research Institute, Wonkwang University, Republic of Korea(韩国圆光大学人工智能融合研究所)

AI总结 提出质量自适应角度边界学习框架QLung,通过频谱熵和均方根能量推导无参考音频质量边界,自适应缩放角度边界,改善特征泛化,在ICBHI和SPRSound数据集上分别提升2.46%和达到最优分布外性能。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

我们提出了一种质量自适应角度边界学习框架,通过增强类内紧凑性和类间可分离性来改进特征泛化。我们的框架名为QLung,引入了基于频谱熵和均方根能量的无参考音频质量边界,根据录音质量自适应缩放角度边界。为此,我们提出了一种对数缩放的角度边界,在严重类别不平衡下稳定训练。我们还使用了一个角度分类器,对特征和类别权重进行归一化,确保在单位超球面上一致地应用边界惩罚。我们的方法在ICBHI数据集上比交叉熵基线提高了2.46%的分布内性能,最重要的是,在SPRSound数据集上,与先前最先进的方法相比,实现了最强的分布外性能。代码可在以下网址获取:https://this URL。

英文摘要

We present a quality-adaptive angular-margin learning framework that improves feature generalization by enforcing intra-class compactness and inter-class separability. Our framework, titled QLung, introduces a no-reference audio quality margin derived from spectral entropy and root-mean-square energy, which adaptively scales angular margins based on recording quality. To this end, we propose a log-scaled angular margin that stabilizes training under severe class imbalance. We also use an angular classifier that normalizes features and class weights, ensuring margin penalties are applied consistently on the unit hypersphere. Our approach improves in-distribution performance on the ICBHI dataset by 2.46\% over the cross-entropy baseline, and most significantly, achieves the strongest out-of-distribution performance on the SPRSound dataset compared to prior state-of-the-art methods. Code is available at https://github.com/RSC-Toolkit/QLung.

2606.11913 2026-06-11 cs.CV 新提交

From Content to Knowledge: Lightning Fast Long-Video Understanding with Neural Knowledge Representations

从内容到知识:基于神经知识表示的闪电般快速长视频理解

Yuchen Guan, Xiao Li, Zongyu Guo, Xiaoyi Zhang, Xiulian Peng, Chun Yuan, Yan Lu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出将长视频编码为神经知识表示(NKR),通过智能体知识蒸馏(AKD)自动合成描述和问答对,将视频知识嵌入VLM骨干网络的少量权重中,实现轻量级、可复用的视频理解,推理时无需重新加载视频,大幅降低延迟。

详情
AI中文摘要

我们提出了一种新的长视频理解范式,将长视频视为神经知识表示(NKR)。NKR既不将视频内容表示为标记流,也不表示为预组织的数据库,而是作为附加到VLM骨干网络的一小部分网络权重。通过一种新颖的智能体知识蒸馏(AKD)过程优化NKR权重,以封装视频的语义内容,其中智能体自动合成密集描述和问答对,将视频知识蒸馏到NKR中。虽然AKD作为一次性的全面编码阶段,但生成的NKR将视频转换为可移植、可重用的资产。在推理时,轻量级NKR被挂载到冻结的视觉语言模型(VLM)上,实现直接的、基于查询的理解,无需重新加载或重新编码原始视频。这种方法将视频长度与推理成本解耦,为多轮视频理解提供了高摊销效率。在LVBench基准上的实验表明,我们的方法在实现与最先进方法相当的性能的同时,将端到端延迟降低了两个数量级以上,为交互式长视频理解开辟了新的可能性。

英文摘要

We propose a new paradigm for long video understanding by treating a long video as a Neural Knowledge Representation (NKR). NKR represents video contents neither as a stream of tokens nor pre-organized databases, but as an individual small portion of network weights attached to the VLM backbone. The NKR weights are optimized to encapsulate the video's semantic content via a novel Agentic Knowledge Distillation (AKD) process, where an agent automatically synthesizes dense descriptions and question-answer pairs to distill the video's knowledge into the NKR. While AKD serves as a comprehensive, one-time encoding phase, the resulting NKR transforms the video into a portable, reusable asset. At inference, the lightweight NKR is mounted onto a frozen Vision-Language Model (VLM), enabling direct, query-based understanding without reloading or re-encoding the original video. This approach decouples video length from inference cost, offering high amortized efficiency for multi-turn video understanding. Experiments on the LVBench benchmark show our method achieves performance comparable to state-of-the-art approaches while reducing end-to-end latency by over two orders of magnitude, opening new possibilities for interactive long-video understanding.

2606.11910 2026-06-11 cs.CL 新提交

An Ontology-Guided Multi-Anchor Graph Retrieval Framework for Traffic Legal Liability Determination

一种本体引导的多锚点图检索框架用于交通事故法律责任判定

Xu Li, Shuqi Tian, Xun Han, Kuncheng Zhao, Xinyi Li

发表机构 * Southwest Petroleum University(西南石油大学) Sichuan Police College(四川警察学院)

AI总结 提出OMAGR框架,通过本体引导将查询分解为锚点并执行并行图检索,解决多维度检索瓶颈,在TrafficLaw-QA数据集上提升上下文精度和忠实度。

Comments Submitted to ICONIP. 15 pages, 3 figures

详情
AI中文摘要

交通事故法律责任判定对于分配法律处罚至关重要,需要同时识别跨多个法律维度的相互依赖的法定条款。然而,现有的检索增强生成方法存在多维度检索瓶颈:单轴架构将复杂的法律查询压缩为单一通路,导致相互依赖的法定维度被忽视。为了解决这个问题,我们提出了OMAGR,一个本体引导的框架,将查询分解为与本体对齐的锚点,并在每个维度上执行并行图检索,确保在融合前各维度独立检索。为了评估所提出的方法,我们创建了TrafficLaw-QA数据集,这是一个经过专家验证的基准数据集,包含200个问题和527条法律条款。结果表明,TrafficOmni-RAG在上下文精度和忠实度指标上优于基线。研究结果表明,并行多锚点检索有效解决了多维度检索瓶颈,为交通事故法律责任判定研究提供了有前景的方向。

英文摘要

Traffic law liability determination is critical for assigning legal penalties, requiring the simultaneous identification of interdependent statutory provisions across multiple legal dimensions. However, existing retrieval-augmented generation methods suffer from a multi-dimensional retrieval bottleneck: single axis architectures compress complex legal queries into a single pathway, causing interdependent statutory dimensions to be overlooked. To address this, we propose OMAGR, an ontology-guided framework that decomposes queries into ontology-aligned anchors and executes parallel graph retrieval across each dimension, ensuring independent retrieval across dimensions before fusion. To evaluate the proposed method, we created the TrafficLaw-QA dataset, an expert-validated benchmark dataset containing 200 questions and 527 legal provisions. Results show that TrafficOmni-RAG outperforms baselines on Context Precision and Faithfulness metrics. The findings demonstrate that parallel multi-anchor retrieval effectively resolves the multi-dimensional retrieval bottleneck, offering a promising direction for traffic law liability determination research.

2606.11909 2026-06-11 cs.AI 新提交

Embodied-BenchClaw: An Autonomous Multi-Agent System for Embodied Spatial Intelligence Benchmark Construction

Embodied-BenchClaw:用于具身空间智能基准构建的自主多智能体系统

Baoyang Jiang, Fengchun Zhang, Leyuan Wang, Haotian Li, Yida Wang, Zhe Ji, Jinshan Lai, Xi Ren, Jianwei Hu, Qiang Ma

发表机构 * QiYuan Lab(启元实验室) School of Information and Software Engineering, University of Electronic Science and Technology of China(电子科技大学信息与软件工程学院) Beijing University of Posts and Telecommunications(北京邮电大学) School of Computer Science and Engineering, Northeastern University(东北大学计算机科学与工程学院) School of Computer Science and Engineering, Beihang University(北京航空航天大学计算机科学与工程学院)

AI总结 提出Embodied-BenchClaw,一个通过五阶段流水线和三个智能体协调的自主系统,自动构建可验证、可执行、可维护且诊断有用的具身空间智能基准,减少人工工作量。

详情
AI中文摘要

基准测试对于评估具身空间智能至关重要,但其构建劳动密集、难以重用且维护困难。现有的具身基准通常是静态的,随着模型改进可能迅速饱和,限制其区分新能力的能力。我们提出Embodied-BenchClaw,一个用于构建具身空间智能基准的自主智能体系统。给定用户指定的评估意图,Embodied-BenchClaw通过五个阶段流水线自动生成完整且可持续更新的基准包:意图蓝图、数据收集、结构化与清洗、基准合成、评估报告。该流水线由三个智能体协调:规划、构建和评估。为提高可重用性和可靠性,Embodied-BenchClaw引入了可扩展的技能库和过程质量控制,使基准构建可组合、可验证和可修复。我们实例化了多个基准,涵盖室内空间推理、室外空间推理、机器人操作、四足机器人导航、无人机/空中视图理解以及静态基准增强。这些基准跨越不同的具身载体、数据源和空间能力。通过人工评估、基于评判者的评估、一致性检查、成本分析和消融实验,结果表明Embodied-BenchClaw能够以较少的人工努力构建可验证、可执行、可维护且诊断有用的具身空间基准。

英文摘要

Benchmarks are essential for evaluating embodied spatial intelligence, yet their construction is labor-intensive, hard to reuse, and difficult to maintain. Existing embodied benchmarks are often static and may quickly become saturated as models improve, limiting their ability to distinguish new capabilities. We propose Embodied-BenchClaw, an autonomous agentic system for constructing embodied spatial intelligence benchmarks. Given a user-specified evaluation intent, Embodied-BenchClaw automatically produces a complete and continually updatable benchmark package through a five-stage pipeline: intent blueprinting, data collection, structuring and cleaning, benchmark synthesis, and evaluation reporting. The pipeline is coordinated by three agents for planning, construction, and evaluation. To improve reusability and reliability, Embodied-BenchClaw introduces an extensible Skill Library and process quality control, enabling benchmark construction to be composable, verifiable, and repairable. We instantiate multiple benchmarks covering indoor spatial reasoning, outdoor spatial reasoning, robotic manipulation, quadruped robot navigation, UAV/aerial-view understanding, and static benchmark enhancement. These benchmarks span diverse embodied carriers, data sources, and spatial capabilities. Experiments with human evaluation, judge-based assessment, consistency checks, cost analysis, and ablations show that Embodied-BenchClaw can construct verifiable, executable, maintainable, and diagnostically useful embodied spatial benchmarks with reduced manual effort.

2606.11906 2026-06-11 cs.CL 新提交

When Does Language Matter? Multilingual Instructions Reveal Step-wise Language Sensitivity in Vision-Language-Action Models

语言何时重要?多语言指令揭示视觉-语言-动作模型中的逐步语言敏感性

Xuan Dong, Zhe Han, Tianhao Niu, Qingfu Zhu, Wanxiang Che

发表机构 * Harbin Institute of Technology(哈尔滨工业大学)

AI总结 本研究通过将LIBERO基准翻译成十种语言,首次系统评估了VLA模型的多语言鲁棒性,发现非英语指令下成功率下降30-50%,并基于步骤级语言敏感性提出推理时对齐干预,显著提升性能。

Comments Accepted to ACL 2026 Main Conference

详情
AI中文摘要

视觉-语言-动作(VLA)模型在语言条件机器人操作中表现出强大性能,但其对语言变化的鲁棒性仍知之甚少。在这项工作中,我们通过将LIBERO基准翻译成十种语言,首次对VLA模型进行了系统的多语言评估,揭示了在非英语指令下性能严重下降,成功率下降30-50%。通过对任务执行的细粒度分析,我们发现语言影响在步骤间高度不均匀:某些步骤表现出强烈的语言依赖性并主导整体任务失败,而其他步骤则基本与语言无关。基于这一见解,我们提出了一种逐步推理时干预方法,根据步骤语言敏感性对齐表示,显著提高了语言变化下的性能。我们的结果表明,VLA模型中的语言鲁棒性本质上是一个逐步控制问题,突出了时间结构化分析对于可靠具身智能体的重要性。

英文摘要

Vision-Language-Action (VLA) models have shown strong performance in language-conditioned robotic manipulation, yet their robustness to linguistic variation remains poorly understood. In this work, we present the first systematic multilingual evaluation of VLA models by translating the LIBERO benchmark into ten languages, revealing severe performance degradation under non-English instructions, with success rates dropping by 30-50%. Through fine-grained analysis of task executions, we find that language influence is highly non-uniform across steps: certain steps exhibit strong language dependence and dominate overall task failure, while others are largely language-agnostic. Based on this insight, we propose a step-wise inference-time intervention that aligns representations according to step language sensitivity, substantially improving performance under linguistic variation. Our results indicate that language robustness in VLA models is fundamentally a step-wise control problem, highlighting the importance of temporally structured analysis for reliable embodied agents.

2606.11903 2026-06-11 cs.SD 新提交

Snapping Matters: Context-Aware Onset Refinement for Automatic Music Transcription

Snapping Matters: 上下文感知的起始点细化用于自动音乐转录

Abhirup Saha, Hans-Ulrich Berendes, Meinard Müller, Ben Maman

发表机构 * Hans-Ulrich Berendes(海恩-乌尔里希·伯恩德斯) Meinard Müller(迈纳德·穆勒) Ben Maman(本·马曼)

AI总结 针对弱对齐的乐谱-音频数据,提出基于二分图匹配的上下文感知起始点细化方法,显著提升自动音乐转录的起始点对齐和转录精度。

Comments Published in International Computer Music Conference (ICMC) 2026

详情
AI中文摘要

精确的音符级标注对于训练自动音乐转录(AMT)系统至关重要,尤其是音符起始点标签,它是许多现代AMT系统的核心组成部分。然而,真实世界录音的高质量标注非常稀缺。序列级乐谱-音频对齐方法(如动态时间规整)仅提供粗略对应,因此需要局部细化步骤。这个细化步骤称为snapping,它使用神经起始点后验图的峰值来调整对齐的乐谱起始点,并且通常决定了弱对齐的乐谱-音频对是否能够成为可用的训练数据。尽管具有实际重要性,snapping通常被视为简单的后处理启发式方法,并通过贪婪的局部决策实现。我们提出了用于训练乐器无关转录器的snapping策略的系统分析,证明了snapping对于从弱对齐数据学习至关重要。在此基础上,我们将snapping形式化为每个音高的分配问题,并通过二分图匹配解决,从而在重叠的细化窗口和不确定的初始对齐下做出上下文感知的起始点决策。在钢琴、室内乐和管弦乐录音上的广泛跨数据集实验表明,与贪婪snapping相比,起始点对齐和转录精度有所提高,并且随着snapping窗口变宽和初始对齐变粗糙,增益增加。定性示例见我们的项目页面:this https URL

英文摘要

Precise note-level annotations are critical for training automatic music transcription (AMT) systems, in particular note-onset labels, which form a core component of many recent AMT systems. However, high-quality annotations for real-world recordings are scarce. Sequence-level score--audio alignment methods such as dynamic time warping provide only coarse correspondence, making a local refinement step necessary. This refinement step, known as snapping, adjusts aligned score onsets using peaks in a neural onset posteriorgram and often determines whether weakly aligned score--audio pairs become usable training data at all. Despite its practical importance, snapping is typically treated as a simple post-processing heuristic and implemented with greedy local decisions. We present a systematic analysis of snapping strategies for training instrument-agnostic transcribers, demonstrating that snapping is essential for learning from weakly aligned data. Building on this, we formulate snapping as a per-pitch assignment problem and solve it via bipartite graph matching, yielding context-aware onset decisions under overlapping refinement windows and uncertain initial alignments. Extensive cross-dataset experiments across piano, chamber, and orchestral recordings show improved onset alignment and transcription accuracy over greedy snapping, with gains increasing for wider snapping windows and coarser initial alignments. Qualitative examples are provided on our project page: https://abhirupsaha8.github.io

2606.11901 2026-06-11 cs.RO cs.AI 新提交

DuoBench: A Reproducible Benchmark for Bimanual Manipulation in Simulation and the Real World

DuoBench: 一个可复现的双手操作基准,涵盖仿真与现实世界

Tobias Jülg, Seongjin Bien, Simon Hilber, Yannik Blei, Pierre Krack, Maximilian Li, Sven Parusel, Rudolf Lioutikov, Florian Walter, Wolfram Burgard

发表机构 * University of Technology Nuremberg(纽伦堡工业大学) Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院) Franka Robotics Technical University of Munich(慕尼黑工业大学)

AI总结 提出DuoBench,一个基于FR3 Duo平台的双手操作基准框架,包含11个任务和阶段式评估方案,用于诊断当前策略在双手协调、仿真到现实迁移等方面的失败模式。

详情
AI中文摘要

双手机器人系统极大地扩展了操作能力,但协调两只手臂引入了额外的控制复杂性和故障模式,现有基准未能很好地捕捉这些。我们介绍了DuoBench,一个针对FR3 Duo平台上的双手操作策略的可扩展基准框架。DuoBench包含跨越四个协调类别的十一个任务,在仿真中实现,并通过可复现的任务配方和3D打印资产部分地在现实世界中复现。此外,我们提出了一种基于阶段的评估方案,支持超出二元成功之外的细粒度语义故障分析,并为所有基准任务提供人类遥操作数据集。我们在仿真和真实硬件上对几种双臂模仿学习和视觉-语言-动作策略进行了基准测试。我们的结果表明,当前策略在双手操作中仍然面临挑战,特别是在早期交互阶段、并行手臂执行以及仿真与现实环境之间的迁移方面。DuoBench为诊断这些故障模式和研究未来的双臂策略学习方法提供了一个可复现的测试平台。代码、数据集和视频可在该https URL获取。

英文摘要

Bimanual robot systems substantially expand manipulation capabilities, but coordinating two arms introduces additional control complexity and failure modes that are not well captured by existing benchmarks. We introduce DuoBench, an extensible benchmarking framework for bimanual manipulation policies on the FR3 Duo platform. DuoBench comprises eleven tasks spanning four coordination categories, implemented in simulation and partially reproduced in the real world through reproducible task recipes with 3D-printable assets. In addition, we propose a stage-based evaluation scheme that supports fine-grained semantic failure analysis beyond binary success and provide human-teleoperated datasets for all benchmark tasks. We benchmark several dual-arm imitation-learning and vision-language-action policies in simulation and on real hardware. Our results show that current policies remain challenged by bimanual manipulation, particularly in early interaction stages, parallel arm execution, and transfer between simulation and real-world settings. DuoBench provides a reproducible testbed for diagnosing these failure modes and studying future methods for dual-arm policy learning. Code, datasets, and videos are available at https://duobench.github.io/

2606.11897 2026-06-11 cs.CL 新提交

Notes2Skills: From Lab Notebooks to Certainty-Aware Scientific Agent Skills

Notes2Skills: 从实验室笔记本到具有确定性意识的科学智能体技能

Shi Liu, Jiayao Chen, Chengwei Qin, Yanqing Hu, Jufan Zhang, Linyi Yang

发表机构 * Southern University of Science and Technology(南方科技大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) University College Dublin(都柏林大学学院)

AI总结 提出Notes2Skills框架,将实验室笔记转化为保留作者确定性的可验证科学智能体技能,解决不确定判断与确认结论混淆问题。

Comments 28 pages, preprint

详情
AI中文摘要

科学发现工作流程通常包含并严重依赖实验室笔记,研究人员在其中记录观察结果、解释不确定的结果并规划后续实验。这些信息丰富的实验室笔记保留了不断演变的科学推理和作者的不确定性,而不是出版物中展示的经过修饰的最终结果,为人工智能在更全面和更深层次上参与科学探索提供了宝贵机会。然而,大多数先前关于科学文本的工作集中在论文、协议或结构化数据库上,使得非正式的实验室笔记作为科学AI智能体的输入未被充分探索。这一差距很重要,因为实验室笔记通常在同一段落中混合了经过验证的观察结果、初步判断和可能的实验下一步。如果这些信号被混淆,AI智能体可能会将不确定的科学判断误认为是已确认的结论或可执行的行动。为此,我们提出了Notes2Skills,一个两阶段框架,用于将实验室笔记本转化为可验证的科学AI智能体技能,同时保留作者的不确定性。在七个条件和三个湿实验环节中,Notes2Skills是唯一既不会将不确定的笔记误认为是明确的指令,也不会丢弃明确指令的配置。我们表明,确定性保留是实验室笔记本与可靠智能体技能之间缺失的一环,为更安全的AI共同科学家系统开辟了一条道路。

英文摘要

Scientific discovery workflows usually contain and rely heavily on lab notes, where researchers record observations, interpret uncertain results, and plan follow-up experiments. Such informative lab notes preserve evolving scientific reasoning and author uncertainty, rather than polished final results exhibited in publications, providing a valuable opportunity for AI to engage in scientific exploration at a more comprehensive and deeper level. However, most prior work on scientific text focuses on papers, protocols, or structured databases, leaving informal laboratory notes underexplored as inputs to AI agents for science. This gap matters because lab notes often intermingle validated observations, tentative judgments, and possible experimental next steps within the same passage. If these signals are conflated, an AI agent may mistake uncertain scientific judgments for confirmed conclusions or executable actions. To this end, we present Notes2Skills, a two-stage framework for turning lab notebooks into verifiable skills for scientific AI agents while preserving the author's certainty. Across seven conditions and three wet-lab sessions, Notes2Skills is the only configuration that neither mistakes uncertain notes for firm instructions nor discards firm ones. We show that certainty preservation is the missing piece between lab notebooks and reliable agent skills, opening a path toward safer AI co-scientist systems.

2606.11893 2026-06-11 cs.LG cs.AI cs.CL q-bio.NC 新提交

Beyond representational alignment with brain-guided language models for robust reasoning

超越表征对齐:基于大脑引导的语言模型实现稳健推理

Mingqing Xiao, Kai Du, Zhouchen Lin

发表机构 * State Key Lab of General AI, School of Intelligence Science and Technology, Peking University(北京大学通用人工智能国家重点实验室、智能科学与技术学院) Department of Psychological and Cognitive Sciences, Tsinghua University(清华大学心理与认知科学系) Microsoft Research Asia(微软亚洲研究院)

AI总结 研究通过fMRI信号增强大型语言模型推理能力,提出脑引导框架,在10个模型上实现最高13%的准确率提升。

详情
AI中文摘要

大型语言模型(LLMs)与人类高阶认知背后的神经机制之间的对应关系仍未得到充分表征。鉴于人脑中语言和推理似乎是可分离的,一个开放的问题是LLMs是否与来自推理相关区域的神经信号对齐,以及这些信号是否能够改进它们。在此,我们聚焦于演绎推理,表明LLM内部表征不仅与任务fMRI活动部分对齐,而且可以直接通过这些信号增强。使用神经预测性度量,我们发现LLMs在聚合水平上解释了推理相关区域中可解释方差的很大一部分,而在特定推理类型内的预测性较低,表明对齐和分歧并存。基于此,我们提出一个脑引导框架:我们沿着由模型和大脑表征的联合结构诱导的方向引导模型表征,在推理时进行干预,在训练时进行微调。我们证明任务诱发的脑信号可以直接增强LLM推理,在10个LLM(1.5B-72B)上产生与仅语言监督正交的增益,具有跨推理类型的迁移,以及高达13%的绝对准确率提升。我们的结果将LLM-大脑对应关系从相关性推进到引导,建立了一条由脑信号驱动的路径,通向更稳健和认知对齐的AI。

英文摘要

The correspondence between large language models (LLMs) and the neural mechanisms underlying human higher-order cognition remains insufficiently characterized. Given that language and reasoning in the human brain appear dissociable, an open question is whether LLMs align with neural signals from reasoning-related regions and whether such signals can improve them. Here, focusing on deductive reasoning, we show that LLM internal representations are not only partially aligned with task-fMRI activity but can also be directly enhanced by these signals. Using a neural-predictivity metric, we find that LLMs explain a substantial fraction of the explainable variance in reasoning-related regions at the aggregate level, whereas predictivity within specific reasoning types is lower, indicating both alignment and divergence. Building on this, we propose a brain-guided framework: we steer model representations along directions induced by the joint structure of model and brain representations, applying intervention at inference and fine-tuning during training. We demonstrate that task-evoked brain signals can directly enhance LLM reasoning, yielding gains orthogonal to language-only supervision across 10 LLMs (1.5B-72B), with transfer across reasoning types and up to 13\% absolute accuracy gain. Our results advance LLM-brain correspondences from correlation to guidance, establishing a brain-signal-driven pathway toward more robust and cognitively aligned AI.