arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3844
热门方向导航
2606.16558 2026-06-16 cs.AI cs.RO cs.SY eess.SY 新提交

ROSA-RL: Uncertainty-Aware Roundabout Optimized Speed Advisory with Reinforcement Learning

ROSA-RL:基于强化学习的不确定性感知环岛优化速度建议

Anna-Lena Schlamp, Jeremias Gerner, Klaus Bogenberger, Werner Huber, Stefanie Schmidtner

发表机构 * Universität der Bundeswehr München(慕尼黑联邦国防军大学) Hochschule für angewandte Wissenschaften Landshut(兰茨胡特应用科学大学)

AI总结 针对混合交通中环岛场景的不确定性,提出ROSA-RL框架,结合Transformer预测冲突区域占用概率与强化学习,实现安全高效的环岛入口速度协调。

Comments 8 pages, 2 figures, 2 tables. Copyright 2026 IEEE. This is the accepted manuscript for 2026 IEEE International Conference on Intelligent Transportation Systems (ITSC), not the final published version

详情
AI中文摘要

环岛在混合交通中对自动驾驶构成挑战,因为异质且非确定性的人类行为、未知的驾驶意图以及高交互复杂性使得在进入时刻冲突区域是被阻塞还是可用存在不确定性。我们提出ROSA-RL——基于强化学习的不确定性感知环岛优化速度建议。它通过概率冲突预测,实现混合交通中自动驾驶和人类驾驶车辆的安全高效环岛进入。一个基于Transformer的模型预测未来五秒内的冲突区域占用情况,捕捉多智能体交互以预测即将发生的冲突和可用间隙。预测输出编码了未来运动和意图的不确定性,并增强经典强化学习框架的状态,实现不确定性感知的速度协调。在基于真实世界数据的仿真评估中,ROSA-RL能有效处理不确定性,并优于基于模型的基线方法,缩小了与假设完全已知占用的理想设置之间的差距,同时提高了交通效率和安全性。本工作的源代码可在github.com/urbanAIthi/ROSA-RL获取。

英文摘要

Roundabouts challenge automated driving in mixed traffic, as heterogeneous and non-deterministic human behavior, unknown driving intentions, and high interaction complexity create uncertainty about whether the conflict zone will be blocked or available at the moment of entry. We present ROSA-RL -- uncertainty-aware Roundabout Optimized Speed Advisory with Reinforcement Learning. It enables safe and efficient roundabout entry for automated and human-driven vehicles in mixed traffic through probabilistic conflict forecasting. A Transformer-based model predicts conflict zone occupancy over a five-second horizon, capturing multi-agent interactions to anticipate upcoming conflicts and available gaps. The prediction outputs encode uncertainty in future motion and intent, and augment the state of a classical RL framework, enabling uncertainty-aware speed coordination. Evaluated in simulations grounded in real-world data, ROSA-RL can effectively handle uncertainty and outperform a comparable model-based baseline, closing the gap to an ideal setting assuming fully known occupancy while improving traffic efficiency and safety. The source code of this work is available under: github.com/urbanAIthi/ROSA-RL.

2606.16545 2026-06-16 cs.CL 新提交

Can LLM Coding Agents Reason About Time Series?

LLM 编码智能体能否推理时间序列?

Filip Rechtorík, Ondřej Dušek, Zdeněk Kasner

发表机构 * Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University(查尔斯大学数学与物理学院形式与应用语言学研究所)

AI总结 研究 LLM 编码智能体在时间序列分析中的能力,发现代码访问可提升性能达 10%,但仍有 22-34% 错误,并分析了推理差距。

Comments 17 pages, 7 figures

详情
AI中文摘要

大型语言模型(LLM)越来越多地用于金融、医疗或环境监测中的自动决策系统。时间序列数据在这些领域中无处不在,但难以自动处理。时间序列能否由 LLM 智能体分析?我们考察了三种方法:向智能体提供原始数值数据、将 LLM 用作编码智能体、或两者结合。在编码智能体设置中,模型使用 Python 代码迭代查询数据。利用两个时间序列理解基准,我们表明具有代码访问权限的智能体可以比处理原始数据的模型性能提升高达 10%。然而,即使性能最好的智能体仍然错误地回答约 22-34% 的问题。为了深入了解模型的策略和推理差距,我们使用强大的 LLM 评判器分析模型输出。我们的分析揭示,编码智能体能够选择适当的统计检验,但常常错过重要的细微差别。同时,具有原始数据访问权限的模型可以通过粗略计算得出正确结论。

英文摘要

Large language models (LLMs) are increasingly being used for automated decision-making systems in finance, healthcare, or environmental monitoring. Time series data are ubiquitous in these fields, yet hard to process automatically. Can time series be analyzed by LLM agents? We examine three approaches: providing the agent with raw numerical data, using the LLM as a coding agent, or a combination of both. In the coding agent setup, the model iteratively queries the data using Python code. Using two time series understanding benchmarks, we show that agents with code access can outperform models processing raw data by up to 10%. However, even the best performing agent still answers about 22-34% of the questions incorrectly. To get insights into models' strategies and reasoning gaps, we analyze the model outputs with a strong LLM judge. Our analysis reveals that coding agents can select appropriate statistical tests, but often miss important nuances. Meanwhile, models with access to raw data can reach the right conclusions using back-of-the-envelope calculations.

2606.16542 2026-06-16 cs.RO 新提交

ADAPT: Analytical Disturbance-Aware Policy Training for Humanoid Locomotion

ADAPT: 面向人形机器人运动的解析干扰感知策略训练

Bofan Lyu, Jindou Jia, Kuangji Zuo, Yanshuo Lu, Shijia Han, Gen Li, Boyu Ma, Jingliang Li, Geng Li, Jianfei Yang

发表机构 * MARS Lab, Nanyang Technological University(南洋理工大学MARS实验室)

AI总结 提出ADAPT框架,通过解析全身干扰观测器在线估计外力/力矩,无需传感器,提升人形机器人在干扰下的运动精度与鲁棒性。

详情
AI中文摘要

部署在人类中心环境中的人形机器人必须处理力交互任务,其中外部接触会引入意外干扰,破坏运动精度和稳定性。现有的基于学习的方法依赖于广泛的域随机化、特定任务的力目标或基于运动历史的学习型力估计器,每种方法都会在精度、任务可迁移性或分布外鲁棒性上做出妥协。我们提出了解析干扰感知策略训练(ADAPT),这是一个框架,它为人形机器人策略配备了物理基础的干扰观测器。ADAPT的核心是一个解析全身干扰观测器,它利用可访问的机器人动力学在线估计残余力/力矩,无需力/力矩传感器。估计的干扰直接输入策略,使人形机器人获得对外力/力矩的显式、基于物理的感知,能够泛化到各种未见过的场景。在Unitree G1人形机器人上的实验表明,ADAPT在躯干扰动、站立推力和不对称手部负载下实现了比仅基于本体感觉的基线更准确的干扰预测和更强的鲁棒性,即使在分布外干扰下也能改善速度跟踪。此外,ADAPT能够惩罚在下肢关节推断出的干扰,以鼓励更轻快的运动。

英文摘要

Humanoids deployed in human-centered environments must handle force-interactive tasks, where external contacts introduce unexpected disturbances that disrupt locomotion accuracy and stability. Existing learning-based approaches rely on broad domain randomization, task-specific force objectives, or learning-based force estimators from motion history, each of which compromises accuracy, task transferability, or out-of-distribution (OOD) robustness. We present Analytical Disturbance-Aware Policy Training (ADAPT), a framework that equips humanoid policies with a physically grounded disturbance observer. The core of ADAPT is an analytical whole-body disturbance observer that estimates residual force/torque online with the accessible robot dynamics, without requiring force/torque sensors. Fed directly into the policy, the estimated disturbances give the humanoid an explicit, physics-derived sense of external force/torque that can generalize across diverse unseen scenes. Experiments on a Unitree G1 humanoid show that ADAPT achieves accurate disturbance prediction and stronger robustness than a proprioception-only baseline under torso perturbations, standing pushes, and asymmetric hand payloads, with improved velocity tracking even on OOD disturbances. Moreover, ADAPT enables penalizing inferred disturbances at lower-body joints to encourage lighter locomotion.

2606.16541 2026-06-16 cs.AI cs.LG 新提交

The Faithfulness Gap: Certifying Semantic Equivalence Between Natural-Language and Formal Mathematical Statements

忠实性差距:认证自然语言与形式数学语句之间的语义等价性

Noor Islam S. Mohammad, Tamim Sheikh

发表机构 * Department of Computer Science, Informatics Institute, Istanbul Technical University, İstanbul, Türkiye(信息学院计算机科学系,伊斯坦布尔技术大学,伊斯坦布尔,土耳其) Department of Computer Science(计算机科学系) Engineering, Jashore University of Science(工程系,贾沙尔大学科学学院)

AI总结 提出双向可证明性指纹识别框架,通过前向和后向推论邻域匹配自然语言探针,认证自动形式化翻译的忠实性,并引入反事实探针生成、等价谱、自适应探针预算分配和忠实性引导解码四个新组件,在基准上实现高检测率并减少漂移。

详情
AI中文摘要

自动形式化——将自然语言数学翻译成形式证明助手——的瓶颈不在于翻译流畅性,而在于\emph{忠实性}:一个形式语句可以通过类型检查且可证明,但仍可能编码与源意图不同的定理。我们引入\emph{双向可证明性指纹识别}(\bpf{}),这是一个通过刻画每个候选在背景理论中的前向和后向推论邻域,并将这些邻域与从自然语言语句导出的探针进行匹配来认证忠实性的框架。我们进一步引入四个新组件:(i)\emph{反事实探针生成}(\cpg{}),一种合成针对特定漂移方向的探针的对比性程序;(ii)\emph{等价谱},一个替代脆弱的二元判决的连续忠实性分数;(iii)\emph{自适应探针预算分配}(\apba{}),一个信息论预算路由器;以及(iv)\emph{忠实性引导解码}(\fgd{}),它在自动形式化过程中使用\bpf{}信号作为奖励。我们证明了一个\emph{漂移检测定理}和一个\emph{PAC-忠实性}结果,该结果确立了在温和假设下,自然语言语句的等价类可以从$\mathcal{O}(\log(1/δ)/\varepsilon)$个探针中学习。我们发布了\driftbench{},一个包含$2{,}183$个NL/Lean~4对的基准,这些对具有跨mathlib4六个子领域的受控漂移标签。\bpf{}\,+\,\cpg{}在$3.0\%$的假阳性率下检测出$89.6\%$的漂移形式化——相比之下,类型检查为$41.2\%$,LLM评判基线为$63.3\%$——并且\fgd{}将最先进的自动形式化器产生漂移语句的比率降低了$47\%$。https://pmlrbd.github.io/BPF/

英文摘要

Autoformalization, translating natural-language mathematics into formal proof assistants, is bottlenecked not by translation fluency but by \emph{faithfulness}: a formal statement can typecheck and be provable, yet still encode a different theorem than the source intended. We introduce \emph{Bidirectional Provability Fingerprinting} (\bpf{}), a framework that certifies faithfulness by characterizing each candidate through its forward and backward consequence neighborhoods in the ambient theory and matching these against probes derived from the natural-language statement. We further introduce four novel components: (i) \emph{Counterfactual Probe Generation} (\cpg{}), a contrastive procedure that synthesizes probes targeting specific drift directions; (ii) the \emph{Equivalence Spectrum}, a continuous faithfulness score that replaces brittle binary verdicts; (iii) \emph{Adaptive Probe Budget Allocation} (\apba{}), an information-theoretic budget router; and (iv) \emph{Faithfulness-Guided Decoding} (\fgd{}), which uses \bpf{} signals as a reward during autoformalization. We prove a \emph{drift detection theorem} and a \emph{PAC-faithfulness} result establishing that the equivalence class of a natural language statement is learnable from $\mathcal{O}(\log(1/δ)/\varepsilon)$ probes under mild assumptions. We release \driftbench{}, a benchmark of $2{,}183$ NL/Lean~4 pairs with controlled drift labels across six subfields of mathlib4. \bpf{}\,+\,\cpg{} detects $89.6\%$ of drifted formalizations at a $3.0\%$ false-positive rate-against $41.2\%$ for typecheck and $63.3\%$ for LLM-judge baselines, and \fgd{} reduces the rate at which a state-of-the-art autoformalizer emits drifted statements by $47\%$. https://pmlrbd.github.io/BPF/

2606.16535 2026-06-16 cs.LG cs.CV cs.SC 新提交

Assessing Reliability of Symbol Detection in Concept Bottleneck Models

评估概念瓶颈模型中符号检测的可靠性

Javier Fumanal-Idocin, Javier Andreu-Perez

发表机构 * University of Essex(埃塞克斯大学)

AI总结 本文研究概念瓶颈模型(CBM)中符号检测的可靠性问题,通过交换独立训练的概念检测器和分类头来识别易受虚假激活影响的概念,并提出一种可靠性感知训练策略,在CUB-200-2011和合成任务上验证了其有效性。

详情
AI中文摘要

概念瓶颈模型(CBM)是可解释人工智能的相关工具,因为它们通过人类可解释的符号进行预测。然而,高任务准确率并不能保证这些符号被忠实地检测到:联合训练的CBM可能在瓶颈中编码任务特定的捷径,使其解释不可靠。在本文中,我们通过交换共享相同符号词汇的独立训练的概念检测器和分类头来研究概念检测的可靠性。我们利用由此产生的性能下降、概念级指标和符号级不确定性估计来识别特别容易发生虚假激活的概念。最后,我们提出了一种可靠性感知训练策略,其中共享的概念检测器通过多个分类头进行优化,并因依赖全局或实例级不可靠符号而受到惩罚。在具有完整概念监督的CUB-200-2011上,检测器和头几乎可以自由互换(交换下降低于一个准确率点,相对保留率高于99%,且没有概念检测低于随机水平),而在受控的合成任务上,我们表明,随着概念监督权重的减少,模型保持近乎完美的任务准确率,而交换准确率和与真实概念的一致性下降到随机水平。我们的可靠性感知训练显著缓解了这种泄漏,在泄漏情况下大致使交换准确率翻倍。

英文摘要

Concept Bottleneck Models (CBMs) are a relevant tool for explainable Artificial Intelligence because they make their predictions through human-interpretable symbols. However, high task accuracy does not guarantee that these symbols are detected faithfully: jointly trained CBMs may encode task-specific shortcuts in the bottleneck, making their explanations unreliable. In this paper, we study concept-detection reliability by swapping independently trained concept detectors and classification heads that share the same symbolic vocabulary. We use the resulting performance degradation, concept-level metrics, and symbol-wise uncertainty estimates to identify concepts that are especially prone to spurious firing. Finally, we propose a reliability-aware training strategy in which a shared concept detector is optimized with multiple classification heads and penalized for relying on globally or instance-wise unreliable symbols. On CUB-200-2011 with full concept supervision, detectors and heads are almost freely interchangeable (swap drop below one accuracy point, relative retention above $99\%$, and no concept detected below chance), whereas on a controlled synthetic task we show that, as the concept-supervision weight is reduced, models keep near-perfect task accuracy while swapped accuracy and agreement with the ground-truth concepts collapse to chance. Our reliability-aware training substantially mitigates this leakage, roughly doubling swap accuracy in the leaky regime.

2606.16532 2026-06-16 cs.SD cs.AI 新提交

Dual-Granularity Orthogonal Disentanglement for Generalizable Audio Deepfake Detection

双粒度正交解耦用于可泛化的音频深度伪造检测

Zhuodong Liu, Hugen Lv, Xiangyu Li, Chunhong Yuan

发表机构 * Beijing Jiaotong University(北京交通大学) Shanghai Jiao Tong University(上海交通大学) ITMO University(ITMO大学)

AI总结 针对音频深度伪造检测中隐式身份泄漏问题,提出双粒度正交解耦框架,通过样本级余弦正交性和批次级交叉协方差正则化强制特征独立,无需辅助网络或对抗训练,在多个数据集上取得更优等错误率。

Comments Accepted at Interspeech 2026, 6 pages, 3 figures

详情
AI中文摘要

音频深度伪造检测器常常无法跨说话人泛化,因为它们学习的是说话人身份特征而非合成伪影,这被称为隐式身份泄漏。现有方法解决了这一问题,但引入了架构复杂性或训练不稳定性。本文提出了一种双粒度正交解耦框架,在两个层次上强制特征独立性:样本级余弦正交性捕获方向去相关,而批次级交叉协方差正则化消除嵌入维度间的线性相关性。课程解耦调度逐步增强正交约束,无需辅助网络或对抗动态。在ASVspoof 2019 LA、ASVspoof 2021 DF和In-the-Wild数据集上的实验表明,所提方法分别实现了1.35%、7.88%和21.58%的等错误率(EER),在跨数据集迁移上比梯度反转解耦绝对提升了2.60%。

英文摘要

Audio deepfake detectors often fail to generalize across speakers, as they learn speaker-identity features rather than synthesis artifacts, known as implicit identity leakage. Existing methods address this but incur architectural complexity or training instability. This paper proposes a dual-granularity orthogonal disentanglement framework enforcing feature independence at two levels: sample-level cosine orthogonality captures directional decorrelation, while batch-level cross-covariance regularization eliminates linear correlations across embedding dimensions. A curriculum disentanglement schedule progressively strengthens the orthogonality constraint without auxiliary networks or adversarial dynamics. Experiments on ASVspoof 2019 LA, ASVspoof 2021 DF, and In-the-Wild datasets demonstrate that the proposed method achieves 1.35%, 7.88%, and 21.58% equal error rates (EER), respectively, surpassing gradient reversal disentanglement by 2.60% absolute on cross-dataset transfer.

2606.16524 2026-06-16 cs.LG astro-ph.CO stat.ML 新提交

Neural Bayesian Anomaly Mitigation: A Robust Loss that Doubles as an Unsupervised Contamination Classifier

神经贝叶斯异常缓解:一种兼具无监督污染分类器功能的鲁棒损失函数

S. A. K. Leeney, W. J. Handley, H. T. J. Bevins, E. de Lera Acedo

发表机构 * Astrophysics Group, Cavendish Laboratory, University of Cambridge(剑桥大学卡文迪许实验室天体物理组) Institute of Astronomy, University of Cambridge(剑桥大学天文研究所)

AI总结 提出神经贝叶斯异常缓解(NBAM)损失,基于贝叶斯潜变量混合模型,既提供鲁棒监督损失又输出无监督污染后验,在CIFAR-10上优于Huber等基线。

Comments 13 pages, 4 figures

详情
AI中文摘要

工程化的鲁棒损失函数(如Huber、Student-$t$和广义交叉熵)使监督模型能够容忍污染,但无法回答哪些观测被破坏。我们引入神经贝叶斯异常缓解(NBAM),一种通用的即插即用损失函数,源自贝叶斯潜在开关混合模型:边际似然定义了一个鲁棒的监督损失,相关的后验定义了一个无监督的污染分类器。与Huber或Student-$t$类似,NBAM可以替换任何监督流程中的标准训练损失;与它们不同,NBAM还学习了一个结构化的污染模型,并返回每个样本的校准污染后验。学习到的输入相关先验$π_ϕ(x)$捕获污染的空间局部性,使得靠近已知损坏的样本更可能被标记,同时自动出现奥卡姆惩罚并正则化以防止过度标记。在具有非对称标签污染的CIFAR-10上,NBAM无需监督即可恢复污染过程的结构:污染后验将干净样本与污染样本分开,学习到的异常头识别每个标签翻转对的方向。除了这些能力之外,在0.2-0.6的污染率下,NBAM的性能优于本文考虑的四种鲁棒损失基线。

英文摘要

Engineered robust losses such as Huber, Student-$t$, and generalised cross-entropy make supervised models tolerant of contamination but cannot answer which observations are corrupted. We introduce Neural Bayesian Anomaly Mitigation (NBAM), a general-purpose drop-in loss derived from a Bayesian latent-switch mixture model: the marginal likelihood defines a robust supervised loss, and the associated posterior defines an unsupervised contamination classifier. Like Huber or Student-$t$, NBAM can replace the standard training loss in any supervised pipeline; unlike them, it additionally learns a structured contamination model and returns a calibrated per-sample contamination posterior. A learned input-dependent prior $π_ϕ(x)$ captures the spatial locality of contamination, so that samples near known corruptions are more likely to be flagged, while an Occam penalty emerges automatically and regularises against over-flagging. On CIFAR-10 with asymmetric label contamination, NBAM recovers the structure of the corruption process without supervision: the contamination posterior separates clean from corrupted samples, and the learned anomaly head identifies the direction of every label-flip pair. Alongside these capabilities, NBAM outperforms the four robust-loss baselines considered here at contamination rates 0.2-0.6.

2606.16523 2026-06-16 cs.CL 新提交

SkillWiki: A Living Knowledge Infrastructure for Agent Skills

SkillWiki: 一个用于智能体技能的活知识基础设施

Dingcheng Huang, Yuda Ding, Bingshuo Liu, Qingbin Liu, Xi Chen, Jiang Bian, Hongliang Sun, Zhiying Tu, Dianhui Chu, Xiaoyan Yu, Dianbo Sui

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) Tencent(腾讯) Nanyang Technological University(南洋理工大学)

AI总结 提出SkillWiki,一个支持智能体技能组织、落地和持续演化的活知识基础设施,通过将异构知识转化为可复用技能资产并关联原始证据,实现从知识摄入到技能生产、溯源探索、治理和执行驱动演化的完整生命周期。

详情
AI中文摘要

虽然知识通过维基百科管理,软件通过GitHub管理,但智能体技能仍然缺乏大规模生产、治理和演化的基础设施。SkillWiki是一个活知识基础设施,通过将异构知识转化为可复用的技能资产并链接到其原始证据,支持智能体技能的组织、落地和持续演化。我们的演示展示了完整的技能生命周期,从知识摄入和技能生产到溯源感知的探索、治理和执行驱动的演化。SkillWiki突显了一个未来,其中知识、技能和执行经验在共享基础设施内共同演化。现场演示和源代码可在https://github.com/Huangdingcheng/SkillWiki公开获取。

英文摘要

While knowledge is managed through Wikipedia and software through GitHub, agent skills still lack an infrastructure for large-scale production, governance, and evolution. SkillWiki is a living knowledge infrastructure that supports the organization, grounding, and continuous evolution of agent skills by transforming heterogeneous knowledge into reusable skill assets linked to their originating evidence. Our demonstration presents the complete skill lifecycle, from knowledge ingestion and skill production to provenance-aware exploration, governance, and execution-driven evolution. SkillWiki highlights a future in which knowledge, skills, and execution experience co-evolve within a shared infrastructure. The live demonstration and source code are publicly available at https://github.com/Huangdingcheng/SkillWiki.

2606.16519 2026-06-16 cs.CV 新提交

BadWorld: Adversarial Attacks on World Models

BadWorld:对世界模型的对抗攻击

Linghui Shen, Mingyue Cui, Xingyi Yang

发表机构 * The Hong Kong Polytechnic University(香港理工大学)

AI总结 提出BadWorld框架,通过自监督速度攻击和轨迹自适应双层优化,对自回归视觉世界模型进行无标签对抗攻击,暴露其结构脆弱性。

Comments Project Page: https://linghuiishen.github.io/BadWorld/

详情
AI中文摘要

视觉世界模型(VWMs)从单张上下文图像合成交互式、条件于动作的展开。然而,这些模型对对抗扰动的鲁棒性仍是一个未解问题。标准对抗攻击无法评估这种脆弱性,因为攻击者缺乏真实未来视频且无法预测后续用户控制。我们提出了BadWorld,一个专为自回归VWMs设计的无标签对抗框架,系统性地克服了这两个限制。首先,为绕过对未来监督的需求,我们提出了一种自监督速度攻击,直接破坏模型的早期去噪动态。其次,为确保攻击能泛化到不可预测的用户动作,我们制定了一种轨迹自适应双层优化,主动挖掘困难的控制序列以锻造控制无关的扰动。在具有连续和离散控制的代表性VWMs上评估,BadWorld暴露了严重的结构脆弱性。视觉上难以区分的对抗图像可靠地触发未来展开中的灾难性退化,导致去噪不完整、结构崩溃和控制不一致。这些发现揭示了在安全关键系统中部署VWMs的关键风险,同时突显了一种隐私保护的实用机制。

英文摘要

Visual world models (VWMs) synthesize interactive, action-conditioned rollouts from a single context image. However, it remains an open question how robust these models are to adversarial perturbations. Standard adversarial attacks fail to assess this vulnerability because attackers lack ground-truth future videos and cannot predict subsequent user controls. We introduce BadWorld, a label-free adversarial framework tailored for autoregressive VWMs that systematically overcomes both constraints. First, to bypass the need for future supervision, we propose a self-supervised velocity attack that directly disrupts the early denoising dynamics of the model. Second, to ensure the attack generalizes across unpredictable user actions, we formulate a trajectory-adaptive bi-level optimization that actively mines hard control sequences to forge control-agnostic perturbations. Evaluated on representative VWMs with continuous and discrete controls, BadWorld exposes severe structural fragility. Visually indistinguishable adversarial images reliably trigger catastrophic degradation in future rollouts, leading to incomplete denoising, structural collapse, and control inconsistency. These findings reveal critical risks for deploying VWMs in safety-critical systems while highlighting a practical mechanism for privacy protection.

2606.16517 2026-06-16 cs.LG q-bio.QM 新提交

How Post-Training Shapes Biological Reasoning Models

后训练如何塑造生物学推理模型

Lukas Fesser, Hanlin Zhang, Michelle M. Li, Eric Wang, Bryan Perozzi, Shekoofeh Azizi, Sham M. Kakade, Marinka Zitnik

发表机构 * Harvard University(哈佛大学) Google DeepMind(谷歌DeepMind) Google Research(谷歌研究院)

AI总结 研究后训练各阶段(CPT、SFT、RL)对生物学推理模型领域内和领域外性能的影响,发现SFT提升领域内性能但损害泛化,RL可部分恢复泛化,最佳策略是短SFT加长RL。

详情
AI中文摘要

生物学科学推理模型将语言模型与在多模态生物数据(包括DNA、RNA和蛋白质)上训练的基础模型相结合。这些模型通过后训练构建,然而每个阶段如何塑造推理和泛化能力仍知之甚少。我们研究后训练何时提升性能以及何时导致过度专门化。在基因组学、转录组学和蛋白质领域,我们训练并评估了超过100个生物学推理模型,在骨干网络、持续预训练(CPT)、监督微调(SFT)和强化学习(RL)方面进行受控变化,并测量领域内(ID)和领域外(OOD)性能。我们发现每个后训练阶段以不同方式重塑泛化,而非贡献均匀增益。CPT通过使模型与生物语言对齐来提升下游性能。SFT持续提高ID性能,但导致OOD性能早期达到峰值并随着模型拟合训练分布而下降。RL在应用于具有对齐奖励的强SFT检查点时,改善OOD性能并部分恢复泛化。这些结果表明,生物学推理并非随着额外监督或计算而单调提升。相反,性能取决于训练阶段的组合方式。在固定后训练预算下,最强的ID-OOD权衡来自短暂的SFT、更大的RL分配以及各阶段间不对称的适应能力。

英文摘要

Scientific reasoning models for biology combine language models with foundation models trained on multimodal biological data, including DNA, RNA, and proteins. These models are built through post-training, yet how each stage shapes reasoning and generalization remains poorly understood. We study when post-training improves performance and when it induces over-specialization. Across genomics, transcriptomics, and proteins, we train and evaluate more than 100 biological reasoning models under controlled variation in backbone, continued pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning (RL), measuring both in-domain (ID) and out-of-domain (OOD) performance. We find that each post-training stage reshapes generalization in a distinct way rather than contributing uniform gains. CPT improves downstream performance by aligning models with biological language. SFT consistently increases ID performance but causes OOD performance to peak early and decline as models fit the training distribution. RL, when applied to strong SFT checkpoints with aligned rewards, improves OOD performance and partially recovers generalization. These results show that biological reasoning does not improve monotonically with additional supervision or compute. Instead, performance depends on how training stages are composed. Under fixed post-training budgets, the strongest ID-OOD trade-off comes from brief SFT, larger RL allocations, and asymmetric adaptation capacity across stages.

2606.16513 2026-06-16 cs.RO 新提交

Agile Fall Recovery for Quadrotors with Bidirectional Thrust via Reinforcement Learning

基于强化学习的双向推力四旋翼敏捷坠落恢复

Anke Zhao, Yuhang Zhong, Kenghou Hoi, Junyu Mou, Junjie Wang, Lijie Wang, Jialiang Hou, Fei Gao

发表机构 * Institute of Cyber-Systems and Control, College of Control Science and Engineering, Zhejiang University(浙江大学控制科学与工程学院工业控制技术研究所) Differential Robotics

AI总结 提出基于强化学习的框架,利用轻量级机载传感器实现四旋翼从任意地面姿态恢复至稳定悬停,通过非对称演员-评论家架构和增量非线性动态逆控制器解决部分可观测性和传感器失效问题,仿真和实验验证了零样本迁移和鲁棒性。

详情
AI中文摘要

自主坠落恢复是四旋翼在现实环境中运行的关键能力,因为碰撞或故障可能导致飞行器以任意姿态停在地面上。该问题具有挑战性,因为恢复必须在有限的机载感知、受限的自由空间、地面接触以及存在未知干扰的情况下实现。本文提出了一种基于强化学习的框架,用于四旋翼从任意地面姿态自主恢复至稳定悬停,仅使用轻量级机载传感器。为了解决严重的部分可观测性和间歇性传感器失效问题,我们在非对称演员-评论家架构中训练了一个循环策略,并利用增量非线性动态逆(INDI)控制器跟踪策略输出。结合电机响应和光流的高保真仿真,整体训练框架显著缩小了仿真到现实的差距。仿真消融研究验证了主要设计选择的重要性,而真实世界实验展示了在不同初始姿态、风干扰和额外负载下的零样本迁移和鲁棒恢复。这些结果表明,无需明确的状态估计,仅使用有限且不可靠的机载传感即可实现敏捷的四旋翼坠落恢复。

英文摘要

Autonomous fall recovery is a critical capability for quadrotors operating in real-world environments, where collisions or failures may leave the vehicle resting on the ground in an arbitrary attitude. This problem is challenging because recovery must be achieved under limited onboard sensing, in constrained free space, with ground contact, and in the presence of unknown disturbances. In this letter, we present an RL-based framework for autonomous fall recovery of a quadrotor from arbitrary ground attitudes to stable hover using only lightweight onboard sensors. To address severe partial observability and intermittent sensor invalidity, we train a recurrent policy within an asymmetric actor--critic architecture, leveraging an Incremental Nonlinear Dynamic Inversion (INDI) controller to track the policy output. Combined with high-fidelity simulations of motor response and optical flow, the overall training framework significantly reduces the sim-to-real gap. Simulation ablation studies validate the importance of the main design choices, while real-world experiments demonstrate zero-shot transfer and robust recovery under different initial attitudes, wind disturbances, and additional payloads. These results demonstrate that agile quadrotor fall recovery can be achieved without explicit state estimation using only limited and unreliable onboard sensing.

2606.16511 2026-06-16 cs.LG 新提交

Tail-Shape Estimation in LLM Evaluation Is Fragile: A Protocol for Diagnosing False Positives

LLM评估中的尾部形状估计是脆弱的:诊断假阳性的协议

Luca Zhou

发表机构 * Sapienza University of Rome(罗马大学)

AI总结 本文提出一个协议,用于检验LLM评估中尾部形状估计的假阳性,通过极值理论指标区分尾部重量和尾部质量,并在毒性评估中识别出三种假阳性模式。

Comments 9 pages of main paper, 4 figures and 4 tables in the main paper, more in the appendix

详情
AI中文摘要

最近的研究推动将大型语言模型(LLM)评估从基于均值转向基于尾部的指标,包括条件风险价值和奖励模型误差的尾部指数估计。我们探讨了极值理论中的尾部指数参数(该参数将尾部的沉重程度与尾部质量的大小分离开来)是否在LLM评估中提供了超越均值和标准尾部幅度统计量的区分信息。我们预先注册了一个协议,涵盖任何正面尾部形状主张的可接受性、拟合优度、阈值稳定性和效应量要求。该协议是本文的贡献;下面的实证研究展示了其门控机制如何捕捉问题。应用于两个结构不同的评分器家族下的标准LLM毒性评估设置时,该协议捕捉了三种不同的假阳性模式(这些模式在简单分析中会被发表),并拒绝了两个评分器上的标题尾部形状主张。我们得出结论,在我们检查的LLM毒性评估设置中,尾部形状估计比近期文献所暗示的更为脆弱,并建议将该协议作为类似设置中尾部指数主张的起点。

英文摘要

Recent work motivates moving large language model (LLM) evaluation from mean-based to tail-aware metrics, including conditional value-at-risk and tail-index estimates of reward-model error. We ask whether the canonical extreme-value-theory tail-index parameter, which isolates how heavy a tail is from how large the tail mass is, adds discriminative information beyond the mean and a standard tail-magnitude statistic in LLM evaluation. We pre-register a protocol covering admissibility, goodness-of-fit, threshold-stability, and effect-size requirements for any positive tail-shape claim. The protocol is the contribution of this paper; the empirical study below is a demonstration of what its gates catch. Applied to a standard LLM toxicity-evaluation setup under two structurally different scorer families, the protocol catches three distinct modes of false positives that a naive analysis would have published, and rejects the headline tail-shape claim on both scorers. We conclude that tail-shape estimation in the LLM toxicity-evaluation setups we examined is more fragile than the recent literature suggests, and recommend the protocol as a starting point for tail-index claims in similar setups.

2606.16509 2026-06-16 cs.AI 新提交

Model Graph Inductive Learning for Knowledge Graph Completion

模型图归纳学习用于知识图谱补全

Mohommad Esmaei Khani, Mahdieh Hasheminejad, Ali Taherkhani, Hossein Hajiabolhassan

发表机构 * Yazd University(亚兹德大学) Institute for Advanced Studies in Basic Sciences (IASBS)(基础科学高等研究所) Medizinische Universität Graz(格拉茨医科大学)

AI总结 提出模型图归纳学习(MGIL)框架,通过聚类实体构建模型图并应用GNN捕获全局结构,生成高质量初始嵌入,在归纳链接预测任务上取得最优或竞争性结果。

详情
AI中文摘要

知识图谱中的链接预测根本上依赖于实体和关系嵌入的质量。然而,大多数现有方法仅通过聚合每个实体的局部邻域来推导这些嵌入,忽略了知识图谱的全局结构。这种有限的视角阻止了模型捕获对于准确和可泛化的链接预测至关重要的高层结构模式。为了解决这些限制,我们引入了模型图归纳学习(MGIL),该框架通过基于实体传入和传出关系结构或实体类型的相似性对实体进行聚类来构建模型图。然后,在模型图上应用GNN以生成捕获知识图谱全局视图的嵌入。这些嵌入随后作为原始知识图谱的高质量初始特征,取代随机初始化,从而产生更稳定和更具表达力的表示。在标准和最近提出的归纳基准上的广泛实验表明,MGIL在归纳链接预测中实现了最先进或极具竞争力的性能,突显了其在不同图设置下的有效性。

英文摘要

Link prediction in knowledge graphs fundamentally depends on the quality of learned embeddings for entities and relations. However, most existing methods derive these embeddings by aggregating only the local neighborhood of each entity, neglecting the global structure of the knowledge graph. This limited view prevents models from capturing higher-level structural patterns that are essential for accurate and generalizable link prediction. To address these limitations, we introduce Model Graph Inductive Learning (\textbf{MGIL}), a framework that constructs a model graph by clustering entities based on the similarity of their incoming and outgoing relational structures or their entity types. A GNN is then applied to this model graph to produce embeddings that capture the global view of the knowledge graph. These embeddings subsequently serve as high-quality initial features %embeddings for the original knowledge graph, replacing random initialization and leading to more stable and expressive representations. Extensive experiments on standard and recently proposed inductive benchmarks demonstrate that MGIL achieves state-of-the-art or highly competitive performance in inductive link prediction, highlighting its effectiveness across diverse graph settings.

2606.16505 2026-06-16 cs.SD cs.LG 新提交

Semi-Supervised Speech Confidence Detection using Pseudo-Labelling and Whisper Embeddings

半监督语音自信度检测:使用伪标签和Whisper嵌入

Adam Wynn, Jingyun Wang, Xiangyu Tan

发表机构 * Durham University(杜伦大学) Shanghai Open University(上海开放大学)

AI总结 提出一种结合人工特征与Whisper嵌入的框架,通过伪标签技术扩充数据,利用共注意力机制融合特征,实现75%的语音自信度检测准确率。

Comments 8 pages, 3 figures. Published in the Proceedings of the 26th International Conference on Artificial Intelligence in Education (AIED 2025). Shorter, preliminary version of arXiv:2605.12387

Journal ref AIED 2025. LNCS vol 15882. Springer, Cham (2025)

详情
AI中文摘要

理解说话者的自信度在教育环境中至关重要,因为它可以增强个性化反馈并改善学习成果。本研究引入了一种新颖的框架,通过将人工设计的特征与Whisper编码器的嵌入相结合来检测说话者的自信度。为了解决数据限制问题,采用伪标签技术来扩展标记数据集,使模型能够从人工标注和模型生成的标签中学习。该框架将传统语音特征(包括音高、音量、语速以及不流畅和重音的存在)与Whisper嵌入相结合,并使用共注意力机制融合这些表示,实现了75%的整体准确率。本研究有助于推进语音分析,支持个性化学习和口语技能发展的应用。

英文摘要

Understanding speaker confidence is crucial in educational settings, as it can enhance personalised feedback and improve learning outcomes. This study introduces a novel framework for detecting speaker confidence by integrating human-engineered features with embeddings from the Whisper encoder. To address data limitations, a pseudo-labelling technique is employed to expand the labelled dataset, allowing the model to learn from both human-annotated and model-generated labels. The framework combines traditional speech features including pitch, volume, rate of speech, and the presence of disfluencies and stress, with Whisper embeddings, and uses a co-attention mechanism to fuse these representations and achieve an overall accuracy of 75%. This study contributes to advancing speech analysis, enabling applications that support personalised learning and speaking skill development.

2606.16504 2026-06-16 cs.RO 新提交

APEX: Adaptive Policy Execution for Precise Manipulation

APEX: 用于精确操作的适应性策略执行

Mengfei Zhao, Chenxi Jiang, Tuo An, Jindou Jia, Jianfei Yang

发表机构 * MARS Lab, Nanyang Technological University(南洋理工大学MARS实验室)

AI总结 针对策略与控制器间的执行差距,提出即插即用的APEX框架,通过动态可行参考重建和测试时自适应,减少跟踪误差并提升操作成功率。

Comments 20 pages, 9 figures, 4 tables

详情
AI中文摘要

现代模仿学习方法,包括视觉运动策略和视觉-语言-动作(VLA)策略,通常输出高层动作参考,由低层控制器执行。然而,缺乏高层参考信号以及策略在训练过程中对底层控制动态的不了解,不可避免地导致了执行差距。结果,实际动作系统地偏离策略指令的动作,对精度敏感的操作产生关键影响。先前的工作要么修改策略架构,要么修改低层控制器,两者都需要对预训练策略或封装控制器进行侵入式更改。这引发了一个自然问题:当策略和控制器都被视为不可访问的黑盒时,我们能否弥合执行差距?我们提出了适应性策略执行(APEX),这是一个插入在策略和控制器之间的即插即用框架,从策略输出中重建动态可行的参考,并在测试时根据低层状态反馈进行自适应,具有可证明的收敛保证。广泛的实证研究表明,APEX在演示回放中将控制器引起的跟踪误差减少了41.2%,并在四种视觉运动策略和VLA策略类别上将操作成功率提高了4.8-25.8个百分点。

英文摘要

Modern imitation learning methods, including visuomotor and Vision-Language-Action (VLA) policies, typically output high-level action references that are executed by low-level controllers. However, the absence of higher-order reference signals, together with the policy's lack of awareness of the underlying low-level control dynamics during training, inevitably induces an execution gap. As a result, realized actions deviate systematically from policy-commanded ones, with a critical impact on precision-sensitive manipulation. Prior work either modifies the policy architecture or the low-level controller, both requiring intrusive changes to the pretrained policy or packaged controller. This raises a natural question: when the policy and controller are both treated as inaccessible black boxes, can we bridge the execution gap? We propose Adaptive Policy Execution (APEX), a plug-and-play framework inserted between the policy and the controller that reconstructs a dynamically feasible reference from policy outputs and adapts at test-time according to low-level state feedback, with a provable convergence guarantee. Extensive empirical studies show that APEX reduces controller-induced tracking error by 41.2% on demonstration replay and improves manipulation success by 4.8--25.8 percentage points across four visuomotor and VLA policy classes.

2606.16502 2026-06-16 cs.CV 新提交

Active Reference Acquisition in Few-Shot Font Generation

少样本字体生成中的主动参考获取

Shinnosuke Matsuo

发表机构 * NTT, Inc., Japan(日本电报电话公司) Kyushu University(九州大学)

AI总结 针对少样本字体生成中参考不足导致风格不匹配的问题,提出主动参考获取框架,通过基于局部结构部分覆盖的获取函数,顺序选择最需补充的字符,提升生成质量并减少查询次数。

Comments Accepted at ICDAR2026

详情
AI中文摘要

少样本字体生成旨在给定一个或几个参考字形的情况下,合成字体的其余字形,同时保持风格一致性,从而支持字体设计师高效完成字体设计。现有方法主要关注在固定参考集下提高生成质量。然而,当当前参考字形不足以代表目标风格时,少样本字体生成可能无法产生令人满意的结果。在实际场景中,必要时可以从设计师处获取额外的参考字形。因此,我们提出一个新的框架——少样本字体生成中的主动参考获取,其中模型顺序决定下一个获取哪个字符作为额外参考。此外,我们提出一种基于参考部分覆盖的获取函数来高效地查询设计师。受字体风格由局部结构部分良好表征的观察启发,我们使用局部特征直方图表示每个字形,并选择最大化参考集预期部分覆盖的查询字符。通过优先选择包含当前参考未覆盖部分的字符,所提方法逐步扩展参考集中视觉部分的多样性。结果,生成质量得到提高,且查询次数更少。在Google Fonts数据集上的实验表明,所提方法实现了比随机查询和与参考无关的基线更高的生成质量。代码可在https://github.com/matsuo-shinnosuke/ActiveRef-FontGen获取。

英文摘要

Few-shot font generation aims to synthesize the remaining glyphs of a font given one or a few reference glyphs while preserving stylistic consistency, thereby supporting font designers in efficiently completing a typeface. Existing methods primarily focus on improving generation quality given a fixed reference set. However, when the current reference glyphs are insufficient to represent the target style, few-shot font generation may fail to produce satisfactory results. In practical scenarios, additional reference glyphs can often be obtained from the designer when necessary. Accordingly, we propose a new framework, Active Reference Acquisition in Few-Shot Font Generation, in which the model sequentially decides which character to acquire next as an additional reference. Furthermore, we propose a reference part-coverage-based acquisition function to efficiently query the designer. Motivated by the observation that font styles are well characterized by local structural parts, we represent each glyph using a histogram of local features and select query characters that maximize the expected part coverage of the reference set. By prioritizing characters that contain parts not yet covered by the current references, the proposed method progressively expands the diversity of visual parts in the reference set. As a result, generation quality is improved with fewer queries. Experiments on the Google Fonts dataset demonstrate that the proposed method achieves higher generation quality than random querying and reference-agnostic baselines. The code is available at https://github.com/matsuo-shinnosuke/ActiveRef-FontGen.

2606.16497 2026-06-16 cs.LG cs.AI cs.CL 新提交

daVinci-kernel: Co-Evolving Skill Selection, Summarization, and Utilization via RL for GPU Kernel Optimization

daVinci-kernel:通过强化学习协同进化技能选择、总结与利用的GPU内核优化

Dayuan Fu, Mohan Jiang, Tongyu Wang, Dian Yang, Jiarui Hu, Liming Liu, Jinlong Hou, Pengfei Li

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出daVinci-kernel框架,通过强化学习联合训练技能选择、策略生成和技能总结三个智能体,共享LLM骨干,实现GPU内核优化,在KernelBench上超越先前最优模型。

详情
AI中文摘要

GPU内核优化代表了一种范式,其中功能正确性被假定,执行效率是目标。我们提出daVinci-kernel,一个强化学习框架,通过动态演化的技能库将技能发现与技能利用相结合。daVinci-kernel联合训练三个共享一个LLM骨干的智能体:技能选择智能体通过BM25和LLM重排序检索相关技术,策略智能体基于所选技能生成多轮CUDA/Triton内核,技能总结智能体将成功轨迹提炼为可复用技能。候选技能仅在基于执行的验证确认可复现加速后才被添加。所有三个智能体共享单个LLM骨干,通过多样性过滤数据上的结构化SFT冷启动初始化,然后通过多轮REINFORCE和每个智能体的优势估计进行端到端联合优化。在KernelBench上,daVinci-kernel-14B在Fast$_1$阈值下,Level 1、Level 2和Level 3分别达到37.2%、70.6%和32.2%,优于先前最强的RL训练模型Dr.Kernel-14B。

英文摘要

GPU kernel optimization represents a paradigm where functional correctness is assumed and execution efficiency is the objective. We present daVinci-kernel, a reinforcement learning framework that couples skill discovery with skill exploitation through a dynamically evolving skill library. daVinci-kernel jointly trains three agents sharing one LLM backbone: a Skill Selection Agent that retrieves relevant techniques via BM25 and LLM reranking, a Policy Agent that generates multi-turn CUDA/Triton kernels conditioned on selected skills, and a Skill Summary Agent that distills successful rollouts into reusable skills. Candidate skills are added only after execution-based verification confirms reproducible speedups. All three agents share a single LLM backbone, are initialized via a structured SFT cold start on diversity-filtered data, and are then jointly optimized end-to-end with multi-turn REINFORCE and per-agent advantage estimation. On KernelBench, daVinci-kernel-14B achieves 37.2%, 70.6%, and 32.2% on Level 1, Level 2, and Level 3 under the Fast$_1$ threshold, outperforming the strongest prior RL-trained model, Dr.Kernel-14B.

2606.16494 2026-06-16 cs.CL cs.AI cs.CV 新提交

Lost at the End: Primacy Bias in Multimodal Retrieval-Augmented Question Answering

迷失在末尾:多模态检索增强问答中的首因偏差

Jieyuan Liu, Jianyang Gu, Shijie Chen, Jefferson Chen, Zhen Wang

发表机构 * University of California, San Diego(加州大学圣地亚哥分校) The Ohio State University(俄亥俄州立大学)

AI总结 研究多模态知识型视觉问答中检索上下文的位置依赖,发现不同于纯文本的U形效应,出现首因偏差(开头优于末尾),并通过消融实验定位原因为指令调优阅读器的提示槽0。

Comments 15 pages, 9 figures. Under review at EMNLP 2026

详情
AI中文摘要

基于知识的视觉问答(KB-VQA)通过将阅读器条件化于从维基百科规模知识库检索的段落,使视觉-语言系统能够回答超出其参数知识的问题。在纯文本长上下文LLM中,检索上下文的使用遵循Liu等人(2024)的U形“迷失在中间”效应:上下文开头和结尾的信息被使用,中间部分被忽略。这种效应是否会迁移到部署的多模态KB-VQA中尚不清楚。为填补这一空白,我们设计了首个针对多模态KB-VQA中阅读器侧位置依赖的受控探针:一种黄金位置协议,其中只有黄金段落的提示槽在问题内变化。我们在三个开源7B/8B VLM阅读器和两个KB-VQA基准上运行,k最大为20。形状从U形翻转为首因:在每个阅读器-基准组合上,黄金在开头比黄金在结尾高出16到26个点,我们称这种效应为“迷失在末尾”。三项针对性消融实验缩小了原因:纯文本对照显示多模态设置将已存在的文本模式首因放大了2.2到4.5倍,图像位置和干扰物洗牌消融共同将根源定位到指令调优阅读器的提示槽0。在冻结的阅读器上,三种检索侧修复(MMR、神权重排序、基于排名的重排序)均未缩小差距(无显著改进)。我们的发现表明,recall@k是部署KB-VQA的错误指标,缩小差距需要阅读器侧干预;我们发布该协议作为评估此类干预的受控工具。

英文摘要

Knowledge-based visual question answering (KB-VQA) lets vision-language systems answer questions that exceed their parametric knowledge by conditioning a reader on passages retrieved from a Wikipedia-scale knowledge base. In pure-text long-context LLMs, retrieved-context use follows the U-shaped "lost-in-the-middle" effect of Liu et al. (2024): information at the start and end of context is used, the middle is lost. Whether this transfers to deployed multimodal KB-VQA is open. To close this gap, we design the first controlled probe of reader-side position dependence in multimodal KB-VQA: a gold-position protocol in which only the gold passage's prompt slot varies within question. We run it on three open-source 7B/8B VLM readers and two KB-VQA benchmarks at k up to 20. The shape flips from U to primacy: gold-at-first beats gold-at-last by 16 to 26 points on every reader-by-benchmark cell, an effect we call "Lost at the End". Three targeted ablations narrow the cause: a text-only control shows the multimodal setting amplifies an already-present text-mode primacy 2.2 to 4.5 times, and image-position and distractor-shuffle ablations together pin the locus to prompt slot 0 of the instruction-tuned reader. On a frozen reader, three retrieval-side fixes (MMR, oracle reranking, rank-based reordering) all leave the gap intact (no separable improvement). Our findings indicate that recall@k is the wrong metric for deployed KB-VQA and that closing the gap requires reader-side intervention; we release our protocol as a controlled instrument for evaluating such interventions.

2606.16491 2026-06-16 cs.RO 新提交

HATS: A Human-Agent Teleoperation System for Multi-Arm Data Collection

HATS:用于多臂数据收集的人-智能体遥操作系统

Zesen Lin, Jian-Jian Jiang, Haoming Cen, Xiao-Ming Wu, Dandan Zhang, Wei-Shi Zheng

发表机构 * School of Computer Science and Engineering, Sun Yat-sen University(中山大学计算机科学与工程学院) Nanyang Technological University(南洋理工大学) Imperial College London(帝国理工学院)

AI总结 提出HATS系统,由单操作员借助MLLM智能体控制两主臂和两辅助臂,实现高效多臂数据收集,性能媲美双人专家团队。

详情
AI中文摘要

许多真实世界的操作场景,例如处理复杂的协作任务和应对大工作空间,需要协调两个以上的机械臂。因此,需要一个有效的多臂遥操作系统来收集训练协调多臂操作策略的示范数据。然而,现有的遥操作框架主要关注单操作员或多操作员设置,面临着单操作员认知负荷与多操作员协调成本之间的实际权衡。为了解决这个问题,我们引入了HATS,一个人类-智能体遥操作系统,使单个人类操作员在基于MLLM的智能体辅助下,能够收集多臂操作任务的数据。我们的系统解耦了控制空间:两个主臂由人类直接遥操作,而两个辅助臂由处理子任务的无训练智能体控制。此外,人类操作员可以在执行过程中使用语音命令来防止碰撞并纠正辅助臂的行为。大量评估表明,HATS在数据收集效率和成功率上与专家双人团队相当。此外,下游策略评估证明了通过HATS收集的数据的有效性和质量。

英文摘要

Many real-world manipulation scenarios, such as handling complex collaborative tasks and dealing with large workspaces, require coordination of more than two robotic arms. Consequently, an effective multi-arm teleoperation system is required to collect demonstrations for training coordinated multi-arm manipulation policies. However, existing teleoperation frameworks mainly focus on single-operator or multi-operator setups, facing a practical trade-off between the cognitive load placed on a single operator and the coordination cost incurred by multiple operators. To address this problem, we introduce HATS, a human-agent teleoperation system that enables a single human operator, assisted by an MLLM-based agent, to collect data for multi-arm manipulation tasks. Our system decouples the control space: two primary arms are directly teleoperated by the human, while two assistive arms are controlled by a training-free agent that handles sub-tasks. In addition, the human operator can use voice commands to prevent collisions and correct assistive arm behaviors during execution. Extensive evaluations demonstrate that HATS achieves data collection efficiency and success rates comparable to expert dual-human teams. Moreover, downstream policy evaluations demonstrate the efficacy and quality of the data collected through HATS.

2606.16490 2026-06-16 cs.RO 新提交

Robots that Collaborate: Sequential Asymmetric Imitation for Learning Coupled Robot Policies

协作机器人:用于学习耦合机器人策略的序列非对称模仿

Yincong Chen, Ranpeng Qiu, Zihao Li, Yanan Zhou, Guoqiang Ren, Weiming Zhi

发表机构 * Zeno AI University of Sydney(悉尼大学)

AI总结 提出序列非对称模仿(SAI),通过单操作员课程学习耦合多机器人行为,无需同步双操作员演示或显式通信,在真实双机器人操作任务中提升成功率与相位同步。

详情
AI中文摘要

协作移动操作要求机器人与部分可观测的伙伴协调,同时通过共享物体进行物理交互。这很困难,因为失败通常不是由于局部技能差,而是由于不合时宜的等待、让步、拉动、释放或重新定位。我们通过两个双臂移动操作器与刚性和可变形物体耦合来研究这个问题。我们提出序列非对称模仿(SAI),一种单操作员课程,用于学习耦合的多机器人行为,无需同步双操作员演示或显式机器人间通信。SAI 首先从与顺从人类伙伴的单侧演示中训练机器人 A,然后针对已部署的机器人 A 策略训练机器人 B,最后在协调失败附近使用稀疏干预来优化机器人 A。这种分阶段过程使策略暴露于越来越真实的伙伴行为,包括延迟、相位不匹配、让步不足和交互冲突。在真实世界的双机器人操作任务中,SAI 在任务成功率、相位同步和伙伴条件性让步方面优于独立模仿和课程消融基线。这些结果表明,物理耦合协作可以通过模仿课程的结构来学习,而不是通过同步多操作员演示或显式协调机制。项目页面:http://cyc0429.github.io/sai-project-page/

英文摘要

Collaborative mobile manipulation requires robots to coordinate with a partially observed partner while physically interacting through shared objects. This is difficult because failures often arise not from poor local skills, but from mistimed waiting, yielding, pulling, releasing, or repositioning. We study this problem with two bimanual mobile manipulators coupled through rigid and deformable objects. We propose Sequential Asymmetric Imitation (SAI), a single-teleoperator curriculum for learning coupled multi-robot behaviors without synchronized dual-operator demonstrations or explicit inter-robot communication. SAI trains Robot A from unilateral demonstrations with a compliant human partner, trains Robot B against the deployed Robot A policy, and then refines Robot A using sparse interventions near coordination failures. This staged process exposes the policies to increasingly realistic partner behaviors, including delay, phase mismatch,insufficient yielding, and interaction conflict. Across real-world dual-robot manipulation tasks, SAI improves task success, phase synchronization, and partner-contingent yielding over independent imitation and curriculum-ablation baselines. These results suggest that physically coupled collaboration can be learned through the structure of the imitation curriculum, rather than through synchronized multi-operator demonstrations or explicit coordination mechanisms.Project page:http://cyc0429.github.io/sai-project-page/

2606.16489 2026-06-16 cs.LG 新提交

BRICKS-WM: Building Reusability via Interface Composition Kinetics for Structured World Models

BRICKS-WM:通过接口组合动力学构建结构化世界模型的可重用性

Shaowei Zhang, Jiahan Cao, Xunlan Zhou, Shenghua Wan, De-Chuan Zhan

发表机构 * National Key Laboratory for Novel Software Technology, Nanjing University, China(南京大学计算机软件新技术国家重点实验室) School of Artificial Intelligence, Nanjing University, China(南京大学人工智能学院) School of Intelligence Science and Technology, Nanjing University, China(南京大学智能科学与技术学院)

AI总结 提出BRICKS-WM框架,将全局动力学分解为通过潜在接口交互的独立模块(如智能体和背景),实现冻结背景模块跨智能体重用,避免从头训练。

详情
AI中文摘要

基于模型强化学习(MBRL)通过利用潜在世界模型在连续控制中取得了显著成功。然而,现有方法通常依赖单一潜在动力学,将环境动力学纠缠为耦合过程。这种耦合严重限制了可重用性:即使环境保持不变,改变智能体也需要从头重新训练整个世界模型。为了解决这个问题,我们引入了BRICKS-WM(通过接口组合动力学构建结构化世界模型的可重用性),一个用于模块化组装结构化世界模型的框架。基于物理世界由独立实体组成的洞察,我们假设全局动力学可以建模为通过潜在接口交互的不同动力学模块的组合。作为一个最小实例,我们将潜在状态空间分解为一个被驱动的智能体模块和一个外部背景模块,通过学习的潜在接口连接。与先前优先考虑视觉分割的以对象为中心的方法不同,BRICKS-WM在转移动力学中强制执行功能分离,确保背景动力学对智能体动力学保持不可知。实验表明,BRICKS-WM在从头训练时实现了与强单一基线相当的控制性能,并能够跨智能体重用冻结的背景动力学。

英文摘要

Model-based Reinforcement Learning (MBRL) has achieved remarkable success in continuous control by leveraging latent world models. However, prevailing approaches typically rely on monolithic latent dynamics, entangling environment dynamics into a coupled process. This coupling severely limits reusability: altering the agent necessitates retraining the entire world from scratch, even if the environment remains constant. To address this, we introduce BRICKS-WM (Building Reusability via Interface Composition Kinetics for Structured World Models), a framework for the modular assembly of structured world models. Driven by the insight that the physical world is composed of independent entities, we posit that global dynamics can be modeled as a composition of distinct dynamical modules interacting via latent interfaces. As a minimal instantiation, we factorize the latent state space into an actuated Agent module and an external Background module, bridged by a learned latent interface. Unlike prior object-centric methods that prioritize visual segmentation, BRICKS-WM enforces a functional separation in transition dynamics, ensuring that background dynamics remains agnostic to the agent's dynamics. Empirically, BRICKS-WM achieves control performance comparable to strong monolithic baselines when trained from scratch, and enables the reuse of frozen background dynamics across agents.

2606.16484 2026-06-16 cs.CV cs.AI cs.MM 新提交

Unified Multimodal Model for Brain MRI Imputation and Understanding

统一多模态模型用于脑MRI补全与理解

Zhiyun Song, Che Liu, Tian Xia, Avinash Kori, Wenjia Bai

发表机构 * Department of Computing, Imperial College London(伦敦帝国理工学院计算机系) Department of Brain Sciences, Imperial College London(伦敦帝国理工学院脑科学系)

AI总结 提出UniBrain模型,通过统一训练策略联合处理脑MRI模态补全与图像理解,采用自对齐和动态隐藏状态机制,在多疾病数据集上实现高性能。

Comments Early accepted to MICCAI 2026

详情
AI中文摘要

多模态大语言模型(MLLMs)在医学领域具有巨大潜力,因为它们继承了LLM的知识,并允许以自然语言集成、分析和解释多种数据模态。然而,医学MLLMs面临重大挑战,特别是高质量训练数据的稀缺以及现实临床环境中数据缺失的频繁发生。在此,我们提出了一种新颖的统一多模态模型UniBrain,用于脑磁共振图像(MRI)分析。为了解决潜在的脑MRI模态缺失问题,我们采用统一训练策略进行联合成像模态补全和脑图像理解。在训练过程中,构建了交错且描述丰富的数据流,以自回归方式训练模型,从而实现基于生成的多模态数据的医学推理。引入自对齐策略,利用密集图像嵌入学习细粒度解剖特征,无需详细的图像描述。此外,我们提出了一种动态隐藏状态机制,以缓解长上下文多模态推理中的暴露偏差。在多疾病脑MRI数据集上的大量实验表明,UniBrain在模态不完全的各种情况下,在脑图像补全、理解和疾病诊断方面均取得了高性能。

英文摘要

Multimodal large language models (MLLMs) hold great potential for medicine, as they inherit knowledge from LLM and allow multiple data modalities to be integrated, analysed and interpreted in natural language. However, the field of medical MLLMs is constrained by non-trivial challenges, notably the scarcity of high-quality training data and the frequent occurrence of missing data in the real-world clinical setting. Here, we propose a novel unified multimodal model, UniBrain, for brain magnetic resonance image (MRI) analysis. To address potential missing brain MRI modalities, we employ a unified training strategy to perform joint imaging modality imputation and brain image understanding. During training, an interleaved and description-enriched data flow is constructed to train the model in an autoregressive manner, enabling medical reasoning with generated multimodal data. A self-alignment strategy is introduced to leverage dense image embeddings to learn fine-grained anatomical features without requiring detailed image captions. Furthermore, we propose a dynamic hidden state mechanism to alleviate the exposure bias during long-context multimodal inference. Extensive experiments on multi-disease brain MRI dataset demonstrate that UniBrain achieves high performance for brain image imputation, understanding, and disease diagnosis under various extents of modality incompleteness.

2606.16481 2026-06-16 cs.AI 新提交

Steering Emotional Dynamics for Art Therapy: Controllable Narrative Script Generation through Hierarchically Guided LLM Agents

引导艺术治疗的情感动态:通过分层引导的LLM智能体实现可控叙事脚本生成

Suqing Wang, Qinghai Miao, Chao Guo, Yisheng Lv

发表机构 * Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院)

AI总结 提出EC-Script框架,通过分层控制情感轨迹生成叙事脚本,实现情感轨迹规划、场景驱动和局部情感调节,显著优于基线方法。

详情
AI中文摘要

艺术治疗在情感治愈中扮演重要角色,其中叙事创作是情感表达的主要载体。鉴于治愈过程中情感固有的动态特性,具有精细控制情感波动的叙事使个体能够安全地投射内心冲突并实现情感宣泄。近年来,随着大型语言模型(LLM)的快速发展,自动叙事生成技术为支持此类艺术设计提供了新途径。然而,现有方法虽然能生成流畅文本,但难以生成遵循特定情感轨迹的叙事,无法满足以情感为导向的心理治愈需求。为解决这些问题,本文提出EC-Script,一种基于LLM智能体的框架,能够实现对情感治愈叙事生成中情感轨迹的分层控制。为确保生成的叙事严格遵循给定的情感模式,EC-Script通过情感轨迹规划建立整体叙事方向,通过角色驱动场景生成推动场景级情节发展,并通过情感控制脚本编写调节角色的局部情感变化。最终输出逐场景的脚本内容,与预设情感轨迹保持高度一致。实验结果表明,EC-Script在情感轨迹遵循度上显著优于基线方法,展现出优秀且可靠的情感可控性,从而为AI辅助情感治愈场景提供有效的技术支持。

英文摘要

Art therapy plays a vital role in emotional healing, in which narrative creation acts as the primary vehicle for emotional expression. Given the inherently dynamic nature of emotions during healing, narratives with finely controlled emotional fluctuations enable individuals to safely project inner conflicts and achieve emotional catharsis. Recently, with the rapid development of Large Language Models (LLMs), automated narrative generation technology has provided a new pathway to support such artistic designs. However, while existing methods can produce fluent texts, they struggle to generate narratives that adhere to specified affective trajectories, failing to meet the demands of emotion-oriented psychological healing. To address these issues, this paper proposes EC-Script, an LLM agent-based framework that enables hierarchical control of the affective trajectory in narrative generation for emotional healing. To ensure that the generated narratives strictly follow the given emotional patterns, EC-Script establishes overall narrative direction through Emotion-Trajectory Planning, propels scene-level plot development with Character-Driven Scene Generation, and regulates local emotional changes of characters via Emotion-Controlled Script Writing. Ultimately, it outputs scene-by-scene script content that remains highly consistent with the preset affective trajectory. Experimental results demonstrate that EC-Script significantly outperforms baseline methods in affective trajectory adherence, exhibiting excellent and reliable emotional controllability, thereby providing effective technical support for AI-assisted emotional healing scenarios.

2606.16480 2026-06-16 cs.RO cs.AI cs.SY eess.SY 新提交

HOLO-MPPI: Multi-Scenario Motion Planning via Hierarchical Policy Optimization

HOLO-MPPI:通过分层策略优化的多场景运动规划

Youngjae Min, Jovin D'sa, Faizan M. Tariq, David Isele, Navid Azizan, Sangjae Bae

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Honda Research Institute, USA(本田研究所(美国))

AI总结 提出HOLO-MPPI框架,结合离线高层策略学习与在线低层随机最优控制,实现多场景运动规划,无需针对每个场景重新调整参数,在自动驾驶中优于MPPI和端到端RL基线。

详情
AI中文摘要

部署在现实世界中的机器人必须在不同场景下规划运动,而无需针对每个场景重新调整参数。端到端强化学习(RL)可以跨场景泛化,但在分布偏移、奖励错误指定和随机交互下往往变得脆弱。模型预测路径积分(MPPI)控制能够在无梯度的情况下实现强大的实时优化,但其性能依赖于良好形状的采样先验,而手动设计先验无法扩展到多场景部署。我们提出了HOLO-MPPI(高层离线,低层在线MPPI),一种多场景运动规划框架,结合了高层策略学习与低层随机最优控制。离线时,我们学习一个高层策略,在抽象动作空间中提出场景鲁棒的规划,并利用学习的世界模型进行在线推演。在线时,该策略作为数据驱动的先验生成器,根据当前观测和目标参数化MPPI的采样分布。然后MPPI围绕该先验实时优化低层控制序列,以适应局部扰动。我们通过设计有效的高层动作空间和定制模型架构,在自动驾驶中实例化HOLO-MPPI。在多种驾驶场景下的评估表明,HOLO-MPPI在保持实时控制的同时,优于MPPI和端到端RL基线。

英文摘要

Robots deployed in the real world must plan motions across diverse scenarios without per-scenario retuning. End-to-end reinforcement learning (RL) can generalize across scenarios but often becomes brittle under distribution shift, reward misspecification, and stochastic interactions. Model predictive path integral (MPPI) control enables strong real-time refinement without gradients, but its performance depends on a well-shaped sampling prior, while manually designing the priors does not scale to multi-scenario deployment. We present HOLO-MPPI (High-level Offline, Low-level Online MPPI), a multi-scenario motion planning framework that combines high-level policy learning with low-level stochastic optimal control. Offline, we learn a high-level policy that proposes scenario-robust plans in an abstract action space, with a learned world model for online rollout. Online, the policy serves as a data-driven prior generator that parameterizes MPPI's sampling distribution conditioned on the current observation and goal. MPPI then optimizes low-level control sequences around this prior in real time to adapt to local disturbances. We instantiate HOLO-MPPI in autonomous driving by designing an effective high-level action space and tailored model architectures. Our evaluation across diverse driving scenarios shows that HOLO-MPPI improves upon MPPI and end-to-end RL baselines while maintaining real-time control.

2606.16479 2026-06-16 cs.CV cs.AI 新提交

Uncertainty Quality of VGGT: An Analysis on the DTU Benchmark Dataset

VGGT的不确定性质量:基于DTU基准数据集的分析

Markus Hillemann, Robert Langendörfer, Steven Landgraf, Markus Ulrich

发表机构 * Institute of Photogrammetry and Remote Sensing, Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院摄影测量与遥感研究所)

AI总结 本文分析VGGT模型在DTU数据集上的不确定性预测质量,确定有效置信度阈值,并证明提升不确定性质量可显著改善3D重建精度。

Comments Accepted for publication in the ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences

详情
AI中文摘要

视觉几何基础变换器(VGGT)在短时间内引起了广泛关注,尤其是因其在CVPR-2025上获得最佳论文奖。与DUSt3R和MASt3R类似,VGGT旨在通过用一个简单、统一的馈送神经网络取代束调整和特征匹配等既定方法,实现范式转变,该网络可直接从场景的多张图像中在几秒内预测相机位姿、深度图和密集3D结构。其关键能力是在单次前向传播中一致地处理任意数量的视图,无需任何后处理或迭代优化。对于摄影测量学,这为实时、可扩展和可访问的3D重建开辟了新的可能性。在此背景下,不仅高重建精度至关重要,高质量的不确定性估计也至关重要,因为它们能增强信任并实现稳健的质量保证。因此,本文研究了VGGT不确定性预测的质量。分析确定了用于过滤VGGT原始输出的有效置信度阈值,并证明提升不确定性质量在提高其3D重建精度方面具有巨大潜力。

英文摘要

Visual Geometry Grounded Transformer (VGGT) has already attracted a great deal of attention in a short period of time, not least due to the Best Paper Award at CVPR-2025. Similar to DUSt3R and MASt3R, VGGT aims to bring about a paradigm shift by replacing established methods like bundle adjustment and feature matching with a simple, unified, feed-forward neural network that predicts camera poses, depth maps, and dense 3D structure directly from multiple images of a scene in a few seconds. A key aspect is its ability to process an arbitrary number of views consistently in a single forward pass without any post-processing or iterative optimization. For photogrammetry, this opens new possibilities for real-time, scalable, and accessible 3D reconstruction. In this context, not only high reconstruction accuracy but also high-quality uncertainty estimates are crucial, as they foster trust and enable robust quality assurance. This paper therefore investigates the quality of VGGT's uncertainty predictions. The analysis identifies an effective confidence threshold for filtering VGGT's raw output and demonstrates that enhancing uncertainty quality holds strong potential for improving the accuracy of its 3D reconstructions.

2606.16478 2026-06-16 cs.AI 新提交

Tensor-Coord: Algebraic Decomposition of Joint Plan Tensors for Conflict-Free Multi-Agent LLM Planning

Tensor-Coord:用于无冲突多智能体LLM规划的联合计划张量代数分解

Mudit Rastogi

发表机构 * University of Michigan(密歇根大学)

AI总结 提出Tensor-Coord框架,将多智能体联合计划表示为三阶张量,通过CP和Tucker分解识别协调结构,计算协调复杂度并定位冲突,实现无冲突规划。

详情
AI中文摘要

大型语言模型(LLM)在多智能体规划中仍然受限,因为独立生成的计划可能导致协调失败,如空间碰撞、资源争用和时间死锁。我们引入Tensor-Coord,一个多线性代数框架,将N个智能体的联合计划表示为三阶张量 \(T \in R^{N \times H \times A}\),维度为智能体、时间步和动作。使用典型多面体(CP)和Tucker分解来识别潜在协调结构。最小ε近似CP秩R*定义了一个可计算的协调复杂度度量,\(CC(Pi)=(R*-N)/N\)。我们证明R*=N是计划独立性的充分必要条件。残差 \(E=T-T_{R*}\) 定义了智能体对、时间步和动作上的冲突分数,无需领域特定规则即可定位失败。Tucker因子提供可解释的智能体角色、时间阶段和动作聚类,这些被转换为自然语言约束,用于迭代LLM重规划。在多机器人配送任务上的实验,包括简单(2个智能体,5x5网格)、中等(3个智能体,5x5网格)和困难(4个智能体,5x5网格)设置,显示在2个智能体情况下100%收敛到无冲突计划,平均迭代1.4次;3个智能体情况下80%收敛,平均迭代3.2次;4个智能体情况下60%收敛,平均迭代4.0次。CP秩近似线性增长,\(R*(N) = 3.9N + 0.5\),支持其作为协调复杂度预测器的使用。

英文摘要

Large language models (LLMs) remain limited in multi-agent planning because independently generated plans can create coordination failures such as spatial collisions, resource contention, and temporal deadlocks. We introduce Tensor-Coord, a multilinear algebra framework that represents the joint plan of N agents as a third-order tensor \(T \in R^{N \times H \times A}\) over agents, timesteps, and actions. Canonical Polyadic (CP) and Tucker decompositions are used to identify latent coordination structure. The minimal epsilon-approximate CP rank R* defines a computable coordination complexity measure, with \(CC(Pi)=(R*-N)/N\). We prove that R*=N is necessary and sufficient for plan independence. The residual \(E=T-T_{R*}\) defines a conflict score over agent pairs, timesteps, and actions, localizing failures without domain-specific rules. Tucker factors provide interpretable agent roles, temporal phases, and action clusters that are converted into natural language constraints for iterative LLM replanning. Experiments on multi-robot delivery tasks across Easy (2 agents, 5x5 grid), Medium (3 agents, 5x5 grid), and Hard (4 agents, 5x5 grid) settings show convergence to conflict-free plans in 100% of 2-agent cases within 1.4 iterations on average, 80% of 3-agent cases within 3.2 iterations, and 60% of 4-agent cases within 4.0 iterations. CP rank scaled approximately linearly as \(R*(N) = 3.9N + 0.5\), supporting its use as a predictor of coordination complexity.

2606.16477 2026-06-16 cs.CV 新提交

AURA: Active-Response Attribution under Treatment Ambiguity in Bacterial Cytological Profiling

AURA: 细菌细胞学分析中治疗模糊性下的主动响应归因

Kartik Jhawar, Mrunmayee Deshpande, Wilfried Moreira, Guillermo C. Bazan, Lipo Wang

发表机构 * Nanyang Technological University(南洋理工大学) Institute of High Performance Computing, A*STAR(新加坡科技研究局高性能计算研究所) University of California, Santa Barbara(加州大学圣塔芭芭拉分校)

AI总结 针对抗生素组合中仅部分药物实际作用的问题,提出基于能量的约束逆归因方法AURA,通过分解残余形态并选择重构能量最低的子集,在跨重复实验中达到95.47%的精确匹配准确率。

详情
AI中文摘要

当细菌样本暴露于多种抗生素时,并非每种施加的药物都必然起作用:如果细菌对其中一种药物耐药,则该药物不会留下形态学痕迹。因此,临床上有意义的量不是施加了哪些抗生素,而是哪些抗生素是活跃的。我们表明,在实际的大肠杆菌显微镜中,这两者严重脱钩——天真地假设施加的组合等于活跃组合的正确率仅约37%——然而现有的计算工具不适合恢复活跃集。前向扰动模型如scGen、CPA和IMPA旨在从处理预测外观,而非反向,并且反转它们会严重退化;判别式图像分类器倾向于记忆菌株和批次特定的纹理,并且无法跨实验重复迁移。我们引入AURA,它将任务重新定义为基于能量的约束逆归因。其核心归纳偏置是活跃集必须是施加集的子集;这压缩了候选空间,并让AURA通过将残余形态分解为抗生素响应原子并选择重构能量最低的子集来推断施加抗生素中的活跃子集,测试时不使用菌株标签。AURA-E添加了证据感知的弃权,当候选解释仍然近乎同等合理时保留预测。在大肠杆菌细胞学分析数据集的跨重复迁移中,AURA以95.47%的精确匹配准确率恢复活跃抗生素组合。

英文摘要

When a bacterial sample is exposed to several antibiotics, not every applied drug necessarily acts: if the organism is resistant to one of them, that drug leaves no morphological trace. The clinically meaningful quantity is therefore not which antibiotics were applied, but which ones were active. We show that these two are sharply decoupled in real E. coli microscopy - naively assuming the applied combination equals the active one is correct only about 37% of the time - yet existing computational tools are ill-suited to recovering the active set. Forward perturbation models such as scGen, CPA, and IMPA are designed to predict appearance from treatment, not the reverse, and inverting them degrades sharply; discriminative image classifiers tend to memorise strain- and batch-specific texture and fail to transfer across experimental replicates. We introduce AURA, which reframes the task as constrained, energy-based inverse attribution. Its central inductive bias is that the active set must be a subset of the applied set; this collapses the candidate space and lets AURA infer the active subset of applied antibiotics by decomposing residual morphology into antibiotic response atoms and selecting the subset with the lowest reconstruction energy, using no strain label at test time. AURA-E adds evidence-aware abstention, withholding a prediction when candidate explanations remain near-equally plausible. On cross-replicate transfer in an E. coli cytological profiling dataset, AURA recovers the active antibiotic combination with 95.47% exact-match accuracy.

2606.16474 2026-06-16 cs.CV cs.RO 新提交

MVOFormer: Flow-Semantic Transformer for Robust Monocular Visual Odometry

MVOFormer:用于鲁棒单目视觉里程计的流-语义Transformer

Jituo Li, Shunwang Sun, Jialu Zhang, Xinqi Liu, Jinyao Hu, Zhicheng Lu, Sajad Saeedi, Guodong Lu

发表机构 * State Key Laboratory of Fluid Power and Mechatronic Systems, Zhejiang University(浙江大学流体动力与机电系统国家重点实验室) Zhejiang Key Laboratory of Industrial Big Data and Robot Intelligent Systems(浙江省工业大数据与机器人智能系统重点实验室) School of Mechanical Engineering, Zhejiang University(浙江大学机械工程学院) Robotics Institute, Zhejiang University(浙江大学机器人研究院) School of Artificial Intelligence and Robotics, Hunan University(湖南大学人工智能与机器人学院) Rural Health Research Institute, Charles Sturt University(查尔斯特大学农村健康研究所) University College London(伦敦大学学院)

AI总结 提出MVOFormer,一种流-语义双分支编码器与迭代多模态解码器结合的Transformer框架,通过融合密集几何运动与语义先验实现粗到细位姿优化,在零样本泛化上显著超越现有方法。

Comments 8 pages, 6 figures. Accepted for publication in IEEE Robotics and Automation Letters (RA-L)

详情
AI中文摘要

单目视觉里程计(MVO)是自主导航和机器人定位的基础。然而,现有的基于学习的MVO方法通常缺乏可解释的互补特征或具有过于复杂的多阶段架构,这些局限性固有地限制了它们的鲁棒性和跨域泛化能力。在这项工作中,我们提出了MVOFormer,一种用于鲁棒单目视觉里程计的新型Transformer框架。我们的架构采用流-语义双分支编码器,将密集几何运动线索与以物体为中心的语义先验协同结合,明确区分静态结构与动态干扰物。然后,这些表示通过迭代多模态解码器融合,实现从粗到细的位姿优化,同时动态抑制对不可靠区域的注意力。大量评估表明,无需任何目标域微调,MVOFormer在TartanAir、KITTI、TUM-RGBD和ETH3D-SLAM等多个基准上实现了优越的零样本泛化和鲁棒性,显著优于先前基于学习的帧到帧方法。

英文摘要

Monocular visual odometry (MVO) is foundational to autonomous navigation and robotic localization. However, existing learning-based MVO approaches often struggle with either a lack of interpretable, complementary features or overly complex multi-stage architectures. These limitations inherently restrict their robustness and cross-domain generalization. In this work, we propose MVOFormer, a novel transformer framework for robust monocular visual odometry. Our architecture features a Flow-Semantic Dual Branch Encoder that synergizes dense geometric motion cues with object-centric semantic priors, explicitly distinguishing static structures from dynamic distractors. These representations are then fused by an Iterative Multimodal Decoder, enabling coarse-to-fine pose refinement while dynamically suppressing attention on unreliable regions. Extensive evaluations demonstrate that, without any target-domain fine-tuning, MVOFormer achieves superior zero-shot generalization and robustness, significantly outperforming prior learning-based frame-to-frame methods across diverse benchmarks including TartanAir, KITTI, TUM-RGBD, and ETH3D-SLAM.

2606.16472 2026-06-16 cs.CL 新提交

From Awareness to Adherence: Bridging the Context Gap in Spoken Dialogue Systems via Context-Aware Decoding

从意识到遵循:通过上下文感知解码弥合口语对话系统中的上下文鸿沟

Che Hyun Lee, Heeseung Kim, Sungroh Yoon

发表机构 * ECE, Seoul National University(首尔大学电气与计算机工程系) IPAI, Seoul National University(首尔大学IPAI研究所) Department of AI, University of Seoul(首尔市立大学人工智能系)

AI总结 提出音频适配的上下文感知解码方法,通过对比有无关键上下文的输出分布,放大多模态上下文信号,解决口语对话系统中上下文遵循问题。

Comments Interspeech 2026 Main Track

详情
AI中文摘要

尽管端到端口语对话系统取得了成功,但在多轮对话中保持严格的上下文遵循仍然是一个挑战。先前的工作将这些失败归因于模型忘记对话历史,但我们强调了一个同样关键但被忽视的瓶颈:潜在上下文意识与主动遵循之间的鸿沟。尽管模型内部能识别相关的历史话语,但强大的参数先验在解码时常会掩盖这些信号。为弥合这一鸿沟,我们提出了一种音频适配的上下文感知解码方法。通过利用内部注意力机制隔离关键历史轮次,我们的方法在推理时对比有无此关键上下文的输出分布,直接放大多模态上下文信号。在Audio MultiChallenge基准上的评估表明,在语义记忆和自我一致性子任务上取得了显著改进,成功实现了严格、忠实于上下文的遵循。

英文摘要

Despite the success of end-to-end (E2E) spoken dialogue systems, maintaining strict context adherence in multi-round conversations remains a challenge. While prior works attribute these failures to models forgetting dialogue history, we highlight an equally critical but overlooked bottleneck: a gap between latent context awareness and active adherence. Although models internally recognize relevant past utterances, strong parametric priors often overshadow these signals during decoding. To bridge this gap, we propose an audio-adapted Context-Aware Decoding (CAD) approach. By leveraging internal attention mechanisms to isolate key historical rounds, our approach contrasts output distributions with and without this key context during inference, directly amplifying multimodal contextual signals. Evaluations on the Audio MultiChallenge benchmark demonstrate significant improvements in Semantic Memory and Self Coherence subtasks, successfully enforcing strict, context-faithful adherence.

2606.16470 2026-06-16 cs.CV cs.RO 新提交

Decoupled Object-Centric Video Understanding for Generating Robotic Manipulation Commands

解耦的以对象为中心的视频理解用于生成机器人操作指令

Thanh Nguyen Canh, Thanh-Tuan Tran, Haolan Zhang, Ziyan Gao, Xiem HoangVan, Nak Young Chong

发表机构 * School of Information Science, Japan Advanced Institute of Science and Technology(日本北陆先端科学技术大学院大学信息科学学院) University of Engineering and Technology, Vietnam National University(越南国立大学工程与技术大学) Department of Robotics, Hanyang University(汉阳大学机器人学系)

AI总结 提出解耦动作识别与对象选择的框架,通过TSM分类动作和对象选择算法识别任务相关对象,结合VLM生成精确指令,在Something-Something V2上显著提升性能。

详情
AI中文摘要

将视频演示翻译为可执行的机器人命令仍然具有挑战性,因为现有方法通常无法识别演示动作中功能涉及的对象。因此,它们可能生成语言上合理但操作上模糊的命令。我们提出了一种以对象为中心的视频理解框架,将动作识别与对象识别解耦,以生成精确的、无语法的操作命令。我们的方法集成了时间移位模块(TSM)用于高效的时空动作分类,以及一种新颖的\textbf{对象选择}算法,通过基于轨迹的角色分类、模糊检测和重叠最小化来识别任务相关对象。然后,选定的对象由视觉语言模型(VLM)处理,以实现鲁棒的类别识别和零样本泛化。在修改后的Something-Something V2数据集上评估,我们的方法达到了86.79%的动作分类准确率,在标准对象上BLEU-4得分为0.337,在新颖对象上为0.261。这些结果分别比最强的任务特定基线提高了80.2%和143.9%。在METEOR和CIDEr指标上观察到更大的提升,在新颖对象上分别达到157.9%和171.7%。在所有语义指标上,我们的方法始终优于任务特定方法,并与大型通用VLM保持竞争力或超越它们,同时保留了模块化的、以对象为中心的设计。

英文摘要

Translating video demonstrations into executable robot commands remains challenging because existing methods often fail to identify which objects are functionally involved in the demonstrated action. As a result, they may generate commands that are linguistically plausible but operationally ambiguous. We propose an object-centric video understanding framework that decouples action recognition from object identification to generate precise, grammar-free manipulation commands. Our approach integrates Temporal Shift Modules (TSM) for efficient spatio-temporal action classification with a novel \textbf{Object Selection} algorithm that identifies task-relevant objects through trajectory-based role classification, blur detection, and overlap minimization. The selected objects are then processed by Vision-Language Models (VLMs) for robust category recognition and zero-shot generalization. Evaluated on a modified Something-Something V2 dataset, our method achieves 86.79\% action classification accuracy and BLEU-4 scores of 0.337 on standard objects and 0.261 on novel objects. These results improve over the strongest task-specific baseline by 80.2\% and 143.9\%, respectively. Larger gains are observed in METEOR and CIDEr, reaching 157.9\% and 171.7\% on novel objects. Across all semantic metrics, our approach consistently outperforms task-specific methods and remains competitive with, or surpasses, large general-purpose VLMs while retaining a modular, object-centric design.