arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3844
专题追踪
2606.15568 2026-06-16 cs.RO 新提交

SAPS: Shared Autonomy for Policy Steering by Blending Teleoperation with a Pretrained VLA

SAPS: 通过混合遥操作与预训练VLA的策略引导共享自主性

Crystal Zhou, Jehan Yang, Douglas J. Weber, Zackory Erickson

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出SAPS框架,在动作层面混合人类遥操作命令与预训练策略动作,无需重训练或辅助模型,通过动态余弦相似度仲裁策略提升任务成功率高达82%,并减少人工干预。

Comments 23 pages, 15 figures, 5 tables

详情
AI中文摘要

近期视觉-语言-动作(VLA)模型的进展展示了机器人操作中令人印象深刻的通用能力,但这些策略在分布外的空间和语义扰动下可能变得脆弱。虽然人类遥操作提供了可靠的恢复,但可能要求高认知负荷和精确的手动控制,且现有的策略引导方法通常需要辅助模型或采样器修改。在这项工作中,我们引入了策略引导的共享自主性(SAPS),这是一个在动作层面混合实时人类遥操作命令与预训练策略动作的框架。SAPS不需要策略重训练、辅助动力学模型或架构修改。我们提出并评估了三种仲裁策略来平衡人类和VLA策略控制,包括一种动态余弦相似度仲裁策略,该策略计算人类与策略动作之间的几何一致性。在仿真(LIBERO、LIBERO-PRO、CALVIN)和真实机器人硬件上的评估中,SAPS在仿真和真实世界中将任务成功率比自主执行提高了高达82%。此外,与纯遥操作相比,我们的方法大幅减少了人工干预,同时实现了比自主执行和纯遥操作更快的任务完成时间。这些结果表明,动作级共享自主性是一种实用的、模型无关的方法,用于在涉及人类操作员的真实世界环境中可靠部署通用机器人策略,在辅助遥操作和可扩展数据收集方面具有前景的应用。

英文摘要

Recent advancements in Vision-Language-Action (VLA) models have demonstrated impressive generalist capabilities in robot manipulation, yet these policies can be brittle under out-of-distribution spatial and semantic perturbations. While human teleoperation offers reliable recovery, it can demand high cognitive load and precise manual control, and existing policy steering methods often require auxiliary models or sampler modifications. In this work, we introduce Shared Autonomy for Policy Steering (SAPS), a framework that blends real-time human teleoperation commands with pretrained policy actions at the action level. SAPS requires no policy retraining, auxiliary dynamics models, or architectural modifications. We propose and evaluate three arbitration strategies to balance human and VLA policy control, including a dynamic Cosine-similarity arbitration strategy that computes the geometric agreement between human and policy actions. Across evaluations in simulation (LIBERO, LIBERO-PRO, CALVIN) and on real-world robot hardware, SAPS improves task success rates over autonomous execution by up to 82% in both simulation and the real world. Furthermore, our approach drastically reduces human intervention compared to pure teleoperation, while simultaneously achieving faster task completion times than both autonomous execution and pure teleoperation. These results demonstrate that action-level shared autonomy is a practical, model-agnostic approach for reliably deploying generalist robot policies in real-world contexts involving a human operator,with promising applications in assistive teleoperation and scalable data collection.

2606.15566 2026-06-16 cs.CL cs.AI 新提交

LLM-Assisted Stance Detection in Scientific Discourse: A Test Case in Bayesian Cognitive Science

科学话语中的立场检测:以贝叶斯认知科学为例的LLM辅助方法

Eyup Engin Kucuk, Tarik Kelestemur, Ömer Dağlar Tanrikulu

发表机构 * University of New Hampshire(新罕布什尔大学) Independent Researcher(独立研究员)

AI总结 提出结合理论驱动编码手册、专家标注和诊断门控提示优化的方法,利用三个前沿LLM检测贝叶斯模型在科学文本中的现实主义/工具主义立场,在210篇文章的6858条引文中达到0.78的联合信度。

Comments 9 pages, 4 figures; Code and data: https://github.com/EyupEK/autoresearch_bayes

详情
AI中文摘要

定性编码是社会科学的核心,但专家标注难以规模化。LLM提供了一种可能的扩展,但当目标构念是解释性的、理论负载的且仅间接表达时,需要仔细验证。我们在一个困难案例中研究这个问题:检测作者是将贝叶斯模型视为心理和神经机制的描述(现实主义)还是有用的数学工具(工具主义)。我们的方法结合了理论驱动的编码手册、专家编码的参考标注、诊断门控提示优化搜索(为三个前沿LLM:GPT-5.1、Claude Sonnet 4.6、Gemini 3 Pro Preview生成共享的零样本提示)以及多评估者信度分析。最终提示在保留样本上实现了0.76的综合信度分数(ICC=0.79和α=0.74的调和平均数),所有诊断均满足。在来自210篇文章的6858条引文上部署后,三个LLM达到了显著的引文级一致性(ICC=0.80;α=0.76;综合=0.78)和近乎完美的文章级排名稳定性(评估者对之间r=0.96-0.97)。语料库总体偏向弱现实主义,但文章级立场很少一致:仅1.4%的文章使用单一波段,而59.5%的文章跨越四个或更多波段。低层感知/运动文章比高层认知文章高出8.8个现实主义点(p<.001,d=0.60),量化了长期持有的定性直觉。我们将其作为专家主导的案例研究呈现;该框架旨在推广到类似的理论密集型任务,而非所有定性分析。

英文摘要

Qualitative coding is central to social science, but expert annotation is difficult to scale. LLMs offer a possible extension, yet require careful validation when the target construct is interpretive, theoretically loaded, and only indirectly expressed. We study this problem in a difficult case: detecting whether authors treat Bayesian models as descriptions of mental and neural mechanisms (realism) or as useful mathematical tools (instrumentalism). Our method combines a theory-driven codebook, expert-coded reference annotations, a diagnostic-gated prompt-optimization search yielding a shared zero-shot prompt for three frontier LLMs (GPT-5.1, Claude Sonnet 4.6, Gemini 3 Pro Preview), and multi-rater reliability analysis. The final prompt achieved a held-out combined reliability score of 0.76 (harmonic mean of ICC = 0.79 and $α$ = 0.74), with all diagnostics satisfied. Deployed on 6,858 quotes from 210 articles, the three LLMs reached substantial quote-level agreement (ICC = 0.80; $α$ = 0.76; combined = 0.78) and near-perfect article-level rank stability ($r$ = 0.96-0.97 across rater pairs). The corpus was predominantly weakly realist, but article-level stances were rarely uniform: only 1.4% of articles used a single band, while 59.5% spanned four or more. Low-level perception/motor articles scored 8.8 Realism points higher than high-level cognition articles ($p < .001$, $d = 0.60$), quantifying a long-held qualitative intuition. We present this as an expert-led case study; the framework is intended to generalize to similar theoretically demanding tasks, not to all qualitative analysis.

2606.15563 2026-06-16 cs.AI cs.IT cs.MA math.IT 新提交

Minimal Oversight: Uncertainty-Aware Governance for Delegated AI Systems

最小监督:委托AI系统的不确定性感知治理

Carlos R. B. Azevedo

发表机构 * Independent Researcher(独立研究员)

AI总结 提出最小充分监督原则(MSO),通过Fisher信息流形上的变分法最小化治理负担,导出任务空间的水填充分配,并证明容量定理、局部近似和漂移主导的自律时间标度律,为委托AI系统提供可计算的治理框架。

Comments Companion Python package: pip install minimal-oversight | Code: https://github.com/crbazevedo/delegation-lab | 26 pages, 1 figure, 5 tables

详情
AI中文摘要

AI系统越来越多地将决策委托给专门的模型、评估器、工具和监督控制器。中心AI问题不再是单纯的模型准确性,而是不确定性感知治理:授予多少自主权,哪些证据应校准信任,委托AI系统能维持的性能上限,以及何时需要人类干预。我们提出最小充分监督原则(MSO),这是一个用于原则性自主委托的变分原理:在满足交付约束的前提下,最小化Fisher信息流形上的治理负担。由此得到的欧拉-拉格朗日解在任务空间上产生一种水填充式的委托分配。基于一个揭示动作的委托治理信道模型,我们证明了平稳符号级审查策略的容量定理,推导了将工作流复杂度与质量退化联系起来的局部一阶近似,并给出了一个漂移主导的自主-时间标度律,将干预时机与有效容量、复杂度和漂移联系起来。在此框架内,掩蔽表现为一种结构性AI治理病理:修正后的性能可能隐藏校准信任所需的能力信号。合成模拟和半真实重构工作流支持设计建议,包括上游优先修正、基于敏感性的干预以及在扩展自主权之前进行显式可行性检查。结果为委托AI系统提供了一个可计算的框架,用于处理不确定性、规划和监督。配套Python包可在https://github.com/crbazevedo/delegation-lab获取。

英文摘要

AI systems increasingly delegate decisions to specialized models, evaluators, tools, and supervisory controllers. The central AI problem is no longer only model accuracy, but uncertainty-aware governance: how much autonomy to grant, which evidence should calibrate trust, what performance ceiling a delegated AI system can sustain, and when human intervention becomes necessary. We propose the Minimum Sufficient Oversight Principle (MSO), a variational principle for principled autonomy delegation: minimize governance burden on the Fisher information manifold subject to a delivery constraint. The resulting Euler-Lagrange solution yields a water-filling allocation of governed delegation across the task space. Building on a revealed-action governed delegation channel model, we prove a capacity theorem for stationary symbolwise review policies, derive a local first-order approximation relating workflow complexity to quality degradation, and give a drift-dominated autonomy-time scaling law linking intervention timing to effective capacity, complexity, and drift. Within this framework, masking appears as a structural AI-governance pathology: corrected performance can hide the competence signal needed to calibrate trust. Synthetic simulations and a semi-real reconstructed workflow support design prescriptions including upstream-first correction, sensitivity-based intervention, and explicit feasibility checks before autonomy is expanded. The result is a computable framework for uncertainty, planning, and oversight in delegated AI systems. A companion Python package is available at https://github.com/crbazevedo/delegation-lab.

2606.15553 2026-06-16 cs.LG cs.AI 新提交

Distilling Drifting Transformers with Representation Autoencoders

用表示自编码器蒸馏漂移变换器

Jiawei Zhang, Mengfei Xia, Gen Li, Yuantao Gu

发表机构 * Tsinghua University(清华大学) Ant Group(蚂蚁集团) CUHK(香港中文大学)

AI总结 提出Drift-RAE方法,通过漂移范式在表示自编码器潜空间中蒸馏预训练流模型,解决各向异性和大曲率问题,在ImageNet 256上仅用10k步达到1.77 FID。

详情
AI中文摘要

表示自编码器(RAE)通过预训练编码器中强标签聚类的DINO特征,在语义更丰富的潜空间中改进了扩散和流模型。然而,在蒸馏阶段,丰富语义表示导致的严重各向异性和大曲率会阻碍收敛和性能,使得基于轨迹的蒸馏不稳定。在这项工作中,我们认为RAE潜空间通过新提出的漂移模型与蒸馏兼容。我们首先定量研究了不同自编码器上的曲率和各向同性统计,并从理论上揭示了漂移模型本身极有可能在像基于重建的VAE这样的极端分散空间上失败。这些促使我们直接将漂移范式应用于表示自编码器。我们提出的方法Drift-RAE使用漂移在RAE潜空间中蒸馏预训练流模型,并进行了有洞察力的修改,通过理论上将漂移场与其他框架对齐来提高训练稳定性。关于实验证据,我们在ImageNet 256数据集上仅用10k步蒸馏就达到了1.77 FID,超越了最先进的RAE蒸馏方法,并且与原始漂移模型相比具有竞争力,而无需辅助MAE特征提取器。代码将公开提供。

英文摘要

Representation Autoencoders (RAEs) have improved diffusion and flow models by semantically richer latent space owing to the strongly label-wise clustered DINO features in the pretrained encoders. Yet in the distillation stage, the severe anisotropy and large curvatures caused by the rich semantic representations would hinder the convergence and performance, making the trajectory-based distillation unstable. In this work, we argue that the RAE latent space is compatible with distillation via the newly proposed Drifting Models. We first quantitatively study the curvatures and isotropy statistics across different autoencoders, and theoretically reveal that Drifting Model itself is highly likely to fail on extremely scattered spaces like reconstruction-based VAEs. These motivate us to apply the drifting paradigm directly to representation autoencoders. Our proposed method, Drift-RAE, distills pretrained flow models in RAE latent spaces using Drifting, together with insightful modifications that improve training stability by thereotically aligning drifting fields with other frameworks. Regarding the experimental evidences, we achieve 1.77 FID on ImageNet 256 dataset using only 10k distillation steps, surpassing state-of-the-art RAE distillation methods and appearing comparative with the original Drifting Model without requiring an auxiliary MAE feature extractor. The code will be made publicly available.

2606.15551 2026-06-16 cs.LG 新提交

A Bifurcation Theory Framework for Gradient Descent on the Edge of Stability

梯度下降在稳定性边缘的分岔理论框架

Eric Gan

发表机构 * Eric Gan(埃里克·甘)

AI总结 提出分岔理论框架,通过将训练动力学分解为法向和切向分量,证明稳定性边缘训练源于法向的翻转分岔,并收敛到最小化流形。

详情
AI中文摘要

稳定性边缘(EoS)现象,即梯度下降操作的锐度超过经典收敛阈值但损失在长时间尺度上下降,在现代深度学习中普遍存在,但在现实环境中仍鲜为人知。先前的严格分析主要局限于具有特定结构形式的标量或低维损失。在这项工作中,我们为梯度下降在稳定性边缘上开发了一个分岔理论框架,该框架直接适用于过参数化神经网络。通过将训练动力学分解为法向和切向于最小化流形的分量,我们表明稳定的EoS训练源于法向方向的翻转分岔,由第一个李雅普诺夫系数的符号控制,而切向动力学向锐度递减的区域漂移。在损失景观的温和谱和几何假设下,我们证明了在EoS阈值下训练时收敛到最小化流形。作为推论,我们恢复并统一了先前的结果:我们表明Gan(2026)的乘积稳定性条件是我们框架的一个实例。

英文摘要

The Edge of Stability (EoS) phenomenon, where gradient descent operates with sharpness exceeding the classical convergence threshold yet the loss decreases over long timescales, is ubiquitous in modern deep learning but remains poorly understood in realistic settings. Prior rigorous analyses have been largely confined to scalar or low-dimensional losses with specific structural forms. In this work, we develop a bifurcation theory framework for gradient descent on the edge of stability that applies directly to overparameterized neural networks. By decomposing the training dynamics into components normal and tangent to the manifold of minimizers, we show that stable EoS training arises from a flip bifurcation in the normal direction, governed by the sign of the first Lyapunov coefficient, while the tangent dynamics drift toward regions of decreasing sharpness. Under mild spectral and geometric assumptions on the loss landscape, we prove convergence to the minimizing manifold when training at the EoS threshold. As a corollary, we recover and unify prior results: we show that the product-stability condition of Gan (2026) is an instance of our framework.

2606.15550 2026-06-16 cs.RO 新提交

Robots as Tokens: Unified Diffusion Transformer for Coordinated Multi-Robot Trajectory Generation

机器人作为令牌:面向协调多机器人轨迹生成的统一扩散Transformer

Ruofei Bai, Jie Chen, Yuxin Cai, Jun Li, Wei-Yun Yau, Lihua Xie

发表机构 * Nanyang Technological University(南洋理工大学) Agency for Science, Technology and Research(新加坡科技研究局) National University of Singapore(新加坡国立大学)

AI总结 提出Roken框架,将每个机器人表示为离散令牌,通过扩散Transformer直接生成满足安全和连通性约束的多机器人轨迹,无需迭代后处理。

Comments 23 pages, 13 figures; \textbf{Project page:} \href{https://bairuofei.github.io/roken-project-page/}{\texttt{bairuofei.github.io/roken-project-page}}

详情
AI中文摘要

生成模型在语言和视觉生成中的成功激发了其在生成式机器人规划中的广泛应用。然而,现有工作大多聚焦于单机器人规划,或以顺序方式生成多机器人轨迹并通过迭代后处理解决机器人间冲突。本文研究协调多机器人轨迹(作为一种特殊的时空分布)是否可以通过生成模型以前馈方式学习和生成。我们提出Roken(Robots as Tokens),一种统一的扩散Transformer,直接生成同时满足(个体)安全和(全局)连通性约束的多机器人轨迹。Roken的核心设计是将每个机器人表示为一个离散令牌,使它们能够通过自注意力自然交互,并通过交叉注意力关注地图令牌以获取环境布局。我们进一步引入基于贝叶斯定理的多个辅助任务,提供多尺度时空监督以高效学习条件分布。训练时,Roken吸收来自不同团队规模的多样化专家轨迹。推理时,Roken作为一个多功能多机器人规划器,可处理单机器人规划、协调多机器人轨迹生成,以及通过固定部分机器人令牌作为条件进行条件轨迹生成。在多种杂乱环境中的实验表明,Roken能够生成协调的多机器人轨迹,以高成功率执行连通性约束的目标导航任务,优于用于生成训练数据集的基线方法。Roken在混合团队规模训练后展现出良好的可扩展性,并对未见或部分观测环境具有泛化能力,验证了其从多样化数据中学习并执行多种任务的潜力。

英文摘要

The success of generative models in language and visual generation has inspired extensive applications to generative robot planning. However, most existing works either focus on single-robot planning, or generate multi-robot trajectories in a sequential manner with iterative post-processing to resolve inter-robot conflicts. In this work, we investigate whether coordinated multi-robot trajectories, as a special spatiotemporal distribution, can be learned and generated with a generative model in a feed-forward manner. We propose Robots as Tokens (Roken), a unified diffusion transformer that directly generates multi-robot trajectories that satisfy both (individual) safety and (global) connectivity constraints. The core design of Roken is to represent each robot as a discrete token, allowing them to naturally interact with each other through self-attention, and cross-attend to map tokens for environment layouts. We further introduce several auxiliary tasks based on Bayes' theorem to provide multi-scale spatial-temporal supervision for efficient learning of the conditional distribution. In training, Roken absorbs diverse expert trajectories from different team sizes. During inference, Roken behaves as a versatile multi-robot planner that can handle single-robot planning, coordinated multi-robot trajectory generation, and conditional trajectory generation by fixing some robot tokens as conditions. Experiments in diverse cluttered environments show that Roken can generate coordinated multi-robot trajectories to perform connectivity-constrained goal navigation tasks with high success rates, outperforming the baseline method used to generate the training dataset. Roken also demonstrates good scalability after training with mixed team sizes, and shows generalization to unseen or partially observed environments, verifying its potential to learn from diverse data and perform versatile tasks.

2606.15547 2026-06-16 cs.CV cs.AI 新提交

EcoBin: A Two-Stage Deep Convolutional Neural Network for Contamination-Aware Waste Classification

EcoBin: 一种用于污染感知废物分类的两阶段深度卷积神经网络

Raghav Senthil Kumar

发表机构 * BASIS Phoenix(BASIS凤凰学校)

AI总结 提出EcoBin两阶段深度CNN,通过合成污染数据集和污染检测模块,显著提升回收废物分类中污染物的识别准确率。

Comments 7 pages, 8 figures

详情
AI中文摘要

废物分类模型在分类废物方面已经变得非常准确,在基准数据集上通常超过95%。然而,这些模型未能考虑可回收废物中的污染。我们提出了EcoBin,一种两阶段深度卷积神经网络,它根据处理途径对家庭废物进行分类,并明确考虑污染。第一阶段是一个基于EfficientNetV2-S骨干网络的基础废物分类器,将数据集中的三十个废物类别分配到四个处理途径之一。第二阶段是一个污染分类器,检查任何被导向回收的物品,并在检测到污染时将其决策覆盖为垃圾。由于不存在公开的污染可回收物数据集,我们通过使用U2-Net模型分割干净可回收物体的图像,并在其表面合成逼真的污染纹理来合成一个数据集。第一阶段达到87.42%的测试准确率和96.13%的途径调整准确率。同时,污染阶段以0.99的ROC-AUC区分干净和污染物品。在污染可回收物的测试集上,完整流水线正确路由了25个物品中的24个,而单独的基础分类器仅正确路由了25个中的1个。McNemar检验证实污染阶段带来的改进具有统计学显著性(p < 0.001)。

英文摘要

Waste classification models have become highly accurate at sorting waste, often exceeding 95% on benchmark datasets. However, these models fail to account for contamination in recyclable waste. We present EcoBin, a two-stage deep convolutional neural network that classifies household waste by its disposal pathway and that explicitly accounts for contamination. The first stage is a base waste classifier built on an EfficientNetV2-S backbone that assigns each of the thirty waste categories in our dataset to one of four disposal pathways. The second stage is a contamination classifier that inspects any item routed toward recycling and overrides the decision to garbage when contamination is detected. Because no public dataset of contaminated recyclables exists, we synthesize one by segmenting images of clean recyclable objects with a U2-Net model and compositing realistic contamination textures onto their surfaces. The first stage achieves 87.42% test accuracy and a 96.13% pathway-adjusted accuracy. Meanwhile, the contamination stage distinguishes clean from contaminated items with a 0.99 ROC-AUC. On a test set of contaminated recyclables, the complete pipeline routes 24 of 25 items correctly, compared with only 1 of 25 for the base classifier alone. A McNemar's test confirms that the improvement contributed by the contamination stage is statistically significant (p < 0.001).

2606.15540 2026-06-16 cs.SD cs.AI cs.MM eess.AS 新提交

AP-GRPO: Anchor-Gated Phonetic Alignment with Policy Optimization for Pathological Speech Reconstruction

AP-GRPO: 基于锚定门控语音对齐与策略优化的病理语音重建

Pengfei Zhang, Hoang H Nguyen, Yutong Song, Wenjun Huang, Tahmid Imtiaz Imu, Henry Peng Zou, Jiang Wu, Honghui Xu, Amir M. Rahmani

发表机构 * University of California Irvine(加州大学尔湾分校) University of Illinois Chicago(伊利诺伊大学芝加哥分校) Kennesaw State University(肯尼索州立大学)

AI总结 针对神经退行性和神经运动障碍患者的病理语音,提出AP-GRPO框架,通过锚定门控奖励和语音对齐奖励优化语音语言模型,实现忠实重建,并揭示疾病特异性模式。

详情
AI中文摘要

来自神经退行性和神经运动障碍患者的病理语音通常在声学上失真且语言上支离破碎,因此需要病理语音重建来从失真和不完整的语音录音中恢复预期的文本内容。关键在于,此类录音很少均匀退化:一些单词或短语仍然可靠,可以作为可听锚点来重建受损的周围内容。我们引入了锚定门控语音组相对策略优化(AP-GRPO),这是一个带有语音奖励的GRPO框架,通过可听锚点保留和锚点间语音兼容性来对齐语音语言模型(SLM)与原始语音信号。AP-GRPO包括:(i)一个锚定门控奖励,用于匹配清晰区域中的可靠可听锚点;(ii)一个锚点间语音对齐奖励,用于评估恢复的内容是否在语音上得到相应受损锚点间语音片段的支持。在四种疾病条件下,AP-GRPO提高了忠实语音重建,并且学习的锚点约束自动适应每种条件,从而揭示可解释的疾病特异性特征:严重发音退化条件需要更强的锚点强制,而轻度损伤或语言障碍条件则更依赖于锚点间恢复的语音对齐。

英文摘要

Pathological speech from patients with neurodegenerative and neuromotor disorders is often acoustically distorted and linguistically fragmented, making pathological speech reconstruction necessary to recover intended textual content from distorted and incomplete speech recordings. Crucially, such recordings are rarely uniformly degraded: some words or short phrases remain reliable and can serve as audible anchors for reconstructing the corrupted surrounding content. We introduce Anchor-gated Phonetic Group Relative Policy Optimization (AP-GRPO), a GRPO framework with phonetic reward that aligns speech language models (SLMs) through audible-anchor preservation and inter-anchor phonetic compatibility to the original speech signal. AP-GRPO consists of: (i) an anchor-gated reward that matches reliable audible anchors in clear regions; and (ii) an inter-anchor phonetic alignment reward that evaluates whether recovered contents are phonetically supported by the corresponding corrupted inter-anchor speech span. Across four disease conditions, AP-GRPO improves faithful speech reconstruction, and the learned anchor constraint automatically adapts to each condition and thus reveals interpretable disease-specific profiles: conditions with severe articulatory degradation require stronger anchor enforcement, whereas milder impairment or linguistically impaired conditions rely more on phonetic alignment for inter-anchor recovery.

2606.15534 2026-06-16 cs.CV 新提交

Track2View: 4D-Consistent Camera-Controlled Video Generation via Paired 3D Point Tracks

Track2View: 通过配对3D点轨迹实现4D一致的相机控制视频生成

Feng Qiao, Zhaochong An, Zhexiao Xiong, Serge Belongie, Nathan Jacobs

发表机构 * Washington University in St. Louis(圣路易斯华盛顿大学) University of Copenhagen(哥本哈根大学)

AI总结 提出Track2View,利用配对3D点轨迹为视频扩散变压器提供显式时空对应,实现新视角视频渲染,在视觉质量、视角同步和相机精度上达到最先进水平。

详情
AI中文摘要

从新相机视角重新渲染现有视频需要输出遵循规定的相机轨迹,同时保持原始场景每一帧的外观和动态。现有方法依赖于每帧姿态嵌入、噪声点云渲染或隐式学习对应关系,这些方法都没有提供源像素和目标像素之间的显式、时间连续链接。我们提出Track2View,它将视频扩散变压器条件化为配对3D点轨迹:投影到源和目标相机视图中的场景点的稀疏轨迹。这些轨迹提供了显式的时空对应关系,在构造上是时间连续的,编码了内容应在何时何地出现。Track2View的核心是一个双视图轨迹调节器,通过无参数几何操作和学习的时间聚合将视觉上下文从源视图转移到目标视图,确保对任意相机轨迹的泛化能力,而无需记忆特定运动。我们进一步引入了一个数据整理流程,通过在时间上连接的多相机视图对上运行3D点跟踪器来提取一对一的轨迹对应关系。在一个包含静态和动态场景的400视频基准测试中,Track2View在视觉质量、视角同步和相机精度方面取得了最先进的结果,相对于领先基线,旋转误差减少了30-65%,平移误差减少了61-72%。项目页面可访问:https://qjizhi.github.io/track2view

英文摘要

Re-rendering an existing video from a novel camera viewpoint requires the output to follow the prescribed camera trajectory while preserving the appearance and dynamics of the original scene across every frame. Existing methods rely on per-frame pose embeddings, noisy point-cloud renderings, or implicit learned correspondences, none of which provides an explicit, temporally continuous link between source and target pixels. We propose Track2View, which conditions a video diffusion transformer on paired 3D point tracks: sparse trajectories of scene points projected into both the source and target camera views. These tracks provide explicit spatiotemporal correspondences that are temporally continuous by construction, encoding what content should appear where and when. At the core of Track2View is a dual-view track conditioner that transfers visual context from source to target view through parameter-free geometric operations and learned temporal aggregation, ensuring generalization to arbitrary camera trajectories without memorizing specific motions. We further introduce a data curation pipeline that extracts one-to-one track correspondences by running a 3D point tracker on temporally concatenated multi-camera view pairs. On a 400-video benchmark spanning static and dynamic scenes, Track2View achieves state-of-the-art results across visual quality, view synchronization, and camera accuracy, reducing rotation error by 30-65% and translation error by 61-72% relative to leading baselines. Project page is available at this https URL: https://qjizhi.github.io/track2view

2606.15532 2026-06-16 cs.CL cs.LG 新提交

EIBench: A Simulator-Based Benchmark and Turn-Credit RL for Emotion Management

EIBench: 基于模拟器的基准测试和用于情绪管理的回合信用强化学习

Rongzhi Zhu, Xiang Huang, Yuchuan Wu, Rui Wang, Zequn Sun, Tao Ren, Weiyao Luo, Bingxue Qiu, Jieping Ye, Yongbin Li, Wei Hu

发表机构 * State Key Laboratory for Novel Software Technology, Nanjing University(南京大学计算机软件新技术国家重点实验室) Qwen-Character Team, Alibaba Group(阿里巴巴集团Qwen-Character团队)

AI总结 提出EIBench模拟器基准,包含2222个场景,通过2x2分类(支持、防御、修复、魅力)评估多轮情绪管理;并设计CTC-GRPO方法利用逐轮状态更新作为密集反馈,提升模型情绪智能。

详情
AI中文摘要

大型语言模型(LLM)的情绪智能(EI)通常通过静态理解任务或单轮对话生成来评估。然而,情绪管理是交互式的:一个好的模型不仅应识别用户的情绪,还应在多轮对话中改善用户的情绪和关系状态。我们引入了EIBench,一个基于模拟器的交互式情绪管理基准。EIBench包含2222个场景,其中2009个用于训练,213个用于保留测试。场景按2x2分类法组织,涵盖支持、防御、修复和魅力,分别对应不同形式的支持、边界维护、信任修复和融洽关系建立。在每个场景中,LLM模拟器扮演用户,每轮后更新情绪-关系状态,并将最终状态映射到基于锚点的分数。这一设计使EIBench既是一个评估基准,也是一个训练环境:最终状态提供结果奖励,而逐轮状态更新为强化学习提供密集反馈。我们评估了15个开源和闭源LLM。当前模型在支持和融洽关系建立场景中表现良好,但在用户压力下的边界维护方面存在困难。为了提升LLM的情绪智能能力,我们提出了中心化回合信用GRPO(CTC-GRPO),这是GRPO的一个扩展,它重用模拟器的逐轮状态更新作为密集的回合级反馈,同时保留最终结果奖励。CTC-GRPO将Qwen3-8B在EIBench上的得分从-22.4提升至+22.4,并在分布外评估(包括SAGE +12.4和EQBench3 +20.9%)中也有所提升。我们的结果表明,模拟器追踪的用户状态可以支持多轮情绪管理的评估和训练。

英文摘要

Emotional intelligence (EI) in Large Language Models (LLMs) is often evaluated through static understanding tasks or single-response dialogue generation. However, emotion management is interactive: a good model should not only recognize a user's emotion, but also improve the user's emotional and relational state over several turns. We introduce EIBench, a simulator-based benchmark for interactive emotion management. EIBench contains 2,222 scenarios, with 2,009 for training and 213 for held-out testing. The scenarios are organized by a 2x2 taxonomy covering Support, Defense, Repair, and Charm, which together capture different forms of support, boundary maintenance, trust repair, and rapport building. In each scenario, an LLM simulator plays the user, updates an emotion-relation state after each turn, and maps the final state to an anchor-based score. This design makes EIBench both an evaluation benchmark and a training environment: the final state gives the outcome reward, while the per-turn state updates provide dense feedback for RL. We evaluate 15 open- and closed-source LLMs. Current models perform well on support and rapport-building scenes, but struggle with boundary maintenance under user pressure. To improve the EI ability of LLMs, we propose Centered Turn-Credit GRPO (CTC-GRPO), a GRPO extension that reuses the simulator's per-turn state updates as dense turn-level feedback while preserving the final outcome reward. CTC-GRPO improves Qwen3-8B from -22.4 to +22.4 on EIBench and also improves on out-of-distribution evaluations including SAGE (+12.4) and EQBench3 (+20.9%). Our results show that simulator-tracked user states can support both evaluation and training for multi-turn emotion management.

2606.15527 2026-06-16 cs.CV cs.AI 新提交

Selective Synergistic Learning for Video Object-Centric Learning

选择性协同学习用于视频对象中心学习

WonJun Moon, Jae-Pil Heo

发表机构 * KAIST(韩国科学技术院) Sungkyunkwan University(成均馆大学)

AI总结 提出选择性协同学习(SSync),通过伪标签线性复杂度选择性蒸馏可靠线索,避免错误传播,提升视频对象分解质量并作为即插即用模块。

详情
AI中文摘要

典型的视频对象中心学习(VOCL)方法采用基于槽的框架,依赖重建驱动的编码器-解码器架构,学习通过两个空间图进行:编码器的注意力图和解码器的对象图。由于这两个不同的图表现出不同的属性,最近的密集对齐策略试图通过对比学习强制所有时空补丁之间的一致性来调和这种差异。然而,这种无差别的对齐无意中传播了每个模块固有的弱点,例如编码器的噪声预测和解码器的模糊边界。此外,计算所有对之间的密集相似性会带来与时空补丁总数二次方关系的计算成本,严重限制了可扩展性。受此启发,我们提出了选择性协同学习(SSync)。SSync 不是进行穷举的补丁到补丁对齐,而是通过选择性蒸馏仅最可靠的线索来防止错误传播:严格利用编码器进行边界细化,利用解码器进行内部去噪。这通过线性复杂度的伪标签实现,消除了二次空间比较的需要。此外,为了防止强化架构偏差(如槽冗余),我们引入了传递性伪标签合并,基于时空激活一致性合并重叠的槽。大量研究表明,SSync 提高了分解质量,并作为一个通用的即插即用模块,同时对槽配置表现出卓越的鲁棒性。代码可在 github.com/wjun0830/SSync 获取。

英文摘要

Typical video object-centric learning (VOCL) approaches employ slot-based frameworks that rely on reconstruction-driven encoder-decoder architectures, where learning is mediated by two spatial maps: attention maps from the encoder and object maps from the decoder. As these two distinct maps exhibit different properties, a recent dense alignment strategy attempted to reconcile this discrepancy by enforcing agreement across all spatio-temporal patches via contrastive learning. However, this indiscriminate alignment inadvertently propagates the inherent weaknesses of each module, such as noisy encoder predictions and blurred decoder boundaries. Moreover, computing dense similarities across all pairs incurs a computational cost quadratic in the total number of spatio-temporal patches, severely limiting scalability. Motivated by this, we propose Selective Synergistic Learning (SSync). Instead of exhaustive patch-to-patch alignment, SSync prevents error propagation by selectively distilling only the most reliable cues: leveraging the encoder strictly for boundary refinement and the decoder for interior denoising. This is realized via a pseudo-labeling with linear complexity, eliminating the need for quadratic spatial comparisons. Also, to prevent the reinforcement of architectural biases like slot redundancy, we introduce a transitive pseudo-label merging that consolidates overlapping slots based on spatio-temporal activation consistency. Extensive studies demonstrate that SSync improves decomposition quality and serves as a versatile, plug-and-play module while also exhibiting exceptional robustness to slot configurations. Code is available at github.com/wjun0830/SSync.

2606.15521 2026-06-16 cs.CL cs.LG 新提交

Emergent retokenization symmetry in large language models: phenomenology and applications

大型语言模型中涌现的重分词对称性:现象学与应用

Kanishk Jain, Matthew Day, Tankut Can

发表机构 * Department of Physics, Emory University(埃默里大学物理系)

AI总结 研究发现大型语言模型在训练中部分涌现出重分词对称性,通过重分词实验探测模型对语义等价输入表示的敏感性和鲁棒性,并提出一种新的推理时采样策略。

详情
AI中文摘要

分词引入了表示冗余:在固定词表下,每个字节串存在多种有效的分词编码(或切分方式),它们解码后得到相同的表面字符串。然而,给定提示词时,大多数语言模型的分词器通过返回规范切分打破了这种表示对称性。仅基于规范切分进行训练应会影响推理行为,且几乎没有理由期望模型在下游任务中尊重切分对称性。我们发现这种对称性在训练过程中部分涌现。本文通过实验探测这种涌现对称性,测试了分词组合理解、表示多样性和任务导向的基准性能。我们主要使用\textbf{重分词}——在保持字节完全不变的情况下,将提示词的规范分词替换为另一种切分。相对于其他提示扰动,重分词异常干净,因为它隔离了切分效果而不改变语法、语义或表面形式。我们利用重分词研究预训练和后训练中对语义等价输入表示的敏感性和鲁棒性。此外,这种部分重分词对称性暗示了一个不同的推理时采样轴。温度采样通过模型的下一个词概率分布生成多样输出,而重分词通过语义等价的输入表示从模型内部计算生成多样性。我们发现,虽然这种重分词采样策略在简单问题上可能损害性能,但它也能恢复传统采样无法找到的解决方案。总体而言,我们的工作将重分词呈现为一种简单而强大的大型语言模型探测工具,揭示了组合理解和提示敏感性,并提供了一种新颖的采样策略。

英文摘要

Tokenization introduces representational redundancy: under a fixed token vocabulary, every byte string admits many valid token encodings, or segmentations, that decode to the same surface string. However, given a prompt, most language model tokenizers break this representational symmetry by returning a canonical segmentation. Training only on canonical segmentations should influence inference behavior, and there is little reason to expect models to respect segmentation symmetry on downstream tasks. We find that this symmetry partially emerges during training. Here, we probe this emergent symmetry through experiments testing token compositional understanding, representation diversity, and task focused benchmark performance. We primarily use \textbf{retokenization} -- replacing a prompt's canonical tokenization with an alternative segmentation while preserving its bytes exactly. Relative to other prompt perturbations, retokenization is unusually clean because it isolates segmentation effects without changing syntax, semantics or surface form. We use retokenization to study sensitivity and robustness to semantically identical input representations across pretraining and post-training. Moreover, this partial retokenization symmetry suggests a distinct inference-time sampling axis. While temperature sampling generates diverse outputs from the model using its next-token probability distribution, retokenization generates diversity from the model's internal computations through semantically equivalent input representations. We find that while this retokenization sampling strategy can hurt performance on easy problems, it can also recover solutions that conventional sampling does not find. Overall, our work presents retokenization as a simple yet powerful probe of large language models, shedding light on compositional understanding and prompt sensitivity, and offering a novel sampling strategy.

2606.15517 2026-06-16 cs.CL 新提交

SHARD: Safe and Helpful Alignment via Self-Reframing Distillation

SHARD: 通过自我重构蒸馏实现安全且有益的校准

Viswonathan Manoranjan, Amogh Gupta, Anvesh Rao Vijjini, Thomas Hofweber, Snigdha Chaturvedi

发表机构 * UNC Chapel Hill(北卡罗来纳大学教堂山分校)

AI总结 提出SHARD方法,通过哲学准则重构敏感提示以暴露良性意图,并自我重构响应,在保持安全性的同时提升帮助性。

详情
AI中文摘要

大型语言模型在处理敏感提示时常常遇到困难。它们可能直接拒绝、提供通用的安全套话,或者无法满足用户可以通过安全方式回答的合法信息需求。我们引入了SHARD,一种自我重构蒸馏方法,以改善安全-帮助性。它首先使用哲学准则重写敏感提示以暴露良性意图,然后将其原始响应重构为安全且更有帮助的响应,最后在自我重构的响应上微调模型。在DNA和LINGUASAFE的英文子集上,SHARD在保持安全性的同时提高了大多数模型家族的帮助性。它还与来自更大教师模型的蒸馏保持竞争力,表明模型可以内化从其自身引发的安全和有帮助的行为。警告:本文包含可能具有冒犯性或有害的内容。

英文摘要

Large language models often struggle with sensitive prompts. They may refuse outright, provide generic safety boilerplate, or fail to address the user's legitimate informational needs that can be answered safely. We introduce SHARD, a self-reframing distillation method to improve safe-helpfulness. It first rewrites sensitive prompts to surface benign intent using philosophical guidelines, then reframes its original responses into safe, more helpful ones, and finally fine-tunes the model on its self-reframed responses. Across DNA and the English subset of LINGUASAFE, SHARD improves helpfulness for most model families while preserving safety. It also remains competitive with distillation from a larger teacher model, suggesting that models can internalize safe and helpful behavior elicited from their own. Warning: This paper contains content that may be offensive or harmful.

2606.15514 2026-06-16 cs.RO cs.LG 新提交

Reinforcement Learning-Guided Retrieval with Soft Fusion for Robust Multimodal Imitation Learning under Missing Modalities

强化学习引导的软融合检索用于缺失模态下的鲁棒多模态模仿学习

Hassan Ismkhan, Hamid Bouchahcia

发表机构 * Bournemouth University(伯恩茅斯大学)

AI总结 提出RL4IL方法,利用强化学习策略从训练库中检索最相关专家演示,并通过软交叉注意力融合生成动作,有效处理传感器缺失问题,在LIBERO基准上超越现有方法。

详情
AI中文摘要

机器人系统通过多种输入模态感知世界——包括视觉摄像头流和自然语言指令——并必须基于这些信号选择适当的动作。然而,假设所有输入设备永久可用是不现实的,因为在部署过程中传感器可能失效、被遮挡或完全丢失。因此,鲁棒处理此类缺失模态场景对于真实世界的机器人操作至关重要。本文介绍了RL4IL,一种强化学习引导的模仿学习方法,通过从训练库中识别最相关的专家演示,为给定观测选择最合适的动作。一个强化学习策略,通过基于广度优先搜索候选集的近端策略优化进行训练,对候选演示进行排序,一个软交叉注意力融合头聚合它们的动作信号以产生最终预测。当推理时模态缺失时,一个专用的每模态RL检索策略从训练库中识别捐赠演示,一个软插补头通过交叉注意力在排名靠前的捐赠者上重建缺失嵌入——无需对系统进行任何重新训练。在三个LIBERO基准套件上的实验表明,RL4IL在传感器丢失条件下显著优于最先进的模仿学习方法,同时无需策略网络训练。代码可在https://github.com/h-ismkhan/Reinforcement-Learning-via-kNN-for-Robotic-Learning-with-Missing-Camera找到。

英文摘要

Robotic systems perceive the world through multiple input modalities -- including visual camera streams and natural language instructions -- and must select appropriate actions based on these signals. However, assuming the permanent availability of all input devices is unrealistic, as sensors may fail, become occluded, or drop out entirely during deployment. Robust handling of such missing-modality scenarios is therefore essential for real-world robot operation. This paper introduces RL4IL, a reinforcement learning guided method for imitation learning that selects the most suitable action for a given observation by identifying the most relevant expert demonstrations from a training library. A reinforcement learning policy, trained via Proximal Policy Optimisation over Breadth-First Search candidate sets, ranks candidate demonstrations and a soft cross-attention fusion head aggregates their action signals to produce the final prediction. When a modality is missing at inference time, a dedicated per-modality RL retrieval policy identifies donor demonstrations from the training library, and a soft imputation head reconstructs the missing embedding via cross-attention over the top-ranked donors -- without requiring any retraining of the system. Experiments on three LIBERO benchmark suites demonstrate that RL4IL substantially outperforms state-of-the-art imitation learning methods under sensor dropout conditions, while requiring no policy network training. The code can be found at https://github.com/h-ismkhan/Reinforcement-Learning-via-kNN-for-Robotic-Learning-with-Missing-Camera

2606.15512 2026-06-16 cs.LG physics.plasm-ph 新提交

Towards Data-Efficient Cross-Device Generalization of Grad-Shafranov Equilibria via Transfer Learning Neural Operator

通过迁移学习神经算子实现Grad-Shafranov平衡的数据高效跨设备泛化

Jay Phil Yoo, William Howes, Yashika Ghai, Kazuma Kobayashi, Souvik Chakraborty, Syed Bahauddin Alam

发表机构 * Grainger College of Engineering, Nuclear, Plasma & Radiological Engineering Department, University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校格兰杰工程学院核、等离子体与放射工程系) Fusion Energy Division, Oak Ridge National Lab(橡树岭国家实验室聚变能源部) National Center for Supercomputing Applications(国家超级计算应用中心) Department of Applied Mechanics, Indian Institute of Technology Delhi(印度理工学院德里分校应用力学系) Yardi School of Artificial Intelligence, Indian Institute of Technology Delhi(印度理工学院德里分校亚迪人工智能学院)

AI总结 提出跨设备神经算子框架,将平衡重建转化为算子学习问题,通过多几何预训练实现数据高效迁移,Wavelet Neural Operator在100个目标样本下达到低于4%的L2误差。

详情
AI中文摘要

磁流体动力学平衡的实时重建对于磁约束聚变中的等离子体成形、稳定性评估和反馈控制至关重要。然而,Grad-Shafranov平衡计算在很大程度上仍然是设备特定的和迭代的,限制了它们在延迟受限的控制环境中的应用。现有的神经方法可以加速单个平衡预测,但它们通常无法提供跨变化的等离子体边界或托卡马克几何形状的可重用模型。在这里,我们展示了平衡重建可以重新表述为跨设备算子学习问题。我们开发了一个特定领域的神经算子框架,将几何和剖面参数直接映射到极向磁通场,用摊销的算子推理取代重复的按需求解计算。使用可解析处理的Solov'ev族作为受控的Grad-Shafranov测试平台,我们在八种几何上不同的类托卡马克配置中生成平衡,并在四种迁移学习策略下对五种神经算子架构进行基准测试。单几何预训练对未见设备迁移效果差,而多几何预训练能够实现数据高效的适应。Wavelet Neural Operator在跨几何性能上最强,在100个标记目标平衡下达到低于4%的平均相对L2误差,在全微调下低于2%。预测的磁场满足无散约束至数值精度,四种架构实现毫秒或亚毫秒级推理。这些结果确定了神经算子预训练是实现跨聚变设备配置的可重用实时平衡推理的途径。

英文摘要

Real-time reconstruction of magnetohydrodynamic equilibria is essential for plasma shaping, stability assessment and feedback control in magnetic confinement fusion. However, Grad-Shafranov equilibrium calculations remain largely device-specific and iterative, limiting their use in latency-constrained control settings. Existing neural approaches can accelerate individual equilibrium predictions, but they do not generally provide reusable models across changing plasma boundaries or tokamak geometries. Here we show that equilibrium reconstruction can be recast as a cross-device operator learning problem. We develop a domain-specific neural operator framework that maps geometry and profile parameters directly to the poloidal flux field, replacing repeated solve-on-demand computation with amortized operator inference. Using the analytically tractable Solov'ev family as a controlled Grad-Shafranov testbed, we generate equilibria across eight geometrically distinct tokamak-like configurations and benchmark five neural operator architectures under four transfer-learning strategies. Single-geometry pretraining gives poor transfer to unseen devices, whereas multi-geometry pretraining enables data-efficient adaptation. The Wavelet Neural Operator gives the strongest cross-geometry performance, reaching mean relative L2 errors below 4% with 100 labelled target equilibria and below 2% with full fine-tuning. The predicted magnetic fields satisfy the divergence-free constraint to numerical precision, and four architectures achieve millisecond or sub-millisecond inference. These results identify neural operator pretraining as a route towards reusable, real-time equilibrium inference across fusion device configurations.

2606.15510 2026-06-16 cs.CL cs.DL 新提交

AthDGC: An Open Diachronic Greek Treebank with Indo-European Parallels

AthDGC:一个开放的历时希腊语树库及其印欧语平行语料

Nikolaos Lavidas, Kiki Nikiforidou, Dag Haug, Leonid Kulikov, Vassiliki Geka, Vassileios Symeonidis, Theodoros Michalareas, Sofia Chionidi, Anastasia Tsiropina, Eleni Plakoutsi, Evangelos Argyropoulos

发表机构 * National and Kapodistrian University of Athens(国家与kapodistrian大学)

AI总结 提出首个跨越八个历时时期的开放许可依存句法树库,采用统一PROIEL XML 2.0模式,并与拉丁语、哥特语等印欧语进行跨对齐。

Comments 16 pages. Data paper for the v0.4 release of AthDGC. Concept DOI: 10.5281/zenodo.20439182. Companion site: https://athdgc.github.io

详情
AI中文摘要

AthDGC(“Athens-PROIEL”)是一个开放的端到端工作流和数据集。据我们所知,它是第一个公开许可的依存句法分析树库,涵盖希腊语的八个历时时期,即古风时期、古典时期、通用希腊语时期、晚期古代、拜占庭时期、晚期拜占庭时期、早期现代和现代希腊语,采用单一的PROIEL XML 2.0模式,并将《新约》按诗句级别与拉丁语(武加大译本)、哥特语(乌尔菲拉译本)、古教会斯拉夫语(Marianus译本)和古典亚美尼亚语进行交叉对齐。AthDGC建立在PROIEL树库家族(Haug and Johndal 2008; Eckhoff et al. 2018)之上,该家族为项目建立了模式和通用希腊语参考集。标注使用Stanford Stanza PROIEL训练的工作流;句子级对齐使用LaBSE,一种多语言句子嵌入模型;词级对齐通过AwesomeAlign程序使用多语言BERT注意力。v0.4版本提供精选样本和开源工具包;完整注释语料库分区仍在希腊国家HPC上进行v0.5审计。定量规模、每个见证的诗句计数和每个时期的注释行计数将在审计通过后的v0.5发布说明中报告。概念DOI:10.5281/zenodo.20439182。

英文摘要

AthDGC ("Athens-PROIEL") is an open, end-to-end workflow and dataset. It is, to the best of our knowledge, the first openly licensed dependency-parsed treebank of Greek that spans eight diachronic periods, namely Archaic, Classical, Koine, Late Antique, Byzantine, Late Byzantine, Early Modern, and Modern Greek, under a single PROIEL XML 2.0 schema, with verse-level cross-alignment of the New Testament to Latin (Vulgate), Gothic (Wulfila), Old Church Slavonic (Marianus), and Classical Armenian. AthDGC builds on the PROIEL Treebank Family (Haug and Johndal 2008; Eckhoff et al. 2018), which established the schema and the Koine-Greek reference set for the project. Annotation uses the Stanford Stanza PROIEL-trained workflow; sentence-level alignment uses LaBSE, a multilingual sentence-embedding model; word-level alignment uses multilingual-BERT attention through the AwesomeAlign procedure. The v0.4 release provides curated samples and the open-source toolkit; the full annotated corpus partitions remain under v0.5 audit on the Greek national HPC. Quantitative scale, per-witness verse counts, and per-period annotated-row counts are reported in the v0.5 release notes, after the audit pass completes. Concept DOI: 10.5281/zenodo.20439182.

2606.15507 2026-06-16 cs.AI 新提交

Frame-Conditioned Moral Computation in LLaMA 3.1-8B-Instruct: A Mechanistic Interpretability Audit of Ethical Reasoning

LLaMA 3.1-8B-Instruct中的框架条件化道德计算:伦理推理的机械可解释性审计

Ali Dasdan, Manan Shah, W. Russell Neuman, Chad Coleman, Kund Meghani, Safinah Ali

发表机构 * KD Consulting, CA, USA(KD咨询公司,美国加利福尼亚州) New York University, NY, USA(纽约大学,美国纽约州)

AI总结 通过机械可解释性平台分析LLaMA 3.1-8B-Instruct在54个道德提示上的内部计算,发现情境锚定效应:领域特定表示主导激活列表顶部,模型道德能力恒定但显著性高度依赖于提示选择的解释框架。

Comments 47 pages, 10 figures

详情
AI中文摘要

大型语言模型在道德提示上的行为审计测量的是模型所说的内容,而非产生这些内容的内部计算。我们使用AI驱动的机械可解释性平台Transluce,在四个电池组的54个道德提示上检查LLaMA 3.1-8B-Instruct:17个困境、政策和元伦理问题(B1);6个角色扮演场景(B3);以及一个受控的电车难题对比,其中切换机制随人员固定而变化(B4,15个提示)或身份属性随机制固定而变化(B5,16个提示)。两个互补的度量族——五个聚类级度量和六个度量神经元级面板——收敛于一个情境锚定效应:在每个电池组中,领域特定表示主导激活列表的顶部。模型的道德标记能力基本保持不变;其显著性(排名、优先级、列表顶部存在性)对提示选择的解释框架高度敏感。B4与B5的对比证实,模型关注任何变化的表面特征:聚合的道德度量无法区分,但占主导地位的非道德干扰因素反映了设计。多温度审计识别出一个候选道德神经元(L16/N3837),在不同温度下保持稳定;两个前沿模型上的跨模型行为代理提供了自我报告道德焦点差异的初步证据,与对齐包装器一致,其中RLHF重新排序表面文本而不移除底层的领域优先框架。我们将这些统一为框架条件化道德计算:提示的表面词汇选择一个特征流形,道德结论是该选择的下游结果。行为对齐必须辅以机械对齐:一个研究计划,询问在受控框架变化下,道德相关特征是否可以被证明具有因果特权,而不仅仅是在解释中响亮。

英文摘要

Behavioral audits of Large Language Models on moral prompts measure what the model says, not the internal computation producing it. We use Transluce, an AI-driven mechanistic-interpretability platform, to examine LLaMA 3.1-8B-Instruct on 54 moral prompts in four batteries: 17 dilemmas, policy, and meta-ethical questions (B1); 6 role-playing scenarios (B3); and a controlled trolley contrast varying the switching mechanism with people fixed (B4, 15 prompts) or identity attributes with mechanism fixed (B5, 16 prompts). Two complementary metric families, five cluster-level metrics and a six-metric neuron-level panel, converge on a Situational Anchor Effect: domain-specific representations dominate the top of the activation list across every battery. The model's ethics-labeled capacity stays essentially constant; its salience (rank, priority, top-of-list presence) is highly sensitive to the interpretive frame the prompt selects. The B4-vs-B5 contrast confirms the model attends to whichever surface feature varies: aggregate ethics metrics are indistinguishable, but the dominant non-ethics distractor mirrors the design. A multi-temperature audit identifies a candidate ethics neuron (L16/N3837) stable across temperatures; a cross-model behavioral proxy on two frontier models yields preliminary evidence of divergence in self-reported moral focus, consistent with an Alignment Wrapper in which RLHF re-orders surface text without removing underlying domain-first frames. We unify these as Frame-Conditioned Moral Computation: the prompt's surface vocabulary selects a feature manifold, and the moral conclusion is downstream of that selection. Behavioral alignment must be supplemented by Mechanistic Alignment: a research program asking whether ethics-related features can be shown causally privileged under controlled frame variation, not merely loud in the explanation.

2606.15503 2026-06-16 cs.AI cs.CY cs.MA cs.NE 新提交

Synthetic Counteradaptation: A Principle of Human-AI Co-evolution

合成反适应:人机共同进化的一个原理

Ivar Frisch, Jackie Kay, Philip Moreira Tomei

发表机构 * Spectral Circuits Research Independent Researcher(独立研究者) AI Objectives Institute(AI Objectives研究所)

AI总结 提出合成反适应概念,描述人机通过相互适应策略和行为实现共同进化,并分析围棋、混合动机社交和地缘政治模拟等案例。

Comments 15 pages, 1 figure. Published in Antikythera (MIT Press), February 2025

Journal ref Antikythera Journal, MIT Press, February 2025

详情
AI中文摘要

在本文中,我们引入了合成反适应的概念,这是一个人类与AI系统通过相互适应对方的策略和行为而共同进化的过程。当AI系统发展出新的策略或社会协议,促使人类提取见解并调整自身行为作为回应时,就会发生合成反适应,从而导致新的智能体交互动态的出现。为了说明这些动态,我们分析了来自不同背景的案例,包括围棋游戏、混合动机社交互动和地缘政治模拟。通过探索这些案例,我们展示了合成反适应如何为理解多智能体环境中人机交互的递归和共同进化性质提供一个框架。

英文摘要

In this paper, we introduce the concept of synthetic counteradaptation, a process where human and AI systems co-evolve by adapting to each other's strategies and behaviors. Synthetic counteradaptation occurs when AI systems develop novel strategies or social protocols, prompting humans to extract insights and adapt their own behaviors in response, leading to the emergence of new agent interaction dynamics. To illustrate these dynamics, we analyze examples from various contexts, including the game of Go, mixed-motive social interactions, and geopolitical simulations. By exploring these cases, we demonstrate how synthetic counteradaptation provides a framework for understanding the recursive and co-evolutionary nature of human-AI interactions in multi-agent environments.

2606.15497 2026-06-16 cs.AI 新提交

Towards End-to-End Automation of AI Research

迈向AI研究的端到端自动化

Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Chris Lu, Shengran Hu, Jakob Foerster, David Ha, Jeff Clune

发表机构 * Sakana AI FLAIR University of Oxford(牛津大学) University of British Columbia(不列颠哥伦比亚大学) Vector Institute(向量研究所)

AI总结 提出AI Scientist系统,利用基础模型实现从构思到论文撰写的全自动研究,并通过机器学习会议研讨会的同行评审。

Comments Published in Nature 651, 914-919 (2026)

详情
AI中文摘要

科学自动化是AI领域的一个长期目标。虽然社区在自动化科学过程的各个组成部分方面取得了显著进展,但能够自主导航整个研究生命周期(从构思到发表)的系统仍然遥不可及。在这里,我们展示了迄今为止朝着端到端自动化整个过程的最强演示。我们提出了AI Scientist,它能够创建研究想法、编写代码、运行实验、绘制和分析数据、撰写完整的科学手稿并进行自己的同行评审。其想法、执行和呈现的质量足以生成一份由AI系统产生的手稿,该手稿通过了机器学习会议研讨会的首轮同行评审。该研讨会的接受率为70%。我们的系统在一个复杂的代理系统中利用了现代基础模型。我们在两种设置中评估AI Scientist:一种聚焦模式,使用人类提供的代码模板作为初始支架,在特定主题上进行研究;另一种是无模板的开放模式,利用代理搜索进行更广泛的科学探索。两种设置都能产生多样化的想法,并自动测试、报告和评估它们。这一成就展示了AI在科学贡献方面日益增长的能力,并标志着研究方式可能发生的范式转变。与任何有影响力的新技术一样,可能存在重大风险,包括给不堪重负的评审系统增加负担以及给科学文献带来噪音。然而,如果负责任地开发,这种自主系统可以极大地加速科学发现。

英文摘要

The automation of science is a long-standing ambition in the field of AI. While the community has made significant progress in automating individual components of the scientific process, a system that autonomously navigates the entire research lifecycle -- from conception to publication -- has remained out of reach. Here, we present the strongest demonstration to date toward automating the entire process end-to-end. We present The AI Scientist, which creates research ideas, writes code, runs experiments, plots and analyzes data, writes the entire scientific manuscript and performs its own peer review. Its ideas, execution, and presentation are of sufficient quality to produce a manuscript generated by an AI system that passes the first round of peer review at a major machine learning conference workshop. The workshop has an acceptance rate of 70 percent. Our system leverages modern foundation models within a complex agentic system. We evaluate The AI Scientist in two settings: a focused mode using human-provided code templates as an initial scaffold to conduct research on a specific topic, and a template-free, open-ended mode that leverages agentic search for wider scientific exploration. Both settings produce diverse ideas and automatically test, report on, and evaluate them. This achievement demonstrates AI's growing capacity for scientific contribution and signifies a potential paradigm shift in how research is conducted. As with any impactful new technology, there could be significant risks, including taxing overwhelmed review systems and adding noise to scientific literature. However, if developed responsibly, such autonomous systems could greatly accelerate scientific discovery.

2606.15494 2026-06-16 cs.RO 新提交

Understanding and Modeling Perceived Cognitive and Physical Strain Dynamics for Planning-Oriented Human-Robot Collaboration in Prefabricated Construction

理解与建模感知的认知和体力应变动态:面向预制建筑中规划导向的人机协作

Yifan Wang, Bo Xiao, Shane T. Mueller

发表机构 * Department of Civil, Environmental, and Geospatial Engineering, Michigan Technological University(土木、环境与地理空间工程系,密歇根技术大学)

AI总结 本研究通过受控重复工作-休息实验,建立基于经验数据的线性混合效应模型描述认知应变积累和非线性恢复,为预制建筑中人机协作的规划提供依据。

Comments 53 pages, 15 figures

详情
AI中文摘要

预制建筑中的人机协作需要规划方法不仅考虑生产力,还要考虑重复工作和休息期间随时间变化的工人状态。现有的规划模型通常依赖于关于疲劳、工作量或恢复的简化假设,缺乏关于感知应变如何演变的特定领域经验证据。本研究开发了一种基于经验的、规划导向的方法,以表征预制建筑人机协作中感知应变的积累和恢复。通过受控重复工作-休息实验,使用心理努力评定量表和博格感知 exertion 评定量表评估感知的认知和体力应变。评估了线性和指数函数形式,然后进行混合效应建模以检查协作条件、会话效应和个体间变异性。结果表明,认知应变积累最好由线性混合效应模型表示,而休息阶段的恢复遵循非线性衰减。由此产生的规划导向模型可能为未来的人状态感知任务分配和调度研究提供信息。

英文摘要

Human-robot collaboration (HRC) in prefabricated construction requires planning approaches that consider not only productivity but also time-dependent worker states during repeated work and rest. Existing planning models often rely on simplified assumptions about fatigue, workload, or recovery, with limited domain-specific empirical evidence on how perceived strain evolves. This study develops an empirically grounded, planning-oriented approach to characterize perceived strain accumulation and recovery in prefabricated construction HRC. A controlled repeated work-rest experiment assessed perceived cognitive and physical strain using the Rating Scale for Mental Effort and Borg's Rating of Perceived Exertion. Linear and exponential functional forms were evaluated, followed by mixed-effects modeling to examine collaborative conditions, session effects, and inter-individual variability. Results indicate that cognitive strain accumulation is best represented by a linear mixed-effects model, whereas rest-phase recovery follows nonlinear decay. The resulting planning-oriented models may inform future human-state-aware task allocation and scheduling research.

2606.15493 2026-06-16 cs.LG cs.CR 新提交

Model Stealing Through the Lens of Model Multiplicity

从模型多重性视角看模型窃取

Eliott Baltz, Satoshi Hara, Ulrich Aïvodji

发表机构 * ÉTS, Mila(蒙特利尔高等技术学院,Mila) The University of Electro-Communications(电气通信大学)

AI总结 本文通过计算替代模型的Rashomon集并评估其多样性,发现高保真替代模型在关键性能指标上可能与目标模型存在显著差异,挑战了传统观点。

Comments 14 pages, 15 figures

详情
AI中文摘要

模型窃取攻击中,对手创建高保真替代模型,对机器学习服务的知识产权构成重大威胁。传统观点认为这些替代模型能为对手提供与原始服务提供商相当的经济杠杆。本文通过评估模型窃取攻击超越单纯对目标模型的保真度来挑战这一假设。由于基于查询的提取仅提供目标输入输出行为的部分监督,替代模型并非唯一确定:许多接近最优的替代模型可以在实现相当保真度的同时,在部署相关属性上存在差异。我们不执行经典的基于学习的模型窃取攻击,而是计算替代模型的Rashomon集(即几乎同等准确的模型集合),并使用多重性指标(歧义性、差异性和Rashomon容量)和群体公平性指标评估其多样性。在表格、医学影像和NLP任务中,我们在真实数据集上的实验表明,尽管替代模型与目标模型表现出相似的保真度,但在其他关键性能指标上可能显示出显著差异。这些发现对高保真替代模型与实际部署场景中目标模型之间的假定等价性提出了质疑。

英文摘要

Model stealing attacks, where adversaries create high-fidelity surrogate models, are a significant threat to the intellectual property of machine learning services. Conventional wisdom suggests these surrogates could provide adversaries with economic leverage comparable to the original service providers. This paper challenges this assumption by evaluating model stealing attacks beyond mere fidelity to the target model. Because query-based extraction provides only partial supervision of the target's input-output behavior, the surrogate is not uniquely identified: many near-optimal surrogates can achieve comparable fidelity while differing in deployment-relevant properties. Instead of performing a classic learning-based model stealing attack, we compute the Rashomon Set (i.e., the set of almost-equally-accurate models) of surrogate models, and evaluate its diversity using multiplicity metrics (ambiguity, discrepancy, and Rashomon Capacity) and group fairness metrics. Across tabular, medical imaging, and NLP tasks, our experiments on real-world datasets reveal that despite exhibiting similar fidelity to the target model, surrogate models can display significant variances in other critical performance metrics. These findings cast doubt on the presumed equivalence between high-fidelity surrogates and the target model in practical deployment scenarios.

2606.15491 2026-06-16 cs.RO 新提交

FD-SLAM: Fast Dense Radar-Inertial SLAM with Frequency-Domain Loop Closure and Pose Graph Optimization

FD-SLAM: 基于频域闭环和位姿图优化的快速密集雷达-惯性SLAM

Nader J. Abu-Alrub, Nathir A. Rawashdeh

发表机构 * University of Texas at Austin(得克萨斯大学奥斯汀分校)

AI总结 提出FD-SLAM,通过频域闭环检测和位姿图优化提升密集雷达-惯性SLAM的精度与鲁棒性,在公开数据集上达到先进水平。

详情
AI中文摘要

雷达SLAM对于在视觉退化环境中运行的自主地面车辆具有吸引力,然而扫描雷达噪声大、扫描速率低,且其测量值难以在长轨迹上可靠匹配。本文提出FD-SLAM,一种快速密集雷达-惯性SLAM系统,它通过频域闭环检测和位姿图优化扩展了密集雷达-惯性里程计。所提方法通过使用紧凑的频域极坐标描述符进行闭环候选检索,以及基于时间滤波、相位相关筛选、扫描对齐相似性和几何一致性检查的多阶段验证流水线,保留了扫描雷达测量的类图像结构。验证后的闭环作为非顺序约束添加到SE(2)位姿图中,与雷达-惯性里程计因子一起。FD-SLAM在公开数据集上使用标准KITTI评估指标进行评估。结果表明,FD-SLAM改进了FD-RIO基线,与当前最先进的雷达SLAM方法相比具有竞争性能,并在多个评估的驾驶轨迹上提供了良好的旋转精度。运行时分析进一步表明,在仅CPU的设置下,雷达-惯性前端运行速度高于雷达采样率,而闭环检测和图优化适合并行后台执行。

英文摘要

Radar SLAM is attractive for autonomous ground vehicles operating in visually degraded environments, however, scanning radars are noisy, have low scanning rates, and their measurements are challenging to match reliably over long trajectories. This paper presents FD-SLAM, a fast dense radar-inertial SLAM system that extends dense radar-inertial odometry with frequency-domain loop closure and pose graph optimization. The proposed method preserves an image-like structure of scanning radar measurements by using a compact frequency-domain polar descriptor for loop-candidate retrieval and a multi-stage verification pipeline based on temporal filtering, phase-correlation screening, scan-alignment similarity, and geometric consistency checks. Verified loop closures are added as non-sequential constraints in an SE(2) pose graph together with radar-inertial odometry factors. FD-SLAM is evaluated on a publicly available dataset using standard KITTI evaluation metrics. The results show that FD-SLAM improves FD-RIO baseline, achieves competitive performance against current state-of-the-art radar SLAM methods, and provides favorable rotational accuracy across multiple evaluated driving trajectories. Runtime analysis further indicates that the radar-inertial front-end operates above the radar sampling rate on a CPU-only setup, while loop closure detection and graph optimization remain suitable for parallel background execution.

2606.15486 2026-06-16 cs.CV 新提交

ST-DiffEye: Diffusion-based Continuous Gaze Generation via Joint Scanpath-Trajectory Modeling

ST-DiffEye: 基于扩散的连续注视生成通过联合扫描路径-轨迹建模

Brian Nlong Zhao, Ozgur Kara, Junho Kim, James M. Rehg

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出ST-DiffEye,一种联合轨迹-扫描路径扩散框架,通过将两者拼接为额外输入通道进行联合建模,并引入基于连续排序概率得分(CRPS)的评估方法,在视觉搜索和自由观看任务上达到最先进性能。

详情
AI中文摘要

我们研究人类注视建模问题,旨在生成观察者在观看视觉刺激时产生的注视模式。注视主要通过两种模态捕获:连续眼动轨迹(描述细粒度运动动态)和离散扫描路径(描述高级注视结构)。由于注视在不同观察者和试验间差异显著,我们将这种变异性视为定义属性而非噪声,并将注视建模为随机生成过程。现有的生成式注视模型仅对这两种表示之一进行单独监督。我们假设轨迹和扫描路径以互补尺度描述注视,并在训练过程中联合提供信息,通过ST-DiffEye(一种联合轨迹-扫描路径扩散框架)验证该假设,该框架通过将两者拼接为额外的原始输入通道来耦合两种模态,除了输入和输出通道扩展外无需额外架构开销。我们进一步引入基于连续排序概率得分(CRPS)的原则性评估框架,该框架将任何现有序列相似性度量推广为适当的评分规则,以联合评估生成注视的准确性和多样性。在任务驱动的视觉搜索(涵盖目标存在和目标缺失场景)以及自由观看基准上的实验证明了最先进的性能。这些结果以及详细的消融实验证实了联合建模的优势以及分布感知评估在捕捉人类注视内在变异性方面的价值。项目网页:https://st-diffeye.github.io/

英文摘要

We study the problem of human gaze modeling, which aims to generate the gaze patterns a viewer produces while observing a visual stimulus. Gaze is primarily captured through two modalities: continuous eye-tracking trajectories, which describe fine-grained motion dynamics, and discrete scanpaths, which describe high-level fixation structure. Because gaze varies substantially across viewers and trials, we treat this variability as a defining property rather than noise and model gaze as a stochastic generative process. Existing generative gaze models supervise on only one of these two representations in isolation. We hypothesize that trajectories and scanpaths describe gaze at complementary scales and are jointly informative during training, and test this hypothesis through ST-DiffEye, a joint trajectory-scanpath diffusion framework that couples both modalities by concatenating them as an additional raw input channel, requiring no architectural overhead beyond an input and output channel expansion. We further introduce a principled evaluation framework based on the Continuous Ranked Probability Score (CRPS), which generalizes any existing sequence similarity metric into a proper scoring rule that jointly assesses the accuracy and diversity of generated gaze. Experiments on task-driven visual search, covering both target-present and target-absent scenarios, and on free-viewing benchmarks demonstrate state-of-the-art performance. These results, along with detailed ablations, confirm the benefit of joint modeling and the value of distribution-aware evaluation in capturing the intrinsic variability of human gaze. Project webpage: https://st-diffeye.github.io/

2606.15483 2026-06-16 cs.CL 新提交

Evaluative Judgement in Teaching AI-based Translation: A Class-room Case Study of AI-Mediated Translation and Post-Editing

基于AI的翻译教学中的评价判断:AI中介翻译与译后编辑的课堂案例研究

Gokhan Dogru

发表机构 * Universitat Pompeu Fabra Barcelona(巴塞罗那庞培法华大学)

AI总结 通过分析23个学生项目,研究结构化比较通用LLM和在线MT系统如何激发AI中介翻译中的评价判断,发现学生不盲从自动指标,而是基于充分性、流畅性等理由选择译后编辑输出。

Comments Workshop on Teaching AI-based Translation and Technologies (TAITT 2026) - EAMT 2026

详情
AI中文摘要

基于翻译本科课程中第四年机器翻译与译后编辑课程的23个匿名学生项目,本文研究了通用大语言模型和在线机器翻译系统的结构化比较如何引发AI中介翻译中的评价判断。学生将英文维基百科短文本翻译成加泰罗尼亚语或西班牙语,生成四个系统输出,使用自动指标和人工充分性/流畅性评估进行评价,选择一个输出进行译后编辑,并在书面报告中证明其决定。对所有23个项目报告了描述性计数,而定性解释基于22个附有书面报告的案例。结果表明,学生并未将自动指标视为最终权威:最终的译后编辑选择往往与指标排名不同,并通过充分性、流畅性、术语、自然性和预期的译后编辑工作量来证明其合理性。因此,本研究并非在受控条件下对系统进行基准测试;而是分析学生在真实课堂作业中如何证明系统选择的合理性。

英文摘要

Drawing on 23 anonymized student pro-jects from a fourth-year Machine Transla-tion and Post-editing course in a BA-level translation programme, this paper exam-ines how structured comparison of gen-eral-purpose LLMs and online MT sys-tems can elicit evaluative judgement in AI-mediated translation. Students translat-ed short specialised English Wikipedia texts into Catalan or Spanish, generated four system outputs, evaluated them using automatic metrics and human adequa-cy/fluency assessment, selected one output for post-editing, and justified their deci-sion in written reports. Descriptive counts are reported for all 23 projects, while qualitative interpretation is based on the 22 cases accompanied by written reports. Results show that students did not treat automatic metrics as final authority: final post-editing selections often diverged from metric rankings and were justified through adequacy, fluency, terminology, naturalness, and expected post-editing ef-fort. The study therefore does not bench-mark systems under controlled conditions; it analyses how students justified system choice within an authentic classroom as-signment.

2606.15479 2026-06-16 cs.LG cs.AI math.PR 新提交

Bayesian 3D Steerable CNNs: Enabling Equivariance and Uncertainty Quantification Simultaneously

贝叶斯3D可转向CNN:同时实现等变性和不确定性量化

Abhishek Keripale, Ponkrshnan Thiagarajan, Susanta Ghosh

发表机构 * Michigan Technological University(密歇根理工大学) Johns Hopkins University(约翰霍普金斯大学) The Center for Artificial Intelligence at the Institute of Computing and Cybersystems, Michigan Technological University(密歇根理工大学计算与网络系统研究所人工智能中心)

AI总结 提出贝叶斯可转向CNN,通过后验分布赋予核随机性同时保持SE(3)-等变性,实现不确定性分解,在分类精度和分布偏移下鲁棒性优于确定性模型。

详情
AI中文摘要

可转向卷积神经网络(Steerable-CNNs)通过将核参数化为可转向基函数的线性组合来保证SE(3)-等变性,但其确定性本质阻碍了不确定性量化——限制了其在需要置信度估计的场景中的应用。我们提出一种贝叶斯可转向CNN,将后验分布置于基系数上,从而在精确保持等变性的同时产生随机核。模型的损失函数通过变分推断获得,并通过贝叶斯反向传播最小化。该框架将预测不确定性分解为认知不确定性和偶然不确定性。实验上,该模型在取得竞争性分类精度的同时,预期校准误差为0.0263,并且在加性高斯噪声引起的分布偏移下,其性能比确定性对应模型高出最多6.17%。此外,我们利用模型的不确定性估计显著提升其性能,在测试数据集的84%上实现了约4%的准确率提升。认知不确定性与预测误差之间统计显著的负相关性表明,学习到的后验方差具有语义意义。该框架将贝叶斯不确定性量化与等变CNN的归纳偏置统一起来。

英文摘要

Steerable convolutional neural networks (Steerable-CNNs) guarantee SE(3)-equivariance by parameterizing kernels as linear combinations of steerable basis functions, but their deterministic nature precludes uncertainty quantification - limiting their use in settings where confidence estimates are essential. We propose a Bayesian Steerable-CNN that places posterior distributions over the basis coefficients, yielding stochastic kernels while preserving equivariance exactly. The loss function of the model is obtained via variational inference and minimized by Bayes-by-Backpropagation. The framework admits a decomposition of predictive uncertainty into epistemic and aleatoric components. Empirically, the model attains competitive classification accuracy alongside an expected calibration error of 0.0263 and outperforms its deterministic counterpart by up to 6.17% under distributional shift induced by additive Gaussian noise. Furthermore, we leverage the model's uncertainty estimates to enhance its performance significantly, achieving a notable gain - approximately 4% higher accuracy across 84% of the test dataset. A statistically significant negative correlation between epistemic uncertainty and prediction error confirms that the learned posterior variance is semantically meaningful. The framework unifies Bayesian uncertainty quantification with the inductive bias of equivariant CNNs.

2606.15476 2026-06-16 cs.RO 新提交

FARM: Find Anything using Relational Spatial Memory

FARM: 使用关系空间记忆找到任何物体

Siming He, Leo Huang, Adam Lilja, Fabio Hubel, Jonas Frey, Marco Pavone, S. Shankar Sastry, Jitendra Malik, Claire Tomlin

发表机构 * UC Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出FARM系统,通过实时构建包含几何、视觉语言描述和视角证据的开放词汇物体级记忆,并利用VLM解析查询和显式空间约束,在44k语言查询中Recall@5和Recall@10分别提升164%和224%,Accuracy@1提升35%。

详情
AI中文摘要

在家庭、仓库及其他物体丰富的环境中运行的机器人需要能够按需找到特定物体实例的记忆系统。仅靠物体级记忆往往不够:场景中包含许多看似匹配的物体,用户通过目标与地标及周围物体的关系来指代目标(例如,“飞镖盘下方、海报左侧的高灯”),这要求一种支持通过语义、外观和空间谓词进行检索的关系空间记忆。为此,我们提出了FARM(使用关系空间记忆找到任何物体),该系统以5-10 Hz的实时速度构建一个紧凑的、开放词汇的物体级记忆,包含几何、视觉语言描述和视角证据。在查询时,FARM使用VLM解析查询并评分视觉证据,同时通过物体符号和关系谓词显式地约束空间关系。这种对VLM的结构化使用使得检索比基于帧历史或场景图上下文的端到端推理更准确和鲁棒。在涵盖67个室内外场景(面积从15到15,000平方米)的44k语言查询实验中,FARM的Recall@5和Recall@10相比先前方法分别提升了164%和224%,最终VLM重排序阶段将Accuracy@1提升了35%,同时保持实时运行。我们进一步在四足机器人上使用机载传感器和计算展示了闭环部署。

英文摘要

Robots operating in homes, warehouses, and other object-rich environments need memory systems that can find specific object instances on demand. Object-level memory alone is often insufficient: scenes contain many plausibly matching objects, and users refer to the target through relations to landmarks and surrounding objects (e.g. ``the tall lamp below the dartboard and to the left of the poster''), demanding a relational spatial memory that supports retrieval through semantic, appearance, and spatial predicates over objects. To achieve this, we present FARM (Find Anything using Relational Spatial Memory), which builds, in real time at 5-10 Hz, a compact, open-vocabulary, object-level memory with geometry, visual-language descriptors, and viewpoint evidence. At query time, FARM uses VLMs to parse the query and score visual evidence, while grounding spatial constraints explicitly through object symbols and relational predicates. This structured use of VLMs enables more accurate and robust retrieval than end-to-end reasoning over frame histories or scene-graph context. In experiments on 44k language queries spanning 67 indoor and outdoor scenes, ranging from 15 to 15,000 m^2, FARM improves Recall@5 and Recall@10 over prior methods by 164% and 224%, and a final VLM reranking stage improves Accuracy@1 by 35%, while running in real time. We further demonstrate closed-loop deployment on a quadrupedal robot using onboard sensors and compute.

2606.15474 2026-06-16 cs.AI stat.AP 新提交

Who Drifted: the System or the Judge? Anytime-Valid Attribution in LLM Evaluation Pipelines

谁漂移了:系统还是裁判?LLM评估流水线中的随时有效归因

Yitao Li

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出一种基于固定锚点集和赌检验的方法,区分LLM评估中产品性能下降与裁判模型变化导致的分数漂移,并证明其随时有效性和归因准确性。

详情
AI中文摘要

对LLM产品的持续评估依赖于一个被视为地面真相的强大LLM裁判:一个廉价的监控器对每次交互进行评分,当分数下降时团队会收到警报。但裁判本身是一个API背后的模型,静默的版本升级或评分提示更新会改变其评分方式——因此每次漂移警报在更差的产品和变化的裁判之间是模糊的。我们通过一个固定的人工标注锚点集(当前裁判以稳定间隔重新评分)、一个关于裁判与人类差距的二次赌e过程,以及一个返回{无, 系统, 裁判}判决的守卫窗口规则来解决这种模糊性。我们证明了随时有效性、单向识别(只有裁判可以移动锚点)、一个归因竞赛(其设计法则是锚点必须跑赢它们守卫的主过程)以及过程正交性。在两个真实的裁判变化中,静默版本升级在60/60次运行中被检测为裁判漂移,且零次误归因为系统;而一个污染性的严格提示变化在守卫宽度为300时,120次运行中有110次被正确归因——而行业默认的滚动z检验在75%的无漂移流上产生误报。每个实验在第二个领域(TL;DR摘要)上重复,无需重新调整参数,并且当领域不同时,差异正是竞赛所预测的:严格提示变化在那里更强烈地改变分数,因此锚点触发更快,归因变得完美(240/240)。该监控器的运行成本约为对每个项目使用强裁判的0.64倍,或在更便宜但更聋的模式下为0.21倍。

英文摘要

Continuous evaluation of LLM products relies on a strong LLM judge treated as ground truth: a cheap monitor scores every interaction and a team is paged when the score drifts down. But the judge is itself a model behind an API, and a silent version bump or scoring-prompt update changes how it scores -- so every drift alarm is ambiguous between a worse product and a changed judge. We resolve the ambiguity with a fixed, human-labeled anchor set that the current judge re-scores at a steady interleave, a second betting e-process on the judge-versus-human gap, and a guard-window rule returning a verdict in {none, system, judge}. We prove anytime-validity, one-way identification (only the judge can move the anchors), an attribution race whose design law is that the anchors must out-run the main process they guard, and process orthogonality. On two real judge changes, a silent version bump is detected as judge drift in 60/60 runs with zero judge-to-system misattribution, and a contaminating strict-prompt change is correctly attributed on 110 of 120 runs at guard width 300 -- while the industry-default rolling z-test false-alarms on 75% of drift-free streams. Every experiment replicates on a second domain (TL;DR summarization) with nothing re-tuned, and where the domains differ the differences are the ones the race predicts: the strict-prompt change shifts scores harder there, so the anchors fire faster and attribution becomes perfect (240/240). The monitor runs at approximately 0.64 of the cost of strong-judging every item, or 0.21 in a cheaper-but-deafer regime.

2606.15469 2026-06-16 cs.RO 新提交

Learning Context-Aware Neural ODE Dynamics for Adaptive Robotic Control

学习上下文感知的神经ODE动力学用于自适应机器人控制

Shao-Yi Yu, Jen-Wei Wang, Maya Horii, Masayoshi Tomizuka, Vikas Garg

发表机构 * University of California, Berkeley(加州大学伯克利分校) Aalto University(阿尔托大学) YaiYai Ltd(YaiYai有限公司)

AI总结 提出基于神经ODE的上下文感知动力学模型,通过两阶段训练从状态-动作历史推断环境因素,实现模型预测控制下的自适应,在四旋翼、Sphero BOLT和Fanuc机械臂上验证了时空变化环境下的有效性。

详情
AI中文摘要

部署在不确定和动态变化环境中的机器人系统经常面临接触条件、空气动力学效应和外部干扰的变化,这些挑战了可靠的控制。为了在基于模型的控制下保持有效性,这些系统需要能够适应此类变化的动力学模型,特别是在直接获取完整环境信息受限的情况下。为了实现适应性并促进与模型预测控制的集成,我们提出了一种基于神经常微分方程的上下文感知动力学模型,该模型使用两阶段训练过程从状态-动作历史推断环境因素。我们在多种机器人平台上验证了该方法,包括仿真中的四旋翼,以及真实世界实验中的Sphero BOLT机器人和Fanuc机械臂。结果表明,我们的方法有效地适应了不同任务中时间和空间变化的环境变化。视频可在https://youtu.be/PY0sNyF2rqE 获取,源代码可在https://github.com/syyu410-yu/context-aware-neural-ode-control.git 获取。

英文摘要

Robotic systems deployed in uncertain and dynamically changing environments often face variations in contact conditions, aerodynamic effects, and external disturbances that challenge reliable control. To remain effective under model-based control, these systems require dynamics models that can adapt to such changes, especially when direct access to complete environmental information is limited. To enable adaptability and facilitate integration with model predictive control, we propose a context-aware dynamics model based on neural ordinary differential equations, which infers environmental factors from state-action histories using a two-phase training procedure. We validate the approach across diverse robotic platforms, including a quadrotor in simulation, as well as a Sphero BOLT robot and a Fanuc manipulator in real-world experiments. The results demonstrate that our method effectively adapts to temporally and spatially varying environmental changes across different tasks. Videos are available at https://youtu.be/PY0sNyF2rqE , and the source code is available at https://github.com/syyu410-yu/context-aware-neural-ode-control.git .

2606.15468 2026-06-16 cs.CV cs.LG 新提交

Analyzing Visual Aircraft Representations with Sparse Autoencoders

使用稀疏自编码器分析飞机视觉表示

Deepshik Sharma

发表机构 * Jain University(耆那大学)

AI总结 本文通过稀疏自编码器分解ConvNeXt模型在FGVC-Aircraft数据集上的中间表示,发现可解释的飞机结构特征,并通过消融实验验证其类别相关性。

Comments 18 pages, 4 figures, 7 tables

详情
AI中文摘要

视觉模型可以在分类任务上取得强性能,但支持其预测的内部表示通常难以解释。本文研究稀疏自编码器是否可以将视觉模型的中间表示分解为可解释的特征。我们在FGVC-Aircraft数据集上训练ConvNeXt分类器,从其最终特征阶段提取空间激活,并在这些激活上训练稀疏自编码器。使用最高激活图像块、激活强度和类别选择性分析学习到的稀疏特征。定性视觉检查显示,几个特征对应于可识别的飞机结构和视觉模式。我们使用输入空间和特征空间消融评估选定的特征子集,测量模糊图像块和抑制稀疏特征对类别logits、分类边界和预测置信度的影响。结果表明,稀疏自编码器可以揭示与飞机识别相关的部分可解释、类别相关的视觉特征,同时也暴露出多义性和粗糙空间定位等局限性。

英文摘要

Vision models can achieve strong performance on classification tasks, but the internal representations supporting their predictions are often difficult to interpret. This work investigates whether sparse autoencoders can decompose intermediate representations of a vision model into interpretable features. We train a ConvNeXt classifier on the FGVC-Aircraft dataset, extract spatial activations from its final feature stage, and train a sparse autoencoder on these activations. The learned sparse features are analyzed using top-activating image patches, activation strength, and class selectivity. Qualitative visual inspection reveals that several features correspond to recognizable aircraft structures and visual patterns. We evaluate a subset of selected features using input-space and feature-space ablations, measuring how blurring image patches and suppressing sparse features affect class logits, classification margins, and prediction confidence. The results suggest that sparse autoencoders can reveal partially interpretable, class-relevant visual features associated with aircraft recognition, while also exposing limitations such as polysemanticity and coarse spatial localization.

2606.15461 2026-06-16 cs.CL cs.AR 新提交

ESBMC-PLC: Formal Verification of IEC 61131-3 Ladder Diagram Programs Using SMT-Based Model Checking

ESBMC-PLC:基于SMT模型检测的IEC 61131-3梯形图程序形式化验证

Pierre Dantas, Lucas Cordeiro, Waldir Junior

发表机构 * The University of Manchester(曼彻斯特大学) Federal University of Amazonas (UFAM)(亚马逊联邦大学)

AI总结 提出首个原生支持梯形图(LD)的开源形式化验证工具ESBMC-PLC,通过SMT有界模型检测和k-归纳验证安全属性,在13个基准测试中正确分类61个属性,发现8个错误。

Comments 24 pages

详情
AI中文摘要

PLC在工业领域执行安全关键程序。IEC 61131-3标准下的梯形图(LD)作为主流PLC表示法,仍缺乏形式化验证:基于SMT的模型检测器无法处理LD的梯级-线圈图形。本文提出ESBMC-PLC,首个原生支持LD(PLCopen XML格式)的开源形式化验证器,作为ESBMC的新前端实现。ESBMC-PLC将LD梯级转换为GOTO IR,将PLC扫描周期建模为带有非确定性输入的while(true)循环,并通过基于SMT的有界模型检测或k-归纳检查安全属性。一个包含五个属性的YAML语言(互斥、不变性、不存在、响应、可达性)避免了时序逻辑。对22项研究(2020-2026)的调查识别出四个研究空白;ESBMC-PLC填补了其中两个。在13个基准测试(6个领域,3个来源——包括已部署的CONTROLLINO PLC和MathWorks Simulink PLC Coder)上的评估显示,在61个属性上正确分类:所有9个作者构建的程序(类别A/B)符合预期,所有4个供应商程序(类别C)正确未标注,发现8个错误(可操作反例),7个无界k-归纳证明,所有运行在Apple Silicon上低于60毫秒。与PLCverif的功能对比表明,ESBMC-PLC是唯一结合原生LD、k-归纳和SMT位向量语义的开源工具。

英文摘要

PLCs execute safety-critical programs across industrial sectors. The dominant PLC notation, ladder diagram (LD) per IEC 61131-3, remains absent from formal verification: SMT-based model checkers cannot process LD's rung-and-coil graphics. This paper presents ESBMC-PLC, the first open-source formal verifier with native LD support (PLCopen XML format), implemented as a new ESBMC frontend. ESBMC-PLC translates LD rungs to GOTO IR, models the PLC scan cycle as a while(true) loop with nondeterministic inputs, and checks safety properties via SMT-based bounded model checking or k-induction. A five-property YAML language (mutual_exclusion, invariant, absence, response, reachability) avoids temporal logic. A survey of 22 studies (2020-2026) identifies four research gaps; ESBMC-PLC closes two of them. Evaluation on 13 benchmarks (6 domains, 3 sources - including deployed CONTROLLINO PLCs and MathWorks Simulink PLC Coder) shows correct classification across 61 properties: all 9 author-constructed programs (Categories A/B) as expected, all 4 vendor programs (Category C) correctly unlabeled, with 8 bugs found (actionable counterexamples), 7 unbounded k-induction proofs, all runs under 60ms on Apple Silicon. Feature comparison with PLCverif shows that ESBMC-PLC is the only open-source tool that combines native LD, k-induction, and SMT bit-vector semantics.