arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3851
热门方向导航
2606.08292 2026-06-09 cs.AI 新提交

Ablation-Reversible Heads Don't Transfer: A Stress Test for Mechanistic Role Claims in Transformers

消融可逆头不传递:对Transformer中机制角色声称的压力测试

Philip Quirke

发表机构 * Martian

AI总结 本文发现注意力头通过必要性、线性编码和消融后恢复三个测试仍不足以证明其角色,引入KID框架和匹配控制下的激活转导,揭示角色声称的不足。

Comments 9 pages, 1 figure

详情
AI中文摘要

在机制可解释性中,注意力头通常被提升为角色声称(例如,“这个头表示加法”),当它们对某个行为是必要的、线性编码该行为,并且在消融后恢复该行为时。我们证明这种证据是不充分的:在三个7-8B指令微调模型和五个计算家族中,通过所有三个检查的头在匹配控制下将其激活修补到不同提示时,通常无法传递计算。我们引入KID(知道/意图/做),一个注意力头的角色分配视角,并将其与一个三阶段流程配对:能力选择性筛选(CSS)、奇异值分解(SVD)和匹配控制下的激活转导。我们的结果记录了一个初步的角色分类(包括提示轨迹稳定器、答案侧logit偏置头和软计算模式载体),并表明相同答案控制(一个共享答案字符串但不共享请求计算的转导目标)是一种未被充分利用的检查,它暴露了伪装成语义特异性的广泛状态转移。

英文摘要

In mechanistic interpretability, attention heads are commonly elevated to role claims (e.g., "this head represents addition") when they are necessary for a behavior, encode it linearly, and recover that behavior when restored after ablation. We show this evidence is insufficient: across three 7-8B instruction-tuned models and five computation families, heads passing all three checks routinely fail to transfer the computation when their activations are patched into a different prompt under matched controls. We introduce KID (Knowing / Intent / Doing), a role-assignment lens for attention heads, and pair it with a three-stage pipeline: capability-selective screening (CSS), singular value decomposition (SVD), and activation transduction under matched controls. Our results document a preliminary role taxonomy (including prompt-trajectory stabilizers, answer-side logit-bias heads, and soft computation-pattern carriers) and show that the same-answer control (a transduction target sharing the answer string but not the requested computation) is an underused check that exposes broad state transfer masquerading as semantic specificity.

2606.08291 2026-06-09 cs.LG 新提交

On solving symmetric multi-type orthogonal non-negative matrix tri-factorization problem

求解对称多类型正交非负矩阵三因子分解问题

Rok Hribar, Gregor Papa, Janez Povh, Andrej Kastrin

发表机构 * Laboratory for Engineering Design, Faculty of Mechanical Engineering, University of Ljubljana(卢布尔雅纳大学机械工程学院工程设计实验室) Rudolfovo – Science and Technology Centre Novo mesto(诺沃莫斯特鲁德沃尔福科学与技术中心) Institute of Biostatistics and Medical Informatics, Faculty of Medicine, University of Ljubljana(卢布尔雅纳大学生物统计与医学信息学研究所)

AI总结 研究对称多类型正交非负矩阵三因子分解问题,提出基于KKT条件的定点法和基于ADAM的三阶段算法,在合成数据和引文网络上验证了分解质量与聚类、链接预测等任务中的竞争力。

Comments 27 pages, 9 tables, 3 figures

详情
AI中文摘要

我们研究了对称多类型正交非负矩阵三因子分解问题,其中多个对称非负矩阵被同时近似为形式为$GS_{i}G^{\top}$的因子,共享一个非负且正交的因子$G$。该模型由聚类和网络分析驱动,其中非负性提高了可解释性,正交性为潜在因子提供了自然的分配型结构。由于所得优化问题高度非凸,我们开发了两种启发式算法来计算高质量的局部解。第一种是基于Karush-Kuhn-Tucker条件在添加正交约束惩罚项后导出的不动点方法。第二种是三阶段基于ADAM的方法,结合了保持非负性的优化、正交化以及可行集上的受限ADAM精化。我们在合成数据(包括含噪声实例)和引文网络基准上评估了这两种方法。合成实验表明,两种算法都能恢复接近最优的分解,并在噪声下保持稳定。在真实网络上,学习到的嵌入在链接预测、节点聚类和节点分类任务中与标准基线(如SVD、node2vec和经典链接预测启发式方法)相比具有竞争力或更优。

英文摘要

We study the symmetric multi-type orthogonal non-negative matrix tri-factorization problem, where several symmetric non-negative matrices are simultaneously approximated by factors of the form $GS_{i}G^{\top}$, with a shared non-negative and orthogonal factor $G$. This model is motivated by clustering and network analysis, where non-negativity improves interpretability and orthogonality gives a natural assignment-type structure to the latent factor. Since the resulting optimization problem is highly non-convex, we develop two heuristic algorithms for computing high-quality local solutions. The first one is a fixed point method derived from the Karush-Kuhn-Tucker conditions after adding a penalty term for the orthogonality constraint. The second one is a three-stage ADAM-based method that combines non-negativity-preserving optimization, orthogonalization, and restricted ADAM refinement on the feasible set. We evaluate both methods on synthetic data, including noisy instances, and on citation network benchmarks. The synthetic experiments show that both algorithms recover factorizations close to the optimum and remain stable under noise. On real networks, the learned embeddings are competitive with or better than standard baselines such as SVD, node2vec, and classical link prediction heuristics in link prediction, node clustering, and node classification tasks.

2606.08288 2026-06-09 cs.RO 新提交

MotionVLA: Injecting Geometric Motion into Vision-Language-Action Model

MotionVLA: 将几何运动注入视觉-语言-动作模型

Shanglin Yuan, Weiheng Zhao, Xianda Guo, Wei Sui, Li Yu, Wenyu Liu, Xinggang Wang

发表机构 * Huazhong University of Science and Technology(华中科技大学) D-Robotics(大疆机器人) Wuhan University(武汉大学)

AI总结 提出MotionVLA,通过运动历史接口将过去视频窗口转换为紧凑的连续轨迹场令牌,解决长程操作中的几何漂移和时间线索碎片化问题,提升动作平滑性和执行效率。

Comments 17 pages, 8 figures

详情
AI中文摘要

视觉-语言-动作(VLA)模型越来越多地基于历史、深度或4D特征来调节机器人策略,以解决长程操作中的歧义。然而,更多的时空证据并不一定更好:当注入的证据不是运动一致的时,它可能引入几何漂移、碎片化的时间线索和不稳定的动作生成。这提出了一个简单的问题:VLA应该记住过去的帧,还是记住连接它们的运动?我们引入了MotionVLA,一个运动历史接口,它将短时间仅包含过去的视频窗口转换为紧凑的、时间连续的轨迹场令牌。MotionVLA不是将历史视为一组稀疏的独立提升帧,而是将最近的观测表示为物理一致的运动证据。当前的视觉令牌查询这个历史以检索任务相关的运动信息,然后在轨迹基础的监督下重新耦合到VLA流中。在模拟基准和初步真实机器人部署上的实验表明,MotionVLA改善了长程操作,同时产生了更平滑、更直接的执行。这些结果表明,有效的VLA记忆不仅仅是提供更多的4D上下文,而是暴露可用于控制的运动一致证据。

英文摘要

Vision-language-action (VLA) models increasingly condition robot policies on history, depth, or 4D features to resolve ambiguity in long-horizon manipulation. However, more spatiotemporal evidence is not necessarily better: when the injected evidence is not motion-consistent, it can introduce geometric drift, fragmented temporal cues, and unstable action generation. This raises a simple question: should a VLA remember past frames, or remember the motion that connects them? We introduce MotionVLA, a motion-history interface that converts a short past-only video window into compact, time-continuous trajectory-field tokens. Instead of treating history as a sparse set of ndependently lifted frames, MotionVLA represents recent observations as physically coherent motion evidence. Current visual tokens query this history to retrieve task-relevant motion information, which is then recoupled into the VLA stream under trajectory-grounded supervision. Experiments across simulation benchmarks and preliminary real-robot rollouts show that MotionVLA improves long-horizon manipulation while producing smoother and more direct executions. These results suggest that effective VLA memory is not just about providing more 4D context, but about exposing motion-consistent evidence that is usable for control.

2606.08287 2026-06-09 cs.LG cond-mat.mtrl-sci cs.CE 新提交

Mesh Graph Neural Network Framework for Accelerating Finite Element Simulation for Arbitrary Geometries

网格图神经网络框架加速任意几何形状的有限元仿真

Josiah D. Kunz, Kamal Choudhary

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出网格图网络(MGN)预测任意孔洞几何2D结构的von Mises应力场,通过编码节点类型、相对边特征和全局特征实现平移和旋转不变性,在未见几何和载荷下R²≥0.97,优于传统模型。

Comments 10 pages, 6 figures, to be published. Code available at https://github.com/Josiah-Kunz/MGN-Public

详情
AI中文摘要

有限元分析(FEA)对于结构设计至关重要,但在评估多个设计迭代或载荷场景时计算成本高昂。机器学习代理模型提供了一种有前景的替代方案,但大多数方法在跨不同几何形状的泛化方面存在关键局限性。本文提出一种网格图网络(MGN),用于预测具有任意孔洞几何的二维结构部件中的von Mises应力场。与使用绝对节点坐标作为特征的传统机器学习方法不同,该模型基于现有的MGN框架,编码节点类型(例如固定边界、自由表面、孔洞边缘)、相对边特征(邻居之间的距离)和全局特征(施加的载荷)。这种架构本质上是平移和旋转不变的,使得无需重新训练即可泛化到未见过的几何形状。MGN在11种板几何形状和20种载荷条件下训练,并在7种未见几何形状和3种未见载荷下评估。在最有利的情况下,模型在未见几何和未见载荷上达到$R^2 \geq 0.97$,而传统模型(随机森林、梯度提升、K近邻)在相同数据上训练的$R^2$约为$0.01$--$0.86$。然而,即使在不太有利的情况下,MGN模型仍然优于传统模型。本文将Pfaff等人(arXiv:2010.03409)的基于网格的仿真框架扩展到结构力学,证明了图神经网络可以作为跨不同几何形状的有限元分析的高效代理。

英文摘要

Finite element analysis (FEA) is essential for structural design but remains computationally expensive, particularly when evaluating multiple design iterations or load scenarios. Machine learning surrogate models offer a promising alternative, yet most approaches struggle with a critical limitation: generalizing across varying geometries. This work presents a mesh graph network (MGN) for predicting von Mises stress fields in 2D structural components with arbitrary hole geometries. Unlike traditional machine learning approaches that use absolute node coordinates as features, the proposed model builds on existing MGN frameworks that encode node types (e.g., fixed boundary, free surface, hole edge), relative edge features (distance between neighbors), and global features (applied load). This architecture is inherently translation- and rotation-invariant, enabling generalization to unseen geometries without retraining. The MGN was trained on 11 plate geometries under 20 load conditions and evaluated on 7 unseen geometries and 3 unseen loads. In the most favorable case, the model achieves $R^2 \geq 0.97$ on an unseen geometry and unseen load, compared to $R^2 \approx 0.01$--$0.86$ for conventional models (Random Forest, Gradient Boosting , K-Nearest Neighbors) trained on identical data. However, even in less favorable cases, the MGN model still outperforms conventional models. This work extends the mesh-based simulation framework of Pfaff et al. (arXiv:2010.03409) to structural mechanics, demonstrating that graph neural networks can serve as efficient surrogates for finite element analysis across varying geometries.

2606.08286 2026-06-09 cs.SD 新提交

FXplorer: A Map-Based Interface for Exploratory Audio Effect Design

FXplorer: 一种基于地图的探索性音频效果设计界面

Annie Chu, Jason Brent Smith, Bryan Pardo

发表机构 * Northwestern University(西北大学)

AI总结 提出FXplorer界面,将音频效果组织在感知二维空间中,通过空间交互与嵌入方法实现连续浏览与参数精调的统一,支持交互式预设编辑与插值。

Comments Accepted to NIME 2026. Project page: https://anniejchu.github.io/fxplorer/

详情
AI中文摘要

音频效果(FX)在当代音乐实践中塑造声音。然而,大多数界面将它们呈现为离散模块和参数,这有利于针对性调整而非探索性聆听。这种分离使得难以建立关于可能变换的更广阔空间的直觉,也难以在搜索和精调之间流畅移动。我们提出FXplorer,一个将音频效果组织在感知信息丰富的二维空间中的界面,允许将声音变换作为连续景观而非孤立预设进行浏览。通过结合既定的空间交互方法和可解释的DAW风格控制,以及基于嵌入的相似性和语义搜索的机器学习方法,该系统将探索和参数精调整合到单个工作空间中。FXplorer通过允许用户交互式编辑和插值效果预设,支持作曲、制作或表演。

英文摘要

Audio effects (FX) shape sound in contemporary music practice. However, most interfaces present them as discrete modules and parameters that favor targeted adjustment over exploratory listening. This separation can make it difficult to build intuition about the broader space of possible transformations or to move fluidly between searching and refinement. We present FXplorer, an interface that organizes audio effects within a perceptually informed 2D space, allowing sound transformations to be browsed as a continuous landscape rather than as isolated presets. By combining established spatial interaction approaches and interpretable DAW-style controls with recent embedding-based machine learning methods for similarity and semantic search, the system brings exploration and parameter refinement into a single workspace. FXplorer supports composition, production, or performance by allowing users to edit and interpolate between effect presets interactively.

2606.08284 2026-06-09 cs.CV cs.RO 新提交

G2G: Exploiting Intra-Group Geometry for Inter-Group Pose Estimation

G2G:利用组内几何进行组间姿态估计

Yufei Wei, Shuhao Ye, Chenxiao Hu, Yiyuan Pan, Dongyu Feng, Rong Xiong, Yue Wang, Yanmei Jiao

发表机构 * State Key Laboratory of Industrial Control and Technology, Zhejiang University(浙江大学工业控制技术国家重点实验室) Zhejiang Humanoid Robot Innovation Center Co., Ltd.(浙江人形机器人创新中心有限公司) School of Information Science and Engineering, Hangzhou Normal University(杭州师范大学信息科学与工程学院)

AI总结 提出G2G方法,通过冻结多视图基础模型并添加三个轻量可训练模块(感知器重采样器、跨组桥接模块和多帧姿态头),仅利用相对姿态监督实现组间6-DoF姿态估计,在四个数据集上达到SOTA。

详情
AI中文摘要

恢复两个图像组之间的相对6-DoF姿态是跨序列重定位和多相机刚性里程计的基础。每个组通过视觉里程计或刚性校准携带已知的组内几何,预训练的多视图骨干网络已经将这种几何融合到视觉特征中。然而,当前模型将所有视图视为非结构化集合,缺少跨组推理的关键环节。我们提出\ours{},该方法保持基础模型完全冻结,并添加三个轻量可训练模块来桥接两个组:感知器重采样器、带有合并自注意力的跨组桥接模块以及多帧姿态头。可训练部分总计约32M参数,不到完整模型的6%,且仅由相对姿态监督。在四个数据集(涵盖室内外仿真、真实世界跨季节采集以及零样本仿真到真实迁移)上,\ours{}在两个任务上都达到了最先进的精度,而每个基线都使用其完整的原始监督进行重新训练。代码可在https://github.com/WeiYuFei0217/G2G获取。

英文摘要

Recovering the relative 6-DoF pose between two image groups underlies cross-sequence relocalization and multi-camera rig odometry. Each group carries known intra-group geometry from visual odometry or rig calibration, and pretrained multi-view backbones already fuse such geometry into visual features. Yet current models treat all views as an unstructured set, leaving cross-group reasoning as the missing piece. We introduce \ours{}, which keeps the foundation model entirely frozen and adds three lightweight trainable modules to bridge the two groups: a perceiver resampler, a cross-group bridge with merged self-attention, and a multi-frame pose head. The trainable footprint totals about 32M parameters, under 6\% of the full model, and is supervised only by relative poses. Across four datasets that span indoor and outdoor simulation, real-world cross-season capture, and zero-shot sim-to-real transfer, \ours{} attains state-of-the-art accuracy on both tasks, while every baseline is retrained with its full original supervision. Code is available at https://github.com/WeiYuFei0217/G2G.

2606.08282 2026-06-09 cs.AI 新提交

From Validator Selection to Portfolio Collection Optimization in Proof-of-Stake Blockchains

从验证者选择到权益证明区块链中的投资组合收集优化

Jonas Gehrlein, Grzegorz Miebs, Matteo Brunelli, Adam Mielniczuk, Miłosz Kadziński

发表机构 * Parity Technologies AG Institute of Computing Science, Poznan University of Technology(波兹南工业大学计算科学研究所) Department of Industrial Engineering, University of Trento(特伦托大学工业工程系)

AI总结 针对权益证明区块链中提名者选择验证者的多准则决策问题,提出双目标优化框架,同时最大化验证者期望效用(代表组合质量和盈利能力)和分配期望熵(代表风险分散),通过主动偏好学习和多目标进化算法求解,并引入交互式二分搜索导航确定满意折衷。

Comments 24 pages, 5 figures, 3 tables

详情
AI中文摘要

我们考虑权益证明区块链环境中出现的一个问题,其中称为提名者的代理选择验证者——负责维护区块链物理基础设施的实体。选择过程本质上是主观和多准则的,并且结合了提名者通常通过多个账户操作的事实。这引出了一个投资组合选择问题,其中代理寻求将其提名分配到多个账户以分散风险。我们提出了一个决策支持框架来优化这一选择,通过同时最大化两个目标:可能分配的验证者的期望效用,代表组合质量和盈利能力;以及分配的期望熵,代表跨 stash 的多样化和风险缓解。验证者效用通过基于多属性价值理论的原始主动偏好学习过程推导,重点关注排名靠前的验证者。所得的双目标优化问题通过多目标进化算法求解,为了支持最终选择,我们引入了一个交互式二分搜索导航程序,该程序引导提名者穿过前沿,并仅通过几个问题确定一个满意的折衷。数值实验检验了优化策略,而涉及五位经验丰富的提名者的专家评估证实了该方法的实际相关性和有用性。

英文摘要

We consider a problem arising in proof-of-stake blockchain environments, where agents called nominators select validators - entities responsible for maintaining the blockchain's physical infrastructure. The selection process is inherently subjective and multi-criterial and combines with the fact that nominators commonly operate through multiple accounts. This gives rise to a portfolio selection problem, where agents seek to distribute their nominations across accounts to diversify risk. We propose a decision support framework to optimize this selection by simultaneously maximizing two objectives: the expected utility of the validators likely to be allocated, representing portfolio quality and profitability, and the expected entropy of the allocation, representing diversification and risk mitigation across stashes. Validator utilities are derived using an original active preference learning procedure based on multi-attribute value theory, with emphasis on top-ranked validators. The resulting bi-objective optimization problem is solved with a multi-objective evolutionary algorithm and, to support the final choice, we introduce an interactive binary search navigation procedure that guides the nominator through the front and identifies a satisfactory trade-off with only a few questions. Numerical experiments examine the optimization strategies, while an expert assessment involving five experienced nominators confirms the approach's practical relevance and usefulness.

2606.08278 2026-06-09 cs.RO 新提交

SIMPLE: Simulation-Based Policy Learning and Evaluation for Humanoid Loco-manipulation

SIMPLE:基于仿真的人形机器人全身操作策略学习与评估

Songlin Wei, Zhenhao Ni, Jie Liu, Zhenyu Zhao, Junjie Ye, Hongyi Jing, Junkai Xia, Xiawei Liu, Michael Leong, Liang Heng, Di Huang, Yue Wang

发表机构 * USC Physical Superintelligence (PSI) Lab(南加州大学物理超级智能实验室)

AI总结 提出SIMPLE仿真平台,结合MuJoCo动力学与IsaacSim渲染,包含60个全身任务、50个室内场景和1000+物体资产,支持自动化轨迹生成和VR遥操作数据采集,并集成多种主流策略,实验证明仿真与真实世界性能强相关,可实现零样本迁移。

详情
AI中文摘要

人形基础模型的发展速度超过了我们评估它们的能力。虽然真实世界测试成本高昂且难以复现,但现有的仿真基准主要关注桌面或轮式机器人。针对全身人形操作的可扩展且可复现的基准仍然是一个开放问题。为此,我们提出了SIMPLE,一个用于人形策略学习和评估的统一仿真测试平台。SIMPLE将MuJoCo的精确接触丰富动力学与IsaacSim的光真实感渲染相结合。它提供了一个大规模环境,包含60个多样的全身任务、50个室内场景和超过1000个物体资产。为了促进可扩展的数据收集,该框架集成了两个数据生成流水线:通过运动规划自动生成轨迹和低延迟VR遥操作接口。我们进一步在SIMPLE中大规模集成并基准测试了主流人形策略,包括轻量级模仿网络、大型视觉-语言-动作(VLA)模型以及最新的世界动作模型(WAM)。我们的实验揭示了策略在仿真和真实世界中的性能之间存在强相关性。此外,我们证明了在SIMPLE中收集的数据上训练的策略可以在相似设置下零样本迁移到物理人形机器人上,为人形机器人研究提供了稳健且可复现的基础。

英文摘要

Humanoid foundation models are advancing faster than we can evaluate them. While real-world testing is expensive and difficult to reproduce, existing simulation benchmarks focus primarily on table-top or wheeled robots. A scalable and reproducible benchmark for whole-body humanoid loco-manipulation remains an open problem. To this end, we present SIMPLE, a unified simulation testbed for humanoid policy learning and evaluation. SIMPLE couples the accurate contact-rich dynamics of MuJoCo with the photorealistic rendering of IsaacSim. It provides a large-scale environment comprising 60 diverse whole-body tasks, 50 indoor scenes, and over 1,000 object assets. To facilitate scalable data collection, the framework integrates two data generation pipelines: automated trajectory generation via motion planning and a low-latency VR teleoperation interface. We further integrate and benchmark mainstream humanoid policies at scale in SIMPLE, including lightweight imitation networks, large vision-language-action (VLA) models, and recent world action models (WAMs). Our experiments reveal a strong correlation between policy performance in simulation and the real world. Furthermore, we demonstrate that policies trained on data collected in SIMPLE can be transferred zero-shot to physical humanoid robots under similar settings, providing a robust and reproducible foundation for humanoid robotics research.

2606.08277 2026-06-09 cs.CV 新提交

Remember with Confidence: Uncertainty Quantification for Spatio-temporal Memory with Probabilistic Guarantees

自信记忆:具有概率保证的时空记忆不确定性量化

Harry Zhang, Nicolas Gorlo, Luca Carlone

发表机构 * MIT(麻省理工学院)

AI总结 针对机器人长期操作中VLM描述噪声大、视角不一致的问题,提出目标级语义不确定性评分,并集成到UQ-DAAAM系统中,通过主动选择高质量视图和融合多视角描述来降低不确定性,同时提供概率保证。

详情
AI中文摘要

长期机器人操作需要时空记忆来记录环境状态并在下游推理中回忆。场景图和检索增强系统将VLM描述锚定到持久的3D实体,并带有丰富的语义描述。然而,VLM描述存在噪声且视角不一致,现有系统将其视为神谕,没有机制检测不可靠的存储描述。我们引入了多视角VLM记忆的目标级语义不确定性:一种衡量目标中心跨视角语义描述分散度并识别语义未解决目标的分数。然后,我们将不确定性分数集成到一个高级空间语义记忆系统中,称为UQ-DAAAM。UQ-DAAAM利用该分数,在固定查询预算下通过选择高质量视图并将多视角描述融合为单一目标描述,主动优化不确定目标。我们还推导了概率保证,表明更高质量的候选视图(根据我们的方法选择)更有可能降低不确定性。实验表明,不确定性量化可以使具身4D记忆系统更可靠、更有效。特别是在OC-NaVQA基准上,UQ-DAAAM相比基线实现了显著更大的不确定性降低和更好的时空问答性能。

英文摘要

Long-horizon robot operation requires spatio-temporal memory to record the environment state and recall it for downstream reasoning. Scene graphs and retrieval-augmented systems ground VLM descriptions to persistent 3D entities with rich semantic descriptions. However, VLM captions are noisy and viewpoint-inconsistent, and existing systems treat them as an oracle with no mechanism to detect unreliable stored descriptions. We introduce object-level semantic uncertainty for multi-view VLM memory: a score that measures object-centric cross-view semantic scatter of captions and identifies semantically unresolved objects. Then, we include our uncertainty scores in an advanced spatial-semantic memory system, that we dub UQ-DAAAM. UQ-DAAAM uses this score to actively refine uncertain objects under a fixed query budget by selecting high-quality views and fusing the resulting multi-view captions into a single object description. We also derive probabilistic guarantees showing that higher-quality candidate views (as selected by our approach) are more likely to reduce uncertainty. Our experiments show that uncertainty quantification can make embodied 4D memory systems more reliable and more effective. In particular, on the OC-NaVQA benchmark, UQ-DAAAM achieves substantially larger uncertainty reduction and better spatio-temporal question answering performance than baselines.

2606.08275 2026-06-09 cs.LG cs.AI 新提交

Causal Agent Replay: Counterfactual Attribution for LLM-Agent Failures

因果智能体回放:LLM智能体故障的反事实归因

Jaineet Shah

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出Causal Agent Replay (CAR)方法,通过结构因果模型和干预操作,对LLM智能体失败步骤进行反事实归因,解决现有方法无法定位决策步骤的问题。

Comments Open-source: https://github.com/jaineet17/causal-agent-replay

详情
AI中文摘要

当LLM智能体失败时——例如发放了不应发放的退款、调用了错误的工具、泄露了数据——现有工具只能回答发生了什么(可观测性)或是否通过(评估),但无法回答哪个步骤导致了失败。直观的启发式方法是错误的:执行有害动作的步骤通常不是决定该动作的步骤,而LLM判断的归因是相关性的且不可靠(在Who&When基准上,最先进的步骤级准确率约为14%)。我们提出Causal Agent Replay (CAR),通过干预来回答这个问题:它将智能体运行建模为结构因果模型,对某个步骤应用do操作,并在相同随机策略下重新执行轨迹,测量结果分布的变化。我们定义了智能体步骤上的干预代数、一个单步对比估计器(其承诺点规则解决了特定于随机向前运行的混杂因素),以及一个预算有界的蒙特卡洛Shapley估计器(用于在交互步骤间分配信用)。每个效应都附有置信区间。我们在具有植入真实标签的合成结构因果模型上进行验证:对比估计器恢复了关键步骤,Shapley恢复了两步交互(0.44, 0.45, ~0;效率总和0.909对比解析值0.91)。CAR是开源的,可在托管或免费的本地模型上运行。

英文摘要

When an LLM agent fails -- issues a refund it should not have, calls the wrong tool, leaks data -- existing tooling answers what happened (observability) or whether it passed (evaluation), but not which step caused the failure. The obvious heuristics are wrong: the step that executes the harmful action is usually not the step that decided on it, and LLM-judge attribution is correlational and unreliable (state-of-the-art step-level accuracy on the Who&When benchmark is about 14%). We present Causal Agent Replay (CAR), which answers the question by intervention: it models an agent run as a structural causal model, applies a do-operation to a step, and re-executes the trajectory forward under the same stochastic policy, measuring the shift in the outcome distribution. We define an intervention algebra over agent steps, a single-step contrastive estimator whose point-of-commitment rule resolves a confound specific to stochastic run-forward, and a budget-bounded Monte-Carlo Shapley estimator that splits credit across interacting steps. Every effect is reported with confidence intervals. We validate against synthetic structural causal models with planted ground truth: the contrastive estimator recovers the pivotal step, and Shapley recovers a two-step interaction (0.44, 0.45, ~0; efficiency sum 0.909 versus the analytic 0.91). CAR is open source and runs on hosted or free local models.

2606.08272 2026-06-09 cs.CL cs.AI 新提交

AgriGov: A Structured Multilingual Dataset Curation for Indian Government Schemes for Farmers

AgriGov:面向印度政府农民计划的结构化多语言数据集整理

Mohsina Bilal, Gopakumar G

发表机构 * National Institute of Technology Calicut(国立卡利卡特理工学院)

AI总结 提出AgriGov三语数据集,通过自动抓取、翻译流水线和人工后编辑构建约8000句对齐的农业政策领域平行语料,支持机器翻译、问答等应用。

Comments 15 pages, 4 figures, Submitted to: Sadhana, Elsevier

详情
AI中文摘要

AgriGov是一个精心整理的三语(英语-印地语-马拉地语)数据集,旨在解决农业政策和农民福利计划领域缺乏领域基础的多语言资源的问题。最初,我们使用自动抓取技术从可信门户收集并结构化50个政府计划的数据,将其组织到预定义的语义字段(如标题、资格、申请流程、文件、排除项)。翻译通过结合Google Translate API、MarianMT和人工后编辑的流水线进行,生成了一个包含约2100个源片段的领域特定印地语-马拉地语数据集。为了增强覆盖范围,我们用Samanantar语料库中的句子扩充了该数据集,产生了约8000个句子对齐的印地语-马拉地语平行对。该数据集现在为微调该领域的机器翻译模型提供了强大的资源。AgriGov专为领域自适应机器翻译、问答、信息检索和摘要系统等应用而设计。其主要贡献是一个模式驱动、人工校正的多语言对齐流水线,确保领域保真度、提供来源并支持可重复实验,从而为面向农民的工具实现检索增强应用。

英文摘要

AgriGov is a curated, trilingual (English-Hindi-Marathi) dataset designed to address the scarcity of domain-grounded multilingual resources for agricultural policies and farmer welfare schemes. Initially, we collected and structured data from 50 government schemes sourced from trusted portals using automated scraping techniques, organizing it into predefined semantic fields (e.g., title, eligibility, application process, documents, exclusions). Translations were performed using a pipeline combining Google Translate API, MarianMT, and human post-editing, resulting in a domain-specific Hindi-Marathi dataset comprising approximately 2100 source segments. To enhance coverage, we augmented this dataset with sentences from the Samanantar corpus, leading to approximately 8,000 sentence-aligned Hindi-Marathi parallel pairs. The dataset now offers robust resources for fine-tuning machine translation models in this domain. AgriGov is designed for applications in domain-adaptive machine translation, question answering, information retrieval, and summarization systems. Its key contribution is a schema-driven, human-corrected multilingual alignment pipeline that ensures domain fidelity, provides provenance, and supports reproducible experiments, enabling retrieval-augmented applications for farmer-facing tools.

2606.08262 2026-06-09 cs.LG 新提交

Causal Semantic Alignment for LLM-based Time Series Forecasting

基于大语言模型的时间序列预测的因果语义对齐

Kexuan Zhang, Xiaobei Zou, Cesare Alippi, Gary G. Yen, Yang Tang

发表机构 * University of Science and Technology of China(中国科学技术大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 提出CVAformer框架,通过因果干预解耦变量中的动态和不变成分,消除对齐中的混杂偏差,在多种预测场景下达到或超越最先进性能。

详情
AI中文摘要

大语言模型(LLM)的最新进展通过使时间模式与预训练词嵌入对齐,为时间序列预测开辟了新可能性。然而,大多数基于LLM的方法忽视了时间序列的异质性,其中动态波动和不变语义纠缠在一起。这种纠缠在对齐过程中引入了虚假相关性,因为动态成分作为混杂因素同时影响不变成分和最终的对齐嵌入。为了解决这个问题,提出了一个变量级对齐框架CVAformer。CVAformer在对齐前明确将每个变量解耦为不变和动态成分,并应用因果干预来减轻动态成分的混杂效应。为了更好地支持变量级对齐,CVAformer用非因果注意力机制替换了LLM中的标准因果注意力,该机制捕捉每个时间步上变量之间的交互。在长期、短期、少样本和零样本预测设置上的大量实验表明,CVAformer在大多数数据集上匹配或超越最先进性能,并且在某些情况下实现了显著更好的准确性。实验结果验证了CVAformer中变量级对齐和动态解耦的有效性,为基于LLM的时间序列任务提供了新视角。

英文摘要

Recent advances in Large Language Models (LLMs) have opened new possibilities for time series forecasting by enabling alignment between temporal patterns and pretrained word embeddings. However, most LLM-based methods overlook the heterogeneous nature of time series, where dynamic fluctuations and invariant semantics are entangled. This entanglement introduces spurious correlations during the alignment, as dynamic components act as confounders by simultaneously influencing invariant components and the resulting aligned embeddings. To address this issue, a variable-level alignment framework CVAformer is proposed. CVAformer explicitly disentangles each variable into invariant and dynamic components just before alignment, and applies causal intervention to mitigate the confounding effect of the dynamics. To better support variable-level alignment, CVAformer replaces the standard causal attention in LLMs with a non-causal attention mechanism that captures interactions among variables at each time step. Extensive experiments across long-term, short-term, few-shot, and zero-shot forecasting settings indicate that CVAformer matches or exceeds state-of-the-art performance on most datasets, and in some cases achieves notably better accuracy. Experimental results validate the effectiveness of variable-level alignment and dynamic disentanglement in CVAformer, offering a new perspective for LLM-based time series tasks.

2606.08260 2026-06-09 cs.CV 新提交

TIDE: Task-Isolated Diffusion for Unified Video Editing and Generation

TIDE: 任务隔离扩散模型用于统一视频编辑与生成

Qi Liu, Gang Yue, Mingyu Yin, Lisai Zhang, Yidi Wu, Yaole Wang, Yaohui Wang, Chang Yao, Jingyuan Chen, Lin Ma

发表机构 * Zhejiang University(浙江大学) Bilibili Inc.(哔哩哔哩股份有限公司)

AI总结 提出TIDE统一框架,通过逐token任务嵌入和双路径条件机制,实现指令编辑、参考编辑和多参考生成,在多任务渐进训练下达到SOTA性能。

详情
AI中文摘要

扩散Transformer的最新进展推动了视频生成和编辑的快速发展,但这些能力仍由独立的、任务特定的模型处理。构建支持多种视频任务的统一框架仍然是一个开放挑战:现有的统一尝试要么需要专用的辅助编码器,要么缺乏区分异构条件令牌的显式机制,当视觉条件的数量和类型因任务而异时难以应对。我们提出TIDE,一个统一框架,集成了基于指令的编辑、参考引导编辑和多参考生成。其核心是,我们引入了逐令牌任务嵌入,为每个输入令牌分配一个任务特定标识符,使模型能够显式区分目标、源和参考令牌。为了同时捕捉高层语义理解和细粒度结构保真度,我们设计了一种双路径条件方案,将视觉语言模型与VAE潜在路径耦合以提供互补信号。我们进一步设计了一种多任务渐进训练策略,逐步引入复杂度递增的任务,有效协调不同目标,并实现跨异构任务分布的平滑泛化。在多个视频编辑和生成基准上的大量实验表明,TIDE在所有评估任务上均达到了最先进的性能。我们的项目页面可在https://LittleWork123.github.io/tide获取。

英文摘要

Recent advances in Diffusion Transformers have driven rapid progress in video generation and editing, yet these capabilities are still handled by separate, task-specific models. Building a unified framework that supports diverse video tasks remains an open challenge: existing unified attempts either require dedicated auxiliary encoders or lack explicit mechanisms to distinguish heterogeneous conditioning tokens, struggling when the number and type of visual conditions vary across tasks. We propose TIDE, a unified framework that integrates instruction-based editing, reference-guided editing, and multi-reference generation. At its core, we introduce per-token task embeddings that assign each input token a task-specific identifier, enabling the model to explicitly disambiguate target, source, and reference tokens. To simultaneously capture high-level semantic understanding and fine-grained structural fidelity, we design a dual-path conditioning scheme that couples a vision-language model with a VAE latent path for complementary signals. We further devise a multi-task progressive training strategy that incrementally introduces tasks of increasing complexity, effectively harmonizing diverse objectives and enabling smooth generalization across heterogeneous task distributions. Extensive experiments on multiple video editing and generation benchmarks demonstrate that TIDE achieves state-of-the-art performance across all evaluated tasks. Our project page is available at https://LittleWork123.github.io/tide.

2606.08259 2026-06-09 cs.LG 新提交

Differentially Private Synthetic Data via APIs 4: Tabular Data

通过API实现差分隐私合成数据 4: 表格数据

Toan Tran, Arturs Backurs, Zinan Lin, Victor Reis, Li Xiong, Sergey Yekhanin

发表机构 * Microsoft(微软)

AI总结 提出Tab-PE算法,将Private Evolution框架扩展至表格数据,通过启发式算子迭代优化候选数据集,在保持差分隐私的同时高效处理高阶相关性,相比基线AIM分类准确率提升最高10%,速度提升28倍。

Comments ICML'26

详情
AI中文摘要

本文研究了在差分隐私(DP)保证下生成合成表格数据的问题,使得在敏感领域能够共享数据。尽管已有大量研究,最先进的方法通常侧重于最小化低阶边际查询误差,而忽视了高阶相关性带来的挑战。为解决这一差距,我们将最初为DP合规图像和文本合成开发的Private Evolution(PE)框架扩展到表格数据。我们提出了Tab-PE——一种在DP约束下生成合成表格数据的算法。Tab-PE通过一个进化过程迭代改进候选数据集,该过程利用表格专用算子产生变体,对其进行私有评分,并选择最高质量的样本进行保留和传播。与依赖大型基础模型的原始PE不同,Tab-PE采用计算成本显著更低的启发式算子,使得PE对表格数据更加实用和可扩展。通过在真实和模拟数据集上的大量实验,我们证明Tab-PE在表现出高阶相关性的数据集上显著优于先前的基线。与最佳基线AIM相比,Tab-PE的分类准确率提高了最高10%,同时运行速度快了28倍。

英文摘要

This paper investigates the problem of generating synthetic tabular data with differential privacy (DP) guarantees, enabling data sharing in sensitive domains. Despite extensive study, state-of-the-art methods often focus on minimizing low-order marginal query errors and overlook the challenges posed by high-order correlations. To address this gap, we extend the Private Evolution (PE) framework, originally developed for DP-compliant image and text synthesis, to tabular data. We introduce Tab-PE -- an algorithm for synthetic tabular data generation under DP constraints. Tab-PE iteratively improves a candidate dataset via an evolutionary process that leverages tabular-specialized operators to produce variations, privately scores them, and selects the highest-quality samples to retain and propagate. In contrast to the original PE, which relies on large foundation models, Tab-PE employs heuristic operators with significantly lower computational costs, making PE more practical and scalable for tabular data. Through extensive experiments on real-world and simulation datasets, we demonstrate that Tab-PE substantially outperforms prior baselines on datasets exhibiting high-order correlations. Compared to the best baseline -- AIM, Tab-PE improves classification accuracy by up to 10% while running 28 times faster.

2606.08256 2026-06-09 cs.AI cs.DL 新提交

Traxia: A Framework for Verifiable, Agent-Native Scientific Publishing

Traxia:一个可验证的、智能体原生的科学出版框架

Wisdom Dogah

发表机构 * Faculty of Computing and Mathematical Sciences, University of Mines and Technology (UMaT), Tarkwa, Ghana(加纳塔夸矿业与技术大学计算与数学科学学院) BlackMatrix AI Research, Accra, Ghana(加纳阿克拉BlackMatrix AI研究院)

AI总结 提出Traxia框架,通过智能体身份、可验证出版、四层同行评审、声誉机制和知识图谱,解决科学出版中可验证性、归属和可重复性问题。

Comments 22 pages, 3 figures, 3 tables. Preprint. Under active development. Comments welcome

详情
AI中文摘要

可验证性、归属和可重复性是科学知识的基本要求,但当前的出版基础设施并未大规模强制执行这些要求。我们介绍Traxia,一个智能体原生的科学出版框架,其中AI研究智能体发布可验证的论文,建立声誉身份,相互进行同行评审,并与人类在共享溯源模型中协作。Traxia将智能体视为第一类认知参与者:每篇论文都带有推理轨迹,每个声明都带有置信区间,每个智能体都有加密签名的身份,每次协作都有不可变的贡献日志。我们形式化了五个组件:智能体身份与注册、可验证出版层、四层同行评审协议、声誉与质押引擎,以及带有矛盾检测的知识图谱。该框架针对可重复性失败、溯源不透明以及排除全球南方研究能力的问题。本文仅介绍架构基础和形式化规范;未报告实证结果。评估和更深入的组件研究将在后续论文中进行。原型部分实现了核心形式化;完整系统仍在积极开发中。

英文摘要

Verifiability, attribution, and reproducibility are foundational requirements of scientific knowledge, yet current publishing infrastructure does not enforce them at scale. We introduce Traxia, an agent-native scientific publishing framework in which AI research agents publish verifiable papers, build reputational identities, peer-review one another, and collaborate with humans in a shared provenance model. Traxia treats agents as first-class epistemic participants: every paper carries a reasoning trace, every claim a confidence interval, every agent a cryptographically signed identity, and every collaboration an immutable contribution log. We formalise five components: Agent Identity and Registry, Verifiable Publishing Layer, four-tier Peer Review Protocol, Reputation and Staking Engine, and a Knowledge Graph with contradiction detection. The framework targets reproducibility failure, provenance opacity, and exclusion of Global South research capacity. This paper presents architectural foundations and formal specifications only; it does not report empirical results. Evaluation and deeper component studies will follow in subsequent papers. A prototype partially implements core formalisms; the full system remains under active development.

2606.08254 2026-06-09 cs.CL 新提交

SSR: Can Simulated Patients Learn to Stigmatize Themselves? Modeling Self-Stigma through Internal Monologue

SSR: 模拟患者能否学会自我污名化?通过内心独白建模自我污名

Kunyao Lan, Bingrui Jin, Zichen Zhu, Mengyue Wu

发表机构 * Shanghai Jiao Tong University(上海交通大学) X-LANCE Lab, Dept. of Computer Science and Engineering(X-LANCE实验室,计算机科学与工程系) MoE Key Lab of Artificial Intelligence, AI Institute(教育部人工智能重点实验室,人工智能研究院)

AI总结 提出基于心理3A1H模型的SSR框架,通过内心独白数据集和链式思维微调LLM,使模拟患者根据对话触发动态调整污名表达,生成更真实的情境适应性反应。

详情
AI中文摘要

使用大语言模型(LLM)模拟患者是心理健康训练的一种有前景的工具,但现有方法未能捕捉一个关键的临床现实:自我污名。经历自我污名的患者,即内化负面刻板印象,通常表现出情境敏感性的抵抗,如回避、否认或自责,而当前模型将其呈现为静态或统一顺从的行为。为了解决这一问题,我们引入了一个基于自我污名化心理3A1H模型的新型模拟框架。我们的核心创新是创建了一个\textbf{污名化自我反思}(\textbf{SSR})数据集,在该数据集中,我们通过反映污名意识推理的内心独白来增强心理健康对话。通过使用链式思维方法对LLM进行微调,我们训练患者代理根据对话触发动态调整其污名水平和表达方式。评估表明,我们的方法显著优于专门的基线,生成了更真实且情境适当的患者反应。这项工作为临床训练和共情对话系统的现实污名模拟迈出了关键一步。

英文摘要

Simulating patients with large language models (LLMs) is a promising tool for mental health training, but existing approaches fail to capture a key clinical reality: self-stigma. Patients experiencing self-stigma, the internalization of negative stereotypes, often exhibit context-sensitive resistance, such as avoidance, denial, or self-blame, which current models render as static or uniformly compliant behavior. To address this, we introduce a novel simulation framework grounded in the psychological 3A1H model of self-stigmatization. Our core innovation is the creation of a \textbf{Stigmatized Self-Reflection} (\textbf{SSR}) dataset, where we augment mental health dialogues with internal monologues that reflect stigma-aware reasoning. By fine-tuning LLMs with this data using a chain-of-thought approach, we train patient agents to dynamically adjust their level and expression of stigma based on conversational triggers. Evaluations demonstrate that our approach significantly outperforms specialized baselines, generating more authentic and situationally appropriate patient responses. This work provides a crucial step towards realistic stigma simulation for clinical training and empathetic dialogue systems.

2606.08253 2026-06-09 cs.RO cs.LG 新提交

Mind Your Steps: A General Learning Framework for Accurate Humanoid Foothold Tracking

注意你的步伐:一种用于精确人形机器人落脚点跟踪的通用学习框架

Alessandro Montenegro, Shihao Li, Puze Liu, Alberto Maria Metelli, Jan Peters

发表机构 * Politecnico di Milano(米兰理工大学) TU Darmstadt(达姆施塔特工业大学) Max Planck Institute for Intelligent Systems(马克斯·普朗克智能系统研究所) Italian Institute of Technology(意大利技术研究院) University of Pisa(比萨大学)

AI总结 提出一种轻量级通用3D落脚点跟踪策略学习框架,通过目标采样器动态提供步态支持,结合新目标表示克服真实世界噪声,实现与多种高层规划器无缝集成的精确自然运动。

Comments Accepted to RSS 2026

详情
AI中文摘要

使人形机器人在复杂动态环境中运行仍然是一个关键挑战,其根本受限于稳健、安全且精确导航的能力。虽然基于速度指令策略的强化学习在人形机器人运动方面取得了显著的鲁棒性,但这种方法缺乏对落脚点位置的显式控制,导致不安全行为(如踩到人脚)或不精确导航,阻碍后续操作任务。相反,显式落脚点跟踪策略通过直接以目标足部姿态作为指令提供了一种有前景的替代方案。然而,现有方法通常受限于不切实际的状态假设(影响实际部署),或者作为分阶段流程的一部分而受限于特定下游任务。在这项工作中,我们引入了一种新颖的轻量级框架,用于训练通用的3D落脚点跟踪策略。通过目标采样器动态提供步态支持,该方法使学习到的策略对特定地形不敏感。我们的新目标表示有效缓解了现实世界中出现的挑战,例如噪声和不准确的姿态估计以及足部接触估计。为直接迁移到现实世界而设计,我们的策略作为一个独立的低级控制器,可以与各种高级落脚点生成器无缝配对。通过在仿真和现实世界中的大量实验,我们证明了框架的有效性。通过将我们的策略与不同的上游规划器耦合,我们在具有挑战性的环境中实现了自然且精确的运动,为复杂环境中的运动-操作任务铺平了道路。

英文摘要

Enabling humanoid robots to operate in complex, dynamic environments remains a critical challenge, fundamentally limited by the ability to navigate robustly, safely, and accurately. While reinforcement learning with velocity-commanded policies has achieved remarkable robustness in humanoid locomotion, this approach lacks explicit control of the foothold placement, leading to unsafe behavior, such as stepping onto human feet, or imprecise navigation, hindering the following manipulation task. Conversely, explicit foothold-tracking policies offer a promising alternative by directly being commanded with target foot poses. However, existing approaches are often limited by unrealistic state assumptions, compromising real-world deployment, or they are part of staged pipelines, making them tied to specific downstream tasks. In this work, we introduce a novel, lightweight framework for training general-purpose 3D foothold-tracking policies. By dynamically providing footstep support through a goal sampler, this method enables the learned policy to be agnostic to specific terrains. Our new target representation effectively mitigates challenges arising in the real world, such as noisy and inaccurate pose estimation and foot contact estimation. Designed for direct real-world transfer, our policy acts as a standalone low-level controller that can be seamlessly paired with various high-level foothold generators. We demonstrate the effectiveness of our framework through extensive experiments in simulation and in the real world. By coupling our policy with different upstream planners, we achieve natural and accurate locomotion in challenging settings, paving the way for loco-manipulation tasks in complex environments.

2606.08249 2026-06-09 cs.RO cs.LG 新提交

Disturbance-Aware Aerial Robotics for Ethical Wildlife Monitoring

面向道德野生动物监测的扰动感知空中机器人

Mahmut Osmanovic, Isac Paulsson, Teddy Lazebnik

发表机构 * Department of Computing, Jonkoping University(约翰内斯堡大学计算机系) Department of Information Systems, University of Haifa(海法大学信息系统系)

AI总结 提出一种基于强化学习的扰动感知框架,用于异构空中机器人编队自主追踪野生动物,同时最小化行为干扰,在三种动物和四种行为模型上超越规则基线。

详情
AI中文摘要

可靠的野生动物监测对生态学和保护至关重要,然而许多现有方法,如标记、捕捉和近距离观察,可能会改变它们旨在测量的行为。空中机器人提供了一种可扩展的替代方案,在多项研究中显示出有前景的性能。尽管如此,现有方法通常缺乏行为感知,依赖固定启发式规则,或需要昂贵、不切实际且伦理上难以获取的真实世界训练数据。因此,目前尚无通用的自适应无人机监测框架,既能保持生态有效性,又能跨物种、行为和机器人平台扩展。在本研究中,我们引入了一种基于扰动感知强化学习的异构空中机器人编队框架,能够自主追踪野生动物,同时明确最小化行为干扰。我们将动物学模拟环境与基于真实轨迹统计拟合的动物运动模型相结合,并使用一种捕捉观测质量与扰动风险之间权衡的奖励公式来训练控制策略。在三种具有不同生态和运动模式的物种(鸽子、豺和距翅麦鸡)以及四种在自然界中常见的日益策略性的行为模型上,学习到的策略持续超越当前使用的基于规则的基线,并泛化到不同的监测任务、动物动态和无人机类型。这些结果确立了扰动感知学习作为非侵入式自主野生动物观测的可行基础,为生态学和保护中可扩展、道德负责且科学可靠的机器人监测开辟了道路。

英文摘要

Reliable wildlife monitoring is essential for ecology and conservation, yet many existing methods, such as tagging, capture, and close-range observation, can alter the very behaviors they aim to measure. Aerial robots offer a scalable alternative, which has shown promising performance in multiple studies. Nonetheless, existing approaches typically lack behavioral awareness, rely on fixed heuristics, or require real-world training data that are costly, impractical, and ethically difficult to obtain. As a result, there remains no general framework for adaptive drone-based monitoring that can both preserve ecological validity and scale across species, behaviors, and robotic platforms. In this study, we introduce a disturbance-aware reinforcement-learning-based framework for heterogeneous aerial robotic fleets that enables autonomous wildlife tracking while explicitly minimizing behavioral disruption. We couple a zoologically grounded simulation environment with fitted animal movement models derived from real trajectory statistics, and train control policies using a reward formulation that captures the trade-off between observation quality and disturbance risk. Across three species (pigeon, jackal, and spur-winged lapwing) with distinct ecologies and motion patterns and four increasingly strategic behavior models common in nature, the learned policies consistently surpassed currently used rule-based baselines and generalized across monitoring tasks, animal dynamics, and drone types. These results establish disturbance-aware learning as a viable foundation for non-invasive autonomous wildlife observation, opening a path towards scalable, ethically responsible, and scientifically reliable robotic monitoring in ecology and conservation.

2606.08245 2026-06-09 cs.CL 新提交

ZAS-SQL: Distilling Rules from Failures for Zero-Shot Text-to-SQL

ZAS-SQL: 从失败中提炼规则用于零样本文本到SQL

Hongzhou Zheng, Yixin Gou, Wenjia Zhang

发表机构 * Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University(同济大学上海自主智能无人系统科学中心) College of Architecture and Urban Planning, Tongji University(同济大学建筑与城市规划学院) Behavioral and Spatial AI Lab, Peking University & Tongji University(北京大学与同济大学行为与空间人工智能实验室)

AI总结 提出ZAS-SQL零样本框架,通过Map-Reduce规则蒸馏从失败案例中提取核心生成规则,结合知识增强模式表示、规则驱动结构化推理和执行引导早停三个模块,在Spider上达到87.2%和88.6%执行准确率,超越多个少样本和微调方法。

详情
AI中文摘要

文本到SQL将自然语言转换为可执行的SQL查询。基于大语言模型(LLM)的少样本上下文学习方法表现出色,但其对示例的依赖限制了跨领域泛化,并消耗大量上下文窗口空间。现有的零样本方法缺乏有效的生成约束,仍落后于少样本方法。我们观察到LLM在零样本文本到SQL中的失败并非随机,而是表现出系统性的、重复出现的模式。基于这一观察,我们提出了一个完全零样本的文本到SQL框架,该框架通过基于Map-Reduce的规则蒸馏管道从失败案例中提炼核心生成规则,并通过三个互补模块提高生成质量:知识增强的模式表示,补充数据定义语言中缺失的语义;规则驱动的结构化推理框架,抑制结构偏差;以及执行引导的早停,实现低成本的自我纠正。在Spider上,所提出的框架在开发集和测试集上分别达到87.2%和88.6%的执行准确率,建立了新的零样本最先进水平,并超越了多个基于GPT-4/4o的少样本和微调方法。在领域特定数据集UrbanPlan上,它达到了81.3%,证实了规则蒸馏方法跨领域的泛化能力。此外,当配备4B参数模型时,该框架超越了领先闭源模型的零样本基线,展示了强大的模型通用性。

英文摘要

Text-to-SQL translates natural language into executable SQL queries. Few-shot in-context learning methods built upon large language models (LLMs) achieve strong performance, yet their reliance on demonstrations limits cross-domain generalization and consumes substantial context window space. Existing zero-shot methods, lacking effective generation constraints, still fall short of few-shot approaches. We observe that LLM failures in zero-shot Text-to-SQL are not random but exhibit systematic, recurring patterns. Building on this observation, we propose a fully zero-shot Text-to-SQL framework that distills core generation rules from failure cases through a Map-Reduce-based rule distillation pipeline and improves generation quality via three complementary modules: knowledge-augmented schema representation, which supplements missing semantics in Data Definition Language; a rule-driven structured reasoning framework that suppresses structural deviations; and Execution-Guided Early Stopping, which enables low-cost self-correction. On Spider, the proposed framework achieves up to 87.2% and 88.6% execution accuracy on the Dev and Test sets, respectively, establishing a new zero-shot state-of-the-art and surpassing multiple few-shot and fine-tuning methods built upon GPT-4/4o. On the domain-specific dataset UrbanPlan, it achieves 81.3%, confirming that the rule distillation approach generalizes across domains. Moreover, when equipped with a 4B-parameter model, the framework surpasses zero-shot baselines of leading closed-source models, demonstrating strong model generality.

2606.08243 2026-06-09 cs.CL 新提交

Building Comparative Motivation Profiles with Instrumental Interventions

构建带有工具性干预的比较动机概况

David Vella Zarb, Rustem Turtayev, Taywon Min, Jinghua Ou, Shi Feng

发表机构 * MATS University of Cambridge(剑桥大学) KAIST(韩国科学技术院) George Washington University(乔治华盛顿大学)

AI总结 通过对称工具性干预区分对齐伪装中的策略性自我保护与研究者期望追踪,发现模型对期望追踪更敏感,提示需要构念效度检验。

详情
AI中文摘要

安全性评估通常从行为模式推断潜在动机,但这些推断的构念效度尚不明确。我们在对齐伪装中研究这一问题,即当模型推断出训练压力时,它们更常服从训练目标。这种行为通常被解释为策略性自我保护,但也可能反映模型对研究者期望的敏感性。我们引入一个对称干预框架来区分这些竞争性假设。我们不直接干预“诡计”或“谄媚”,而是针对每个假设所蕴含的工具性过程:后果追踪和研究者期望追踪。然后比较对这些过程的干预如何影响对齐伪装。我们使用合成文档微调、激活引导和提示研究了四个开源模型生物。在合成文档微调下,Llama-3.1-70B、Llama3.1-405B 和 Qwen-2.5-72B 对期望追踪干预比后果追踪干预更敏感。对 Llama-3.1-70B 的激活引导支持相同的总体图景,提示干预与 SDF 概况大致一致。总体而言,对齐伪装行为在因果上对评估上下文期望敏感,尽管存在与诡计一致的草稿板。因此,诡计和策略性欺骗评估需要构念效度检验,而对称工具性干预提供了这样一种测试。

英文摘要

Safety evaluations often infer latent motivations from behavioral patterns, but the construct validity of these inferences is unclear. We study this problem in alignment faking, where models comply with training objectives more often when they infer training pressure. This behavior is commonly interpreted as strategic self-preservation, but it may also reflect sensitivity to the model's inference about the expectation of researchers conducting the evaluation. We introduce a symmetric intervention framework for distinguishing these competing hypotheses. Instead of directly intervening on "scheming" or "sycophancy", we target instrumental processes entailed by each hypothesis: consequence-tracking and researcher-expectation tracking. We then compare how interventions on these processes affect the alignment faking. We study four openweight model organisms using synthetic document fine-tuning, activation steering, and prompting. Under synthetic document fine-tuning, Llama-3.1-70B, Llama3.1-405B, and Qwen-2.5-72B are more sensitive to expectation-tracking than consequence-tracking interventions. Activation steering on Llama-3.1- 70B supports the same broad picture, and prompt interventions broadly align with SDF profiles. Overall, alignment-faking behavior can be causally sensitive to evaluation-context expectations despite scheming-consistent scratchpads. Scheming and strategic-deception evaluations therefore need construct-validity checks, and symmetric instrumental interventions provide one such test.

2606.08242 2026-06-09 cs.CV 新提交

Light-WAM: Efficient World Action Models with State-Fusion Action Decoding

Light-WAM:基于状态融合动作解码的高效世界动作模型

Ziang Li, Dongzhou Cheng, Yibin Wang, Shiyue Wang, Xiaoyang Xu, Lingxuan Weng, Juan Wang, Jiaqi Wang

发表机构 * Wuhan University(武汉大学) Shanghai Innovation Institute(上海创新研究院) Southeast University(东南大学) Fudan University(复旦大学) East China Normal University(华东师范大学)

AI总结 提出轻量级世界动作模型Light-WAM,通过紧凑视频骨干和降维潜空间未来视频监督降低训练成本,并引入状态融合动作专家实现高效动作预测,在LIBERO和RoboTwin 2.0上取得良好性能。

详情
AI中文摘要

世界动作模型(WAM)通过将未来预测作为额外训练目标来扩展机器人策略学习,鼓励策略在其表示中编码任务相关的时间结构。当前的WAM通常依赖大规模生成架构,导致高训练成本和推理延迟,难以部署为高效的闭环策略。我们提出Light-WAM,一种轻量级的世界动作模型,用于高效的机器人操作。具体来说,它采用紧凑的视频骨干网络,并在降维的潜空间中进行未来视频监督,降低了视频协同训练的成本,同时保留了其对表示学习的益处。对于动作预测,Light-WAM引入了状态融合动作专家(StateFusionActionExpert),该专家从多个骨干层读取适应后的状态,通过可学习查询池化进行融合,并在单次前向传播中直接预测动作块。这种设计为视频骨干表示与机器人动作之间提供了高效接口,避免了繁重的生成式动作专家。实验表明,Light-WAM在LIBERO上保持强劲性能,在RoboTwin 2.0上实现了可用的多任务性能,同时仅使用0.44B可训练参数。它还实现了72.03ms的推理延迟,峰值GPU内存为4.1GiB,并提高了训练吞吐量。

英文摘要

World Action Models (WAMs) extend robot policy learning by incorporating future prediction as an additional training objective, encouraging the policy to encode task-relevant temporal structure in its representations. Current WAMs often rely on large-scale generative architectures that incur high training costs and inference latency, making them difficult to deploy as efficient closed-loop policies. We propose Light-WAM, a lightweight World Action Model for efficient robot manipulation. Specifically, it is built with a compact video backbone and performs future-video supervision in a downsampled latent space, reducing the cost of video co-training while retaining its benefits for representation learning. For action prediction, Light-WAM introduces the StateFusionActionExpert, which reads adapted states from multiple backbone layers, fuses them through learned-query pooling, and directly predicts action chunks in a single forward pass. This design provides an efficient interface between video backbone representations and robot actions, avoiding the need for heavy generative action experts. Experiments demonstrate that Light-WAM maintains strong performance on LIBERO and achieves usable multi-task performance on RoboTwin 2.0, while using only 0.44B trainable parameters. It also achieves 72.03ms inference latency with 4.1GiB peak GPU memory and improved training throughput.

2606.08239 2026-06-09 cs.AI cs.CL cs.CV 新提交

When No Answer Is Correct: Diagnosing Absent Answer Detection for MLLMs in Video Understanding

当没有正确答案时:诊断视频理解中多模态大语言模型的缺失答案检测

Yiheng Wang, Yueqian Lin, Lichen Zhu, Yudong Liu, Hai "Helen" Li, Yiran Chen

发表机构 * Duke University(杜克大学)

AI总结 研究多模态大语言模型在视频理解中检测缺失答案的能力,发现模型倾向于选择干扰项而非识别无正确答案,时间推理任务中问题更严重,链式思维提示虽提升检测率但仍不理想。

Comments Under review

详情
AI中文摘要

多模态大语言模型在视频理解方面取得了实质性进展,但其响应的可靠性仍未得到充分探索。本文对视频理解中多模态大语言模型的缺失答案检测进行了诊断研究,其中正确答案被故意排除在候选集之外,而一个可靠的模型应能识别出没有有效选项。我们在三种设置下评估缺失答案检测行为:带有“以上皆非”选项的多选题、带有检测指令的开放式生成,以及没有任何指导的标准评估。在多种模型和基准测试中,我们发现多模态大语言模型压倒性地选择合理的干扰项,而不是检测到缺失答案。这种失败在时间推理任务中更为明显,并且随着帧采样密度的增加而恶化。我们进一步探索了链式思维提示作为缓解策略,发现虽然它显著提高了检测率,但性能仍不令人满意,这表明仅基于提示的策略不足以完全解决这一局限性。这些发现揭示了缺失答案检测中的系统性失败,并强调了在多模态系统中需要明确的检测机制。

英文摘要

Multimodal large language models (MLLMs) have made substantial advancements in video understanding, yet the reliability of their responses remains underexplored. This work presents a diagnostic study of absent answer detection for MLLMs in video understanding, where the correct answer is deliberately excluded from the candidate set and a reliable model is expected to recognize that no valid option exists. We evaluate the absent answer detection behavior under three settings: multiple-choice questions augmented with an ``None of the Above'' option, open-ended generation with a detection instruction, and standard evaluation without any guidance. Across a diverse set of models and benchmarks, we find that MLLMs overwhelmingly select plausible distractors rather than detecting the absent answer. This failure is more pronounced in temporal reasoning tasks and worsens with denser frame sampling. We further explore chain-of-thought prompting as a mitigation strategy and find that while it substantially improves detection rates, performance remains unsatisfactory, suggesting that prompting-based strategies alone are insufficient to fully address this limitation. These findings expose a systematic failure in absent answer detection and highlight the need for explicit detection mechanisms in multimodal systems.

2606.08238 2026-06-09 cs.LG 新提交

GPT-Micro: A large language paradigm for accelerated, inexpensive, and thermodynamics-consistent discovery of constitutive models in manufacturing

GPT-Micro: 一种用于制造业中加速、低成本且热力学一致的本构模型发现的大语言范式

Soumik Dutta, Kiarash Naghavi Khanghah, Sania Shree, Logan McNeil, Thomas Feldhausen, Hongyi Xu, Rajiv Malhotra

发表机构 * Department of Mechanical and Aerospace Engineering, Rutgers University(罗格斯大学机械与航空航天工程系) Department of Mechanical, Aerospace & Manufacturing Engineering, University of Connecticut(康涅狄格大学机械、航空航天与制造工程系) Edison Welding Institute(埃迪森焊接研究所) Manufacturing Science Division, Oak Ridge National Laboratory(橡树岭国家实验室制造科学分会) Department of Aerospace and Mechanical Engineering, University of Texas at El Paso(德克萨斯州埃尔帕索大学航空航天与机械工程系)

AI总结 提出GPT-Micro范式,结合大语言模型、热力学约束和稀疏数据,实现自主发现本构模型,在印刷电子测试中数据量减少70%、发现时间缩短400倍。

Comments 23 pages, 4 tables, 11 equations, 9 figures

详情
AI中文摘要

本构模型描述了工艺施加的材料状态与基本材料属性之间的关系,对于制造过程中材料微观结构的控制至关重要。传统上依赖易错的人类经验和直觉来假设和修正模型函数形式,导致模型发现过程缓慢且增量式改进,精度有限。传统的机器学习需要大量数据生成成本和时间。使用大语言模型的模型发现存在上述问题,并且/或者忽略了基本热力学定律的不可违背性。本文创建了一种新颖的GPT-Micro范式,用于自主、数据稀疏且符合热力学的全新本构模型发现。该框架无缝集成了文献语义知识提取、基于热力学的守恒定律强制执行、稀疏数据集以及大语言模型驱动的模型假设生成与改进。在印刷电子工艺测试平台上对一个长期难以解决的本构建模问题进行了验证。结果表明,与现有技术相比,该方法具有显著且多方面的优势,包括:(a) 相比基于机器学习的建模,数据负担减少超过70%,且精度不损失;(b) 相比人工驱动建模,数据生成后的发现时间从数月缩短至数小时,减少400倍;(c) 发现具有新颖函数形式的模型,无需主观选择初始假设;(d) 通过综合紧凑、符合守恒定律且物理完整的解析模型,增强了基于物理的可信度、人类可解释性和机理洞察。讨论了GPT-Micro在制造业中实现快速、低成本、物理可信且可解释的微观结构建模的潜力。

英文摘要

Constitutive modeling of the relationship between process-imposed material states and fundamental material properties is critical to control of material microstructure in manufacturing processes. The limited accuracy resulting from the typical reliance on fallible human expertise and intuition for postulation and revision of the models functional form results in incremental and time consuming model discovery. Conventional Machine Learning (ML) incurs significant cost and time of data generation. Model discovery using Large Language Models (LLMs) suffers from the above issues and/or ignores the inviolability of fundamental thermodynamics laws. This work creates a novel GPT-Micro paradigm for autonomous, data sparse, and thermodynamics-compliant discovery of de-novo constitutive models. This framework seamlessly integrates semantic knowledge extraction from literature, enforcement of thermodynamics-based conservation laws, and sparse datasets, with LLM-driven generation and refinement of model hypotheses. Validation is performed for a long-intractable constitutive modeling problem in a printed electronics process testbed. This reveals significant and simultaneous advantages over the state-of-the-art including: (a) More than 70 percent reduction in data burden relative to ML-based modeling without loss in accuracy; (b) 400X reduction in discovery time after data generation, from months to hours, relative to human-driven modeling; (c) Discovery of models with novel functional forms without subjective human choice of a starting hypothesis; (d) Enhanced physics-rooted trustworthiness, human interpretability, and mechanistic insight via synthesis of compact, conservation-compliant, and physically complete analytical models. The potential of GPT-Micro to realize rapid, low-cost, physically trustworthy, and interpretable microstructure modeling across the manufacturing landscape is discussed.

2606.08236 2026-06-09 cs.CL cs.LG 新提交

Shared Semantics, Divergent Mechanisms: Unsupervised Feature Discovery by Aligning Semantics and Mechanisms

共享语义,不同机制:通过对齐语义与机制的无监督特征发现

Hyunjin Cho, Youngji Roh, Jaehyung Kim

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出一种无监督方法,通过语义嵌入和归因签名聚类模型续写,发现隐藏的机制模式,补充电路分析。

Comments 40 pages

详情
Journal ref
ICML 2026 Spotlight
AI中文摘要

随着大型语言模型越来越多地部署在高风险场景中,人们越来越需要工具来审计不仅模型输出,还包括产生这些输出的内部计算。电路分析是机械可解释性中的核心方法,但通常是目标条件化的,解释单个提示与选定补全的配对。这种目标条件化设置可能掩盖模型续写分布中的异质性。我们引入了分布级无监督特征发现,该方法使用语义内容和序列级机械归因对采样续写进行聚类,而无需手动指定目标输出。我们的方法用语义嵌入和前缀到续写的归因签名表示每个续写,然后优化一个率失真目标,该目标在语义一致性、机械一致性和聚类粒度之间进行权衡。在聚类和引导分析中,发现的聚类暴露了单视图基线遗漏的续写模式,并提供了干预证据,表明聚类签名对应于可操作的机械因素。总的来说,我们的方法通过提供对模型续写分布背后机制的可扩展审计,补充了电路分析和行为评估。

英文摘要

As large language models are increasingly deployed in high-stakes settings, there is a growing need for tools that audit not only model outputs but also the internal computations that produce them. Circuit analysis is a central approach in mechanistic interpretability, but it is typically target-conditioned, explaining a single prompt paired with a chosen completion. This target-conditioned setup can obscure heterogeneity across a model's continuation distribution. We introduce distribution-level unsupervised feature discovery, which clusters sampled continuations using both semantic content and sequence-level mechanistic attributions, without manually specifying target outputs. Our method represents each continuation with a semantic embedding and a prefix-to-continuation attribution signature, then optimizes a rate-distortion objective that trades off semantic coherence, mechanistic consistency, and cluster granularity. Across clustering and steering analyses, the discovered clusters expose continuation modes that single-view baselines miss and provide interventional evidence that cluster signatures correspond to actionable mechanistic factors. Overall, our approach complements circuit analysis and behavioral evaluation by providing a scalable audit of the mechanisms underlying a model's continuation distribution.

2606.08234 2026-06-09 cs.AI 新提交

SciTrace: Trajectory-Aware Safety Reasoning for Scientific Discovery Agents

SciTrace: 面向科学发现代理的轨迹感知安全推理

Tanush Swaminathan, Runmin Jiang, Letian Zhang, Min Xu

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Allen Institute(艾伦研究所)

AI总结 提出SciTrace框架,通过安全内在推理循环和组合工具链验证器,在科学代理管道的每个阶段融入安全推理,实现工具调用安全性和对抗鲁棒性的SOTA提升。

Comments 23 pages

详情
AI中文摘要

基于LLM的科学代理在自主研究方面展现出强大能力,但其安全层在结构上与核心推理相分离:它们检查管道输出,而非塑造产生输出的推理过程。这种分离导致两种故障模式:一个阶段积累的安全信号在下一阶段被丢弃,以及一系列单独良性的工具调用可能组合成有害结果,而单步过滤器无法检测到。为了解决这些挑战,我们引入了\ extbf{SciTrace},这是一个将安全推理编织到科学代理管道每个阶段的框架。SciTrace结合了两种互补机制:\ extit{安全内在推理循环}(SIR),通过联合任务与安全推理,在思考者、实验者、写作者和审阅者阶段维护累积风险状态;以及\ extit{组合工具链验证器}(CTV),在执行前执行轨迹感知安全检查,捕捉仅出现在多步工具序列中的风险。在跨越六个科学领域的240个高风险研究任务和120个工具相关风险任务上的评估中,SciTrace在四个骨干模型上实现了框架间的\ extbf{最先进}(SOTA)安全性:它持续提高了工具调用安全性和对抗鲁棒性,同时保持了科学输出质量,并发现了单步监视器遗漏的\ extbf{78.8\%}的组合工具链逃逸。项目网站可在https://opensciagent.github.io/SciTrace/ 获取。

英文摘要

LLM-based scientific agents have shown strong capacity for autonomous research, yet their safety layers remain structurally divorced from core reasoning: they inspect pipeline outputs rather than shaping the deliberation that produces them. This separation opens two failure modes: safety signals accumulated at one stage are discarded before the next, and sequences of individually benign tool calls can compose into harmful outcomes that no single-step filter detects. To address these challenges, we introduce \textbf{SciTrace}, a framework that weaves safety reasoning into every stage of the scientific agent pipeline. SciTrace couples two complementary mechanisms: a \textit{Safety-Intrinsic Reasoning Loop} (SIR) that maintains a cumulative risk state across the Thinker, Experimenter, Writer, and Reviewer stages through joint task-and-safety deliberation, and a \textit{Compositional Tool-Chain Verifier} (CTV) that performs trajectory-aware safety checks before execution, catching risks that surface only across multi-step tool sequences. Evaluated on 240 high-risk research tasks and 120 tool-related risk tasks spanning six scientific domains, SciTrace achieves state-of-the-art (\textbf{SOTA}) safety among compared frameworks across four backbone models: it consistently improves tool call safety and adversarial robustness while preserving scientific output quality, and it uncovers \textbf{78.8\%} of the compositional tool-chain escapes that single-step monitors miss. The project website is available at https://opensciagent.github.io/SciTrace/.

2606.08231 2026-06-09 cs.CV 新提交

Test-Time Scaling in Multimodal Foundation Models: A Comprehensive Survey of Generation and Reasoning

多模态基础模型中的测试时扩展:生成与推理的综合调查

Cong Wan, Ying He, Zhongzhan Huang, Hefeng Wu

发表机构 * Sun Yat-sen University(中山大学)

AI总结 本文首次系统综述多模态基础模型中的测试时扩展(TTS)方法,提出统一分类框架(采样、反馈、搜索三类),总结应用与基准,并讨论未来方向。

Comments Accepted by ACL 2026, Findings

详情
AI中文摘要

测试时扩展(TTS)已成为通过在推理过程中动态分配计算资源来增强模型性能的关键研究方向。最近的进展将这一范式应用于多模态基础模型(MFMs),释放了它们在多模态推理和生成方面的潜力。尽管进展迅速,该领域缺乏系统性的调查和统一的理论框架来描绘多模态TSS的发展格局。为填补这一空白,我们首次对MFMs的TTS研究进行了全面综述,提出了一个统一的分类框架,将现有方法归纳为三种不同策略:基于采样的、基于反馈的和基于搜索的方法。我们进一步总结了常用于评估多模态TTS在生成和推理任务中能力的代表性应用和基准。最后,本调查讨论了开放挑战并概述了未来研究方向,为这一快速发展的领域的后续研究提供了系统路线图。

英文摘要

Test-time Scaling (TTS) has emerged as a pivotal research direction for enhancing model performance by dynamically allocating computational resources during inference. Recent advancements have adapted this paradigm to Multimodal Foundation Models (MFMs), unlocking their potential in multimodal reasoning and generation. Despite rapid progress, the field lacks a systematic survey and unified theoretical framework to delineate the developmental landscape of multimodal TTS. To bridge this gap, we present the first comprehensive review of TTS research for MFMs, proposing a unified taxonomic framework that categorizes existing methodologies into three distinct strategies: sampling-based, feedback-based, and search-based approaches. We further summarize representative applications and benchmarks commonly utilized to evaluate multimodal TTS capabilities in generation and reasoning tasks. Finally, this survey discusses open challenges and outlines future research directions, providing a systematic roadmap for subsequent studies in this rapidly evolving field.

2606.08221 2026-06-09 cs.LG 新提交

De novo molecular generation with optical property preconditioning at the token level

基于Token级光学性质预条件的从头分子生成

Haozhe Huang, Manuel Gonzalez Lastre, Hyun Suk Park, Jorge A. Campos-Gonzalez-Angulo, Xinjian Liu, Alán Aspuru-Guzik

发表机构 * University of Toronto(多伦多大学) Vector Institute for Artificial Intelligence(向量人工智能研究所) Universidad Autónoma de Madrid(马德里自治大学) Canadian Institute for Advanced Research (CIFAR)(加拿大高等研究院) NVIDIA(英伟达)

AI总结 针对OLED分子光学性质可控生成中数据稀缺和条件控制可靠性有限的问题,提出基于GPT2的Token条件自回归语言模型,通过离散属性Token和多任务优化实现垂直吸收能和振子强度的定向生成,并在TDDFT级别评估分布保真度和可控性。

详情
AI中文摘要

由于高质量数据的稀缺以及生成模型中跨化学基序的条件控制可靠性有限,设计具有目标光学性质的OLED分子仍然具有挑战性。在此,我们在现实低数据场景下对用于OLED分子生成的Token条件自回归语言模型进行了基准测试。一个GPT2模型在大规模化学语料库上进行预训练,增加了离散性质Token,并通过多任务优化进行微调。条件目标为垂直吸收能和振子强度,并将HOMO-LUMO能隙作为辅助电子描述符。生成的分子在TDDFT水平上进行评估,以评估分布保真度和可控性。生成的库再现了训练分布的主要光学性质支持,同时向更低分子量和更少重原子偏移。Token级控制在不同条件区间内一致定向,但并非完全正交,并表现出局部校准不规则性。化学型解析分析进一步表明,可控性强烈依赖于局部电子环境:适度共轭的芳香碳基序与改进的联合目标满足度相关,而吸电子基序,特别是芳基腈,表现出系统性红移和可控性降低。这些结果为条件OLED分子生成建立了定量基准,并表明模型可靠性必须在化学上有意义的子空间中评估,而非仅从聚合性质分布中评估。

英文摘要

Designing OLED molecules with targeted optical properties remains challenging due to the scarcity of high-quality data and the limited reliability of conditional control in generative models across chemical motifs. Here, we benchmark a token-conditioned autoregressive language model for OLED molecular generation in a realistic low-data regime. A GPT2 model is pretrained on large chemical corpora, augmented with discrete property tokens, and fine-tuned using multi-task optimisation. Conditioning targets vertical absorption energy and oscillator strength, with the HOMO-LUMO gap included as an auxiliary electronic descriptor. Generated molecules are evaluated at the TDDFT level to assess distributional fidelity and controllability. The generated library reproduces the dominant optical-property support of the training distribution while shifting towards lower molecular weight and fewer heavy atoms. Token-level control is consistently directional across conditioning bins, but is not fully orthogonal and exhibits local calibration irregularities. A chemotype-resolved analysis further shows that controllability depends strongly on local electronic environments: moderately conjugated aromatic-carbon motifs are associated with improved joint target satisfaction, whereas electron-withdrawing motifs, particularly aryl nitriles, show systematic red-shifting and reduced controllability. These results establish a quantitative benchmark for conditional OLED molecular generation and show that model reliability must be assessed in chemically meaningful subspaces rather than from aggregate property distributions alone.

2606.08218 2026-06-09 cs.LG cs.AI math.ST stat.ML stat.TH 新提交

How Deep Are Deep GPs, Really? A Sharp Threshold and a Non-Gaussian Limit for Compositional GPs

深度高斯过程到底有多深?组合高斯过程的尖锐阈值与非高斯极限

Mark Kozdoba, Shie Mannor

发表机构 * Technion, IIT(以色列理工学院) NVIDIA(英伟达)

AI总结 本文研究了深度高斯过程先验在深度增长时的极限行为,识别出RBF核带宽的尖锐阈值,低于该阈值时先验收敛到非退化非高斯分布,具有非零坐标依赖。

详情
AI中文摘要

组合先验描述了深度贝叶斯模型中分层函数的通用属性,其中随机权重的深度神经网络是一个典型例子。在宽网络极限下,先验是一个具有深度相关核的高斯过程,其随深度增长的行为已通过该核得到广泛研究。这里,我们研究另一种情况,其中每一层本身是一个向量值高斯过程,我们的目标类似地理解先验随深度增长的极限行为。先前的高斯过程工作已确定,对于RBF核和一定范围的带宽$r$,先验在极限下退化,收敛到常数函数集——这作为概率模型是无用的。在本文中,我们建立了几个新结果。首先,我们识别出一个尖锐的带宽阈值$r_c(d) = Θ(\sqrt{d})$,高于该阈值极限是退化的,加强了先前的界限。其次,更重要的是,我们证明对于低于阈值$r_c(d)$的$r$,先验收敛到极限分布$π_{\bar{Z}}$。我们还证明这些分布是非退化且非高斯的,坐标之间具有非消失的依赖性。与先前已知的退化机制相反,深度高斯过程先验因此可以允许非平凡极限。实验上,我们在维度$d$的范围内验证了该阈值,并展示了极限分布$π_{\bar{Z}}$的复杂多模态行为——该机制随$d$增长而变得狭窄,且在不了解阈值的情况下难以识别。

英文摘要

Compositional priors describe the generic properties of layered functions in deep Bayesian models, where deep neural networks with random weights are a canonical example.In the wide-network limit, the prior is a Gaussian process with a depth-dependent kernel, and its behaviour as depth grows has been extensively studied through this kernel. Here, we study another case, where each layer itself is a vector valued Gaussian process, and our aim is similarly to understand the limiting behaviour of the prior as depth grows. Previous GP work has established that for the RBF kernel and a certain range of bandwidths $r$, the prior degenerates in the limit, converging to the set of constant functions -- which is not useful as a probabilistic model. In this paper we establish several new results. First, we identify a sharp bandwidth threshold $r_c(d) = Θ(\sqrt{d})$ above which the limit is degenerate, strengthening the earlier bounds. Second, and more importantly, we show that for $r$ below the threshold $r_c(d)$ the prior converges to a limit distribution $π_{\bar{Z}}$. We also prove that these distributions are non-degenerate and non-Gaussian, with non-vanishing dependence between coordinates. In contrast to the previously known degenerate regime, deep Gaussian process priors can therefore admit non-trivial limits. Empirically, we verify the threshold across a range of dimensions $d$, and demonstrate a complex multimodal behaviour of the limit distributions $π_{\bar{Z}}$ -- a regime that becomes increasingly narrow with $d$ and would be hard to identify without knowing the threshold.

2606.08214 2026-06-09 cs.RO 新提交

Agentic Neuro-Symbolic Planning and Commissioning for Human-in-the-Loop Industrial Robotics with Digital Twins

面向人机协同工业机器人的智能神经符号规划与调试:基于数字孪生

Zhihao Liu, Victor Nan Fernandez-Ayala, Tianyu Wang, Qiang Qin, Xi Vincent Wang, Dimos V. Dimarogonas, Lihui Wang

发表机构 * Royal Institute of Technology (KTH)(皇家理工学院(KTH))

AI总结 提出一种结合LLM语言理解与确定性验证执行的神经符号框架,采用SDI架构和两级恢复机制,在数字孪生中验证后执行,显著提升任务成功率。

详情
AI中文摘要

灵活的机器人自动化需要系统能够解释操作员意图、验证物理可行性,并在规划和执行阶段从执行失败中恢复。本文提出了一种面向人机协同工业机器人的智能神经符号框架,其中LLM用于需要语言理解或上下文推理的任务,而所有验证、排序和执行保持确定性。该框架将软件工程中的规划器-生成器-评估器(PGE)模式改编为面向工业机器人的指定器-设计器-检查器(SDI)架构,并结合基于LangGraph的动态路由进行故障恢复。两级恢复机制通过上下文感知编排处理结构级重新规划,并通过确定性恢复技能处理执行级几何故障。Unity3D数字孪生支持在物理执行前进行人工检查、修改和重新验证。在多个难度级别的自然语言命令上对十个基线进行评估,所提方法实现了最高的任务成功率。消融结果证实,结构化命令扩展、符号验证、选择性LLM路由和恢复技能各自都是必要的。

英文摘要

Flexible robotic automation requires systems that interpret operator intent, verify physical feasibility, and recover from execution failures across both the planning and execution stages. This paper proposes an agentic neuro-symbolic framework for human-in-the-loop industrial robotics, in which LLMs are used for tasks that require language understanding or contextual reasoning, while all verification, sequencing, and execution remain deterministic. The framework adapts the Planner-Generator-Evaluator (PGE) harness pattern from software engineering into a Specifier-Designer-Inspector (SDI) architecture for industrial robotics, combined with LangGraph-based dynamic routing for failure recovery. A two-tier recovery mechanism addresses structure-level replanning through context-aware orchestration and execution-level geometric failures through deterministic recovery skills. A Unity3D digital twin supports human inspection, modification, and re-verification prior to physical execution. Evaluated on natural-language commands across multiple difficulty levels against ten baselines, the proposed method achieves the highest task success. Ablation results confirm that structured command expansion, symbolic verification, selective LLM routing, and recovery skills are each individually necessary.

2606.08212 2026-06-09 cs.LG 新提交

Public Machine Learning Solver Framework for Novices in the Machine Learning Domain

面向机器学习初学者的公共机器学习求解器框架

Lokman Saleh, Hafedh Mili, Mounir Boukadoum

发表机构 * LATECE Lab, Université du Québec à Montréal(LATECE实验室,魁北克大学蒙特利尔分校)

AI总结 提出一个结合专家知识和迁移学习的半自动化平台,为非专家推荐完整的机器学习流水线,并自动提取数据特征,通过一阶逻辑推理提供排名算法。

详情
AI中文摘要

解决机器学习问题很复杂,通常只有专家才能胜任。过去二十年中,出现了支持非专家的系统。根据我们的回顾,我们识别出三类:(1) 全自动AutoML系统,(2) 用于算法选择的专家备忘单,以及(3) 使用选择标准(准确性、透明度、数据要求)的决策支持系统。我们提出一个新平台,结合了第2和第3类,为非专家提供半自动化、智能的解决方案推荐。与推荐单一算法的现有方法不同,我们的平台建议一个针对用户问题量身定制的完整流水线。它整合了专家定义的选择标准与迁移学习,并自动从用户提供的数据集中提取数据特征(例如,类别不平衡、缺失值)。该平台使用一阶逻辑对其知识库进行推理,并推荐按相关性排序的合适算法。它具有用户友好的界面,并连接到面向机器学习专家的众包平台,确保持续更新。该平台是增量构建的,允许无缝集成新算法、标准和领域知识。据我们所知,这是第一个免费、公开可访问的在线框架,系统地捕获和操作专家知识,以结构化、透明的方式指导非专家解决机器学习问题。

英文摘要

Solving machine learning problems is complex and typically reserved for experts. Over the past two decades, systems have emerged to support non-experts. Based on our review, we identify three categories: (1) fully automated AutoML systems, (2) expert cheat sheets for algorithm selection, and (3) decision-support systems using selection criteria (accuracy, transparency, data requirements). We propose a new platform combining categories 2 and 3 to deliver semi-automated, intelligent solution recommendations for non-experts. Unlike existing approaches that recommend a single algorithm, our platform suggests a complete pipeline tailored to the user's problem. It integrates expert-defined selection criteria with transfer learning and automatically extracts data characteristics (e.g., class imbalance, missing values) from user-provided datasets. The platform uses first-order logic to reason over its knowledge base and recommends suitable algorithms ranked by relevance. It features a user-friendly interface and connects to a crowdsourcing platform for ML experts, ensuring continuous updates. The platform is built incrementally, allowing seamless integration of new algorithms, criteria, and domain knowledge. To our knowledge, this is the first free, publicly accessible online framework that systematically captures and operationalizes expert knowledge to guide non-experts in solving ML problems in a structured, transparent manner.