arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2070
2606.11396 2026-06-11 cs.RO 新提交

PLUME: Probabilistic Latent Unified World Modeling and Parameter Estimation for Multi-Finger Manipulation

PLUME: 多指操作的概率潜在统一世界建模与参数估计

Abhinav Kumar, Soshi Iba, Rana Soltani Zarrin, Dmitry Berenson

发表机构 * University of Michigan(密歇根大学) Honda Research Institute USA(本田美国研究所)

AI总结 提出PLUME世界模型,联合学习参数信念演化与条件动力学,通过在线参数推断实现零样本迁移,在螺丝刀旋转等任务中优于现有方法。

Comments 16 pages, 5 figures

详情
AI中文摘要

多指手的灵巧操作可能对物理参数(如物体形状、姿态和摩擦系数)敏感。虽然仿真能够利用已知参数值进行大规模数据收集,但基于仿真训练的策略在部署时仍需处理不确定性,此时真实参数及由此决定的真实动力学是未知的。对于螺丝刀旋转等精确任务,标准域随机化策略可能不足,因为操作策略可能需要根据特定参数值而变化。为解决这一问题,我们提出了概率潜在统一世界建模与参数估计(PLUME),这是一种世界模型,它联合学习对参数值的信念演化以及以这些参数为条件的系统动力学。我们学习一个潜在空间,以联合表示多个性质不同的物理参数以及奖励(奖励本身是部分可观测变量的函数),从而为规划提供信息。我们的新颖学习框架通过在线参数推断(而非重新训练或微调)实现了世界模型与真实动力学的高效对齐。我们在模拟的螺丝刀旋转、阀门旋转、桶提升和圆盘弹射任务以及硬件螺丝刀旋转任务上评估了我们的方法,在这些任务中,我们实现了仿真训练策略的成功零样本迁移,并超越了最先进的离线强化学习和世界模型增强行为克隆基线。视频请见我们的网站:https://this URL。

英文摘要

Dexterous manipulation with multi-finger hands can be sensitive to physical parameters such as object shape, pose, and friction coefficients. While simulation enables large-scale data collection with known parameter values, simulation-trained policies must still handle uncertainty at deployment, where the true parameters and therefore the true dynamics are unknown. Standard domain randomization strategies may be insufficient for precise tasks like screwdriver turning, as manipulation strategies may need to change depending on specific parameter values. To address this, we propose Probabilistic Latent Unified world Modeling and parameter Estimation (PLUME), a world model that jointly learns to evolve a belief over parameter values as well as the system dynamics conditioned on those parameters. We learn a latent space to jointly represent multiple qualitatively different physical parameters along with rewards, themselves functions of partially-observable variables, to inform planning. Our novel learning framework leads to efficient alignment of the world model to true dynamics through online parameter inference as opposed to re-training or fine-tuning. We evaluate our method on simulated screwdriver turning, valve turning, bucket lifting, and disk flicking tasks, as well as a hardware screwdriver turning task, where we achieve successful zero-shot transfer of our simulation-trained policy and outperform state-of-the-art offline reinforcement learning and world-model-augmented behavior cloning baselines. Please see our website at https://plume-world-model.github.io for videos.

2606.11391 2026-06-11 cs.LG 新提交

Recursive Binding on a Budget: Subspace Carving in Order-p Tensor Memories

预算上的递归绑定:阶-p张量记忆中的子空间雕刻

Travis Pence, Daisuke Yamada, Vikas Singh

发表机构 * University of Wisconsin-Madison(威斯康星大学麦迪逊分校)

AI总结 提出正交子空间雕刻(OSC)方法,通过将填充符投影到角色基的零空间来绑定到角色,固定阶张量记忆实现深度递归绑定,在恒定内存下提升高叠加场景的效率。

Comments 24 pages, 12 figures, 7 tables

详情
Journal ref
43rd International Conference on Machine Learning 2026
AI中文摘要

张量积表示为模型中的符号推理提供了所需的结构保真度,但在编码深层递归结构时会遭受指数级维度增长。相反,向量符号架构保持恒定维度,但由于通过叠加的噪声压缩而牺牲了容量和保真度。在这项工作中,我们提出了正交子空间雕刻(OSC),一种内存架构,通过将填充符投影到角色基的零空间上,然后聚合到固定的阶-p张量中,从而将填充符绑定到角色。OSC 使用投影来强制静态记忆痕迹中绑定结构之间的几何正交性。我们表明,这种机制将张量阶与结构深度解耦,从而在恒定内存占用内实现深度递归绑定。通过识别进行检索,这种构造允许分量向量比记忆张量小几个数量级,从而在涉及高叠加的场景中提供卓越的内存效率。我们还表明,TPR 是 Clifford 代数中绑定的一个特例,并给出了 OSC 的 Clifford 公式。

英文摘要

Tensor Product Representations provide the structural fidelity required for symbolic reasoning in models but suffer from exponential dimensionality growth when encoding deep recursive structures. Conversely, Vector Symbolic Architectures maintain constant dimensionality but sacrifice capacity and fidelity due to noisy compression via superposition. In this work, we propose Orthogonal Subspace Carving (OSC), a memory architecture that binds fillers to roles by projecting onto the null space of the role basis before aggregating into a fixed order-p tensor. OSC uses projections to enforce geometric orthogonality between bound structures within a static memory trace. We show that this mechanism decouples the tensor order from the structural depth, enabling deep recursive binding within a constant memory footprint. By performing retrieval via recognition, this construction allows for component vectors that are orders of magnitude smaller than the memory tensor, giving superior memory efficiency in settings involving high superposition. We also show that TPR is a special case of binding in Clifford algebra, and give a Clifford formulation of OSC.

2606.11390 2026-06-11 cs.CV cs.DC cs.GR cs.LG 新提交

A Scalable PyTorch Abstraction for Multi-GPU Gaussian Splatting

一种可扩展的多GPU高斯泼溅PyTorch抽象

Matthew Cong, Francis Williams, Jonathan Swartz, Mark Harris, Sanja Fidler, Ken Museth

发表机构 * NVIDIA(英伟达) University of Toronto(多伦多大学) Vector Institute(向量研究所)

AI总结 提出一种多GPU高斯泼溅方法,通过CUDA统一内存和NVLink在算子级别分布参数,实现大规模场景重建,支持超过10亿高斯泼溅。

Comments 14 pages, 6 tables, 2 figures, and 1 listing. Includes supplementary material

详情
AI中文摘要

高斯泼溅方法在真实世界的神经重建中越来越受欢迎。然而,由于计算和内存限制,它们在规模和分辨率上常常受限。我们提出了一种多GPU高斯泼溅方法,将重建扩展到更高的分辨率和更大的场景,同时抽象掉了通常与模型分布相关的代码复杂性。为实现这一目标,我们提出一个PyTorch后端,通过CUDA统一内存和NVLink在GPU之间分布高斯参数和泼溅算子。由于分布发生在算子级别,模型代码不需要显式的跨设备通信。更广泛地说,该后端将多个GPU暴露为一个聚合的PyTorch设备,并支持其他PyTorch算子。我们展示了包含超过10亿个高斯泼溅的城市规模重建,具有街道级细节,数量是当前最先进方法的25倍以上。

英文摘要

Gaussian splatting methods have become increasingly popular for neural reconstruction of the real world. However, they are often limited in scale and resolution due to compute and memory constraints. We present a multi-GPU Gaussian splatting approach that scales reconstruction to higher resolutions and larger scenes while abstracting away the code complexity typically associated with distributing a model. To accomplish this, we propose a PyTorch backend that distributes the Gaussian parameters and splatting operators across GPUs via CUDA unified memory and NVLink. Because distribution occurs at the operator level, the model code requires no explicit cross-device communication. More broadly, the backend exposes multiple GPUs as an aggregate PyTorch device and supports other PyTorch operators. We demonstrate city-scale reconstructions with street-level detail consisting of over 1 billion Gaussian splats, more than 25 times as many as the current state of the art.

2606.11387 2026-06-11 cs.CL cs.AI cs.LG 新提交

Small Experiments, Cheaper Decisions: A Case Study in Staged Promotion for Micro-Pretraining

小实验,更经济的决策:微预训练中分阶段提升的案例研究

Felipe Chavarro Polania

发表机构 * Hewlett Packard Enterprise(慧与科技公司)

AI总结 研究微预训练中分阶段提升协议,通过固定预算筛选配置,在Windows A100和Linux L40S上验证,发现早期排名不稳定,但最终协议以144 GPU小时找到最优配置,成本低于全量筛选。

Comments 14 pages, 5 figures; 12-hour dual-host micro-pretraining promotion study; source package includes curated ancillary artifacts

详情
AI中文摘要

短预训练运行可以降低实验成本,但它们也可能过度推广那些仅在小预算下表现良好的配置。我们针对固定微预训练运行器在两个异构主机块(Windows A100和Linux L40S)上研究了一种可审计的分阶段提升协议。从12个预先筛选的配置开始,我们使用2分钟、5分钟、10分钟、60分钟和12小时的分阶段预算,并在昂贵的延续之前设置固定的提升规则。早期筛选被有意视为不稳定:5分钟和10分钟的排名对主机敏感,而最终的12小时排名最优条件并非复制10分钟门控下的平均最佳条件。由于不同阶段的种子范围不同,这些变化是操作性的提升证据,而非种子内曲线。复制60分钟门控将分阶段因子筛选桥接参考保留在提升集中,它在所有四个60分钟主机-种子单元中排名第一。在最终的12小时确认包中,桥接条件在两个种子的所有四个主机-种子单元中排名第一;贪婪比较器未满足固定的0.010 val_bpb近似等价规则;更便宜的d8/ar48(深度8,宽高比48)哨兵未满足固定的0.020平均差距规则。执行的12小时分支花费144 GPU小时,完整的分阶段协议记录169.2训练GPU小时(包括筛选阶段)。继续所有四个60分钟候选将花费192 GPU小时,而继续所有九个复制10分钟候选将花费432 GPU小时。后者是未运行延续的会计反事实,并非表明跳过的候选不可能超越参考。结果是一个有界成本分配发现,而非全局最优性、容量归一化优越性或优于自适应超参数优化方法的声明。

英文摘要

Short pretraining runs can reduce experimental cost, but they can also over-promote configurations that only look strong at tiny budgets. We study an auditable staged-promotion protocol for a fixed micro-pretraining runner on two heterogeneous host blocks: Windows A100 and Linux L40S. Starting from twelve prior-screened configurations, we use staged budgets of 2 minutes, 5 minutes, 10 minutes, 60 minutes, and 12 hours, with frozen promotion rules before expensive continuations. The early screens are intentionally treated as unstable: the 5- and 10-minute rankings are host-sensitive, and the eventual 12-hour top-ranked condition is not the mean-best condition at the replicated 10-minute gate. Because seed ranges differ across stages, these changes are operational promotion evidence, not within-seed curves. A replicated 60-minute gate keeps the Staged Factorial Screening bridge reference in the promoted set, where it ranks first in all four 60-minute host-seed cells. In the final 12-hour confirmation package, the bridge condition ranks first in all four host-seed cells across two seeds; the greedy comparator does not meet the frozen 0.010 val_bpb near-equivalence rule; and the cheaper d8/ar48 (depth-8, aspect-48) sentinel does not meet the frozen 0.020 mean-gap rule. The executed 12-hour branch spends 144 GPU-hours, and the full staged protocol records 169.2 training GPU-hours including screening stages. Continuing all four 60-minute candidates would spend 192 GPU-hours, while continuing all nine replicated 10-minute candidates would spend 432 GPU-hours. The latter numbers are accounting counterfactuals for unrun continuations, not evidence that skipped candidates could not have overtaken the reference. The result is a bounded cost-allocation finding, not a claim of global optimality, capacity-normalized superiority, or superiority over adaptive hyperparameter optimization methods.

2606.11386 2026-06-11 cs.CL cs.AI eess.AS 新提交

Overcoming State Inertia in Full-Duplex Spoken Language Models via Activation Steering

通过激活引导克服全双工口语语言模型中的状态惯性

Cheng-Kuang Chang, Kai-Wei Chang, Alexander H. Liu, James Glass

发表机构 * MIT CSAIL(麻省理工学院计算机科学与人工智能实验室)

AI总结 针对全双工口语模型在用户打断时响应延迟的问题,提出基于感知向量的激活引导方法,无需微调即可显著提升中断理解能力。

详情
AI中文摘要

全双工口语语言模型(FD-SLMs)通过允许模型同时听和说实现无缝语音交互,但其协调听与说的内部机制尚未充分探索。我们分析了FD-SLM隐藏表示中编码的预测行为,发现它们表现出特定流的预测模式:在听时,它们优先预测传入的用户流;而在说时,它们优先预测模型输出流。基于这一观察,我们表明FD-SLMs动态调节其内部预测焦点在两个状态之间:与模型输出生成一致的生成状态和与传入用户输入一致的感知状态。然而,这种调节可能滞后于对话上下文的突然变化。在用户打断期间,模型在过渡到感知状态之前短暂地偏向生成状态,导致其错过传入输入的开头。我们将这种延迟的内部过渡称为状态惯性。为了量化其下游影响,我们引入了零缓冲基准(ZBB),这是一个用于评估当用户语音突然开始时即时中断理解能力的诊断基准。我们使用响应正确性和初始词出现率(IWOR)来评估这一设置。最后,我们通过使用感知向量的激活引导来缓解状态惯性,这是一种无需训练且计算开销很小的干预措施。在多个最先进的FD-SLMs上,激活引导显著改善了中断处理;例如,在PersonaPlex上,它将正确性从28%提高到45%,将IWOR从40%提高到72%,而无需任何微调。

英文摘要

Full-duplex spoken language models (FD-SLMs) enable seamless speech interaction by allowing models to listen and speak simultaneously, yet the internal mechanism by which they coordinate listening and speaking remains underexplored. We analyze the predictive behavior encoded in FD-SLM hidden representations and find that they exhibit stream-specific predictive patterns: during listening, they preferentially predict the incoming user stream, whereas during speaking, they preferentially predict the model output stream. Building on this observation, we show that FD-SLMs dynamically modulate their internal predictive focus between two states: a generative state aligned with model output generation and a perceptive state aligned with incoming user input. However, this modulation can lag behind abrupt changes in conversational context. During user interruptions, the model remains transiently biased toward the generative state before transitioning into the perceptive state, causing it to miss the beginning of the incoming input. We term this delayed internal transition state inertia. To quantify its downstream impact, we introduce the Zero-Buffer Benchmark (ZBB), a diagnostic benchmark for evaluating immediate interruption comprehension when user speech begins abruptly. We evaluate this setting using response correctness and initial-word occurrence rate (IWOR). Finally, we mitigate state inertia through activation steering with a perception vector, a training-free intervention with little additional computational overhead. Across multiple state-of-the-art FD-SLMs, activation steering substantially improves interruption handling; for example, on PersonaPlex, it improves correctness from 28% to 45% and IWOR from 40% to 72% without any fine-tuning.

2606.11385 2026-06-11 cs.CV 新提交

DeceptionX: Explainable Deception Detection with Multimodal Large Language Models

DeceptionX: 基于多模态大语言模型的可解释欺骗检测

Jiayu Zhang, Shuo Ye, Jiajian Huang, Yawen Cui, Taorui Wang, Wei Xia, Zeheng Wang, Haowen Tang, Hui Ma, Zitong Yu

发表机构 * Great Bay University(大湾区大学) Hong Kong Polytechnic University(香港理工大学)

AI总结 提出DeceptionX框架,将欺骗检测从黑箱分类转变为可解释的观察-思考-总结推理过程,通过构建DeceptChain数据集和三阶段训练管道,在标准基准上超越现有方法,同时提供专家级可解释推理路径。

详情
AI中文摘要

欺骗检测是情感计算和行为分析中一项关键且极具挑战性的任务。现有的深度学习方法通常将此任务视为简单的分类问题;然而,这种黑箱方法缺乏可解释性,无法捕捉人类专家在识别谎言时使用的复杂逻辑推理过程。尽管多模态大语言模型(MLLM)已展现出潜力,但有效应用它们需要在低层视听线索与高层逻辑推理之间建立桥梁。在本文中,我们提出DeceptionX,一种新颖的MLLM框架,将欺骗检测的范式从黑箱分类转变为可解释的观察-思考-总结推理过程。为解决高质量推理数据稀缺的问题,我们首先构建了DeceptChain,这是一个通过人机循环过程开发的高质量数据集。该数据集将细粒度的视觉和听觉证据(如微表情和声音颤抖)综合为结构化的思维链推理数据。此外,我们提出了一个三阶段训练管道和一种针对DeceptionX的差异感知冗余消除(DARE)策略,以进一步增强模型的泛化能力。大量实验表明,DeceptionX不仅在标准真实世界基准上优于现有的MLLM基线和最先进方法,而且提供了透明的、专家级的推理路径,弥合了多模态欺骗检测中准确性与可解释性之间的关键差距。

英文摘要

Deception detection is a critical and highly challenging task within affective computing and behavioral analysis. Existing deep learning methods typically treat this task as a straightforward classification problem; however, this black-box approach lacks interpretability and fails to capture the complex logical deduction processes utilized by human experts when identifying lies. While Multimodal Large Language Models (MLLMs) have shown potential, applying them effectively requires a bridge between low-level audiovisual cues and high-level logical reasoning. In this paper, we propose DeceptionX, a novel MLLM framework that shifts the paradigm of deception detection from black-box classification to an interpretable Observe-Think-Summarize reasoning process. To address the scarcity of high-quality reasoning data, we first constructed DeceptChain, a high-quality dataset developed through a human-in-the-loop process. This dataset synthesizes fine-grained visual and auditory evidence (such as micro-expressions and vocal tremors) into structured chain-of-thought reasoning data. Furthermore, we propose a three-stage training pipeline and a Discrepancy-Aware Redundancy Elimination~(DARE) strategy for DeceptionX to further enhance the model's generalization capabilities. Extensive experiments demonstrate that DeceptionX not only outperforms existing MLLM baselines and state-of-the-art methods on standard real-world benchmarks but also provides transparent, expert-level reasoning paths, bridging the critical gap between accuracy and interpretability in multimodal deception detection.

2606.11382 2026-06-11 cs.LG q-bio.BM 新提交

GLACIER: A Multimodal Student-Teacher Foundation Model for Molecular Property Prediction

GLACIER:用于分子性质预测的多模态师生基础模型

Emily Nguyen, Yongchan Hong, Harsh Toshniwal, Yan Liu, Andreas Luttens

发表机构 * Department of Computer Science, University of Southern California(南加州大学计算机科学系) Department of Quantitative and Computational Biology, University of Southern California(南加州大学定量与计算生物学系) Amazon(亚马逊) Department of Medical Biochemistry and Biophysics, Science for Life Laboratory, Karolinska Institutet(卡罗林斯卡学院医学生物化学与生物物理系,生命科学实验室)

AI总结 提出GLACIER师生框架,通过融合分子图、SMILES和物理化学描述符三种模态,并利用大模型蒸馏,实现高效准确的分子性质预测。

详情
AI中文摘要

深度学习模型有助于在数十亿候选化合物中发现具有定制性质的分子。然而,开发和部署最先进模型的计算负担不断增加,限制了其可扩展性。大多数大规模模型本质上是单模态的,忽视了利用互补分子数据模态的潜力。为了解决这些缺点,本文介绍了用于化学推理和探索的图-语言对齐表示(GLACIER)模型,这是一个师生框架,集成了分子图、SMILES字符串和物理化学描述符,以学习丰富的分子嵌入。我们的框架包括三个阶段:(1)我们在100,000个药物样分子上预训练三个学生编码器:用于分子图的消息传递神经网络、用于SMILES字符串的基于Transformer的编码器以及用于物理化学描述符的多层感知器;(2)我们使用新颖的Finsler几何感知模块融合这些学生模态;(3)通过对比学习,将来自大型教师模型(包括MiniMol和MolFormer)的互补知识蒸馏到一个轻量级模型中。我们证明GLACIER是一个稳健的框架,在复杂的分子性质预测任务中提供高预测性能和计算效率。我们的代码在此https URL公开可用。

英文摘要

Deep learning models facilitate the discovery of molecules with tailored properties among billions of candidate compounds. However, the computational burden to develop and deploy state-of-the-art models continuously increases, limiting their scalability. Most large-scale models are unimodal in nature and overlook the potential to leverage complementary molecular data modalities. To address these shortcomings, this paper introduces the Graph-Language Alignment for Chemical Inference and Exploration using Representations (GLACIER) model, a student-teacher framework that integrates molecular graphs, SMILES strings, and physicochemical descriptors to learn rich molecular embeddings. Our framework consists of three stages: (1) we pretrain three student encoders on 100,000 drug-like molecules: a message-passing neural network for molecular graphs, a transformer-based encoder for SMILES strings, and a multilayer perceptron for physicochemical descriptors, (2) we fuse these student modalities using a novel Finsler geometry-aware module, and (3) distill complementary knowledge from large teacher models, including MiniMol and MolFormer, into a single lightweight model via contrastive learning. We demonstrate that GLACIER is a robust framework that delivers high predictive performance and computational efficiency in complex molecular property prediction tasks. Our code is publicly available at https://github.com/eemokey/glacier.

2606.11379 2026-06-11 cs.AI 新提交

Automated Mediator for Human Negotiation: Pre-Mediation via a Structured LLM Pipeline

人类谈判的自动调解器:通过结构化LLM流水线进行预调解

Jamie Bergen, Sarit Kraus

发表机构 * University of Washington(华盛顿大学) University of Haifa(海法大学)

AI总结 提出一种结构化LLM流水线作为自动调解器,在整合性谈判中支持预调解,通过分解准备任务为专用模块,在短期自我报告结果上与人类调解员相当,并在偏好推理任务上误差降低36%。

Comments 12 pages, 7 figures

详情
AI中文摘要

预调解是直接人类谈判前的准备阶段,在达成互利协议中起着关键作用,但由于成本、时间和缺乏训练有素的调解员而常被省略。我们引入了一种用于人类谈判的自动调解器,实现为结构化LLM模块流水线,在整合性谈判环境中支持预调解。该流水线将准备分解为对话、偏好预测、响应级批评和结构化总结的专用模块,分离推理、生成和评估,以解决单一提示方法的局限性。我们按照常见的LLM系统术语将每个模块称为“智能体”,但组件并非自主且不进行点对点交互;输出按固定顺序向前传递。我们在两个受控人类受试者实验中评估该系统,在多议题谈判场景中将基于AI的预调解与专业人类调解员进行比较。在短期自我报告测量中,自动调解器在准备结果上与人类调解员大致相当,包括对调解员的信任和达成互利协议的信心,同时在我们场景和提示下,偏好推理任务的误差显著降低(RMSE降低36%)。第二项研究表明,有针对性的提示优化将过度肯定模式从36.6%降至16.8%,与人类调解员基线匹配。我们的发现表明,结构化LLM流水线可以在短期自我报告准备结果上提供与人类调解员大致相当的可扩展、低投入的预调解支持。该流水线的单方设计反映了当前人类调解员进行预调解的方式,并支持在争议各方之间并行部署,从而实现可扩展性。

英文摘要

Pre-mediation, the preparatory phase preceding direct human negotiation, plays a critical role in achieving mutually beneficial agreements, yet is often omitted due to cost, time, and limited access to trained mediators. We introduce an automated mediator for human negotiation, implemented as a structured pipeline of LLM modules, that supports pre-mediation in integrative negotiation settings. The pipeline decomposes preparation into specialized modules for dialogue, preference prediction, response-level critique, and structured summarization, separating inference, generation, and evaluation to address limitations of monolithic single-prompt approaches. We use the term "agent" for each module following common LLM-systems terminology, but the components are not autonomous and do not interact peer-to-peer; outputs are passed forward in a fixed sequence. We evaluate the system in two controlled human-subject experiments comparing AI-based pre-mediation with professional human mediators in a multi-issue negotiation scenario. On short-term self-reported measures, the automated mediator achieves preparation outcomes broadly comparable to human mediators, including trust in the mediator and confidence in reaching mutually beneficial agreements, while achieving substantially lower error on the preference-inference task under our scenario and prompts (36% lower RMSE). A second study shows that targeted prompt refinements reduce excessive affirmation patterns from 36.6% to 16.8%, matching human mediator baselines. Our findings suggest that structured LLM pipelines can provide scalable, low-effort pre-mediation support broadly comparable to human mediators on short-term self-reported preparation outcomes. The pipeline's single-party design mirrors how human mediators run pre-mediation today and enables parallel deployment across all parties to a dispute, supporting scalability.

2606.11375 2026-06-11 cs.CL cs.AI cs.LG 新提交

When Probing Accuracy Saturates, Fragility Resolves: A Complementary Metric for LLM Pre-Training Analysis

当探测精度饱和时,脆弱性揭示问题:LLM预训练分析的互补度量

Orion Reblitz-Richardson

发表机构 * Distiller Labs

AI总结 针对线性探测在预训练中精度快速饱和的问题,提出脆弱性度量,通过激活噪声水平衡量探测鲁棒性,揭示精度无法捕捉的表示结构演化。

Comments 22 pages, 5 figures. Code and datasets at https://github.com/deepsteer/deepsteer

详情
AI中文摘要

标准线性探测在隐藏状态上的分类器达到高精度时,宣称属性被“编码”。该协议在快照上表现良好,但在预训练过程中失效:探测精度在最初几千步内饱和,使得大部分训练过程对仪器不可见。我们引入脆弱性,一种互补的逐层度量,定义为探测精度崩溃时的激活噪声水平。脆弱性对可分性边际和表示冗余均敏感,这两者在精度平台期后仍持续演化。应用于开放检查点语言模型时,脆弱性恢复了精度单独无法看到的结构。道德化表示沿着词汇→组合梯度出现:词汇道德检测在先,组合道德编码在后。由于探测精度本身跟踪数据集在词汇层面的可分性,我们通过证明其在共享无对比标记的构造类型间转移,直接建立了组合编码。层深度鲁棒性梯度在训练中单调发展,而精度保持平坦。匹配的微调语料库产生相同的探测精度,却留下不同的脆弱性指纹,表明数据整理在不改变探测精度的情况下重塑了探测鲁棒性。在我们测试的每个比较中,当探测精度返回平坦答案时,脆弱性返回结构化答案。

英文摘要

Standard linear probing declares a property "encoded" when a classifier on hidden states achieves high accuracy. The protocol works well on a snapshot but breaks across pre-training: probe accuracy saturates within the first few thousand steps, leaving most of training invisible to the instrument. We introduce fragility, a complementary per-layer metric defined as the activation-noise level at which probe accuracy collapses. Fragility is sensitive to both the margin of separability and the redundancy of representation, both of which keep evolving long after accuracy plateaus. Applied to open-checkpoint language models, fragility recovers structure that accuracy alone cannot see. Moralized representations emerge along a lexical $\to$ compositional gradient: lexical moral detection first, compositional moral encoding later. Because probe accuracy on its own tracks how lexically separable a dataset is, we establish the compositional encoding directly, by showing it transfers across construction types that share no contrast tokens. A layer-depth robustness gradient develops monotonically across training while accuracy stays flat. And matched fine-tuning corpora that produce identical probing accuracy leave distinct fragility fingerprints, showing that data curation reshapes probe robustness without changing probe accuracy. In every comparison we test, where probing accuracy returns a flat answer, fragility returns a structured one.

2606.11372 2026-06-11 cs.RO 新提交

HiPi: Reproducible High-Fidelity Piezoresistive Sensors for Robotic Manipulation

HiPi: 用于机器人操作的可复现高保真压阻传感器

Changyi Lin, Raihan Haque, Hui-Ping Wang, Ding Zhao

发表机构 * Carnegie Mellon University(卡内基梅隆大学) General Motors(通用汽车)

AI总结 提出HiPi系统,通过低串扰读出原理和可复现硬件设计,在双机械臂四阵列2048触觉点场景下实现220Hz读出,将接触几何IoU从0.428提升至0.797。

详情
AI中文摘要

压阻触觉传感器因其薄、轻、低成本且可扩展至密集大面积传感而受到机器人操作的青睐。然而,现有系统仍面临实际权衡:近期可复现设计强调易用性和可复现性,而高保真读出架构则更难制造、组装和部署。我们提出HiPi,一种用于机器人操作的可复现高保真压阻传感系统。基于低串扰读出原理,HiPi围绕可复现性、可部署性和多传感器可扩展性重新设计了完整硬件堆栈。该系统包括:兼容商业PCB制造和组装服务的紧凑读出PCB,消除了手动焊接;更小、更低成本的基于STM32的MCU模块;优化的通信管道,在双机械臂设置中实现220 Hz读出,配备四个密集触觉阵列(共2048个触觉点);以及基于FPCB的导电层,简化了传感器制造和堆叠。使用结构化3D打印接触图案的实验表明,HiPi在保持接触几何方面显著优于可复现基线,将平均IoU从0.428提高到0.797,平均Dice分数从0.539提高到0.886。这些结果表明,HiPi弥合了可复现制造与高保真读出之间的重要差距,使密集压阻触觉传感在双机械臂操作和多指机器人系统中更加实用。

英文摘要

Piezoresistive tactile sensors are attractive for robotic manipulation because they are thin, lightweight, low-cost, and scalable to dense large-area sensing. However, existing systems still face a practical trade-off: recent reproducible designs emphasize accessibility and ease of reproduction, whereas high-fidelity readout architectures remain more difficult to fabricate, assemble, and deploy. We present HiPi, a reproducible high-fidelity piezoresistive sensing system for robotic manipulation. Building on a low-crosstalk readout principle, HiPi redesigns the complete hardware stack around reproducibility, deployability, and multi-sensor scalability. The system includes a compact readout PCB compatible with commercial PCB fabrication and assembly services, eliminating manual soldering; a smaller and lower-cost STM32-based MCU module; an optimized communication pipeline that achieves 220 Hz readout in a bimanual setup with four dense tactile arrays (2048 taxels in total); and FPCB-based conductive layers that simplify sensor fabrication and stacking. Experiments with structured 3D-printed contact patterns show that HiPi preserves contact geometry substantially better than a reproducible baseline, improving the average IoU from 0.428 to 0.797 and the average Dice score from 0.539 to 0.886. These results suggest that HiPi bridges an important gap between reproducible fabrication and high-fidelity readout, making dense piezoresistive tactile sensing more practical for bimanual manipulation and multi-fingered robotic systems.

2606.11371 2026-06-11 cs.CL cs.AI eess.AS eess.SP 新提交

The Dynamics of Human and AI-Generated Language: How Semantics Fluctuates across Different Timescales

人类与AI生成语言的动态:语义如何在不同时间尺度上波动

Han-Jen Chang, Yasir Çatal, Angelika Wolman, Agustín Ibáñez, David Smith, I-Wen Su, Kai-Yuan Cheng, Georg Northoff

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出语义时间尺度分析流程,通过自相关窗口度量(ACW-0)量化人类与AI生成语音中语义特异性与上下文相似性的时间组织,发现ACW-0长度与词汇通用性相关,且该关联在随机化后被削弱。

Comments 45 pages, 4 figures, 4 tables. Accepted manuscript; published in Computer Speech & Language

详情
Journal ref
Computer Speech & Language (2026) 102013
AI中文摘要

口语,无论是人类还是大型语言模型(LLM)产生的,都会随时间展开,具有变化的语义内容。然而,我们仍然缺乏简单、可解释的时间序列特征来捕捉通用与特定内容如何随时间分布,并可用于比较人类和AI生成的语音。我们引入了一个语义时间尺度分析流程,将带有时间戳的词级转录转换为语义时间序列。对于每个口语叙述,我们计算(i)基于WordNet词深度的语义特异性,以及(ii)基于SBERT嵌入的上下文相似性,并使用自相关窗口度量(ACW-0及相关指标)量化其时间依赖性。然后,我们将原始语音与多种随机化对照进行比较,这些对照选择性地破坏词汇身份、时间顺序和词时长。在人类朗读的自传叙述、TTS朗读和LLM生成的文本(通过TTS渲染)中,我们发现语义时间序列中ACW-0较长的片段往往包含更多通用词汇,而ACW-0较短的片段则富含更具体的词汇。当词序和计时被随机化时,这些关联被强烈削弱或消除,表明基于ACW的度量捕捉了语义内容超越静态词汇分布的非平凡时间组织。我们的结果表明,基于ACW的语义时间尺度是分析和比较人类与AI生成语音时间结构的有用特征系列。

英文摘要

Spoken language, whether produced by humans or large language models (LLM), unfolds over time with varying semantic content. However, we still lack simple, interpretable time-series features that capture how generic versus specific content is distributed over time, and that can be used to compare human and AI-generated speech. We introduce a semantic-timescale analysis pipeline that turns word-level transcripts with timestamps into semantic time-series. For each spoken narrative, we compute (i) semantic specificity using WordNet-based word depth and (ii) contextual similarity using SBERT embeddings and quantify their temporal dependence using autocorrelation-window measures (ACW-0 and related metrics). We then compare original speech to multiple shuffled controls that selectively disrupt lexical identity, temporal order, and word duration. Across human-read autobiographical narratives, TTS readings, and LLM-generated texts rendered with TTS, we find that segments with longer ACW-0 in the semantic time-series tend to contain more generic vocabulary, whereas segments with shorter ACW-0 are enriched in more specific words. These associations are strongly attenuated or abolished when word order and timing are randomized, indicating that ACW-based measures capture non-trivial temporal organization of semantic content beyond static lexical distributions. Our results suggest that ACW-based semantic timescales are a useful family of features for analyzing and comparing the temporal structure of human and AI-generated speech.

2606.11363 2026-06-11 cs.CV 新提交

NSVQ: Mitigating Codebook Collapse by Stabilizing Encoder Drift in Vector Quantization

NSVQ: 通过稳定向量量化中的编码器漂移缓解码本崩溃

Hao Lu, Yongxin Guo, Onur Koyun, Zhengjie Zhu, Abbas Alili, Metin N. Gurcan

发表机构 * Wake Forest University School of Medicine(维克森林大学医学院) Advocate Health(倡导健康)

AI总结 提出NSVQ训练策略,通过非平稳嵌入损失、码本替换和分阶段编码器冻结,缓解大码本VQ中的码本崩溃,在ImageNet-1k上提升重建质量并保持100%码本利用率。

详情
AI中文摘要

向量量化是现代生成建模流程的核心,但大码本VQ模型常遭受码本崩溃。我们识别出编码器漂移是此失败的关键驱动因素:当编码器移动潜在分布时,稀疏更新的码向量可能滞后、失去分配并增加量化误差,通过直通估计器形成反馈循环。我们提出NSVQ,一种非平稳感知的VQ训练策略,结合密集非平稳嵌入损失、码本替换和分阶段编码器冻结。NSVQ首先在早期训练中帮助码本跟踪编码器漂移,然后冻结编码器以在固定潜在几何下巩固码本,最后重新引入对抗性细化。在ImageNet-1k上的实验表明,NSVQ在保持完全码本利用率的同时提高了重建质量。在ImageNet-1k 128×128分辨率下使用65,536个码本,与SimVQ相比,NSVQ将rFID从2.39降至2.10,同时两种方法均保持100%利用率。额外的潜在扩散实验表明,NSVQ还改善了下游ImageNet生成的FID。

英文摘要

Vector quantization is central to modern generative modeling pipelines, but large-codebook VQ models often suffer from codebook collapse. We identify encoder drift as a key driver of this failure: as the encoder moves the latent distribution, sparsely updated code vectors can lag behind, lose assignments, and increase quantization error, creating a feedback loop through the straight-through estimator. We propose NSVQ, a non-stationary-aware VQ training strategy that combines a dense non-stationary embedding loss, codebook replacement, and stage-wise encoder freezing. NSVQ first helps the codebook track encoder drift during early training, then freezes the encoder to consolidate the codebook under a fixed latent geometry, and finally reintroduces adversarial refinement. Experiments on ImageNet-1k show that NSVQ improves reconstruction quality while maintaining full codebook utilization. On ImageNet-1k at 128$\times$128 with 65,536 codes, NSVQ reduces rFID from 2.39 to 2.10 compared with SimVQ, while both methods maintain 100\% utilization. Additional latent diffusion experiments show that NSVQ also improves downstream ImageNet generation FID.

2606.11350 2026-06-11 cs.CL cs.IR 新提交

When More Documents Hurt RAG: Mitigating Vector Search Dilution with Domain-Scoped, Model-Agnostic Retrieval

当更多文档损害RAG:利用领域限定、模型无关的检索缓解向量搜索稀释

Nabaraj Subedi, Ahmed Abdelaty, Shivanand Venkanna Sheshappanavar

发表机构 * Dept. of Electrical Engineering & Computer Science, University of Wyoming(怀俄明大学电气工程与计算机科学系) Dept. of Civil, Architectural Engineering & Construction Management, University of Wyoming(怀俄明大学土木、建筑工程与施工管理系)

AI总结 针对检索增强生成在异构文档集合中因向量搜索稀释导致性能下降的问题,提出基于组织元数据的领域限定方法MASDR-RAG,显著提升P@10至0.86,并揭示多智能体编排的精度-忠实度悖论。

Comments 24 pages, 8 figures, 30 tables. Preprint under review

详情
AI中文摘要

当检索增强生成扩展到大规模、异构的文档集合时,其性能会下降,因为密集相似性失去了区分能力,top-k检索越来越多地返回语义相似但上下文不正确的块。我们将这种失败模式称为向量搜索稀释。即使使用混合密集+稀疏检索,我们在部署的怀俄明州交通部语料库中直接观察到了这一点:当文档从54篇扩展到1128篇(88907个块)时,准确率从75%下降到40%以下。为了解决这种稀释问题,我们提出了MASDR-RAG(用于RAG的多智能体领域限定检索),并在200个专家验证的查询上进行了评估,涉及五个LLM骨干、六个语料库和两个索引栈。我们的结果表明,使用组织元数据进行领域限定是关键修复,显著将P@10从0.77提高到0.86(p < 0.05)。此外,我们对多智能体编排的研究揭示,高度配置依赖会导致我们所谓的精度-忠实度悖论。基于这些不同的结果,我们的实用建议很简单:先限定领域,然后执行一次合成调用,将完整的多智能体编排保留给真正多领域的语料库,并配合原生工具调用骨干。代码和数据将在接收后公开。

英文摘要

Retrieval-augmented generation degrades when scaled to large, heterogeneous document collections, where dense similarity loses discriminative power, and top-k retrieval increasingly returns semantically similar but contextually incorrect chunks. We refer to this failure mode as vector search dilution. Even when using hybrid dense+sparse retrieval, we observed this firsthand in a deployed Wyoming Department of Transportation corpus, where scaling from 54 to 1,128 documents (88,907 chunks) reduced accuracy from 75% to below 40%. To address this dilution, we propose MASDR-RAG ( Multi-Agent Scoped Domain Retrieval for RAG) and evaluate it on 200 expert-validated queries across five LLM backbones, six corpora, and two index stacks. Our results indicate that domain scoping using organizational metadata is the key fix, significantly improving P@10 from 0.77 to 0.86 ($p < 0.05$). Furthermore, our investigation of multi-agent orchestration revealed that a high degree of configuration dependence results --creating what we call the precision-faithfulness paradox. Based on these varied outcomes, our practical recommendation is simple: scope first, then perform a single synthesis call, reserving full multi-agent orchestration for genuinely multi-domain corpora paired with native-tool-call backbones. Code and Data will be made public upon acceptance.

2606.11341 2026-06-11 cs.LG cs.RO 新提交

Energy-Conserved Neural Pipelines: Attenuating Error Propagation in Modular Neural Networks via Physical Conservation Constraints

能量守恒神经管道:通过物理守恒约束减弱模块化神经网络中的误差传播

David Young, Swan Yi Htet

发表机构 * ORION Robotics

AI总结 提出在模块间强制能量守恒(特征向量L2范数不变)作为硬约束,实验证明该方法在多种噪声下显著优于基线,并具有深度不变性和理论保证。

Comments 22 pages, 2 figures, 7 tables, 25 references

详情
AI中文摘要

模块化神经网络管道存在误差累积问题:任何模块边界的噪声都会传播并可能在后续模块中放大。我们引入能量守恒作为模块间信息流的硬物理约束。激活能量(特征向量的平方L2范数)被强制在每个模块边界精确保持不变。与软能量惩罚不同,守恒是不可违反的定律:网络可以在神经元之间重新分配能量,但不能创造或毁灭能量。在CIFAR-10上的四个实验表明:(1)在噪声sigma=0.2时,守恒方法保留了77.4%的干净准确率,而基线为35.1%,能量惩罚模型为30.9%(p<0.001,5个种子);(2)管道变得深度不变,在深度2至5且每个边界都有噪声时保留了93.3%的准确率;(3)该优势泛化到系统性偏差(+45.1%)、高斯噪声(+40.4%)和对抗噪声(+4.8%),而对dropout有原则性的无影响(-0.3%);(4)在ResNet-18上,守恒优势与内在归一化呈反比:在sigma=0.2时,有BatchNorm时+0.3个百分点,无BatchNorm时+26.2个百分点,在sigma=0.5时达到+58.0个百分点。实验5在真实模块化机器人管道(MuJoCo物理,Franka Panda)上验证了该算子。在独立机器上的三次独立运行(每个单元90次试验)中,守恒在单目深度类噪声上提供了平均+18.9个百分点的优势。一个形式化界限证明了守恒噪声能量严格小于输入噪声能量。

英文摘要

Modular neural network pipelines suffer from error compounding: noise at any module boundary propagates and potentially amplifies through subsequent modules. We introduce energy conservation as a hard physical constraint on inter-module information flow. Activation energy (the squared L2 norm of feature vectors) is enforced to be exactly preserved at every module boundary. Unlike soft energy penalties, conservation is an inviolable law: the network may redistribute energy across neurons but cannot create or destroy it. Four experiments on CIFAR-10 demonstrate: (1) conservation retains 77.4% of clean accuracy at noise sigma=0.2, versus 35.1% for baselines and 30.9% for energy-penalized models (p<0.001, 5 seeds); (2) pipelines become depth-invariant, retaining 93.3% at depths 2 through 5 with noise at every boundary; (3) the advantage generalizes to systematic bias (+45.1%), Gaussian (+40.4%), and adversarial noise (+4.8%), with a principled non-effect on dropout (-0.3%); (4) on ResNet-18, the conservation advantage scales inversely with intrinsic normalization: +0.3 pp with BatchNorm, +26.2 pp without at sigma=0.2, reaching +58.0 pp at sigma=0.5. Experiment 5 validates the operator on a real modular robotic pipeline (MuJoCo physics, Franka Panda). Across three independent runs on separate machines (90 trials per cell), conservation provides +18.9 pp average advantage on monocular-depth-style noise. A formal bound proves conserved noise energy is strictly less than input noise energy.

2606.11337 2026-06-11 cs.AI cs.CL cs.CY 新提交

Can AI Agents Synthesize Scientific Conclusions?

AI代理能否综合科学结论?

Hayoung Jung, Pedro Viana Diniz, José Reinaldo Corrêa Roveda, Abner Fernandes da Silva, Haeun Jung, Enoch Tsai, Aleksandra Korolova, Manoel Horta Ribeiro

发表机构 * Princeton University(普林斯顿大学) Universidade Federal de Minas Gerais(米纳斯吉拉斯联邦大学) Stony Brook University(石溪大学) Hackensack Meridian School of Medicine(哈肯萨克子午线医学院)

AI总结 本文提出SciConBench基准和SciConHarness评估框架,通过分解原子事实并计算精确率和召回率,发现前沿AI代理在科学结论综合中事实F1仅0.337,且无约束评估存在数据泄露,消费者代理常生成不完整或矛盾的结论。

Comments 79 pages, 34 figures, 17 tables. Under Submission

详情
AI中文摘要

科学AI代理越来越多地检索证据、跨来源推理并综合用于重要决策的结论。然而,它们在健康等高风险领域中的能力仍不明确。我们引入了SciConBench,一个大规模实时基准,包含9.11K个问题以及来自系统综述的专家撰写的结论,用于评估开放域科学结论综合。该基准采用专家验证的自动评估流程,将结论分解为原子事实,并通过事实精确率和召回率衡量正确性和全面性。为减轻数据泄露,我们进一步引入了SciConHarness,一个洁净室评估框架,为代理配备受控的网页交互以确保有效测量。评估8个前沿模型和深度研究代理,我们发现事实质量仍然较低:在洁净室设置下,最佳代理仅达到0.337的事实F1。与无约束评估相比,我们的洁净室设置持续降低性能,表明数据泄露夸大了模型真实综合能力的估计。最后,我们审计了面向消费者的代理(如Google AI Overview、OpenEvidence),发现它们经常生成不完整甚至矛盾的结论,即使真实答案可用。总体而言,我们的结果表明,科学结论的可靠综合仍然是一个开放挑战,而洁净室评估对于评估开放域AI代理至关重要。

英文摘要

Scientific AI agents increasingly retrieve evidence, reason across sources, and synthesize conclusions used in consequential decisions. Yet, their ability to do so in high-stakes domains such as health remains unclear. We introduce SciConBench, a large-scale live benchmark of 9.11K questions and expert-written conclusions from systematic reviews to evaluate open-domain scientific conclusion synthesis. The benchmark draws on an expert-validated automated evaluation pipeline that decomposes conclusions into atomic facts and measures correctness and comprehensiveness via factual precision and recall. To mitigate data leakage, we further introduce SciConHarness, a clean-room evaluation harness that equips agents with controlled web interaction to ensure valid measurement. Evaluating 8 frontier models and deep research agents, we find that factual quality remains low: under clean-room settings, the best agent achieves only a factual F1 of 0.337. Our clean-room setting consistently reduces performance relative to unconstrained evaluation, suggesting that leakage inflates estimates of models' true synthesis capabilities. Finally, we audit consumer-facing agents (e.g., Google AI Overview, OpenEvidence) and find they frequently generate incomplete and sometimes contradictory conclusions, even when the ground-truth answer is available. Overall, our results show that reliable synthesis of scientific conclusions remains an open challenge, and that clean-room evaluation is essential for assessing open-domain AI agents.

2606.11326 2026-06-11 cs.CV 新提交

DarkVGGT: Seeing Through Darkness Using Thermal Geometry without Daylight Tax

DarkVGGT: 利用热几何在黑暗中透视,无需日光代价

Minseong Kweon, Wenyuan Zhao, Nuo Chen, Lulin Liu, Huiwen Han, Zihao Zhu, Srinivas Shakkottai, Chao Tian, Zhiwen Fan

发表机构 * University of Minnesota(明尼苏达大学) Texas A&M University(德克萨斯农工大学) Stanford University(斯坦福大学)

AI总结 提出DarkVGGT,一种RGB-T前馈几何框架,通过物理感知热建模实现低光照场景下的鲁棒3D估计,引入热分解和几何共享路由模块,在退化RGB条件下保持精度。

Comments Project Page: https://darkvggt.github.io

详情
AI中文摘要

最近的前馈3D重建方法在从图像流高效端到端场景几何估计中展现出强大性能和灵活性。然而,它们对可见光外观的依赖使其在黑暗和低可见度环境中脆弱,此时RGB线索严重退化,几何证据变得模糊。为应对这一挑战,我们提出DarkVGGT,一种RGB-T前馈几何框架,使用物理感知热建模实现低光照场景下的鲁棒3D估计。DarkVGGT引入两个互补模块。首先,物理启发的热分解提取发射主导、几何一致的热线索,同时隔离可能引入几何模糊的稀疏反射残差。其次,几何共享热路由从热特定模式中分离模态不变的几何结构,选择性地将可靠性感知的结构引导注入RGB流。这些组件共同使得在退化RGB条件下实现准确的热信息几何估计,同时在光照良好环境中基本保持性能。在低可见度RGB-T基准上的实验表明,与现有前馈几何基线相比,在深度和相机姿态估计上均有一致改进。

英文摘要

Recent feed-forward 3D reconstruction methods have demonstrated strong performance and flexibility in efficient end-to-end scene geometry estimation from image streams. However, their reliance on visible-light appearance makes them vulnerable in dark and low-visibility environments, where RGB cues are severely degraded and geometric evidence becomes ambiguous. To address this challenge, we propose DarkVGGT, an RGB-T feed-forward geometry framework that uses physics-aware thermal modeling for robust 3D estimation in low-light scenes. DarkVGGT introduces two complementary modules. First, physics-inspired thermal factorization extracts emissive-dominant, geometry-consistent thermal cues while isolating sparse reflective residuals that may introduce geometric ambiguity. Second, geometry-shared thermal routing isolates modality-invariant geometric structures from thermal-specific patterns, selectively injecting reliability-aware structural guidance into the RGB stream. Together, these components enable accurate thermal-informed geometry estimation under degraded RGB conditions while largely preserving performance in well-lit environments. Experiments on low-visibility RGB-T benchmarks demonstrate consistent improvements in both depth and camera pose estimation over existing feed-forward geometry baselines.

2606.11324 2026-06-11 cs.RO cs.AI cs.LG 新提交

Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models

Embodied-R1.5:通过具身基础模型演化物理智能

Yifu Yuan, Yaoting Huang, Xianze Yao, Yutong Li, Shuoheng Zhang, Linqi Han, Pengyi Li, Jiangeng Sun, Wenting Jia, Zhao Zhang, Yuhao Liu, Ruihao Liao, Yucheng Hu, Qiyu Wu, Yuxiao Li, Zibin Dong, Fei Ni, Yan Zheng, Shuyang Gu, Yi Ma, Hongyao Tang, Han Hu, Jianye Hao

发表机构 * Tianjin University(天津大学) Tencent Hunyuan(腾讯混元)

AI总结 提出统一具身基础模型Embodied-R1.5,通过自动化数据管道和多任务平衡强化学习,在8B参数下实现24项基准中16项最优,并支持微调为VLA模型。

Comments Embodied R1.5 technical report. Project page: https://embodied-r.github.io/

详情
AI中文摘要

我们介绍了Embodied-R1.5,一个统一的具身基础模型(EFM),它在单一架构中集成了全面的具身推理能力,涵盖具身认知、任务规划、修正和指向,旨在实现通用物理智能。利用三个自动化数据构建管道显著扩展关键能力的数据覆盖,我们构建了超过150亿token的大规模数据系统,并设计了多任务平衡的RL配方以缓解异构任务冲突。我们进一步引入了规划器-基础模型-修正器(PGC)闭环框架,使单一模型能够自主执行并在长时任务中进行自我修正。仅凭8B参数,Embodied-R1.5在24个具身VLM基准中的16个上达到了最先进水平,超越了Gemini-Robotics-ER-1.5和GPT-5.4等领先模型。得益于内化的具身能力,Embodied-R1.5只需少量数据即可微调为VLA,在4个流行的操作基准套件上优于$\pi_{0.5}$等领先VLA模型。我们进一步进行了广泛的零样本真实机器人实验,验证了在指令跟随、可供性基础、铰接物体操作和长时复杂任务中的性能,展示了向物理世界的强泛化能力。我们开源了模型权重、数据集、训练代码以及EmbodiedEvalKit(一个专为具身任务定制的评估框架),以促进EFM的未来研究。

英文摘要

We introduce Embodied-R1.5, a unified Embodied Foundation Model (EFM) that integrates comprehensive embodied reasoning capabilities, spanning embodied cognition, task planning, correction, and pointing, within a single architecture toward general physical intelligence. Leveraging three automated data construction pipelines to significantly expand the data coverage of critical capabilities, we build a large-scale data system of over 15B tokens, and design a multi-task balanced RL recipe to alleviate heterogeneous task conflicts. We further introduce a Planner-Grounder-Corrector (PGC) closed-loop framework that enables a single model to autonomously execute and self-correct over long-horizon tasks. With only 8B parameters, Embodied-R1.5 achieves SOTA on 16 out of 24 embodied VLM benchmarks, surpassing leading models like Gemini-Robotics-ER-1.5 and GPT-5.4. Benefiting from the internalized embodied capabilities, Embodied-R1.5 can be fine-tuned into a VLA with only a small amount of data, outperforming leading VLA models like $π_{0.5}$ across 4 popular manipulation benchmark suites. We further conduct extensive zero-shot real-robot experiments, validating performance in instruction following, affordance grounding, articulated object manipulation, and long-horizon complex tasks, demonstrating strong generalization to the physical world. We open-source model weights, datasets, training code, and EmbodiedEvalKit, an evaluation framework tailored for embodied tasks, to facilitate future research in EFMs.

2606.11320 2026-06-11 cs.CV 新提交

Semantic Segmentation of Node and Edge Diagrams for Assistive Technology

面向辅助技术的节点和边图语义分割

Michael Cormier, Yichun Zhao, Laura Paul, Cameron Swift, Duc Tri Dang, Miguel Nacenta

发表机构 * Natural Sciences and Engineering Research Council of Canada(加拿大自然科学与工程研究理事会)

AI总结 提出紧凑深度学习模型对节点-边图进行语义分割,在合成数据集上达到93%以上像素精度,以辅助非视觉访问。

Comments 8 pages, 6 figures, 1 table. In Proceedings of the 23rd Conference on Robots and Vision (2026)

详情
AI中文摘要

在本文中,我们提出了一组新颖的用于节点-链接图语义分割的相关模型。这些图经常用于表示数学图、概念之间的关系和流程图。此类图难以非视觉方式访问;尽管已经为节点-链接图设计了一些辅助界面,但它们依赖于图的可机读表示,而此类图通常以位图图像形式提供。我们的紧凑深度学习模型在大型合成节点-链接图数据集上表现出优异的定量和定性性能,达到超过93%的逐像素准确率。

英文摘要

In this paper, we present a novel set of related models for semantic segmentation of node-link diagrams. These diagrams are frequently used to represent mathematical graphs, relationships between concepts, and flowcharts. Such diagrams are difficult to access non-visually; while some assistive interfaces have been designed for node-link diagrams, they rely upon a machine-readable representation of the diagram, whereas such diagrams will generally be made available as bitmap images. Our compact deep learning models show excellent quantitative and qualitative performance on a large synthetic dataset of node-link diagrams, reaching per-pixel accuracy over 93\%.

2606.11319 2026-06-11 cs.LG cond-mat.dis-nn 新提交

Learning from almost nothing: How neural networks survive heavy input corruption

从几乎一无所有中学习:神经网络如何在严重输入损坏中生存

Justin Tahmassebpur, Asadullah Bhuiyan, Hyejin Kim, Omri Lesser

发表机构 * Cornell University(康奈尔大学)

AI总结 研究神经网络在输入严重损坏(>90%)时仍保持高精度的鲁棒性,通过平均场方法推导出网络实现最近类均值原型规则,解释学习成功的机制。

Comments 26 pages, 10 figures

详情
AI中文摘要

从不完美数据中学习是机器学习的核心主题,将鲁棒性的实际问题与可学习性的基本问题联系起来。本文研究属性噪声:在保持标签完整的情况下从损坏输入中学习,这一设置受到的关注远少于标签噪声。我们考虑两种损坏模型:加性噪声和替换噪声。通过在损坏分类数据集上使用多层感知器(MLP)进行实验,我们发现神经网络保持鲁棒性,即使输入损坏超过90%——远超人类识别能力——仍能维持远高于随机水平的准确率。为了理解这种鲁棒性,我们使用平均场启发的方法分析严重损坏机制下的无限宽网络,并推导出分类结果的前导决策规则:网络实现一个原型规则,即最近类均值,将每个测试点分配给其训练集平均值最接近的类别。这个前导决策规则在广泛的MLP架构中具有普适性,适用于任何深度以及多种激活函数和噪声分布。相同的质心机制与实验中有限宽网络的行为高度吻合,并提供了一个可解释且易于分析的说明,解释了为什么即使单个训练样本几乎不携带任何信号,学习也能成功。

英文摘要

Learning from imperfect data is a central theme in machine learning, connecting practical questions of robustness to fundamental questions of learnability. Here we examine attribute noise: learning from corrupted inputs while keeping the labels intact, a setting that has received considerably less analytical attention than its label-noise counterpart. We consider two types of corruption models: additive noise and replacement noise. Through experiments with multi-layer perceptrons (MLPs) on corrupted classification datasets, we find that neural networks remain robust, maintaining well-above-chance accuracy even when inputs are >90% corrupted -- far beyond human recognition. To understand this robustness, we analyze infinite-width networks in the heavy-corruption regime using a mean-field-inspired approach and derive a leading-order decision rule for the classification outcome: the network implements a prototype rule, the nearest-class-mean, assigning each test point to the class whose training-set average it most closely resembles. This leading-order decision rule is universal across a broad range of MLP architectures, holding for any depth, as well as a wide class of activation functions and noise distributions. The same centroid mechanism closely matches finite-width network behavior in our experiments and provides an interpretable and analytically tractable account of why learning can succeed even when individual training examples carry almost no signal.

2606.11316 2026-06-11 cs.CL 新提交

Schützen: Evaluating LLM Safety in Bulgarian and German Contexts

Schützen: 在保加利亚语和德语语境中评估LLM安全性

Kiril Georgiev, Yuxia Wang, Dimitar Iliyanov Dimitrov, Preslav Nakov, Ivan Koychev

发表机构 * Sofia University "St. Kliment Ohridski" MBZUAI(索菲亚大学"圣克莱门特·奥赫里德斯基" MBZUAI)

AI总结 针对现有安全评估数据集以英语和中文为主的问题,构建了覆盖低资源语言保加利亚语和高资源语言德语的Schützen安全数据集,实验揭示多语言LLM在安全行为上的显著跨语言差异,强调了区域特定评估资源的必要性。

Comments 19 pages, 13 tables, 12 figures

详情
AI中文摘要

大型语言模型越来越多地部署在专业领域,带来了难以预测的风险,包括生成有害或不尊重的内容。尽管在开发安全评估数据集方面取得了实质性进展,但现有资源仍然 overwhelmingly 以英语和中文为中心。这种限制在评估共享社会文化、法律和伦理背景下的语言时尤为明显。为了解决这一差距,我们引入了Schützen:一个德语-保加利亚语安全数据集,旨在评估模型在风险下的可回答性,涵盖低资源语言(保加利亚语)和高资源语言(德语)。使用多语言和特定语言LLMs的实验揭示了安全行为中显著的跨语言差异,强调了需要定制的、特定区域的评估资源,以支持在德国和保加利亚负责任地部署LLMs。数据集和代码可在以下网址获取:https://this URL。警告:本文包含可能具有冒犯性、有害性或偏见性的示例。

英文摘要

Large language models are increasingly deployed across professional domains, bringing hard-to-predict risks, including the generation of harmful or disrespectful content. Although substantial progress has been made in developing safety evaluation datasets, existing resources remain overwhelmingly English- and Chinese-centric. This limitation is particularly pronounced when evaluating languages that operate within shared sociocultural, legal, and ethical contexts. To address this gap, we introduce Schützen: a German--Bulgarian safety dataset designed to assess model answerability under risk, covering both a low-resource language (Bulgarian) and a high-resource language (German). Experiments with multilingual and language-specific LLMs reveal pronounced cross-language differences in safety behavior, highlighting the necessity of tailored, region-specific evaluation resources to support the responsible deployment of LLMs in Germany and Bulgaria. Datasets and code are available at https://github.com/xnlp-lab/Schutzen. Warning: this paper contains examples that may be offensive, harmful, or biased.

2606.11314 2026-06-11 cs.CV cs.GR 新提交

TRON: Tracing Rays to Orchestrate a Neural Renderer for 3D Gaussian Reconstructions

TRON:追踪光线以编排用于3D高斯重建的神经渲染器

Or Perel, Hassan Abu Alhaija, Zian Wang, Jacob Munkberg, Matan Atzmon, Sanja Fidler, Masha Shugrina

发表机构 * NVIDIA(英伟达) University of Toronto(多伦多大学) Vector Institute(向量研究所)

AI总结 提出TRON框架,结合3D高斯光线追踪与神经渲染,实现真实世界3D场景在新光照、动态物体运动、物体插入和材质编辑下的逼真可控渲染,通过内在分解先验和光线追踪辐射引导,弥合物理渲染与神经渲染的差距。

Comments Project page: https://research.nvidia.com/labs/sil/projects/tron/

详情
AI中文摘要

我们介绍了TRON,一种渲染框架,它将3D高斯光线追踪与神经渲染相结合,使得在新型光照、动态物体运动、物体插入和材质编辑下,对真实世界3D场景进行逼真且可控的渲染成为可能。先前仅依赖高斯表示的物理渲染(PBR)的方法,由于重建几何、材质估计和光传输估计的不完善,难以实现逼真的重光照。同时,神经渲染方法通常缺乏显式场景表示,限制了它们支持细粒度交互编辑的能力。TRON桥接了这两种范式。我们使用来自学习逆渲染模型的内在分解先验来正则化高斯场的材质属性,并重新利用光线追踪器提供辐射度量指导而非最终像素。通过将此输出视为结构化的3D支架,我们赋予轻量级神经渲染器能力,以弥合着色模型约束估计与逼真输出之间的领域差距。我们的关键见解是,显式3D知识与稳健材质先验的结合提供了速度和可控性,而神经渲染则实现了逼真图像的合成。为了支持真实世界场景,我们采用多阶段策略训练神经渲染器,包括大规模预训练和在从3D重建中构建的210万渲染合成及真实世界帧的新数据集上进行针对性微调。TRON在逼真度上优于基于高斯的重光照方法,在可编辑性和速度上优于先前的神经渲染器。据我们所知,TRON是首个能够在捕获的3D环境中实现实用交互式应用的方法,在动态几何、光照和材质条件下提供逼真的外观。

英文摘要

We introduce TRON, a rendering framework that combines 3D Gaussian ray tracing with neural rendering to enable realistic and controllable rendering of real-world 3D scenes under novel lighting, dynamic object motion, object insertion, and material editing. Prior approaches that rely solely on physically based rendering (PBR) of Gaussian representations struggle to achieve realistic relighting due to imperfections in reconstructed geometry, material estimates, and light transport estimation. At the same time, neural rendering methods often lack an explicit scene representation, limiting their ability to support interactive editing with fine-grained manipulation. TRON bridges these two paradigms. We use intrinsic decomposition priors from a learned inverse rendering model to regularize the material properties of a Gaussian field, and repurpose a ray tracer to provide radiometric guidance rather than final pixels. By treating this output as a structured 3D scaffold, we empower a lightweight neural renderer to bridge the domain gap between shading-model constrained estimates and photorealistic output. Our key insight is that the combination of explicit 3D knowledge with robust material priors provides speed and controllability, while neural rendering enables the synthesis of photorealistic images. To support real-world scenarios, we train our neural renderer with a multi-stage strategy consisting of large-scale pretraining and targeted fine-tuning on a newly constructed dataset of 2.1M rendered synthetic and real-world frames from 3D reconstructions. TRON outperforms Gaussian-based relighting methods in realism, and prior neural renderers in editability and speed. To the best of our knowledge, TRON is the first method to enable practical interactive applications in captured 3D environments, offering realistic appearance under dynamic geometric, lighting and material conditions.

2606.11290 2026-06-11 cs.LG cs.AI cs.CL 新提交

FlowBank: Query-Adaptive Agentic Workflows Optimization through Precompute-and-Reuse

FlowBank: 通过预计算与复用实现查询自适应智能体工作流优化

Lingzhi Yuan, Chenghao Deng, Fangxu Yu, Souradip Chakraborty, Mohammad Rostami, Furong Huang

发表机构 * University of Maryland, College Park(马里兰大学哥伦比亚公园分校) Amazon(亚马逊)

AI总结 提出FlowBank框架,通过预计算多样化工作流并压缩为紧凑组合,在推理时自适应选择最优工作流,平衡性能与成本,在五个基准上平均得分最高且成本可控。

详情
AI中文摘要

基于大型语言模型的多智能体系统日益强大,但当前的智能体工作流优化范式存在令人不满意的权衡。任务级方法花费大量离线计算却只部署单个工作流,导致互补候选未被使用;而查询级方法为每个查询合成新工作流,推理成本高昂。我们的动机分析表明,这些范式更多是互补而非竞争:离线搜索中发现的工作流通常解决不同子集的查询,许多由昂贵查询级生成处理的查询已经可以通过更便宜的预计算工作流解决。这暗示了一个不同的目标:与其寻找一个普遍最佳的工作流或为每个实例重新生成,不如构建一个紧凑的、可复用的互补工作流库,并在推理时自适应地选择。为此,需要解决三个耦合问题:生成互补而非冗余的候选、压缩成小型可部署组合、在性能-成本权衡下为每个查询分配正确的工作流。我们提出FlowBank,一个基于组合的智能体工作流优化的三阶段框架。多样化阶段提出DiverseFlow,引导搜索覆盖未充分覆盖的查询,产生高覆盖率的候选池。精炼阶段提出CuraFlow,将候选池压缩为冗余最小的紧凑组合。匹配阶段将部署建模为查询-工作流二分图上的边值预测,将每个传入查询路由到预测效用最佳的组合成员。在五个基准上,FlowBank在评估方法中实现了最高平均得分,同时保持成本竞争力,相比最强的自动和手工基线分别相对提升4.26%和14.92%。

英文摘要

Large Language Model (LLM)-based multi-agent systems are increasingly powerful, but current agentic workflow optimization paradigms make an unsatisfying trade-off. Task-level methods spend substantial offline compute yet deploy only a single workflow, leaving complementary candidates unused, while query-level methods synthesize a new workflow per query at substantial inference cost. Our motivating analysis shows these paradigms are more complementary than competing: workflows discovered during offline search often solve different subsets of queries, and many queries handled by expensive query-level generation can already be solved by cheaper precomputed workflows. This suggests a different objective: rather than searching for one universally best workflow or regenerating one per instance, we should build a compact bank of reusable, complementary workflows and select among them adaptively at inference time. Doing so requires solving three coupled problems: generating complementary rather than redundant candidates, compressing them into a small deployable portfolio, and assigning each query to the right workflow under a performance-cost trade-off. To this end, we present FlowBank, a three-stage framework for portfolio-based agentic workflow optimization. Diversifying proposes DiverseFlow to steer search toward under-covered queries and produce a high-coverage candidate pool. Curating proposes CuraFlow to compress this pool into a compact portfolio with minimal redundancy. Matching casts deployment as edge-value prediction on a query-workflow bipartite graph and routes each incoming query to the portfolio member with the best predicted utility. Across five benchmarks, FlowBank achieves the highest average score among the evaluated methods while remaining cost-competitive, improving over the strongest automated and handcrafted baselines by 4.26% and 14.92% relative, respectively.

2606.11289 2026-06-11 cs.CV 新提交

i1: A Simple and Fully Open Recipe for Strong Text-to-Image Models

i1: 一种简单且完全开放的强文本到图像模型配方

Boya Zeng, Tianze Luo, Shu Pu, Jucheng Shen, Taiming Lu, Gabriel Sarch, Zhuang Liu

发表机构 * Princeton University(普林斯顿大学)

AI总结 本文通过300多次控制实验系统研究文本到图像扩散模型的设计选择,提出i1模型,仅用公开数据集训练3B参数模型,在五个基准上平均超越现有最佳完全开放模型29.5个百分点。

Comments Project page at https://zlab-princeton.github.io/i1

详情
AI中文摘要

扩散模型持续推动文本到图像生成的进展。然而,将最近的进展归因于特定的建模和数据选择是困难的:最先进的开放权重模型提供的消融研究有限,并且不公开其训练数据和完整的训练细节。研究社区需要完全开放(权重、数据和代码)的模型作为进一步研究的基础;然而,现有的完全开放模型在性能上仍显著落后于领先模型。在本项目中,我们通过300多次控制实验(总计超过70万TPU v6e小时)系统研究了文本到图像扩散训练和推理中的建模与数据设计选择。我们的实验突出了几个经验发现(例如,等权重是混合策划数据集的强默认设置)和简单的设计决策(例如,更大的文本编码器适配器以最小的参数增加提升性能),用于训练强模型。在这些见解的指导下,我们训练了i1,一个仅使用公开可用数据集的3B参数文本到图像扩散模型。i1在五个代表性基准(GenEval、DPG、PRISM、CVTG-2K和LongText)上与领先模型竞争,并且平均超越现有最佳完全开放模型29.5个百分点。我们提供i1检查点、训练和推理代码以及数据处理流程。总之,我们的发现和i1配方为未来文本到图像扩散模型的开放研究建立了实践基础。我们的代码可从此https URL获取。

英文摘要

Diffusion models have consistently driven progress in text-to-image generation. However, it is challenging to attribute recent progress to specific modeling and data choices: state-of-the-art open-weight models provide limited ablations, and do not disclose their training data and full training details. The research community needs fully open (weights, data, and code) models as a foundation for further research; yet existing fully open models still fall significantly short of leading models in performance. In this project, we conduct a systematic investigation of the modeling and data design choices in text-to-image diffusion training and inference with 300+ controlled experiments totaling 700K+ TPU v6e hours. Our experiments highlight several empirical findings (e.g., equal weighting is a strong default for mixing curated datasets) and simple design decisions (e.g., larger text encoder adapters improve performance with minimal added parameters) for training strong models. Guided by these insights, we train i1, a 3B-parameter text-to-image diffusion model using only publicly available datasets. i1 is competitive with leading models on five representative benchmarks (GenEval, DPG, PRISM, CVTG-2K, and LongText), and outperforms the best existing fully open model by 29.5 absolute percentage points on average. We provide the i1 checkpoints, training and inference code, and the data processing pipeline. Together, our findings and the i1 recipe establish a practical foundation for future open research in text-to-image diffusion models. Our code is available at https://github.com/zlab-princeton/i1.

2606.11286 2026-06-11 cs.LG cs.AI 新提交

FreeBridge: Variational Schrödinger Bridges for Cellular Transition Dynamics

FreeBridge: 用于细胞转变动力学的变分薛定谔桥

Xurui Wang, Qin Ren, Jun Ma, Haibin Ling, Chenyu You

发表机构 * Stony Brook University(石溪大学) University of Toronto(多伦多大学) University Health Network(大学健康网络)

AI总结 针对高内涵成像中细胞扰动建模的端点监督问题,提出FreeBridge方法,通过变分薛定谔桥在固定细胞流形上学习随机传输,并利用经验潜在支持正则化约束中间路径,在保持端点保真度的同时减少中间支持违规。

Comments Accepted to MICCAI 2026 (early accept). Project page: https://y-research-sbu.github.io/FreeBridge/

详情
AI中文摘要

高内涵成像实验量化细胞对化学和遗传扰动的反应,但由于细胞在采集时被化学固定,单个细胞的连续轨迹无法观测。因此,扰动建模简化为推断仅在对照和处理群体之间观察到的随机传输,这些群体作为单独的边际分布。虽然最近的生成模型实现了强端点对齐,但边界一致性并不决定中间演化:多个随机过程可能连接相同的边际分布,同时穿过观察到的单细胞形态不支持的区域。我们引入了 \textbf{FreeBridge},一种在仅端点监督下进行单细胞转变建模的薛定谔桥公式。FreeBridge 将原子状态定义为实例分割的单细胞表示,建立固定的细胞流形,并通过经验潜在支持正则化学习在此几何结构内约束的随机传输。在 BBBC021、RxRx1 和 JUMP 数据集上,FreeBridge 在统一评估协议下保持竞争性或改进的端点保真度和作用机制保留;在 BBBC021 上,它进一步减少了中间支持违规。这些发现强调了几何基础对于生物学可解释的扰动动力学的重要性。项目页面:此 https URL。

英文摘要

High-content imaging assays quantify cellular responses to chemical and genetic perturbations, yet continuous trajectories of individual cells are unobservable because cells are chemically fixed at acquisition. Perturbation modeling therefore reduces to inferring stochastic transport between control and treated populations observed only as separate marginals. While recent generative models achieve strong end-point alignment, boundary consistency does not determine intermediate evolution: multiple stochastic processes may connect identical marginals while traversing regions unsupported by observed single-cell morphologies. We introduce \textbf{FreeBridge}, a Schrödinger Bridge formulation for single-cell transition modeling under endpoint-only supervision. FreeBridge defines atomic states as instance-segmented single-cell representations, establishing a fixed cellular manifold, and learns stochastic transport constrained within this geometry via empirical latent support regularization. Across BBBC021, RxRx1, and JUMP, FreeBridge maintains competitive or improved endpoint fidelity and mechanism-of-action retention under a unified evaluation protocol; on BBBC021, it further reduces intermediate support violations. These findings highlight the importance of geometric grounding for biologically interpretable perturbation dynamics. Project page: https://y-research-sbu.github.io/FreeBridge/.

2606.11278 2026-06-11 cs.RO 新提交

Model-based Optimization of Anguilliform Swimming Gaits for Soft Robotic Applications

基于模型的鳗鲡式游泳步态优化及其在软体机器人中的应用

Brian Van Stratum, James Gallentine, Caleb Rucker, Eric Barth, Jonathan E. Clark, Kourosh Shoele

发表机构 * FAMU/FSU College of Engineering(佛罗里达农工大学/佛罗里达州立大学工程学院) Vanderbilt University(范德堡大学) The University of Tennessee, Knoxville(田纳西大学诺克斯维尔分校)

AI总结 本文提出软体七鳃鳗启发双环境机器人(SLIDER),通过结合Lighthill理论、非线性结构模型和遗传算法,优化游泳控制模式与尾鳍设计,实现系留游泳速度21.7±0.4 cm/s,并探索多模态机器人优化。

详情
AI中文摘要

在本文中,我们介绍了软体七鳃鳗启发双环境机器人(SLIDER)以及用于设计该机器人的适当建模和优化流程。我们使用Lighthill的大振幅细长体理论来表示主要的流体环境作用——惯性效应、涡流力和粘性耗散。对于结构设计参数,如内部压力、尾部尺寸和身体刚度,我们开发并验证了一个快速的几何和材料非线性模型。流固耦合方程采用高效的二阶box方法隐式求解。采用气动歧管机器人系统在静水槽环境中驱动SLIDER,实现计算与实验结果的交叉比较。我们发现低频游泳主要受阻力环境影响,而高频游泳主要受惯性流体力的影响。利用我们的高效模型和遗传算法,我们共同优化了游泳控制模式和尾鳍设计(受限于SLIDER的攀爬形态),实现了21.7±0.4 cm/s(0.59 Bl/s)的系留游泳速度。此外,我们研究了执行游泳和攀爬任务的多模态机器人的优化程序。

英文摘要

In this paper, we introduce the Soft Lamprey-Inspired Dual Environment Robot (SLIDER) and a proper modeling and optimization procedure employed to design the robot. We represent the primary fluid environment actions - inertial effects, vortex forces, and viscous dissipation - using Lighthill's theory for large-amplitude elongated bodies. For structural design parameters such as internal pressure, tail size, and body stiffness, a fast, geometrically and materially nonlinear model is developed and validated. The fluid-structure interaction equations are solved implicitly with an efficient second-order box method. A pneumatic manifold robotic system is employed to actuate SLIDER in a quiescent water tank environment, allowing cross-comparison of computational and experimental results. We find that low-frequency swimming is dominated by resistant environmental forces, whereas higher-frequency swimming is primarily affected by inertial fluid forces. Using our efficient model alongside a genetic algorithm, we co-optimize a swimming control pattern and caudal fin design (subject to SLIDER's climbing morphology) to achieve a tethered swimming speed of 21.7 +/- 0.4 cm/s (0.59 Bl/s). Furthermore, we investigate the optimization procedure for a multimodal robot performing both swimming and climbing tasks.

2606.11277 2026-06-11 cs.LG physics.comp-ph 新提交

Least-Action-Guided Diffusion for Physical Extrapolation

最小作用量引导扩散用于物理外推

Zhongxin Yang, Yuanwei Bin, Xiang I. A. Yang, Shiyi Chen

发表机构 * College of Engineering, Peking University(北京大学工学院) Ningbo Institute for Digital Twin, Eastern Institute of Technology(东方理工宁波数字孪生研究院) Eastern Institute for Advanced Study, Eastern Institute of Technology(东方理工高等研究院) Shenzhen Tenfong Technology Co., Ltd.(深圳腾方科技有限公司) Mechanical Engineering, The Pennsylvania State University(宾夕法尼亚州立大学机械工程系)

AI总结 提出最小作用量引导扩散(LAPG)框架,通过将最小作用量原理转化为可微的推理时校正机制,在时间、参数和几何外推中保持物理一致性,优于训练时物理信息基线。

详情
AI中文摘要

可靠的外推仍然是计算物理学中生成模型的核心挑战,因为模型在有限的时间、参数或几何范围内训练,可能会在训练分布之外产生物理上不一致的预测。我们引入了最小作用量引导扩散(LAPG),这是一个在推理过程中促进物理一致性而非仅依赖训练时施加约束的框架。该方法结合了条件得分扩散模型与作用量导出的物理引导得分。在第一阶段,学习的得分模型生成一个分布内的提议;在第二阶段,基于作用量的变分先验将该提议向目标分布外条件细化。这一公式将最小作用量原理转化为可微的推理时校正机制,并提供了对通常需要经验损失平衡的点态残差惩罚的替代方案。我们在代表性的常微分和偏微分方程系统上评估了LAPG,包括自由落体、保守和耗散弹簧-质量动力学、相互作用点涡以及参数化翼型上的势流。在时间、参数和几何外推测试中,与训练时物理信息基线相比,LAPG减少了相位漂移,保持了耗散衰减,捕捉了涡旋运动,并改善了翼型流动的升力响应。

英文摘要

Reliable extrapolation remains a central challenge for generative models in computational physics, because models trained over finite ranges of time, parameters, or geometries may produce physically inconsistent predictions outside the training distribution. We introduce a least-action-principle-guided diffusion, LAPG, a framework that promotes physical consistency during inference rather than relying solely on constraints imposed during training. The method combines a conditional score-based diffusion model with an action-derived physical guidance score. In the first stage, the learned score model generates an in-distribution proposal; in the second, an action-based variational prior refines this proposal toward the target out-of-distribution condition. This formulation turns the principle of least action into a differentiable inference-time correction mechanism and provides an alternative to pointwise residual penalties that often require empirical loss balancing. We evaluate LAPG on representative ordinary- and partial-differential-equation systems, including free fall, conservative and dissipative spring-mass dynamics, interacting point vortices, and potential flow over parameterized airfoils. In temporal, parameter, and geometric extrapolation tests, LAPG reduces phase drift, preserves dissipative decay, captures vortex motion, and improves the lift response of airfoil flows compared with training-time physics-informed baselines.

2606.11275 2026-06-11 cs.LG cs.AI 新提交

RoVE: Rotary Value Embeddings Attention for Relative Position-dependent Value Pathways

RoVE: 旋转值嵌入注意力实现相对位置相关的值路径

Alejandro García-Castellanos, Maurice Weiler, Erik J Bekkers

发表机构 * AMLab University of Amsterdam(阿姆斯特丹大学AMLab) MIT CSAIL(麻省理工学院计算机科学与人工智能实验室)

AI总结 提出RoVE方法,通过同时旋转键和值使值对位置敏感,将RoPE注意力转化为注意力卷积,在少样本学习、分布外困惑度和长上下文检索上优于RoPE。

详情
AI中文摘要

旋转位置嵌入(RoPE)使注意力分数具有位置相对性,但值路径对位置不敏感:值令牌发送的消息与其到查询的距离无关。我们提出RoVE,一种无需参数修改的方法,通过同时旋转键和值使值对位置敏感,并证明它将RoPE注意力转化为注意力卷积。这一新视角统一了计算机视觉、机器人技术和现代LLM架构中同一操作的几种独立表述。训练124M和354M参数的GPT-2模型在少样本上下文学习、分布外困惑度和长上下文检索上一致优于RoPE,在需要长距离聚合的任务上改进最为明显。

英文摘要

Rotary Position Embeddings (RoPE) make attention scores position-relative but leave the value pathway position-blind: the message sent by a value token is the same regardless of its distance from the query. We propose RoVE, a parameter-free modification that makes values position-sensitive by rotating them simultaneously with keys, and show that it turns RoPE attention into attentive convolution. This new perspective unifies several independent formulations of the same operation across computer vision, robotics, and modern LLM architectures. Trained 124M and 354M GPT-2 models show consistent empirical gains over RoPE on few-shot in-context learning, out-of-distribution perplexity, and long-context retrieval, with the clearest improvements on tasks that require long-range aggregation.

2606.11272 2026-06-11 cs.LG cs.AI 新提交

Federated continual learning: A comprehensive survey on lifelong and privacy-preserving learning over distributed and non-stationary data

联邦持续学习:分布式和非平稳数据上的终身与隐私保护学习综述

Masoume Gholizade, Fabrizio Ruffini, Pietro Ducange, Francesco Marcelloni

发表机构 * University of Pisa(比萨大学) University of Modena and Reggio Emilia(摩德纳和雷焦艾米利亚大学)

AI总结 本文系统综述联邦持续学习(FCL),定义问题、分析经典联邦学习在非平稳数据下的局限,提出多维分类法,并讨论应用、评估指标及开放挑战。

Comments 77 pages, 8 figures

详情
Journal ref
Neurocomputing, Volume 694, 2026, 133929
AI中文摘要

联邦学习(FL)能够在分布式客户端之间实现协作和隐私保护的模型训练,但大多数现有的FL系统隐含地假设数据是平稳的。在现实场景中——如医疗、工业物联网(IIOT)、网络安全和智慧城市——数据流本质上是非平稳的,导致经典FL方法遭受性能下降、不稳定和灾难性遗忘。持续学习(CL)解决了在演化数据分布下的学习问题,但主要在集中式环境中研究,忽视了联邦系统的关键约束,包括隐私、有限通信和客户端异质性。联邦持续学习(FCL)出现在FL和CL的交汇处,旨在支持分布式和非平稳数据上的终身、自适应和隐私感知学习。本综述提供了FCL的全面和系统概述。我们首先给出FCL问题的正式定义并阐明其独特特征。然后分析经典FL在非平稳条件下的局限性,强调CL原理如何支持长期适应。为了组织快速增长的文献,我们提出了FCL方法的多维分类法。此外,我们回顾了代表性的应用领域和数据模态,总结了常用的评估指标,并讨论了评估长期性能和遗忘的实验视角。最后,我们强调了关键开放挑战,包括处理时间漂移下的极端异质性、设计可扩展且隐私保护的记忆机制,以及建立标准化基准。本综述旨在为推进FCL走向鲁棒和可部署的现实世界系统提供参考和路线图。

英文摘要

Federated Learning (FL) enables collaborative and privacy-preserving model training across distributed clients, but most existing FL systems implicitly assume data stationarity. In real-world settings-such as healthcare, industrial IoT (IIOT), cybersecurity, and smart cities-data streams are inherently non-stationary, leading classical FL methods to suffer from performance degradation, instability, and catastrophic forgetting. Continual Learning (CL) addresses learning under evolving data distributions but has been largely studied in centralized settings, overlooking key constraints of federated systems, including privacy, limited communication, and client heterogeneity. Federated Continual Learning (FCL) emerges at the intersection of FL and CL, aiming to support lifelong, adaptive, and privacy-aware learning over distributed and non-stationary data. This survey provides a comprehensive and systematic overview of FCL. We first present a formal definition of the FCL problem and clarify its distinctive characteristics. We then analyze the limitations of classical FL under non-stationary conditions, highlighting how CL principles support long-term adaptation. To organize the rapidly growing literature, we propose a multi-dimensional taxonomy of FCL approaches. Furthermore, we review representative application domains and data modalities, summarize commonly used evaluation metrics, and discuss experimental perspectives for assessing long-term performance and forgetting. Finally, we highlight key open challenges, including handling extreme heterogeneity under temporal drift, designing scalable and privacy-preserving memory mechanisms, and establishing standardized benchmarks. This survey aims to serve as a reference and a roadmap for advancing FCL toward robust and deployable real-world systems.

2606.11270 2026-06-11 cs.LG cs.AI cs.CL 新提交

Quantifying Subliminal Behavioral Transfer Ratios in Language Model Distillation

量化语言模型蒸馏中的潜意识行为迁移比率

Uwe Konig, Hamza Kazmi, Ruizhe Li, Maheep Chaudhary

发表机构 * University of Freiburg(弗赖堡大学)

AI总结 通过控制教师模型行为强度并蒸馏学生模型,量化了潜意识行为迁移比率,发现迁移具有鲁棒性且呈现不同缩放行为。

详情
AI中文摘要

旨在将良性行为迁移到学生模型的语言模型蒸馏,也可能迁移教师模型中存在的不良特征,这种现象称为潜意识学习。虽然定性证据支持该效应的存在,但其程度尚未被系统表征。本研究通过控制两个教师模型(Llama-2-7B-Chat 和 Qwen2.5-7B-Instruct)在不同引导强度下,并仅使用良性数据蒸馏学生模型,量化了潜意识行为迁移比率。使用 GPT-4.1 作为评估器对 100 个 JailbreakBench 提示进行评估,结果表明迁移是鲁棒的,但表现出不同的缩放行为。Llama-2 表现出一个尖锐的阈值($\tau = {0.25,0.32} \ \text{beyond} \ \alpha = -0.15$),而 Qwen2.5 表现出连续且更高水平的迁移($\tau$ 高达 $0.61$)。

英文摘要

Distillation of a language model intended to transfer benign behavior to a student model may also transfer undesirable characteristics, if they are present in the teacher model, a phenomenon known as subliminal learning. While qualitative evidence supports the existence of this effect, its magnitude has not been systematically characterized. This study quantifies subliminal behavioral transfer ratios by steering two teacher models (Llama-2-7B-Chat and Qwen2.5-7B-Instruct) at varying steering strengths and distilling student models using only benign data. Evaluation on 100 JailbreakBench prompts with GPT-4.1, serving as the evaluator, indicates that transfer is robust but exhibits distinct scaling behaviors. Llama-2 demonstrates a sharp threshold ($τ= {0.25,0.32} \ \text{beyond} \ α= -0.15$), whereas Qwen2.5 displays continuous and higher levels of transfer ($τ$ up to $0.61$).

2606.11269 2026-06-11 cs.CV cs.HC 新提交

Traits Run Deeper: Trait-Specific Asymmetric Fusion for Personality Assessment

特质更深:面向人格评估的特质特异性非对称融合

Jia Li, Qian Chen, Wei Wang, Xinyu Li, Zhenzhen Hu, Dongsheng Shao, Richang Hong, Meng Wang

发表机构 * Hefei University of Technology(合肥工业大学) Intelligent Interconnected Systems Laboratory of Anhui Province(安徽省智能互联系统实验室) Jianghuai Advanced Technology Center(江淮前沿技术中心) Anhui Provincial Industry Innovation Center of Humanoid Robots(安徽省人形机器人产业创新中心) Anhui Provincial Key Laboratory of Humanoid Robots(安徽省人形机器人重点实验室)

AI总结 提出Traits Run Deeper框架,通过多模态基础表示、特质特异性非对称融合和分布校准回归模块,解决人格评估中模态偏好差异和标签偏差问题,在AVI Challenge 2026上MSE降低约25%。

详情
AI中文摘要

人格评估旨在从语言、声音和面部线索等动态行为中推断稳定的人格特质。由于不同的人格维度通过不同的行为视角展现,建模特质特异性证据具有挑战性。然而,现有大多数方法对所有维度采用统一的多模态融合策略,假设模态贡献相同。这忽略了特质特异性的模态偏好,并引入了跨模态干扰。为解决这一问题,我们提出了一种新颖的人格评估框架,称为Traits Run Deeper,由三个组件组成。具体而言,多模态基础表示(MFR)模块构建面向人格的多模态输入,并利用心理学启发的语义模板作为锚点,使基础模型能够捕获特质相关信息。基于MFR,特质特异性模态融合(TSMF)模块作为一种非对称融合机制,允许每个维度从模态特定建模到互补融合中,选择性地利用不同的模态路径。因此,TSMF捕获了异质的模态偏好,同时减少了跨模态污染。此外,分布校准人格回归(DCPR)模块通过目标分布校准减轻了标签不平衡和中心趋势偏差,提高了鲁棒性和稳定性。在AVI Challenge 2026验证集上的实验结果表明了所提出框架的有效性,与基线相比,均方误差(MSE)降低了约25%。在官方测试集上观察到一致的改进,我们的方法取得了最佳性能,并在人格评估赛道中排名第一。源代码将在此https URL提供。

英文摘要

Personality assessment aims to infer stable personality traits from dynamic behaviors across language, voice, and facial cues. Since different personality dimensions are revealed through distinct behavioral perspectives, modeling trait-specific evidence is challenging. However, most existing approaches adopt a uniform multimodal fusion strategy across all dimensions, assuming identical modality contributions. This overlooks trait-specific modality preferences and introduces cross-modal interference. To address this issue, we propose a novel personality assessment framework called Traits Run Deeper, which consists of three components. Specifically, the Multimodal Foundation Representation (MFR) module constructs personality-oriented multimodal inputs and leverages psychology-informed semantic templates as anchors, enabling foundation models to capture trait-relevant information. Building upon MFR, the Trait-Specific Modality Fusion (TSMF) module acts as an asymmetric fusion mechanism, allowing each dimension to selectively exploit different modality pathways from modality-specific modeling to complementary fusion. Thus, TSMF captures heterogeneous modality preferences while reducing cross-modal contamination. Furthermore, the Distribution-Calibrated Personality Regression (DCPR) module mitigates label imbalance and central tendency bias through target distribution calibration, improving robustness and stability. Experimental results on the AVI Challenge 2026 validation set demonstrate the effectiveness of the proposed framework, reducing mean squared error (MSE) by approximately 25% compared with the baseline. Consistent improvements are observed on the official test set, where our method achieves the best performance and ranks first in the Personality Assessment Track. The source code will be made available at https://github.com/MSA-LMC/AVI2026.