arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.04847 2026-06-15 cs.LG cs.AI 版本更新

Quantile-Free Uncertainty Quantification in Graph Neural Networks

图神经网络中的无分位数不确定性量化

Soyoung park, Hwanjun Song, Sungsu Lim

发表机构 * Soyoung Park Hwanjun Song Sungsu Lim

AI总结提出QpiGNN框架，通过无分位数联合损失直接优化覆盖率和区间宽度，实现高效鲁棒的图神经网络不确定性量化，理论保证渐近覆盖和近最优宽度。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

详情

AI中文摘要

不确定性量化（UQ）在图神经网络（GNN）中对于高风险领域至关重要，但仍是一个重大挑战。在图设置中，消息传递通常依赖于强假设（如可交换性），这些假设在实践中很少满足，并且实现可靠的UQ通常需要昂贵的重采样或事后校准。为了解决这些问题，我们引入了无分位数预测区间GNN（QpiGNN），这是一个基于分位数回归（QR）的框架，通过直接优化覆盖率和区间宽度来实现基于GNN的UQ，无需分位数输入或后处理。QpiGNN采用双头架构，将预测和不确定性解耦，并通过无分位数联合损失使用仅标签监督进行训练。这种设计允许高效训练，并产生鲁棒的预测区间，在温和假设下具有渐近覆盖率和近最优宽度的理论保证。在19个合成和真实世界基准上的实验表明，QpiGNN比基线平均覆盖率高22%，区间窄50%，同时确保了对噪声和结构变化的效率和鲁棒性。

英文摘要

Uncertainty quantification (UQ) in graph neural networks (GNNs) is crucial in high-stakes domains but remains a significant challenge. In graph settings, message passing often relies on strong assumptions such as exchangeability, which are rarely satisfied in practice, and achieving reliable UQ typically requires costly resampling or post-hoc calibration. To address these issues, we introduce Quantile-free Prediction Interval GNN (QpiGNN), a framework that builds on quantile regression (QR) to enable GNN-based UQ by directly optimizing coverage and interval width without requiring quantile inputs or post-processing. QpiGNN employs a dual-head architecture that decouples prediction and uncertainty, and is trained with label-only supervision through a quantile-free joint loss. This design allows efficient training and yields robust prediction intervals, with theoretical guarantees of asymptotic coverage and near-optimal width under mild assumptions. Experiments on 19 synthetic and real-world benchmarks show QpiGNN achieves average 22% higher coverage and 50% narrower intervals than baselines, while ensuring efficiency and robustness to noise and structural shifts.

URL PDF HTML ☆

赞 0 踩 0

2605.03065 2026-06-15 cs.LG cs.RO 版本更新

OGPO: Sample Efficient Full-Finetuning of Generative Control Policies

OGPO：生成控制策略的样本高效全微调

Sarvesh Patil, Mitsuhiko Nakamoto, Manan Agarwal, Shashwat Saxena, Jesse Zhang, Giri Anantharaman, Cleah Winston, Chaoyi Pan, Douglas Chen, Nai-Chieh Huang, Zeynep Temel, Oliver Kroemer, Sergey Levine, Abhishek Gupta, Hongkai Dai, Paarth Shah, Max Simchowitz

发表机构 * University of California, Berkeley（加州大学伯克利分校）； UC Berkeley（加州大学伯克利分校）

AI总结提出OGPO算法，通过离策略评论网络和修改的PPO目标，实现生成控制策略的样本高效微调，在多种操作任务上达到最优性能，并能在无专家数据下微调不良初始化的行为克隆策略。

详情

AI中文摘要

生成控制策略（GCPs），如基于扩散和基于流的控制策略，已成为机器人学习的有效参数化方法。本文介绍了离策略生成策略优化（OGPO），一种用于微调GCPs的样本高效算法，该算法维护离策略评论网络以最大化数据重用，并通过修改的PPO目标将策略梯度传播到策略的完整生成过程，使用评论网络作为终端奖励。OGPO在涵盖多任务设置、高精度插入和灵巧控制的操作任务上达到了最先进的性能。据我们所知，它也是唯一一种能够在在线回放缓冲区中无专家数据的情况下，将初始化不良的行为克隆策略微调到接近完全任务成功的方法，并且只需很少的任务特定超参数调整。通过广泛的实证研究，我们证明了OGPO在策略引导和残差学习方面显著优于替代方法，并确定了其性能背后的关键机制。我们进一步引入了实用的稳定技巧，包括成功缓冲区正则化、双边保守优势和Q方差减少，以减轻基于状态和基于像素的设置中的评论网络过度利用。除了提出OGPO，我们还对GCP微调进行了系统的实证研究，确定了控制成功离策略全策略改进的稳定机制和失败模式。

英文摘要

Generative control policies (GCPs), such as diffusion- and flow-based control policies, have emerged as effective parameterizations for robot learning. This work introduces Off-policy Generative Policy Optimization (OGPO), a sample-efficient algorithm for finetuning GCPs that maintains off-policy critic networks to maximize data reuse and propagate policy gradients through the full generative process of the policy via a modified PPO objective, using critics as the terminal reward. OGPO achieves state-of-the-art performance on manipulation tasks spanning multi-task settings, high-precision insertion, and dexterous control. To our knowledge, it is also the only method that can fine-tune poorly-initialized behavior cloning policies to near full task-success with no expert data in the online replay buffer, and does so with few task-specific hyperparameter tuning. Through extensive empirical investigations, we demonstrate that OGPO drastically outperforms methods alternatives on policy steering and learning residual corrections, and identify the key mechanisms behind its performance. We further introduce practical stabilization tricks, including success-buffer regularization, two-sided conservative advantages, and Q-variance reduction, to mitigate critic over-exploitation across state- and pixel-based settings. Beyond proposing OGPO, we conduct a systematic empirical study of GCP finetuning, identifying the stabilizing mechanisms and failure modes that govern successful off-policy full-policy improvement.

URL PDF HTML ☆

赞 0 踩 0

2606.06010 2026-06-15 cs.LG cs.DB 版本更新

Adaptive Oscillatory-State Alignment for Time Series Forecasting

自适应振荡状态对齐用于时间序列预测

Zhangyao Song, Chaofeng Qu, Chao Zha, Xiaoyu Zhao, Yinfei Xu, Tao Guo

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出AOSNET框架，通过希尔伯特变换将固定模板匹配改为自适应振荡状态对齐，以处理实际时间序列中的非平稳振荡行为，在多个基准上达到先进或竞争性精度。

详情

AI中文摘要

长期时间序列预测受益于揭示重复时间结构的归纳偏置。现有的周期性预测方法通常通过预定义周期、全局频谱分量或固定可学习模板来建模重复性。然而，现实世界的时间动态很少是严格周期性的：振荡行为通常通过幅度调制、相位漂移和局部频率变化而演变。在这些条件下，固定模板的周期性建模可能与底层时间状态根本性不匹配。我们提出了AOSNET，一个希尔伯特引导的预测框架，将周期性预测从固定模板匹配重新表述为自适应振荡状态对齐。AOSNET从观测序列和可学习的全局振荡先验中提取解析信号描述符，然后通过描述符条件门自适应地对齐局部状态，该门选择性地保留可靠观测，同时软性纠正不匹配区域。学习到的先验不是作为刚性的重复模板，而是作为通过局部状态动力学解释的灵活振荡参考。在八个基准上的实验表明，具有快速推理速度的最先进或高度竞争的准确性。控制合成研究分离幅度调制、相位漂移和局部频率变化，证实振荡状态对齐的优势随着非平稳性加剧而持续增加。

英文摘要

Long-term time series forecasting benefits from inductive biases that expose recurring temporal structure. Existing periodic forecasting methods typically model recurrence through predefined periods, global spectral components, or fixed learnable templates. However, real-world temporal dynamics are rarely rigidly periodic: around a nominal cycle, oscillatory behavior often exhibits \emph{non-rigid periodicity} (NRP), where cycle magnitude, cycle alignment, and local cycle duration vary over time. Under these conditions, fixed-template periodic modeling can become fundamentally mismatched to the underlying temporal states. We propose AOSNet, a Hilbert-guided forecasting framework that reformulates periodic forecasting from fixed template matching to adaptive oscillatory-state alignment. AOSNet extracts analytic-signal descriptors from both the observed sequence and a learnable global oscillatory prior, then adaptively aligns local states through a descriptor-conditioned gate that selectively preserves reliable observations while softly correcting mismatched regions. The learned prior serves not as a rigid repeated template but as a flexible oscillatory reference interpreted through local state dynamics. Experiments on eight public benchmarks and two cloud workload traces demonstrate leading or highly competitive accuracy with a compact model size and low inference latency, supporting repeated forecasting settings such as capacity planning and autoscaling. Controlled synthetic studies that isolate cycle-magnitude and cycle-alignment variation and combine them with cycle-duration changes show that the advantage of oscillatory-state alignment increases as NRP intensifies.

URL PDF HTML ☆

赞 0 踩 0

2606.05774 2026-06-15 cs.CV 版本更新

LiAuto-GeoX: Efficient Grounded Driving Transformer

LiAuto-GeoX: 高效接地驾驶Transformer

Jiawei Lian, Haoyi Sun, Yang Wu, Lifu Mu, Siyuan Wang, Le Hui, Ning Mao, Tao Wei, Pan Zhou, Kun Zhan, Jian Yang

发表机构 * Nanjing University of Science and Technology（南京理工大学）； Li Auto Inc.（Li Auto公司）； Northwestern Polytechnical University（西北工业大学）； Department of Computing, The Hong Kong Polytechnic University（香港理工大学计算学院）

AI总结提出LiAuto-GeoX，通过稀疏激光雷达先验和几何保持蒸馏框架，实现高效、实时的自车中心密集3D重建，并显著提升下游自动驾驶任务性能。

详情

AI中文摘要

密集3D重建在空间理解方面展现出巨大潜力，但其作为自动驾驶实时车载表示的可行性仍是一个开放挑战。现有大规模视觉几何模型通常需要大量计算资源，且缺乏动态驾驶环境所需的远距离几何保真度、环视一致性和实时效率。为弥补这一差距，我们提出 extbf{LiAuto-GeoX}，一种为可部署的自车中心3D场景理解设计的高效接地驾驶Transformer。我们的方法首先从大规模环视数据中学习高容量驾驶几何模型，利用稀疏激光雷达先验在远处、模糊或结构稀疏区域提供稳健的几何接地。然后，通过一种新颖的几何保持蒸馏框架，将这一能力实例化为高度紧凑的1.55亿参数车载模型。该框架采用掩码引导的深度感知蒸馏，通过强调几何信息丰富的区域来保留细粒度度量结构，以及相对姿态关系蒸馏，通过姿态诱导的几何关系强制跨视图空间一致性。大量评估表明， extbf{LiAuto-GeoX}在KITTI上以220 FPS运行，同时保持高保真密集重建，实现实时部署。学习到的几何结构无缝迁移到下游自主任务，在轨迹预测中达到90.6 PDMS，在占用预测中达到24.63 mIoU，在未来帧预测中达到47.67 IoU。这些结果表明，高效的密集3D重建可以超越其作为感知目标的传统角色，作为下一代自动驾驶的可扩展基础几何表示。

英文摘要

Dense 3D reconstruction has demonstrated immense potential for spatial understanding, yet its viability as a real-time, onboard representation for autonomous driving remains an open challenge. Existing large-scale visual geometry models typically require substantial computational resources and lack the long-range geometric fidelity, surround-view consistency, and real-time efficiency demanded by dynamic driving environments. To bridge this gap, we present \textbf{LiAuto-GeoX}, an efficient grounded driving transformer designed for deployable, ego-centric 3D scene understanding. Our approach begins by learning a high-capacity driving geometry model from large-scale surround-view data, utilizing sparse LiDAR priors to provide robust geometric grounding in distant, ambiguous, or structure-sparse regions. We then instantiate this capability into a highly compact 155M-parameter onboard model through a novel geometry-preserving distillation framework. This framework employs mask-guided depth-aware distillation to retain fine-grained metric structures by emphasizing geometrically informative regions, and relative-pose relational distillation to enforce cross-view spatial consistency through pose-induced geometric relations. Extensive evaluations reveal that \textbf{LiAuto-GeoX} runs at 220 FPS on KITTI while maintaining high-fidelity dense reconstruction, enabling real-time deployment. The learned geometry transfers seamlessly to downstream autonomy tasks, achieving 90.6 PDMS in trajectory prediction, 24.63 mIoU in occupancy prediction, and 47.67 IoU in future-frame prediction. These all demonstrate that efficient dense 3D reconstruction can transcend its traditional role as a perception target to serve as a scalable, foundational geometric representation for next-generation autonomous driving.

URL PDF HTML ☆

赞 0 踩 0

2606.05461 2026-06-15 cs.AI 版本更新

Output Type Before Quality: A Standards-Derived XAI Admissibility Rubric for Autonomous-Driving Safety

先输出类型，后质量：基于标准的自动驾驶安全XAI可接受性评估标准

Abhinaw Priyadershi, Mandar Pitale, Jelena Frtunikj, Maria Spence

发表机构 * NVIDIA Corporation（英伟达公司）； NVIDIA GmbH（英伟达德国分公司）

AI总结针对基于ML的自动驾驶安全标准与XAI方法输出类型不匹配的证据类型缺口，从多个安全标准推导出19项可测试证据标准，评估六类XAI方法，发现因果XAI在三个生命周期阶段结构上必需，并提出了结构可接受性概念。

Comments Accepted at SAFECOMP 2026 Workshops (SASSUR); to appear in Springer LNCS

详情

AI中文摘要

基于ML的自动驾驶安全标准规定了保证案例必须包含的证据类型（有向因果链、量化的干预效应、命名的根因变量），然而XAI文献是按输出类型和技术族（显著性图、特征归因、反事实、因果图、语言痕迹）组织的。最受推荐的ADS XAI方法SHAP返回一个排序的特征列表，任何实现努力都无法将其转换为有向链（图1）。我们将这种不匹配称为证据类型缺口。从AMLAS、ISO 26262、ISO 21448、ISO/PAS 8800中，我们推导出19项可测试的证据标准，涵盖7个生命周期阶段，并附有代表性的条款引用推导，对六类XAI方法进行了结构性评分。因果XAI在结构上被证明是满足推导标准的必要条件，涉及三个阶段：危害识别（+62%标准缺口）、事件调查（+50%）和数据管理（+50%）；判定集在阈值T∈(0%, 50%]内稳定，并在最坏情况下的单单元翻转下存活至T=25%。在其余四个阶段，相关或基于语言的方法是可比较或足够的。该标准识别了结构可接受性（合规的必要但非充分条件）：一个可接受方法的具体输出内容仍可能是错误的，验证其保真度（拟合SCM产生的边、痕迹命名的原因）是开放的保证挑战。基于1,996个真实驾驶片段（79,840行，十个分割）的单VLA概念验证与每种方法观察到的输出类型匹配其标准预测一致。ADS安全保证的XAI方法选择应由生命周期阶段的证据需求驱动，而非方法流行度。

英文摘要

Safety standards for ML-based autonomous driving specify the kind of evidence an assurance case must contain (directed cause-and-effect chains, quantified interventional effects, named root-cause variables), yet the XAI literature is organised by output type and technique family (saliency maps, feature attribution, counterfactuals, causal graphs, language traces). SHAP, the most-recommended ADS XAI method, returns a ranked feature list that no implementation effort can convert into a directed chain (Fig.1). We name this mismatch the evidence-type gap. From AMLAS, ISO 26262, ISO21448, ISO/PAS 8800 we derive 19 testable evidentiary criteria across 7 lifecycle stages with representative clause-cited derivations and score six XAI method classes structurally. Causal XAI emerges as structurally required to satisfy the derived criteria at three stages: hazard identification (+62% rubric gap), incident investigation (+50%), and data management (+50%); the verdict set is stable across thresholds T in (0%, 50%]$ and survives a worst-case single-cell flip down to T = 25%. At the remaining four stages, correlational or language-based methods are comparable or sufficient. The rubric identifies structural admissibility (necessary but not sufficient for compliance): an admissible method's specific output content may still be wrong, and validating that fidelity (the edges a fitted SCM produces, the cause a trace names) is the open assurance challenge. A single-VLA proof of concept on 1,996 real-world driving clips (79,840 rows, ten splits) is consistent with each method's observed output type matching its rubric prediction. XAI method selection for ADS safety assurance should be driven by lifecycle-stage evidence demand, not by method popularity.

URL PDF HTML ☆

赞 0 踩 0

2606.05102 2026-06-15 cs.CV 版本更新

ZipSplat: Fewer Gaussians, Better Splats

ZipSplat: 更少的高斯，更好的泼溅

Alexander Veicht, Sunghwan Hong, Dániel Baráth, Marc Pollefeys

发表机构 * ETH Zürich（苏黎世联邦理工学院）； Microsoft（微软）

AI总结提出 ZipSplat，一种基于令牌的前馈模型，通过聚类压缩视觉令牌并解码为高斯组，在无需重训练的情况下实现质量-效率权衡，以约6倍更少的高斯数在DL3DV和RealEstate10K上达到新最优。

详情

AI中文摘要

前馈式3D高斯泼溅方法能够在单次前向传递中从有姿态或无姿态图像重建场景，但当前方法为每个输入像素预测一个高斯，将表示预算与相机分辨率而非场景复杂度绑定。因此，一面平坦的墙壁和一块纹理丰富的物体会产生同样多的高斯，尽管几何需求截然不同。我们提出ZipSplat，一种基于令牌的前馈模型，将高斯放置与像素网格解耦。多视图骨干网络提取密集的视觉令牌，k-means聚类将其压缩为紧凑的场景令牌集。交叉注意力和自注意力精炼这些令牌，轻量级MLP将每个令牌解码为一组具有无约束3D位置的高斯。由于聚类在推理时应用，单个训练模型无需重训练即可覆盖质量-效率曲线。ZipSplat无需真实姿态或内参，但在DL3DV和RealEstate10K上以比像素对齐方法少约6倍的高斯数达到新最优，分别超过最佳无姿态基线2.1dB和1.2dB PSNR。它进一步零样本泛化到Mip-NeRF360和ScanNet++，优于所有可比基线。我们的项目页面位于https://veichta.com/zipsplat。

英文摘要

Feed-forward 3D Gaussian Splatting methods reconstruct a scene from posed or pose-free images in a single forward pass, yet current approaches predict one Gaussian per input pixel, tying the representation budget to camera resolution rather than scene complexity. A flat wall and a richly textured object thus produce equally many Gaussians despite very different geometric needs. We propose ZipSplat, a token-based feed-forward model that decouples Gaussian placement from the pixel grid. A multi-view backbone extracts dense visual tokens, and k-means clustering compresses them into a compact set of scene tokens. Cross- and self-attention refine these tokens, and a lightweight MLP decodes each into a group of Gaussians with unconstrained 3D positions. Because clustering is applied at inference, a single trained model spans the quality-efficiency curve without retraining. ZipSplat operates without ground-truth poses or intrinsics, yet sets a new state of the art on DL3DV and RealEstate10K with ${\sim}6{\times}$ fewer Gaussians than pixel-aligned methods, surpassing the best pose-free baseline by 2.1dB and 1.2dB PSNR, respectively. It further generalizes zero-shot to Mip-NeRF360 and ScanNet++, outperforming all comparable baselines. Our project page is at https://veichta.com/zipsplat.

URL PDF HTML ☆

赞 0 踩 0

2606.04883 2026-06-15 cs.CL cs.LO 版本更新

Optimizing the Cost-Quality Tradeoff of Agentic Theorem Provers in Lean

优化 Lean 中智能定理证明器的成本-质量权衡

Kári Rögnvaldsson, Chenhao Sun, Jasper Dekoninck, Martin Vechev

发表机构 * University of Washington（华盛顿大学）； University of California, Berkeley（加州大学伯克利分校）

AI总结提出一种包含数据平面和控制平面的动作路由智能体，通过观察失败轨迹并估计成功概率与成本来动态决定继续证明或重新分解，在 PutnamBench 子集上平均降低 25.8% 成本且保持性能。

详情

AI中文摘要

大型语言模型（LLMs）越来越多地用于在 Lean 中生成形式化证明的工作流程。这些工作流程通常将问题分解为更小的引理，采样许多证明尝试，并使用编译器反馈来指导搜索。然而，它们可能成本高昂，往往在最终失败的尝试上花费大量计算。在这项工作中，我们通过一个包含数据平面和控制平面的动作路由智能体来解决这个问题。数据平面生成自然语言的引理分解，在 Lean 中形式化它们，并为由此产生的定理和引理目标采样证明尝试。控制平面观察之前失败的 Lean 尝试，估计成功可能性和另一次尝试的成本，并决定是继续证明当前目标还是从新的分解重新开始。在 PutnamBench 的一个子集上，我们的智能体平均比固定步长基线降低 25.8% 的成本，在显著减少计算量的同时保持性能。这些结果表明，失败的 Lean 轨迹为智能定理证明中的成本感知资源分配提供了可操作的信号。

英文摘要

Large language models (LLMs) are increasingly used in workflows for generating formal proofs in Lean. These workflows often decompose problems into smaller lemmas, sample many proof attempts, and use compiler feedback to guide search. However, they can be prohibitively expensive, often spending substantial compute on attempts that ultimately fail. In this work, we address this problem with an action routing agent that consists of a data plane and a control plane. The data plane generates natural-language lemma decompositions, formalizes them in Lean, and samples proof attempts for the resulting theorem and lemma targets. The control plane observes previous failed Lean attempts, estimates both the likelihood of success and cost of another attempt, and decides whether to continue proving the current target or restart from a new breakdown. On a subset of PutnamBench, our agent decreases the cost by $28.9\%$ over a fixed-step baseline on average, preserving performance while using substantially less compute. These results suggest that failed Lean trajectories provide actionable signals for cost-aware resource allocation in agentic theorem proving.

URL PDF HTML ☆

赞 0 踩 0

2606.04718 2026-06-15 cs.RO cs.AI 版本更新

CoRe-MoE: Contrastive Reweighted Mixture of Experts for Multi-Terrain Humanoid Locomotion with Gait Adaptation

CoRe-MoE: 面向多地形人形机器人步态适应的对比重加权专家混合

Kailun Huang, Zikang Xie, Yanzhe Xie, Panpan Liao, Fanghai Zhang, Yanheng Mai, Wenhao Xu, Yunheng Wang, Renjing Xu, Haohui Huang, Chenguang Yang

发表机构 * Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； South China Agricultural University（华南农业大学）； Guangdong University of Technology（广东工业大学）

AI总结提出CoRe-MoE两阶段强化学习框架，通过解耦步态生成与地形适应，利用对比学习促进专家专业化，实现人形机器人在多地形下的稳定行走和跑步。

Comments Kailun Huang, Zikang Xie, Yanzhe Xie and Panpan Liao contributed equally to this work. Corresponding authors: Renjing Xu, Haohui Huang and Chenguang Yang

详情

AI中文摘要

人类主要依靠行走和跑步穿越复杂地形，而无需采用不必要复杂的运动模式。类似地，人形机器人应在行走和跑步之间实现平滑过渡，同时保持自然稳定的运动。然而，由于梯度干扰以及地形相关的视觉和动态变化引起的分布偏移，在单一策略中统一步态转换和多地形适应仍然具有挑战性。尽管专家混合（MoE）架构可以缓解多技能干扰，但简单的联合训练往往无法产生清晰的专家专业化，限制了其有效性。为解决这些问题，我们提出了CoRe-MoE，一个两阶段强化学习框架，将步态生成与地形适应解耦。在第一阶段，学习一个稳定的运动策略，以产生具有平滑过渡的自然行走和跑步行为。在第二阶段，引入一个地形感知的MoE分支，并通过对比目标进行训练以塑造门控网络，使其能够捕捉结构化地形表示并促进专家专业化。最终动作通过基础步态策略和地形感知分支的加权融合获得，使策略在适应复杂地形的同时保持稳定的运动模式。大量仿真结果表明，所提方法在成功率、运动稳定性和多地形适应性方面优于基线方法。此外，在Unitree G1人形机器人上的零样本部署验证了我们框架的有效性，实现了在楼梯、斜坡、台阶、障碍物和非结构化户外地形上的稳健行走和跑步，同时在外界干扰下保持精确的落脚点和动态稳定性。

英文摘要

Humans primarily rely on walking and running to traverse complex terrains. Similarly, humanoid robots should be able to smoothly transition between walking and running while maintaining natural and stable locomotion. However, unifying gait transition and multi-terrain adaptation within a single policy remains challenging due to gradient interference between tasks and the distribution shift caused by terrain variations. Although Mixture-of-Experts (MoE) architectures can mitigate multi-skill interference, direct joint training often fails to achieve clear expert specialization. To address these challenges, we propose CoRe-MoE, a two-stage reinforcement learning framework that decouples gait generation from terrain adaptation. In the first stage, a stable locomotion policy is learned to produce natural walking and running behaviors with smooth transitions. In the second stage, a terrain-aware MoE branch is introduced, and the gating network is trained with a contrastive objective to learn structured terrain representations and promote expert specialization. The final action is obtained through weighted fusion of the base gait policy and the terrain-aware branch, enabling the policy to preserve stable locomotion while adapting to complex terrains. Extensive simulation results demonstrate that the proposed method outperforms baseline approaches in terms of success rate, locomotion stability, and multi-terrain adaptability. Furthermore, zero-shot deployment on a Unitree G1 humanoid robot validates the effectiveness of our framework, achieving robust walking and running across stairs, slopes, steps, obstacles, and unstructured outdoor terrains while maintaining accurate foothold control and dynamic stability.

URL PDF HTML ☆

赞 0 踩 0

2606.03108 2026-06-15 cs.AI 版本更新

EvoTrainer: Co-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement Learning

EvoTrainer: 协同进化LLM策略与训练框架以实现自主智能体强化学习

Guhong Chen, Yingcheng Shi, Yongbin Li, Binhua Li, Xander Xu, Hu Wei, Shiwen Ni, Min Yang, Jieping Ye

发表机构 * Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences（深圳先进技术研究院，中国科学院）； Tongyi Lab , Alibaba Group（通义实验室，阿里巴巴集团）； Alibaba Group（阿里巴巴集团）； SUAT（深圳大学）

AI总结提出EvoTrainer框架，通过协同进化LLM策略和训练端框架，基于经验反馈自动诊断、修正并积累可复用技能，在数学推理、编程竞赛和仓库级软件工程任务上匹配或超越人工设计的RL基线。

详情

AI中文摘要

自主LLM训练通常被表述为配方搜索，这使训练框架基本保持静态。这种局限性在智能体RL中尤为突出，其中不断变化的瓶颈和标量奖励掩盖了多种失败模式。我们引入了EvoTrainer，一个通过经验反馈协同进化LLM策略和训练端框架的自主训练框架：它诊断rollout级别的证据、修正诊断、回测干预并积累可复用技能。在数学推理、竞赛编程代码生成和仓库级软件工程上的评估表明，在相同数据、代码库和评估协议下，EvoTrainer匹配或超过了人工设计的RL参考，其中在长周期智能体SWE上增益最大。轨迹分析显示，保留的策略在不同领域分化，进化的诊断阻止了无效的高分分支被提升，而可复用技能塑造了后续搜索。自主LLM RL应超越配方搜索，转向策略和解释它们的训练框架的联合进化。

英文摘要

Autonomous LLM training is often framed as recipe search, which leaves the training harness largely static. This limitation sharpens in agentic RL, where shifting bottlenecks and scalar rewards mask diverse failure modes. We introduce EvoTrainer, an autonomous training framework that co-evolves LLM policies and training-side harnesses through empirical feedback: it diagnoses rollout-level evidence, revises diagnostics, backtests interventions, and accumulates reusable skills. Evaluated on mathematical reasoning, competitive-programming code generation, and repository-level software engineering, EvoTrainer matches or exceeds the human-engineered RL references under the same data, codebase, and evaluation protocol, with the largest gain on long-horizon agentic SWE. Trajectory analyses show that retained strategies diverge across domains, evolving diagnostics prevent invalid high-scoring branches from being promoted, and reusable skills shape later search. Autonomous LLM RL should move beyond recipe search toward joint evolution of policies and the training harnesses that interpret them.

URL PDF HTML ☆

赞 0 踩 0

2606.03085 2026-06-15 cs.LG cs.CL 版本更新

Multi-component Causal Tracing in Large Language Models

大型语言模型中的多组件因果追踪

Zirui Yan, Dennis Wei, Dmitriy A. Katz, Prasanna Sattigeri, Ali Tajer

发表机构 * Rensselaer Polytechnic Institute（拉特拉姆技术学院）； IBM Research（IBM研究院）

AI总结本文提出一个统一框架，通过软干预和度量转换高效识别对目标性能指标最关键的多组件子集，优于现有基线方法。

Comments Accepted to ACL 2026 main conference

详情

AI中文摘要

因果追踪通过系统地干预大型语言模型（LLM）的内部表示，揭示并量化将特定输入或计算与特定感兴趣指标联系起来的因果路径，从而量化LLM的行为。在先前单组件或单层研究的基础上，本文提出了一个同时因果追踪多个组件的统一框架。该框架系统地识别对期望目标性能指标（如准确性和公平性）最关键的组件子集（例如注意力头和多层感知器神经元）。这是通过将灵活的干预应用于广泛期望的指标来实现的。为了解决多组件问题的组合复杂性，设计了一种高效算法，该算法利用软干预和精心设计的度量转换，将组合搜索问题转化为一个连续问题，该问题可以在适当约束下高效求解，从而为选择组件生成适当的二元决策。实验结果表明，所提出的方法高效地识别出对目标指标具有高影响力的模型组件子集，优于现有基线方法。我们的代码可从此https URL获取。

英文摘要

Causal tracing systematically intervenes on a large language model's (LLM's) internal representations to uncover and quantify the causal pathways linking specific inputs or computations to specific metrics of interest, quantifying the LLM's behavior. Building on previous single-component or single-layer studies, this paper presents a unified framework for causally tracing multiple components simultaneously. This framework systematically identifies the subsets of components (e.g., attention heads and multi-layer perceptron neurons) most critical to a desired target performance metric (e.g., accuracy and fairness). This is achieved by incorporating flexible interventions applied to a wide range of desired metrics. To address the combinatorial complexity of the multi-component problem, an efficient algorithm is designed that leverages soft interventions and a carefully designed metric transformation, converting the combinatorial search problem into a continuous one that can be solved efficiently under proper constraints, thereby generating proper binary decisions for selecting components. Experimental results demonstrate that the proposed method efficiently identifies subsets of the model's components that have a high impact on the target metric, outperforming existing baseline approaches. Our code is available at https://github.com/ZiruiYan/multi-component-causal-tracing.

URL PDF HTML ☆

赞 0 踩 0

2606.02320 2026-06-15 cs.CL 版本更新

TVIR: Building Deep Research Agents Towards Text-Visual Interleaved Report Generation

TVIR：构建面向文本-视觉交错报告生成的深度研究智能体

Xinkai Ma, Zhiqi Bai, Dingling Zhang, Pei Liu, Yishuo Yuan, He Zhu, Jiakai Wang, Qianqian Xie, Yifan Zhao, Xinlong Yang, Hao Cong, Zhiheng Yao, Fengxia Xie, Zihao Xu, Haoran Xu, Zhaohui Wang, Minghao Liu, Shirong Lin, Yingshui Tan, Yuchi Xu, Wenbo Su, Zhaoxiang Zhang, Bo Zheng, Jiaheng Liu

发表机构 * Nanjing University Alibaba Group（南京大学阿里集团）

AI总结提出TVIR基准和层次化多智能体框架，解决深度研究报告中视觉元素的事实可靠性与对齐问题。

详情

AI中文摘要

深度研究智能体在多步信息检索、推理和长文本报告生成方面表现出强大能力，但现有基准和系统仍以文本为中心，对视觉元素是否事实可靠且与周围分析良好对齐的评估有限。为填补这一空白，我们引入了TVIR（文本-视觉交错报告生成），包括TVIR-Bench（一个包含100个专家策划的多模态深度研究任务的基准，要求视觉元素服务于特定的分析子目标）和TVIR-Agent（一个层次化多智能体框架，作为构建大纲、检索图像、生成可溯源图表以及通过上下文感知的顺序写作撰写报告的强基线）。我们进一步开发了结合文本评估和视觉评估的双路径评估框架。在九个深度研究系统上的实验表明，TVIR-Agent实现了强大的整体性能，凸显了显式多模态设计和评估对于证据驱动报告生成的重要性。

英文摘要

Deep Research Agents have shown strong capability in multi-step information retrieval, reasoning, and long-form report generation, but existing benchmarks and systems remain predominantly text-centric, with limited evaluation of whether visual elements are factually reliable and well aligned with the surrounding analysis. To address this gap, we introduce TVIR (Text-Visual Interleaved Report Generation), which includes TVIR-Bench, a benchmark of 100 expert-curated multimodal deep research tasks that require visual elements to serve specific analytical sub-goals, and TVIR-Agent, a hierarchical multi-agent framework that serves as a strong baseline for constructing outlines, retrieving images, generating charts with traceable sources, and composing reports through context-aware sequential writing. We further develop a dual-path evaluation framework that combines Textual Assessment and Visual Assessment. Experiments across nine deep research systems show that TVIR-Agent achieves strong overall performance, underscoring the importance of explicit multimodal design and evaluation for evidence-driven report generation.

URL PDF HTML ☆

赞 0 踩 0

2606.01730 2026-06-15 cs.AI cs.LG 版本更新

Evidence-Gated LLM Priors for Multi-Objective Bayesian Optimization

证据门控的LLM先验用于多目标贝叶斯优化

Jiangyu Chen, Ban Yi

发表机构 * State Key Laboratory for Novel Software Technology（新型软件技术国家重点实验室）

AI总结针对多目标贝叶斯优化中LLM先验可能误导的问题，提出一种目标级声誉市场机制，通过在线反馈动态校准专家权重，并引入解耦反事实门控，在合成测试和分子优化基准上验证了动态校准的鲁棒性。

详情

AI中文摘要

大型语言模型（LLM）越来越多地被用作黑箱优化的启发式顾问，但其建议和自我报告的置信度不一定与下游目标值校准。在多目标贝叶斯优化中，这一问题更加突出，因为不同目标可能需要不同的专家知识，而LLM专家可能对一个目标有用，但对另一个目标产生误导。我们研究如何在离散多目标贝叶斯优化中使用LLM生成的专家先验，而不盲目信任它们。我们提出了一种目标级声誉市场机制，将每个专家-目标对视为可证伪的先验来源。专家权重根据观察到的目标反馈在线更新，随时间衰减，并由市场级信任门控。然后，我们引入一个解耦的反事实门控，可以在不使用置信度的情况下使用LLM先验，在置信度下使用，或完全放弃LLM先验。在受控的合成压力测试和三个使用\qwenflash{}生成的专家先验的分子优化基准上，我们发现动态目标级校准比固定LLM先验提高了鲁棒性。然而，原始LLM置信度并不总是有益的：在ESOL上，置信度与预测误差正相关；在FreeSolv上，置信度可能有帮助；在Lipophilicity上，忽略置信度仍然最强。我们的固定三臂反事实门控在ESOL和FreeSolv上优于第一个反事实变体，而尝试的边际组合暴露了一个有用的负面结果：边际选择应基于采集感知，而不是仅基于一步先验误差。

英文摘要

Large language models (LLMs) are increasingly used as heuristic advisors for black-box optimization, yet their suggestions and self-reported confidence are not necessarily calibrated to downstream objective values. This issue becomes more pronounced in multi-objective Bayesian optimization, where different objectives may require different expert knowledge and where an LLM expert can be useful for one objective but misleading for another. We study how to use LLM-generated expert priors in discrete multi-objective Bayesian optimization without blindly trusting them. We propose an objective-wise reputation-market mechanism that treats each expert-objective pair as a falsifiable prior source. Expert weights are updated online from observed objective feedback, discounted over time, and gated by market-level trust. We then introduce a decoupled counterfactual gate that can use the LLM prior without confidence, use it with confidence, or abstain from the LLM prior entirely. Across controlled synthetic stress tests and three molecule optimization benchmarks with \qwenflash{}-generated expert priors, we find that dynamic objective-wise calibration improves robustness over fixed LLM priors. However, raw LLM confidence is not reliably beneficial: on ESOL, confidence is positively correlated with prediction error; on FreeSolv, confidence can help; and on Lipophilicity, ignoring confidence remains strongest. Our fixed three-arm counterfactual gate improves over the first counterfactual variant on ESOL and FreeSolv, while an attempted margin portfolio exposes a useful negative result: margin selection should be acquisition-aware rather than based only on one-step prior error.

URL PDF HTML ☆

赞 0 踩 0

2606.01476 2026-06-15 cs.LG cs.CL 版本更新

OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification

OmniOPD：通过推测性验证实现无Logit的在线策略蒸馏

Yuhang Zhou, Lizhu Zhang, Yifan Wu, Mingyi Wang, Bo Peng, Jiayi Liu, Xiangjun Fan, Zhuokai Zhao

发表机构 * Meta AI

AI总结提出OmniOPD框架，通过基于蒙特卡洛展开的块级语义相似度替代token级logit匹配，结合峰值熵调度器和贝叶斯先验，解决在线策略蒸馏中logit不可获取和信号脆弱问题，在数学任务上超越标准OPD达28.64%。

Comments 26 pages, 3 figures

详情

AI中文摘要

在线策略蒸馏（OPD）在强教师模型的密集token级反馈下，基于学生模型自身的生成轨迹进行训练，缓解了监督微调（SFT）的离策略分布偏移和强化学习（RL）的稀疏信用分配问题。然而，标准OPD面临两个耦合的限制。首先，它需要直接访问教师模型的token级logit，将一大类有能力的专有模型排除在教师之外。其次，token级logit信号本身是脆弱的，依赖于教师和学生之间合理下一个token的狭窄重叠，并且容易放大重复循环等退化模式。在本文中，我们引入了OmniOPD，一种通过无logit的块级监督信号解决这两个限制的新框架。OmniOPD用蒙特卡洛展开替代确定性logit匹配，通过多token块上的连续语义相似性度量近似教师的局部偏好，并通过峰值熵调度器集中这种监督，仅在学生的高不确定性推理分叉处进行审计。Dirichlet-Multinomial贝叶斯先验和基础模型KL锚进一步限制了离散采样的方差，并防止了未审计token上的策略崩溃。在竞争性基准测试中，OmniOPD在数学任务上超越标准OPD方法高达28.64%，证实了块级语义验证提取了比token级logit匹配更可靠的学习信号，后者高信息密度被显著的噪声和脆弱性所抵消。此外，当与更强的黑盒教师（如Claude-4.5-Haiku和Gemini-2.5-Flash）配对时，OmniOPD在数学任务上相对于其开放权重教师对应物额外获得了9.54%的相对提升，使学生超越了自我探索RL的性能。

英文摘要

On-Policy Distillation (OPD) trains a student model on its own generative trajectories under dense token-level feedback from a stronger teacher, mitigating both the off-policy distribution shift of Supervised Fine-Tuning (SFT) and the sparse credit assignment of Reinforcement Learning (RL). However, standard OPD faces two coupled limitations. First, it requires direct access to the teacher's token-level logits, excluding a broad class of capable proprietary models from serving as teachers. Second, the token-level logit signal itself is brittle, depending on a narrow overlap of plausible next tokens between teacher and student, and prone to amplifying degenerate patterns such as repetition loops. In this paper, we introduce OmniOPD, a novel framework that addresses both limitations through a logit-free, chunk-level supervision signal. OmniOPD replaces deterministic logit matching with Monte Carlo rollouts that approximate the teacher's local preferences through a continuous semantic similarity metric over multi-token chunks, and concentrates this supervision via a peak-entropy scheduler that audits the student only at its high-uncertainty reasoning forks. A Dirichlet-Multinomial Bayesian prior and a base-model KL anchor further bound the variance of discrete sampling and prevent policy collapse across unaudited tokens. Across competitive benchmarks, OmniOPD surpasses the standard OPD approach by up to +28.64% on math, confirming that chunk-level semantic verification extracts a more reliable learning signal than token-level logit matching, whose high information density is offset by significant noise and brittleness. Furthermore, when paired with stronger black-box teachers such as Claude-4.5-Haiku and Gemini-2.5-Flash, OmniOPD achieves an additional +9.54% relative on math over its open-weight teacher counterpart, advancing the student past the performance of self-exploratory RL.

URL PDF HTML ☆

赞 0 踩 0

2606.00947 2026-06-15 cs.LG cs.AI 版本更新

知道何时退出：LLM推理中动态弃权的原则性框架

Hen Davidov, Nachshon Cohen, Oren Kalinsky, Yaron Fairstein, Guy Kushilevitz, Ram Yazdi, Patrick Rebeschini

发表机构 * Hebrew University of Jerusalem（特拉维夫大学）

AI总结本文提出一个基于正则化强化学习框架的动态弃权原则，通过价值函数与弃权奖励的比较来决定是否提前终止推理，在数学推理和毒性避免任务上优于现有方法。

详情

Journal ref: Proceedings of the 43rd International Conference on Machine Learning, Seoul, South Korea. PMLR 306, 2026. Copyright 2026 by the author(s)

AI中文摘要

利用思维链推理的大型语言模型常常因产生冗长且错误的响应而浪费大量计算资源。弃权可以通过抑制可能不正确的输出来缓解这一问题。虽然大多数弃权方法在生成之前或之后决定是否保留输出，但动态的生成中弃权考虑在每个token位置提前终止无前途的推理轨迹。先前的工作探索了这一想法的经验变体，但缺乏对弃权规则的原则性指导。我们提出了LLM动态弃权的形式化分析，将弃权建模为正则化强化学习框架中的一个显式动作。弃权奖励参数控制计算与信息之间的权衡。我们证明，在一般条件下，当价值函数低于该奖励时弃权严格优于自然基线。我们进一步推导了一种原则性且高效的方法来近似价值函数。在数学推理和毒性避免任务上的实证结果支持我们的理论，并展示了相比现有方法改进的选择性准确性。

英文摘要

LLMs utilizing chain-of-thought reasoning often waste substantial compute by producing long, incorrect responses. Abstention can mitigate this by withholding outputs unlikely to be correct. While most abstention methods decide to withhold outputs before or after generation, dynamic mid-generation abstention considers early termination of unpromising reasoning traces at each token position. Prior work has explored empirical variants of this idea, but principled guidance for the abstention rule remains lacking. We present a formal analysis of dynamic abstention for LLMs, modeling abstention as an explicit action within a regularized reinforcement learning framework. An abstention reward parameter controls the trade-off between compute and information. We show that abstaining when the value function falls below this reward strictly outperforms natural baselines under general conditions. We further derive a principled and efficient method to approximate the value function. Empirical results on mathematical reasoning and toxicity avoidance tasks support our theory and demonstrate improved selective accuracy over existing methods.

URL PDF HTML ☆

赞 0 踩 0

2602.08324 2026-06-15 cs.LG 版本更新

Towards Efficient Large Language Reasoning Models via Extreme-Ratio Chain-of-Thought Compression

通过极端比例思维链压缩实现高效大型语言推理模型

Yuntian Tang, Bohan Jia, Wenxuan Huang, Lianyue Zhang, Jiao Xie, Wenxi Li, Wei Li, Jie Hu, Xinghao Chen Rongrong Ji, Shaohui Lin

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出Extra-CoT框架，通过极端比例压缩思维链、混合比例监督微调和约束层次化比率策略优化，在显著减少推理令牌的同时保持甚至提升推理准确率。

Comments Accepted to ICML 2026. 15 pages, 7 figures

详情

AI中文摘要

思维链推理成功增强了大型语言模型的推理能力，但推理时会产生大量计算开销。现有的思维链压缩方法在高压缩比下常遭受关键逻辑保真度的损失，导致性能显著下降。为实现高保真、快速推理，我们提出了一种新颖的极端比例思维链压缩框架，称为Extra-CoT，该框架在保留答案准确性的同时，激进地减少令牌预算。为了生成可靠的高保真监督，我们首先在带有细粒度标注的数学思维链数据上训练一个专用的语义保留压缩器。然后，通过混合比例监督微调对大型语言模型进行微调，使其学习遵循一系列压缩预算，并为强化学习提供稳定的初始化。我们进一步提出约束和层次化比率策略优化，通过层次化奖励明确激励在较低预算下的问题解决能力。在三个数学推理基准上的实验显示了Extra-CoT的优越性。例如，在MATH-500上使用Qwen3-1.7B，Extra-CoT实现了超过73%的令牌减少，同时准确率提升0.6%，显著优于最先进方法。我们的源代码已在https://github.com/Mwie1024/Extra-CoT发布。

英文摘要

Chain-of-Thought (CoT) reasoning successfully enhances the reasoning capabilities of Large Language Models (LLMs), yet it incurs substantial computational overhead for inference. Existing CoT compression methods often suffer from a critical loss of logical fidelity at high compression ratios, resulting in significant performance degradation. To achieve high-fidelity, fast reasoning, we propose a novel EXTreme-RAtio Chain-of-Thought Compression framework, termed Extra-CoT, which aggressively reduces the token budget while preserving answer accuracy. To generate reliable, high-fidelity supervision, we first train a dedicated semantically-preserved compressor on mathematical CoT data with fine-grained annotations. An LLM is then fine-tuned on these compressed pairs via a mixed-ratio supervised fine-tuning (SFT), teaching it to follow a spectrum of compression budgets and providing a stable initialization for reinforcement learning (RL). We further propose Constrained and Hierarchical Ratio Policy Optimization (CHRPO) to explicitly incentivize question-solving ability under lower budgets by a hierarchical reward. Experiments on three mathematical reasoning benchmarks show the superiority of Extra-CoT. For example, on MATH-500 using Qwen3-1.7B, Extra-CoT achieves over 73\% token reduction with an accuracy improvement of 0.6\%, significantly outperforming state-of-the-art (SOTA) methods. Our source codes have been released at https://github.com/Mwie1024/Extra-CoT.

URL PDF HTML ☆

赞 0 踩 0

2406.09250 2026-06-15 cs.CV cs.AI cs.LG 版本更新

MirrorCheck: Efficient Adversarial Defense for Vision-Language Models

MirrorCheck: 视觉-语言模型的高效对抗防御

Samar Fares, Klea Ziu, Toluwani Aremu, Nikita Durasov, Martin Takáč, Pascal Fua, Ivan Laptev, Karthik Nandakumar

发表机构 * Mohamed Bin Zayed University of Artificial Intelligence（莫扎伊德大学人工智能大学）； NVIDIA ； École Polytechnique Fédérale de Lausanne（洛桑联邦理工学院）； Michigan State University（密歇根州立大学）

AI总结提出MirrorCheck框架，利用文本到图像模型和随机化策略检测并防御针对视觉-语言模型的自适应对抗攻击。

详情

AI中文摘要

视觉-语言模型（VLM）越来越容易受到复杂的对抗性攻击，包括专门设计用于绕过现有防御的自适应策略。为了解决这一漏洞，我们提出了MirrorCheck，一个鲁棒且与模型无关的检测框架，在单模态和多模态设置中均能有效运行。MirrorCheck利用文本到图像（T2I）模型从目标模型生成的标题中重建视觉内容，并通过比较原始图像和合成图像之间的特征空间嵌入来评估语义一致性。为了增强对自适应攻击的鲁棒性，MirrorCheck引入了一种随机防御策略，从多样化的模型库中随机选择T2I生成器和图像编码器。此外，我们采用了一种新颖的一次性（OTU）扰动，应用于所选编码器嵌入，并通过缩放因子调节，这降低了自适应攻击的有效性。跨多种威胁场景的大量实验表明，MirrorCheck始终优于基线方法，即使在强自适应对抗条件下也能保持其实用性。

英文摘要

Vision-Language Models (VLMs) are increasingly susceptible to sophisticated adversarial attacks, including adaptive strategies specifically designed to bypass existing defenses. To address this vulnerability, we propose MirrorCheck, a robust and model-agnostic detection framework that operates effectively in both unimodal and multimodal settings. MirrorCheck leverages Text-to-Image (T2I) models to regenerate visual content from captions produced by the target model and assesses semantic consistency by comparing feature-space embeddings between the original and synthesized images. To enhance robustness against adaptive attacks, MirrorCheck introduces a stochastic defense strategy that randomly selects T2I generators and image encoders from a diverse model zoo. Additionally, we incorporate a novel One-Time-Use (OTU) perturbation applied to the selected encoder embeddings, regulated by a scaling factor, which decreases the effectiveness of adaptive attacks. Extensive experiments across multiple threat scenarios demonstrate that MirrorCheck consistently outperforms baseline methods, and maintains its utility even under strong adaptive adversarial conditions.

URL PDF HTML ☆

赞 0 踩 0

2605.25025 2026-06-15 cs.RO cs.SY eess.SY 版本更新

Micro-Swarm Locomotion Optimization in Dynamic Flow using Multi-Objective Multi-Agent Reinforcement Learning

动态流场中微群集运动优化的多目标多智能体强化学习方法

Josef Berman, Oren Gal

发表机构 * Hatter Department of Marine Technologies, Leon H. Charney School of Marine Sciences, University of Haifa（哈特尔海洋技术系，列昂·H·夏恩海洋科学学院，海法大学）

AI总结提出混合CFD与多目标多智能体强化学习框架，通过PCGrad解决梯度冲突，在振荡流中优化微机器人集群的上游推进、能量效率和运动平滑性。

详情

AI中文摘要

在生理真实、时间依赖的流体环境中协调微型机器人集群，仍然是生物医学和环境应用中的未解决挑战。我们提出了一种混合计算流体动力学-多目标多智能体强化学习框架，该框架将高保真不可压缩纳维-斯托克斯求解器与去中心化近端策略优化直接耦合，以在振荡流中学习物理一致的集群控制策略。十六个磁驱动微型机器人在脉动动脉波形中导航，同时优化上游推进、能量守恒和运动平滑性，并通过PCGrad手术进行协调。没有PCGrad时，能量效率和平滑度奖励在10000训练步内降至接近零，而进度表现出持续的大幅振荡，证实梯度冲突解决是该领域的一个结构性要求而非可选改进。收敛策略实现了6.5-7.0的进度奖励、0.63-0.65的持续能量效率以及接近最大的平滑度（0.97-0.99），在主目标上比暴力基线有所改进，而两个基线在整个过程中能量效率均为负值。训练揭示了三个涌现行为阶段：在正向流动期间抑制峰值通道速度的集体双层水动力节流编队、利用流动反转进行上游重新定位的周期同步棘轮机制，以及智能体接近成功边界时的个体化最终接近。这些结果表明，时间依赖的流体-智能体相互作用可以直接在多目标强化学习循环中捕获，为生物医学导航、环境监测和工业微流体中的微群集控制提供了基于物理的范式。

英文摘要

Coordinating micro-robotic swarms in realistic, time-dependent fluid environments remains a major challenge for biomedical and environmental applications. We present a hybrid CFD-MO-MARL (Computational Fluid Dynamics-Multi Objective-Multi Agent Reinforcement Learning) framework that couples a high-fidelity incompressible Navier--Stokes solver with decentralized proximal policy optimization to learn swarm control policies in oscillatory flow. Sixteen magnetically actuated micro-robots were simulated to navigate a pulsatile arterial waveform within a 2 mm channel while jointly optimizing upstream progression, energy efficiency, and motion smoothness. Conflicting objectives are resolved using Projected Conflicting Gradient (PCGrad) surgery. Without PCGrad, energy and smoothness rewards collapse during training, demonstrating that gradient conflict resolution is essential for stable multi-objective learning. The converged policy achieves progress rewards of 6.5-7.0, energy efficiency of 0.63-0.65, and smoothness of 0.97-0.99, outperforming brute-force baselines by more than 8 reward units on the primary objective. Training reveals three emergent behaviors not encoded in the reward function: hydrodynamic throttling formations that reduce peak flow velocities, a cycle-synchronized ratchet mechanism that exploits flow reversals for upstream movement, and individualized final-approach strategies near the target boundary. These results demonstrate that physically realistic fluid--agent interactions can be integrated directly into multi-objective reinforcement learning, providing a scalable framework for micro-swarm control in biomedical navigation, environmental monitoring, and microfluidic systems.

URL PDF HTML ☆

赞 0 踩 0

2604.26740 2026-06-15 cs.CV cs.GR 版本更新

Rendering-Aware Sparse Sampling for BRDF Acquisition

面向BRDF采集的渲染感知稀疏采样

W. Cao, D. Jönsson, Z. Huang, J. Unger

发表机构 * Media and Information Technology, Department of Science and Technology, Linköping University（_linköping大学科学与技术学院媒体与信息科技系）

AI总结提出一种渲染感知的稀疏采样方法，通过可微渲染器优化采样方向，以最少BRDF测量实现高质量材质外观重建。

详情

AI中文摘要

精确的BRDF采集对于真实感渲染至关重要，但密集的测角光度计测量既缓慢又昂贵。我们研究如何选择一小部分BRDF测量，这些测量在学习的BRDF先验下对重建材质外观最具信息量。现有的稀疏采集方法通常优化所有材质的BRDF空间重建样本，而自适应测量的感知重要性最终取决于其对每个渲染外观的影响。因此，我们将稀疏自适应采集表述为一个渲染感知的优化问题。我们的方法结合了用于稀疏坐标-值观测的集合编码器、基于预训练超网络/PCA的BRDF重建器以及可微渲染器。在采样器训练期间，重建器保持固定，来自渲染图像损失的梯度优化测量位置。这将采集设计与先验拟合分离，并鼓励采样器选择在学习材质分布下信息量大的方向。为了使比较受控，我们在匹配的样本数量、训练/测试分割、渲染场景、对象掩码、图像映射和指标下评估均匀基线、元学习方法、HyperBRDF方法和我们学习的采样器。我们的核心主张是：当最终渲染外观是目标时，渲染感知采样改进了极其稀疏的BRDF采集。BRDF空间和组合损失仅作为消融实验报告，同时包括联合优化和仅图像潜在拟合以处理未见过的材质。

英文摘要

Accurate BRDF acquisition is essential for realistic rendering, but dense gonioreflectometer measurements are slow and expensive. We study how to select a small set of BRDF measurements that is most informative for reconstructing material appearance under a learned BRDF prior. Existing sparse-acquisition methods often optimize samples for BRDF-space reconstruction for all materials, while the perceptual importance of a adaptive measurement ultimately depends on its effect on each rendered appearance. We therefore formulate sparse adaptive acquisition as a rendering-aware optimization problem. Our method combines a set encoder for sparse coordinate--value observations, a pretrained hypernetwork-based/PCA-based BRDF reconstructor, and a differentiable renderer. During sampler training, the reconstructor remains fixed, and gradients from a rendered-image loss optimize the measurement locations. This separates acquisition design from prior fitting and encourages the sampler to choose directions that are informative under the learned material distribution. To make the comparison controlled, we evaluate the uniform baseline, meta-learning method, HyperBRDF method, and our learned sampler under matched sample numbers, train/test split, rendering scene, object mask, image mapping, and metrics. Our central claim: rendering-aware sampling improves extremely sparse BRDF acquisition when final rendered appearance is the target. BRDF-space and combined losses are reported only as ablations, together with joint refinement and image-only latent fitting for unseen materials.

URL PDF HTML ☆

赞 0 踩 0

2601.05106 2026-06-15 cs.AI cs.CL cs.LG 版本更新

Token-Level LLM Collaboration via FusionRoute

通过融合路由实现令牌级LLM协作

Nuoya Xiong, Yuhang Zhou, Hanqing Zeng, Zhaorun Chen, Furong Huang, Shuchao Bi, Lizhu Zhang, Zhuokai Zhao

发表机构 * [cs.AI]（计算机科学与人工智能）

AI总结本文提出FusionRoute框架，通过轻量级路由器在解码步骤中选择最合适的专家并补充对数几率以优化下一个令牌分布，解决了单个通用模型在多个领域表现不佳的问题，同时在多个基准测试中优于其他方法。

Comments 25 pages

详情

AI中文摘要

大型语言模型（LLMs）在多个领域表现出色。然而，使用单一通用模型在这些领域实现强大性能通常需要扩展到训练和部署成本极高的规模。另一方面，虽然较小的领域专用模型更高效，但它们在训练分布之外的泛化能力较差。为了解决这一矛盾，我们提出了FusionRoute，一种稳健且有效的令牌级多LLM协作框架，其中轻量级路由器同时（i）在每个解码步骤中选择最合适的专家，（ii）贡献一个互补的对数几率，通过对数几率添加来细化或校正所选专家的下一个令牌分布。与现有依赖固定专家输出的令牌级协作方法不同，我们提供了一个理论分析，表明纯专家路由本质上是有限的：除非持有强全局覆盖假设，否则无法一般实现最优解码策略。通过在专家选择中加入可训练的互补生成器，FusionRoute扩展了有效的策略类别，并在温和条件下实现了最优价值函数的恢复。经验上，FusionRoute在Llama-3和Gemma-2家族以及涵盖数学推理、代码生成和指令跟随在内的多种基准测试中，优于序列级和令牌级协作、模型融合和直接微调方法，同时在各自任务上与领域专家保持竞争力。

英文摘要

Large language models (LLMs) exhibit strengths across diverse domains. However, achieving strong performance across these domains with a single general-purpose model typically requires scaling to sizes that are prohibitively expensive to train and deploy. On the other hand, while smaller domain-specialized models are much more efficient, they struggle to generalize beyond their training distributions. To address this dilemma, we propose FusionRoute, a robust and effective token-level multi-LLM collaboration framework in which a lightweight router simultaneously (i) selects the most suitable expert at each decoding step and (ii) contributes a complementary logit that refines or corrects the selected expert's next-token distribution via logit addition. Unlike existing token-level collaboration methods that rely solely on fixed expert outputs, we provide a theoretical analysis showing that pure expert-only routing is fundamentally limited: unless strong global coverage assumptions hold, it cannot in general realize the optimal decoding policy. By augmenting expert selection with a trainable complementary generator, FusionRoute expands the effective policy class and enables recovery of optimal value functions under mild conditions. Empirically, across both Llama-3 and Gemma-2 families and diverse benchmarks spanning mathematical reasoning, code generation, and instruction following, FusionRoute outperforms both sequence- and token-level collaboration, model merging, and direct fine-tuning, while remaining competitive with domain experts on their respective tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.21472 2026-06-15 cs.CV 版本更新

Stream3D: Sequential Multi-View 3D Generation via Evidential Memory

Stream3D: 基于证据记忆的序列多视角3D生成

Kaichen Zhou, Zeyang Bai, Xinhai Chang, Mengyu Wang, Paul Liang, Fangneng Zhan

发表机构 * World Mind Lab, HKUST（世界心智实验室，香港科技大学）； Media Lab and EECS, MIT（媒体实验室和电子工程与计算机科学系，麻省理工学院）； Kempner Institute, Harvard University（凯普纳研究所，哈佛大学）

AI总结提出Stream3D，一种无需训练的流式机制，通过维护紧凑的证据记忆缓存关键历史帧，将冻结的视角条件3D生成器转换为流式生成器，解决单目视频流中3D生成的时间不一致问题。

Comments Multi-view 3D Generation, Streaming 3D Generation

详情

AI中文摘要

视角条件3D生成器（如SAM 3D、TRELLIS和Hunyuan3D）能够从单视角生成高质量物体重建，但真实世界的视觉观测通常以长单目流的形式出现。将这些生成器独立应用于每个流式帧会导致生成结果严重的时间不一致。为解决此问题，我们提出Stream3D，这是第一种无需训练的流式机制，通过恒定跨块记忆将冻结的视角条件3D生成器转换为流式生成器。Stream3D通过维护一个紧凑的证据记忆来实现这一点，该记忆基于提出的证据评分机制选择性缓存最具信息量的历史帧。随着流式处理进行，记忆动态更新以保留固定数量的信息帧，防止内存占用随序列长度线性增长。这还防止了长序列上的性能退化，并保持底层生成器完全不变，无需重新训练、架构修改或辅助损失。在真实和合成流式基准上的评估表明，Stream3D在光度指标和几何指标上均优于潜在传输基线，包括KV缓存重用和基于流的特征编辑。更多详情请见：https://stream-3d.github.io/stream3d.github.io/。

英文摘要

View-conditioned 3D generators such as SAM 3D, TRELLIS, and Hunyuan3D produce high-quality object reconstructions from a single view, but real-world visual observation often arrives as long monocular streams. Naively applying these generators to each streaming frame independently leads to severe temporal inconsistency in the generated results. To address this problem, we propose Stream3D, the first training-free streaming mechanism that turns a frozen view-conditioned 3D generator into a streaming generator with constant cross-chunk memory. Stream3D achieves this by maintaining a compact evidential memory, which selectively caches the most informative historical frames based on a proposed evidence score mechanism. As the stream progresses, the memory dynamically updates to retain a fixed number of informative frames, preventing the memory footprint from growing linearly with sequence length. This also prevents degradation over long sequences and keeps the underlying generator completely unchanged without retraining, architectural modifications, or auxiliary losses. Evaluated on both realistic and synthetic streaming benchmarks, Stream3D outperforms latent-transport baselines, including KV-cache reuse and flow-based feature editing, across both photometric and geometric metrics. More details can be found at: https://stream-3d.github.io/stream3d.github.io/.

URL PDF HTML ☆

赞 0 踩 0