arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2115
2606.17601 2026-06-17 cs.CV 新提交

Test-Time Training for Robust Text-Guided Open-Vocabulary Object Counting

测试时训练用于鲁棒文本引导的开放词汇目标计数

Hao-Yuan Ma, Yuda Zou, Li Zhang, Yongchao Xu

发表机构 * School of Computer Science and Technology, Soochow University(苏州大学计算机科学与技术学院) Wuhan University(武汉大学)

AI总结 提出Dual-TTT框架,通过测试时训练轻量去噪模块,提升文本引导开放词汇目标计数在恶劣条件下的鲁棒性,无需修改原模型架构。

详情
AI中文摘要

文本引导的开放词汇目标计数(TOOC)能够对文本提示指定的任意目标类别进行计数,相比传统的封闭集计数提供了更大的灵活性。然而,现有的TOOC方法主要在理想图像上开发和评估,而现实场景常遭受雨、雾、黑暗和传感器噪声等不利条件,这些条件严重降低视觉质量并损害视觉-语言对齐。为弥补这一差距,我们引入了Robust-TOOC,这是首个在多种损坏条件下评估TOOC的基准,涵盖六种代表性退化类型:雨、雾、黑暗、高斯噪声、椒盐噪声和混合损坏。为提高鲁棒性同时保留原始计数架构,我们提出了Dual-TTT,一种用于TOOC的双架构测试时训练框架。具体来说,在测试时训练期间,Dual-TTT仅更新文本引导轻量去噪模块(TL-Denoiser),同时冻结原始计数网络。受扩散模型启发,TL-Denoiser被优化以从退化条件下的图像表示中去除与损坏相关的噪声。由于仅在测试时训练TL-Denoiser,Dual-TTT无需标注,并且可以无缝集成到现有TOOC模型中而无需修改其原始架构。在多个近期TOOC基线上的大量实验证明了我们方法的有效性。

英文摘要

Text-guided Open-vocabulary Object Counting (TOOC) enables counting arbitrary object categories specified by text prompts, offering substantially greater flexibility than conventional closed-set counting. However, existing TOOC methods are developed and evaluated primarily on ideal images, while real-world scenes often suffer from adverse conditions such as rain, fog, darkness, and sensor noise, which severely degrade visual quality and impair vision-language alignment. To bridge this gap, we introduce Robust-TOOC, the first benchmark for evaluating TOOC under diverse corruption conditions, which covers six representative degradation types: rain, fog, darkness, Gaussian noise, salt-and-pepper noise, and mixed corruption. To improve robustness while preserving the original counting architecture, we propose Dual-TTT, a dual-architecture test-time training framework for TOOC. Specifically, during test-time training, Dual-TTT updates only the Text-guided Lightweight Denoising module (TL-Denoiser), while keeping the original counting network frozen. Inspired by diffusion models, the TL-Denoiser is optimized to remove corruption-aware noise from image representations under degraded conditions. Since only the TL-Denoiser is trained at test time, Dual-TTT is annotation-free and can be seamlessly integrated into existing TOOC models without modifying their original architecture. Extensive experiments on multiple recent TOOC baselines demonstrate the effectiveness of our method.

2606.17598 2026-06-17 cs.RO cs.CV 新提交

MuseVLA: An Adaptive Multimodal Sensing Vision-Language-Action Model for Robotic Manipulation

MuseVLA: 一种用于机器人操作的自适应多模态感知视觉-语言-动作模型

Xingyuming Liu, Ruichun Ma, Heyu Guo, Qixiu Li, Qingwen Yang, Lin Luo, Shiqi Jiang, Chenren Xu, Jiaolong Yang, Baining Guo

发表机构 * School of Computer Science, Peking University(北京大学计算机科学学院) Microsoft Research Asia(微软亚洲研究院) Princeton University(普林斯顿大学) Tsinghua University(清华大学)

AI总结 提出MuseVLA模型,通过将传感器作为按需工具集成,实现自适应多模态感知;设计传感器图像统一表示,并引入数据合成流水线,在灵巧手操作任务中平均成功率80.6%,显著优于RGB-only和多模态基线。

详情
AI中文摘要

人类自然地利用多种感知模态与物理世界交互,而大多数用于机器人的视觉-语言-动作(VLA)模型仅依赖RGB观测。这限制了它们感知难以或无法从RGB相机推断的物理属性(如温度、声音或雷达响应)的能力。我们提出MuseVLA,一种自适应多模态感知VLA模型,将新型传感器作为按需工具集成到机器人操作中。给定任务指令和视觉上下文,MuseVLA首先生成一个传感器令牌和目标描述,选择要调用的感知模态和关注对象,类似于带参数的工具调用。然后,它将选定的传感器测量值转换为接地传感器图像,这是一种统一的中间表示,编码异构读数以进行多模态融合和动作生成。这种设计将传感器特定处理与VLA主干解耦,实现了多种模态的高效集成。为了减少对昂贵的多传感器机器人数据集的需求,我们进一步引入了一种数据合成流水线,用接地传感器图像增强现有的RGB视频数据集,从而实现对未见过的传感器引导任务的泛化。我们在真实机器人上评估了MuseVLA,涉及需要多模态感知输入的挑战性灵巧手操作任务,包括温度引导的拾取与放置、音频驱动的物体搜索和雷达辅助的隐藏物体检索。MuseVLA平均成功率达到80.6%,显著优于仅RGB和多模态VLA基线,并在未见任务上表现出强大的零样本能力。

英文摘要

Humans naturally leverage diverse sensing modalities to interact with the physical world, while most Vision-Language-Action (VLA) models for robotics rely solely on RGB observations. This limits their ability to perceive physical properties that are difficult or impossible to infer from RGB cameras, such as temperature, sound, or radar response. We present MuseVLA, an adaptive multimodal sensing VLA model that integrates novel sensors as on-demand tools for robotic manipulation. Given a task instruction and visual context, MuseVLA first generates a sensor token and target description that select the sensing modality to invoke and what to attend to, analogous to a tool call with arguments. It then converts the selected sensor measurement into a grounded sensor image, a unified intermediate representation that encodes heterogeneous readings for multimodal fusion and action generation. This design decouples sensor-specific processing from the VLA backbone, enabling efficient integration of diverse modalities. To reduce the need for expensive multisensory robot datasets, we further introduce a data synthesis pipeline that augments existing RGB video datasets with grounded sensor images, enabling generalization to unseen sensor-guided tasks. We evaluate MuseVLA on a real-world robot across challenging dexterous hand manipulation tasks that require multimodal sensing inputs, including temperature-guided pick-and-place, audio-driven object search, and radar-assisted hidden object retrieval. MuseVLA achieves 80.6% success rate on average, outperforming RGB-only and multisensory VLA baselines significantly, and exhibits strong zero-shot capabilities on unseen tasks.

2606.17591 2026-06-17 cs.AI 新提交

Closing the Feedback Loop: From Experience Extraction to Insight Governance in Verbal Reinforcement Learning

闭环反馈:从经验提取到洞察治理在言语强化学习中的应用

Yanwei Cui, Xing Zhang, Yulong Zhang, Li Shao, Xiaofeng Shi, Guanghui Wang, Peiyang He

发表机构 * AWS Generative AI Innovation Center(AWS生成式人工智能创新中心) Amazon Web Services (AWS)(亚马逊网络服务(AWS)) BingX Group Limited(BingX集团有限公司)

AI总结 针对非平稳环境中LLM智能体的保留-遗忘困境,提出三层架构(规则、证据、技能)通过反馈驱动的策展循环实现洞察治理,在金融预测中验证了该方法能显著提升准确率和风险调整收益。

Comments Accepted to the ICML 2026 RLxF: Reinforcement Learning from World Feedback Workshop, RLxF@ICML 2026, Seoul, South Korea

详情
AI中文摘要

无训练言语强化学习使LLM智能体能够从世界反馈中学习——客观信号如动态任务结果、市场回报或需求预测——通过从经验中提取言语规则并将其注入上下文,无需参数变化即可更新智能体行为。然而,在非平稳环境中,这些智能体面临保留-遗忘困境:保留过时的洞察会导致负迁移,而丢弃它们则会在条件重现时造成灾难性遗忘。我们识别出应对这一困境的四个要求——结果驱动评估、持久结构化证据、非单调知识生命周期和组合治理——并表明现有方法在经验提取上投入过多,而在洞察治理上投入不足。我们提出一个三层架构——规则、证据和技能——通过反馈驱动的策展循环连接,弥补治理差距。规则从世界结果中捕获提炼的经验;证据日志跟踪每条规则在多个回合中的可靠性;技能管理应用哪些规则、如何解决冲突以及何时弃权。以金融预测作为案例研究,其中世界反馈自然丰富、嘈杂且非平稳,我们表明相同的积累经验要么使性能低于零样本基线,要么显著提高准确率和风险调整收益,取决于是否存在策展循环。

英文摘要

Training-free verbal reinforcement learning enables LLM agents to learn from world feedback -- objective signals such as dynamic task outcomes, market returns, or demand forecasts -- by extracting verbal rules from experience and injecting them as context, updating the agent's behavior without parameter changes. However, in non-stationary environments these agents face a retention-forgetting dilemma: retaining stale insights causes negative transfer, while discarding them causes catastrophic forgetting when conditions recur. We identify four requirements for navigating this dilemma -- outcome-driven evaluation, persistent structured evidence, non-monotonic knowledge lifecycle, and compositional governance -- and show that existing methods invest heavily in experience extraction while underinvesting in insight governance. We propose a three-layer architecture -- rules, evidence, and skills -- connected by a feedback-driven curation loop that closes the governance gap. Rules capture distilled experience from world outcomes; evidence logs track each rule's reliability across episodes; skills govern which rules to apply, how to resolve conflicts, and when to abstain. On financial forecasting as a case study, where world feedback is naturally abundant, noisy, and non-stationary, we show that the same accumulated experience either degrades performance below the zero-shot baseline or dramatically improves accuracy and risk-adjusted returns, depending on whether the curation loop is present.

2606.17590 2026-06-17 cs.CV 新提交

TivTok: Broadcasting Time-Invariant Tokens for Scalable Video Tokenization

TivTok:广播时间不变令牌以实现可扩展视频分词

Weiliang Chen, Yuanhui Huang, Xuebo Wang, Yueqi Duan

发表机构 * Department of Electronic Engineering, Tsinghua University(清华大学电子工程系) Department of Automation, Tsinghua University(清华大学自动化系) Kuaishou Technology(快手科技)

AI总结 提出TivTok,一种可重用感知的视频分词器,通过时间不变(TIV)和时间变化(TV)令牌分解视频,实现高效压缩和长视频建模,在标准基准上rFVD达12.65,压缩效率提升2.91倍。

详情
AI中文摘要

视频分词是可扩展视频生成的基础,因为令牌数量直接决定计算成本和可建模视频长度。现有分词器主要通过将视频压缩为更少令牌来提高可扩展性,但它们通常跨帧和块重复表示持久内容,如静态背景和一致物体外观。本文提出\textbf{TivTok}(\textit{时间不变分词器}),一种可重用感知的视频分词器,使持久信息随时间可重用。TivTok用时间不变(TIV)令牌(编码跨帧共享信息)和时间变化(TV)令牌(编码帧特定残差)表示一个片段。为获得这种分解,我们引入范围诱导分解(SIF),为两个令牌组分配不同的注意力范围:TIV令牌关注整个片段,而每个TV令牌仅访问其对应帧及TIV令牌。在解码器中,不变广播(IB)跨帧和块重用相同的TIV令牌,用于并行重建和长视频分词。实验表明,TivTok在标准$16{\times}256{\times}256$基准上达到12.65的rFVD,与评估基线相比,128帧视频的压缩效率提升2.91倍,同时仅使用下采样分词器所需令牌的1.1%。

英文摘要

Video tokenization is fundamental to scalable video generation, as the number of tokens directly determines the computational cost and the length of videos that can be modeled. Existing tokenizers mainly improve scalability by compressing videos into fewer tokens, but they often continue to represent persistent content, such as static backgrounds and consistent object appearances, repeatedly across frames and chunks. In this paper, we propose \textbf{TivTok} (\textit{Time-Invariant Tokenizer}), a reuse-aware video tokenizer that makes persistent information reusable across time. TivTok represents a clip with Time-Invariant (TIV) tokens that encode information shared across frames and Time-Variant (TV) tokens that encode frame-specific residuals. To obtain this factorization, we introduce Scope-Induced Factorization (SIF), which assigns different attention scopes to the two token groups: TIV tokens attend to the full clip, whereas each TV token only accesses its corresponding frame together with the TIV tokens. In the decoder, Invariant Broadcasting (IB) reuses the same TIV tokens across frames and chunks for parallel reconstruction and long-video tokenization. Experiments show that TivTok achieves an rFVD of 12.65 on the standard $16{\times}256{\times}256$ benchmark and improves compression efficiency by 2.91$\times$ for 128-frame videos compared with the evaluated baselines, while using only 1.1\% of the tokens required by downsample-based tokenizers in our evaluation.

2606.17584 2026-06-17 cs.CV cs.LG 新提交

Root-Selecting Fixed-Point Inversion for Rectified Flows via Trajectory Straightness

基于轨迹直线度的整流流根选择不动点反演

Semin Kim, Jihwan Yoon, Seunghoon Hong

发表机构 * KAIST(韩国科学技术院)

AI总结 提出SelFix方法,通过选择使逆轨迹更直的不动点解,在整流流中实现精确反演,提升图像重建和编辑质量。

详情
AI中文摘要

找到生成给定数据样本的初始噪声(称为反演)是下游应用(如无训练图像编辑)的关键组成部分。现有的不动点反演方法通过将每个反演步骤表述为不动点问题来提高反演精度,但它们缺乏一个原则性的机制来选择实践中可能出现的多个不动点解。我们观察到不同的选择会引发不同的反演轨迹,导致重建和编辑质量的显著变化。对于整流流,我们进一步发现这种变化与轨迹直线度密切相关,这促使我们将直线度作为原则性的选择标准。我们提出SelFix,一种不动点反演方法,它选择诱导更直逆轨迹的不动点解,同时在标准局部假设下保持收敛到精确的反演根。在FLUX.1-dev和PIE-Bench上的实验表明,SelFix改进了不动点反演,实现了比先前反演基线更强的真实图像重建和更好的源保持提示编辑。代码可在该https URL获取。

英文摘要

Finding the initial noise that generates a given data sample, known as inversion, is a key component for downstream applications such as training-free image editing. Existing fixed-point inversion methods improve inversion accuracy by formulating each inversion step as a fixed-point problem, but they lack a principled mechanism for selecting among multiple fixed-point solutions that can arise in practice. We observe that different selections induce different inversion trajectories, leading to substantial variation in reconstruction and editing quality. For rectified flows, we further find that this variation is closely associated with trajectory straightness, motivating straightness as a principled selection criterion. We propose SelFix, a fixed-point inversion method that selects fixed-point solutions inducing straighter inverse trajectories while retaining convergence to an exact inverse root under standard local assumptions. Experiments on FLUX.1-dev and PIE-Bench show that SelFix improves fixed-point inversion, achieving stronger real-image reconstruction and better source-preserving prompt-based editing than prior inversion baselines. The code is available at https://github.com/seminkim/selfix.

2606.17577 2026-06-17 cs.AI 新提交

Surrogate Assisted Pedestrian Protection Design via a Foundation Model Orchestrated Workflow

基于基础模型编排工作流的代理辅助行人保护设计

Osamu Ito, Akihiko Katagiri, Yoshikazu Nakagawa, Shin Saeki, Jun Shiraishi, Masato Sasaki

发表机构 * Honda Motor Co., Ltd.(本田汽车有限公司)

AI总结 提出首个基础模型编排的碰撞安全设计工作流,集成代理模型、多目标进化搜索、几何生成器和自然语言接口,将行人保护评估时间从数小时降至秒级。

详情
Journal ref
ICLR 2026 Workshop The 2nd Workshop on Foundation Models for Science
AI中文摘要

AI驱动的工程工作流在碰撞安全设计中面临特殊挑战:与空气动力学不同,碰撞事件涉及高度非线性的接触动力学、材料非线性和离散状态转换,难以用数据驱动的代理模型捕捉。据我们所知,我们首次提出了一个基于基础模型编排的碰撞安全设计工作流,实现了代理辅助的行人保护探索,将评估时间从每次CAE模拟数小时缩短至数秒。该工作流集成四个组件:(1) 基于CAE碰撞模拟训练的代理模型,用于从设计参数预测行人腿部伤害指标,平均$R^2=0.87$,并提供无分布假设的共形预测区间;(2) 多目标进化搜索(NSGA-II),在用户指定约束下发现多样化的可行参数集;(3) 基于形变的几何生成器,将参数映射为保持拓扑的3D形状;(4) 自然语言接口,其中LLM编排工作流,视觉-语言模型支持生成设计的语义比较。在一个汽车前保险杠案例研究中,该工作流通过单次探索产生35个不同的安全合规替代方案,而传统CAE迭代需要数周。这些结果表明,基础模型可以作为ML代理和基于物理的模拟之间的集成层,帮助将AI能力引入安全关键的工程领域。

英文摘要

AI-driven engineering workflows face particular challenges in crash safety design: unlike aerodynamics, crash events involve highly nonlinear contact dynamics, material nonlinearity, and discrete state transitions that are difficult to capture with data-driven surrogate models. To the best of our knowledge, we present the first foundation model--orchestrated workflow for crash safety design that enables surrogate-assisted exploration for pedestrian protection, reducing evaluation time from hours per CAE simulation to seconds. The workflow integrates four components: (1) a surrogate trained on CAE crash simulations to predict pedestrian leg injury metrics from design parameters, achieving an average $R^2=0.87$ and providing distribution-free conformal prediction intervals; (2) multiobjective evolutionary search (NSGA-II) to discover diverse feasible parameter sets under user-specified constraints; (3) a morphing-based geometry generator that maps parameters to topology-preserving 3D shapes; and (4) a natural-language interface in which an LLM orchestrates the workflow and a vision--language model supports semantic comparison of generated designs. In an automotive front-bumper case study, the workflow produces 35 distinct safety-compliant alternatives from a single exploration, a process that would require weeks with conventional CAE iteration. These results suggest that foundation models can serve as integration layers between ML surrogates and physics-based simulation, helping bring AI capabilities to safety-critical engineering domains.

2606.17574 2026-06-17 cs.AI 新提交

DeepInsight: A Unified Evaluation Infrastructure Across the Physical AI Stack

DeepInsight:跨物理AI栈的统一评估基础设施

Siyi Li, Chunyu Sun, Jiahao Zhang, Yuchen Kang, Wuliang Wang, Yu Qiu, Rui Jiang, Haitao Cui, Jie Chen

发表机构 * Xiaopeng(小鹏汽车)

AI总结 提出DeepInsight,一个在单一运行时上支持物理AI栈全谱系评估的基础设施,通过三个抽象(任务、资源、结果)保持异构性,实现跨层回归诊断。

详情
AI中文摘要

评估物理AI栈涉及的操作符跨越三个数量级以上——从单个基础模型解码步骤到全身控制的数千个物理滴答——在模态、奖励语义和资源概况上正交变化。现有框架无法覆盖这一范围,因此当前栈的评估是通过拼接独立的测试工具完成的,这些工具既不共享运行时也不共享评分,保留了每个片段的局部有效性,但失去了诊断跨层回归所需的共享身份。我们提出DeepInsight,一个在单一运行时上服务于这一完整谱系的评估基础设施。它不将各体制同质化,而是通过三个狭窄的抽象——任务、资源和结果——保持其异构性,每个抽象都由每个子系统共享的一个不变量实现:一个情节驱动器、一个由每个昂贵后端(LLM推理和沙盒运行时)实现的资源句柄协议,以及一个写入每个事件的跟踪身份方案。在具身人形机器人栈的所有三层上部署后,这一组不变量主要通过配置即可引入新的基准测试。在成熟的对等编排器存在的地方——在基础模型端——它在其自身分布内复现已发布的参考值和对等框架读数,在单个节点上更快地运行相同的套件,并跨节点近线性扩展。其独特的回报在于诊断:由于每一层都写入一个共享的跟踪,从一个层开始并在另一个层显现的回归在该跟踪上仍然可定位——这是任何片段测试工具联合体无法复现的跨层收益。

英文摘要

Evaluating a Physical AI stack spans operators that differ by more than three orders of magnitude -- from a single foundation-model decoding step to thousands of physics ticks of whole-body control -- varying orthogonally in modality, reward semantics, and resource profile. No existing framework spans this range, so the stack is evaluated today by stitching together separate harnesses that share neither runtime nor scoring, preserving each segment's local validity but losing the shared identity needed to diagnose cross-layer regressions. We present DeepInsight, an evaluation infrastructure that serves this full spectrum on a single runtime. Rather than homogenize the regimes, it preserves their heterogeneity behind three narrow abstractions -- task, resource, and result -- each realized as one invariant shared by every subsystem: one episode driver, one resource-handle protocol implemented by every expensive backend (LLM inference and sandboxed runtimes alike), and one trace identity scheme under which every event is written. Deployed in production across all three layers of an embodied humanoid stack, this single set of invariants onboards new benchmarks largely by configuration. Where mature peer orchestrators exist -- at the foundation-model end -- it reproduces published references and peer-framework readings within their own spread, runs the same suites faster on a single node, and scales near-linearly across nodes. Its distinctive return is diagnostic: because every layer writes into one shared trace, a regression that begins in one layer and surfaces in another stays localizable on that trace -- a cross-layer payoff no federation of per-segment harnesses can reproduce.

2606.17567 2026-06-17 cs.LG 新提交

Reducing Learner Redundancy in Boosting via Residual Orthogonalization

通过残差正交化减少Boosting中的学习器冗余

Ye Su, Jipeng Guo, Yong Liu, Xin Xu, Gangchun Zhang, Jinxin Chen, Di Wu, Longlong Zhao

发表机构 * Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(中国科学院深圳先进技术研究院) College of Information Science and Technology, Beijing University of Chemical Technology(北京化工大学信息科学与技术学院) Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学高瓴人工智能学院) School of Computer Science, Central China Normal University(华中师范大学计算机学院) the School of Computing, Engineering and Mathematical Sciences, La Trobe University(拉筹伯大学计算、工程与数学科学学院)

AI总结 针对Boosting中残差拟合导致的学习器冗余问题,提出SCBoost框架,通过谱残差投影和协方差正则加权两种机制减少冗余,理论证明其几何性质,实验表明在精度和F1分数上表现优异。

详情
AI中文摘要

虽然顺序残差拟合是标准Boosting框架的基础,但它通过反复处理相关的误差成分,内在导致了学习器冗余。为了解决这一瓶颈,我们提出从残差拟合转向\textit{残差正交化},并引入SCBoost。我们的框架通过两种互补机制处理冗余:谱残差投影(SRP)和协方差正则加权(CRW)。在训练过程中,SRP将每个残差目标投影到历史预测子空间的正交补上,迫使后续学习器仅捕获新的经验创新。在聚合过程中,CRW在验证集上优化集成权重,并加入显式的协方差惩罚以减轻剩余相关性。理论上,我们提供了有限样本的几何刻画,证明SRP产生精确的加性残差能量分解。此外,在各向同性噪声假设下,我们严格建立了该投影改善有效信噪比的条件。在十个基准数据集上的大量实验表明,SCBoost在开箱即用的情况下表现出色,特别是在准确率和F1分数上。这项工作通过几何视角重新诠释了Boosting,表明显式的冗余控制是迈向更高效集成架构的一个有原则且必要的步骤。

英文摘要

While sequential residual fitting is the bedrock of standard boosting frameworks, it inherently breeds learner redundancy by repeatedly revisiting correlated error components. To address this bottleneck, we propose a shift from residual fitting to \textit{residual orthogonalization} and introduce SCBoost. Our framework tackles redundancy through two complementary mechanisms: Spectral Residual Projection (SRP) and Covariance-Regularized Weighting (CRW). During training, SRP projects each residual target onto the orthogonal complement of the historical prediction subspace, forcing successive learners to capture only novel empirical innovations. During aggregation, CRW optimizes ensemble weights on a validation set with an explicit covariance penalty to mitigate remaining correlations. Theoretically, we provide a finite-sample geometric characterization proving that SRP yields an exact additive residual-energy decomposition. Furthermore, under an isotropic-noise assumption, we rigorously establish the conditions under which this projection improves the effective Signal-to-Noise Ratio. Extensive experiments across ten benchmark datasets demonstrate that SCBoost delivers strong out-of-the-box performance, particularly in accuracy and F1 score. This work reinterprets boosting through a geometric lens, suggesting that explicit redundancy control is a principled and necessary step toward more efficient ensemble architectures.

2606.17564 2026-06-17 cs.CV cs.AI 新提交

Geometric Consistency Protocol for Foundation Model Features in Multi-View Satellite Imagery

多视图卫星图像中基础模型特征的几何一致性协议

Qiyan Luo, Jie Yang, Yingdong Pi, Lekang Wen, Mi Wang

发表机构 * Hubei Province Key Research and Development Program(湖北省重点研发计划) LIESMARS Special Research Funding(测绘遥感信息工程国家重点实验室专项研究基金) National Science Fund for Distinguished Young Scholars(国家杰出青年科学基金)

AI总结 针对卫星多视图重建中传统2D全局匹配的误导性,提出基于有理函数模型(RFM)的几何忠实评估协议,通过RPC投影3D一致性度量和几何约束密集匹配代理,揭示语义一致性与几何定位的解耦,并证明在RPC一致评估下2D骨干网络仍具竞争力。

Comments The manuscript is accepted as Oral Presentation in IEEE International Geoscience and Remote Sensing Symposium(IGARSS 2026)

详情
AI中文摘要

标准化的评估协议对于遥感领域的稳健基准测试至关重要,特别是当基础特征越来越多地跨不同传感器和复杂成像几何进行迁移时。在卫星多视图重建中,依赖无约束2D全局匹配的传统评估常常具有误导性。有理函数模型(RFM)及其有理多项式系数(RPC)决定了弯曲的、高度依赖的极线几何,这使得平坦的2D搜索空间在物理上不一致。我们提出了一种针对RPC框架的几何忠实且可复现的协议。我们的方法将RPC投影的3D一致性度量与几何约束的密集匹配代理相结合,专门评估在物理上合理的搜索流形下相似性响应是否保持局部化和唯一性。我们联合报告策略的一个关键发现是语义一致性与几何定位的解耦:在投影3D点处的高跨视图相似性并不能保证实际推理中的可靠匹配性。我们的基准测试表明,将几何约束纳入问题定义对于卫星图像是基础性的。此外,我们展示了最先进的2D骨干网络在经受这种RPC一致评估时,仍然与专门的3D感知模型保持显著竞争力。

英文摘要

Standardized evaluation protocols are indispensable for robust benchmarking in remote sensing, particularly as foundation features are increasingly transferred across diverse sensors and complex imaging geometries. In satellite multi-view reconstruction, conventional evaluations relying on unconstrained 2D global matching are often misleading. The Rational Function Model (RFM) and its Rational Polynomial Coefficients (RPC) dictate a curved, height-dependent epipolar geometry that render flat 2D search spaces physically inconsistent. We propose a geometry-faithful and reproducible protocol tailored for the RPC framework. Our approach integrates an RPC-projected 3D consistency metric with a geometry-constrained dense matching proxy, specifically evaluating whether similarity responses remain localized and unique under physically plausible search manifolds. A pivotal finding of our joint reporting strategy is the decoupling of semantic agreement and geometric localization: high cross-view similarity at a projected 3D point does not guarantee reliable matchability in practical inference. Our benchmark demonstrates that incorporating geometric constraints is fundamental to the problem definition in satellite imagery. Furthermore, we show that state-of-the-art 2D backbones remain remarkably competitive against specialized 3D-aware models when subjected to this RPC-consistent evaluation.

2606.17561 2026-06-17 cs.CV 新提交

RT-Counter: Real-Time Text-Guided Open-Vocabulary Object Counting

RT-Counter:实时文本引导的开放词汇目标计数

Hao-Yuan Ma, Li Zhang, Zhiwei Zhu, Jie Gao

发表机构 * School of Computer Science and Technology, Soochow University(苏州大学计算机科学与技术学院)

AI总结 提出实时文本引导开放词汇计数框架RT-Counter,通过视觉原型文本化模块和编织Transformer层,在保持高精度的同时实现实时推理,在FSC147上MAE为13.30,速度达112.48 FPS。

详情
AI中文摘要

文本引导的开放词汇目标计数(TOOC)旨在对自然语言描述指定的类别中的对象进行计数。尽管视觉-语言预训练模型已成功应用于TOOC任务,但在计数场景中仍面临细粒度空间理解和实时推理需求的挑战。为解决这些限制,本文提出一种实时TOOC框架,称为实时计数器(RT-Counter),它不仅实现了良好的计数精度,而且具有高计算效率。RT-Counter设计了一种新颖的视觉原型文本化(VPT)模块,该模块可以将学习到的视觉特征投影到文本特征空间,然后生成包含视觉原型难以捕获的抽象信息和文本难以描述的详细原型信息的特征,增强了对象级视觉-语言模型的计数能力。此外,RT-Counter集成了我们的编织Transformer(Weaformer)层,以极低的计算成本保持高描述能力。Weaformer层采用了一种新颖的混合注意力机制,可以高效地编织局部和全局视觉特征。在三个公共数据集上的大量实验表明,RT-Counter成功打破了TOOC中精度与速度的权衡。在FSC147上实现具有竞争力的MAE 13.30的同时,RT-Counter以112.48 FPS运行,比现有TOOC领先方法快7.4倍,参数效率高4倍以上。我们的工作旨在平衡TOOC中的高精度和实时性能。代码可在以下网址获取:this https URL。

英文摘要

Text-guided open-vocabulary object counting (TOOC) aims to count objects belonging to the categories specified by natural language descriptions. Although vision-language pre-trained models have been successful applied to TOOC tasks, they still struggle with fine-grained spatial understanding and real-time inference requirements in counting scenarios. To address these limitations, this paper proposes a real-time TOOC framework, called the Real-Time Counter (RT-Counter), that achieves not only good counting accuracy but also high computational efficiency. RT-Counter designs a novel Visual Prototype Textualization (VPT) module that can project learned visual features into a text feature space and then generate features containing the abstract information that is hard to capture with visual prototypes and the detailed prototype information that is difficult to describe in text, enhancing the object-level visual-language model's counting capabilities. Additionally, RT-Counter incorporates our Weaving Transformer (Weaformer) layers, maintaining high descriptive power at a fraction of the computational cost. The Weaformer layer adopts a novel hybrid attention mechanism that can efficiently weave together local and global visual features. Extensive experiments on three public datasets show that RT-Counter successfully breaks the accuracy-speed trade-off in TOOC. While achieving a competitive MAE of 13.30 on FSC147, RT-Counter operates at 112.48 FPS, making it 7.4x faster and over 4$\times$ more parameter-efficient than the existing leading methods in TOOC. Our work aims at balancing high accuracy and real-time performance in TOOC. Code is available at: https://github.com/Jason-Mar1/RT-Counter.

2606.17557 2026-06-17 cs.CV 新提交

Universal Image Restoration via Internalized Chain-of-Thought Reasoning

通过内化思维链推理的通用图像恢复

Yu Guo, Zhengru Fang, Shengfeng He, Senkang Hu, Yihang Tao, Phone Lin, Yuguang Fang

发表机构 * Hong Kong JC Lab of Smart City and Department of Computer Science, City University of Hong Kong(香港城市大学智慧城市香港联合实验室及计算机科学系) School of Computing and Information Systems, Singapore Management University(新加坡管理大学计算与信息系统学院) Computer Science and Information Engineering, National Taiwan University(国立台湾大学计算机科学与信息工程学系)

AI总结 提出CoTIR框架,将思维链推理内化到单个预训练编辑模型中,通过可微拉格朗日优化实现混合退化下的通用图像恢复,在5.2M样本基准上优于现有方法。

详情
AI中文摘要

图像恢复旨在从退化输入中恢复高质量图像,但在复杂混合退化下高度病态。虽然统一的全能模型很常见,但其性能随退化复杂性增加而下降。近期工作采用思维链推理,通过专用模块进行多轮恢复。然而,这种方法面临两个关键限制:(i) 多步处理导致计算成本增加,(ii) 逐步推理过程中退化间交互建模薄弱。我们提出CoTIR,一种将思维链推理内化到单个模型中的通用图像恢复框架。具体而言,我们将图像恢复视为图像编辑的一个专门子任务,这意味着大规模预训练编辑模型提供了更有利的优化起点。在此基础上,我们对模型进行恢复微调,并通过受拉格朗日优化启发的可微公式将结构化思维链式推理编码到学习目标中,从而实现无需链接专用恢复器的整体恢复。为促进训练和评估,我们进一步提出CoTIR-Bench,一个包含520万样本及思维链式推理轨迹的大规模基准。在CoTIR-Bench和广泛真实复合退化场景上的大量实验表明,CoTIR在感知质量和保真度上均优于全能模型和多轮恢复方法。源代码见https://this https URL。

英文摘要

Image restoration seeks to recover high-quality images from degraded inputs but becomes highly ill-posed under complex, mixed degradations. While unified all-in-one models are common, their performance declines as degradation complexity increases. Recent works adopt Chain-of-Thought (CoT) reasoning for multi-round restoration using specialized modules. However, this approach faces two key limitations: (i) increased computational cost due to multi-step processing, and (ii) weak modeling of interactions between degradations during stepwise inference. We introduce CoTIR, a universal image restoration framework that internalizes CoT reasoning within a single model. Concretely, we view image restoration as a specialized subtask of image editing, which implies that a large-scale pre-trained editing model provides a more favorable optimization starting point. Building on this, we fine-tune the model for restoration and further encode structured CoT-style reasoning into the learning objective via a differentiable formulation inspired by Lagrangian optimization, enabling holistic restoration without chaining specialized restorers. To facilitate training and evaluation, we further present CoTIR-Bench, a large-scale benchmark comprising 5.2 million samples with CoT-style reasoning traces. Extensive experiments on CoTIR-Bench and broad real composite degradation scenes show that CoTIR achieves stronger perceptual quality and more competitive fidelity than both all-in-one models and multi-round restoration methods. The source code is available at https://github.com/gy65896/CoTIR.

2606.17553 2026-06-17 cs.LG 新提交

SpatioTemporal Causal Network Diagnostics for Geographic Tipping Point Early Warning

地理临界点早期预警的时空因果网络诊断

Zhaoyuan Yu, Zhangyong Liang

发表机构 * Jiangsu Center for Collaborative Innovation in Geographical Information Resource Development and Application(江苏省地理信息资源开发与应用协同创新中心) National Center for Applied Mathematics, Tianjin University(天津大学国家应用数学中心)

AI总结 提出时空因果网络诊断(ST-CND)框架,通过构建数据驱动的有向因果网络,结合局部恢复率估计和脆弱子网识别,解决地理临界点早期预警中的空间稀释、欧氏假设和相关噪声问题,在AMOC任务上AUROC达0.783。

详情
AI中文摘要

生态系统、气候子系统或冰盖中的地理临界点对局部早期预警提出了严峻挑战。经典的空间指标如Moran's I总结了全局空间结构,但难以处理三个问题:空间稀释、欧氏假设和相关噪声。本文引入时空因果网络诊断(ST-CND),该框架通过将地理场表示为随时间演化的有向因果网络来解决这三个问题。核心工作流程如下:(1)通过转移熵推断哪些空间节点有助于预测其他节点,用数据驱动的信息流拓扑替代固定的欧氏邻域;(2)通过动态模态分解估计每个候选子网络内的局部恢复率;(3)结合三个信号——高内部波动、高内部同步和低外部耦合——识别最脆弱的子网络,从而抑制空间相关噪声引起的误报。在合成分岔和两个观测海表温度基准(即印度洋-太平洋SST和北大西洋AMOC)上验证,ST-CND提供了局部且可解释的预警。在AMOC任务上,它达到了0.783的AUROC和0.378的临界子网络IoU,优于递归网络和lambda-AR1基线。该框架为地球系统科学中的空间早期预警提供了可解释且可扩展的流程。

英文摘要

Geographic tipping points in ecosystems, climate subsystems, or ice sheets pose severe challenges for localized early warning. Classical spatial indicators such as Moran's I summarize global spatial structure, but they struggle with three issues: spatial dilution, Euclidean assumptions, and correlated noise. This paper introduces SpatioTemporal Causal Network Diagnostics (ST-CND), a framework that addresses these three issues by representing the geographic field as a time-evolving directed causal network. The core workflow is as follows: (1) infer which spatial nodes help predict other nodes via transfer entropy, replacing fixed Euclidean neighborhoods with data-driven information-flow topology; (2) estimate local recovery rates within each candidate subnetwork via dynamic mode decomposition; and (3) identify the most vulnerable subnetwork by combining three signals, namely high internal fluctuation, high internal synchronization, and low external coupling, thereby suppressing false alarms from spatially correlated noise. Validated on synthetic bifurcations and two observational sea-surface temperature benchmarks, namely Indo-Pacific SST and North Atlantic AMOC, ST-CND delivers localized and interpretable warnings. On the AMOC task, it achieves an AUROC of 0.783 and a critical-subnetwork IoU of 0.378, outperforming recurrence-network and lambda-AR1 baselines. The framework provides an interpretable and scalable pipeline for spatial early warning in Earth system science.

2606.17551 2026-06-17 cs.LG cs.AI 新提交

Reversal Q-Learning

逆向Q学习

Aditya Oberai, Seohong Park, Sergey Levine

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出逆向Q学习(RQL)算法,通过扩展MDP框架和逆向流生成虚拟在线轨迹,结合偏差-方差缩减技术,实现基于流策略的离线强化学习,在50个机器人任务中取得最佳平均性能。

详情
AI中文摘要

迭代生成建模技术(如流匹配)为建模复杂行为以进行有效的离线强化学习(RL)提供了强大工具。在这项工作中,我们提出了一种新的离策略RL算法,该算法基于先验数据训练流策略。我们的想法始于“扩展”马尔可夫决策过程(MDP)框架,该框架将单个流细化步骤视为MDP中的独立动作。为了在该框架中实现离策略RL,我们应用了两种技术:我们通过“逆向”流生成虚拟在线轨迹,使该框架与先验数据兼容;并应用偏差-方差缩减技术来缓解离策略RL中的视界诅咒。我们将由此产生的算法称为逆向Q学习(RQL)。RQL相比先前基于流的RL方法具有若干优势:它不受时间反向传播的影响,更好地利用学习到的价值函数,并直接训练完整的、富有表现力的流策略。通过在50个具有挑战性的模拟机器人任务上的实验,我们表明,与最先进的基于流的离线RL算法相比,RQL实现了最佳的平均离线RL性能。

英文摘要

Iterative generative modeling techniques, such as flow matching, provide powerful tools to model complex behaviors for effective offline reinforcement learning (RL). In this work, we propose a new off-policy RL algorithm that trains a flow policy based on prior data. Our idea starts from the "expanded" Markov decision process (MDP) framework, which treats individual flow refinement steps as separate actions in an MDP. To enable off-policy RL within this framework, we apply two techniques: we generate virtual on-policy trajectories (by "reversing" flows) to make this framework compatible with prior data, and we apply a bias-and-variance reduction technique to mitigate the curse of horizon in off-policy RL. We call the resulting algorithm Reversal Q-learning (RQL). RQL has several advantages over previous flow-based RL methods: it does not suffer from backpropagation through time, makes better use of the learned value function, and directly trains the full, expressive flow policy. Through our experiments on 50 challenging simulated robotic tasks, we show that RQL leads to the best average offline RL performance compared to state-of-the-art flow-based offline RL algorithms.

2606.17546 2026-06-17 cs.AI 新提交

SEAGym: An Evaluation Environment for Self-Evolving LLM Agents

SEAGym: 自我进化LLM智能体的评估环境

Congjie Zheng, Chuanyi Xue, Bin Liang, Jun Yang, Changshui Zhang

发表机构 * Department of Automation, Tsinghua University(清华大学自动化系) Beijing National Research Center for Information Science and Technology (BNRist), Tsinghua University(北京信息科学与技术国家研究中心(BNRist),清华大学)

AI总结 提出SEAGym评估环境,通过训练、验证、测试、重放和成本记录多维度衡量智能体框架更新,揭示更新是否带来可复用改进、过拟合、成本增加或旧行为退化。

详情
AI中文摘要

基于LLM的自我进化智能体主要通过改变其智能体框架(agent harness)来改进:即围绕基础模型的结构化执行层,包括提示、记忆、工具、中间件、运行时状态以及模型-工具交互循环。现有评估通常将此过程简化为孤立的任务分数或单一的顺序曲线,掩盖了更新是否产生可复用的改进、过拟合近期任务、增加成本或损害旧行为。我们引入了SEAGym,一个用于跨训练、验证、测试、重放和成本记录衡量智能体框架更新的评估环境。SEAGym将Harbor兼容的基准测试转化为动态的自我进化任务源,包含训练批次、冻结更新验证、留出ID和OOD迁移视图、重放诊断以及保存的快照和指标记录。在Terminal-Bench 2.0和HLE上实例化SEAGym,我们在共享的epoch/batch协议下比较了ACE、TF-GRPO和AHE。结果表明,这些评估视图提供了关于进化过程的互补信号:频繁更新可能无法改善留出性能,有用的中间快照可能随后崩溃,源多样性和模型后端可能影响框架可靠性。

英文摘要

Self-evolving LLM-based agents improve mainly by changing their agent harness: the structured execution layer around a base model, including prompts, memory, tools, middleware, runtime state, and the model-tool interaction loop. Existing evaluations often reduce this process to isolated task scores or a single sequential curve, obscuring whether an update produces reusable improvement, overfits recent tasks, increases cost, or harms older behavior. We introduce SEAGym, an evaluation environment for measuring agent harness updates across training, validation, test, replay, and cost records. SEAGym turns Harbor-compatible benchmarks into dynamic self-evolution task sources with train batches, frozen update-validation, held-out ID and OOD transfer views, replay diagnostics, and saved snapshot and metric records. Instantiating SEAGym on Terminal-Bench 2.0 and HLE, we compare ACE, TF-GRPO, and AHE under a shared epoch/batch protocol. The results show that these evaluation views provide complementary signals about the evolution process: frequent updates may fail to improve held-out performance, useful intermediate snapshots may collapse later, and source diversity and model backend can affect harness reliability.

2606.17545 2026-06-17 cs.LG q-fin.CP q-fin.PR 新提交

Continuous-time Optimal Stopping through Deep Reinforcement Learning

通过深度强化学习的连续时间最优停止

Cosmin Borsa, Michael Ludkovski

发表机构 * Department of Statistics & Applied Probability, UC Santa Barbara(加州大学圣塔芭芭拉分校统计与应用概率系)

AI总结 提出CARLOS算法,利用聚合深度神经网络学习任意精细时间分辨率下的停止规则,通过渐进式时间网格细化和自适应采样,逼近美式期权价格上界。

Comments 33 pages

详情
AI中文摘要

基于仿真的最优停止问题求解器必须离散化停止决策。在经典动态规划下,粗网格(只有少数停止机会)会显著低估最优期望回报,而在极细网格上,近似误差通过反向递归累积。为消除这一限制,我们开发了一种新的强化学习启发算法,能够在任意精细时间分辨率下学习停止规则。我们的CARLOS(连续时间自适应强化学习最优停止)算法利用聚合深度神经网络(ADNN)学习联合时空决策边界。从粗时间网格开始,我们逐步增加停止机会的频率,同时并行训练ADNN以精化其时机-价值估计。此外,我们设计了一种自适应采样策略,逐渐将训练集中到停止边界附近。基准测试结果表明,CARLOS相比现有百慕大求解器提供更高的价格,接近美式上界,并且相对于非RL比较器实现了高计算效率。

英文摘要

Simulation based solvers for optimal stopping problems must discretize the stopping decision. Under classical dynamic programming, a coarse exercise grid with only a few stopping opportunities can materially undervalue the optimal expected reward, whereas on a very fine grid, approximation errors accumulate through the backward recursion. To remove this limitation, we develop a new reinforcement-learning inspired algorithm that enables us to learn the exercise rule at arbitrarily fine time resolution. Our CARLOS (Continuous-time Adaptive Reinforcement Learning for Optimal Stopping) algorithm utilizes an aggregate deep neural network (ADNN) to learn a joint space-time decision boundary. Starting from a coarse time grid, we progressively increase the frequency of stopping opportunities, while in parallel training the ADNN to refine its timing-value estimates. We moreover design an adaptive sampling strategy that gradually concentrates training effort near the stopping boundary. Benchmarked results show that CARLOS delivers higher prices than existing Bermudan solvers, approaching the American upper bound, and achieves high computational efficiency relative to non-RL comparators.

2606.17542 2026-06-17 cs.CL 新提交

Evaluating Large Language Models Abilities for Addressee, Turn-change, and Next Speaker Prediction in Meetings

评估大型语言模型在会议中的收话人、话轮转换和下一说话人预测能力

Ryo Fukuda, Takatomo Kano, Siddhant Arora, Marc Delcroix, Naohiro Tawara, Atsunori Ogawa, Yuya Chiba, Atsushi Ando, William Chen, Shinji Watanabe

发表机构 * NTT, Inc., Japan(日本电信电话公司) Language Technologies Institute, Carnegie Mellon University, USA(卡内基梅隆大学语言技术研究所)

AI总结 利用大型语言模型(LLMs)研究多模态多人对话中的话轮转换,构建了收话人检测、话轮转换预测和下一说话人预测三个任务的评估框架,实验表明LLMs在下一说话人预测上优于监督模型和人类,但多模态LLM在收话人检测和话轮转换预测上仍低于人类水平。

Comments Accepted to INTERSPEECH 2026

详情
AI中文摘要

我们利用大型语言模型(LLMs)研究多模态多人对话中的话轮转换。我们构建了一个评估框架,涵盖三个任务:收话人检测、话轮转换预测和下一说话人预测。我们比较了针对这些任务训练的监督模型、基于文本的LLMs、多模态LLMs(MM-LLMs)以及人类受试者。在AMI语料库上的实验表明,尽管LLMs未在目标领域进行训练,也无法访问音频或视觉信息,但它们在下一说话人预测方面优于监督模型和人类。多模态LLM在收话人检测和话轮转换预测上表现优于基于文本的LLMs,但仍低于人类表现,表明其难以利用原始音视频信号。消融分析显示,对话上下文至关重要,尤其是对于下一说话人预测。我们观察到人类和LLM的预测模式相似,且话轮转换频繁的区间对两者都具有挑战性。

英文摘要

We investigate turn-taking in multimodal multi-party conversations using large language models (LLMs). We construct an evaluation framework for three tasks: addressee detection, turn-change prediction, and next speaker prediction. We compare supervised models trained for these tasks, text-based LLMs, multimodal LLMs (MM-LLMs), and human subjects. Experiments on the AMI corpus showed that LLMs outperformed supervised models and humans in next speaker prediction, despite not being trained on the target domain and without access to audio or visual information. An MM-LLM performed better than text-based LLMs on addressee detection and turn-change prediction but remained below human performance, indicating difficulty leveraging raw audio-visual signals. Ablation analyses revealed that conversational context was critical, particularly for next speaker prediction. We observed that human and LLM prediction patterns were similar, and intervals with frequent turn changes were difficult for both.

2606.17541 2026-06-17 cs.LG cs.AI 新提交

Offline Preference-Based Trajectory Evaluation

基于偏好的离线轨迹评估

Fernando Diaz

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 针对离线评估中仅使用终端成功率导致统计效率低下的问题,提出基于偏好的轨迹评估方法,通过比较轨迹的时间偏好减少平局,提升区分能力、排名稳定性和数据效率。

详情
AI中文摘要

智能系统的离线评估通常将轨迹简化为终端成功,丢弃了部分进展信息并导致大量平局,通过减少有效样本量和削弱区分系统的能力,造成显著的统计低效。我们提出基于偏好的轨迹评估,该方法通过时间偏好(关于进展和返回时间分布)直接比较轨迹。我们发现,在多种智能和交互基准测试中,基于标准成功率的指标在大约75%的实例上产生平局比较,而轨迹感知偏好将平局减少到大约35%,从而提高了区分能力、排名稳定性和数据效率。我们的结果表明,通常归因于数据收集不足或问题难度的基准饱和,也可能由评估指标的选择所解释。

英文摘要

Offline evaluation of agentic systems often collapses trajectories to terminal success, discarding information about partial progress and inducing widespread ties, creating substantial statistical inefficiency by reducing effective sample size and weakening the ability to distinguish systems. We propose preference-based trajectory evaluation, which compares trajectories directly through temporal preferences over progress and time-to-return profiles. We find that, across diverse agentic and interactive benchmarks, standard success-based metrics produce tied comparisons on roughly 75% of instances, whereas trajectory-aware preferences reduce ties to roughly 35%, improving discriminative power, ranking stability, and data efficiency. Our results suggest that benchmark saturation, often attributed to poor data collection or problem difficulty, may also be explained by the choice of evaluation measure.

2606.17540 2026-06-17 cs.CV 新提交

TaFD: Threat-Aware Frequency Decoupling for Adversarial Robustness against Heterogeneous Attacks

TaFD:针对异构攻击的对抗鲁棒性的威胁感知频率解耦

Mengda Xie, Yiling He, Meie Fang

发表机构 * School of Computer Science and Cyber Engineering, Guangzhou University(广州大学计算机科学与网络工程学院) Information Security Research Group, University College London(伦敦大学学院信息安全研究组)

AI总结 针对异构攻击下联合对抗训练中的梯度不兼容问题,提出威胁感知频率解耦(TaFD)框架,通过频域分治策略实现结构参数分离,在多个基准上平均鲁棒准确率提升约11%。

详情
AI中文摘要

多威胁鲁棒性仍然是深度学习中的一个基本挑战。尽管联合对抗训练(JAT)被广泛采用,但在异构威胁下,特别是$\ell_p$有界攻击和语义攻击之间,它遭受负迁移。通过一阶梯度分析,我们将此形式化为梯度不兼容,并从理论上证明了分离优化的必要性。我们进一步揭示这些冲突的威胁在频域中表现出可分离的频谱特性。受此观察启发,我们提出了威胁感知频率解耦(TaFD),一个两阶段防御框架,将JAT重新表述为频域分治范式。TaFD首先通过攻击频谱原型的无监督聚类发现潜在威胁域,并训练一个轻量级分类器用于推理时的威胁域识别。基于预测,TaFD采用频率条件卷积,学习威胁域特定的频谱掩码,并将每个样本路由到相应的专家,强制结构参数分离并缓解优化冲突。我们在三个代表性图像分类基准(CIFAR-10、CIFAR-100和Tiny-ImageNet)和两个代表性架构(卷积ResNet和混合Transformer MobileViT)上验证了TaFD。大量结果表明,与现有的JAT和频域基线相比,TaFD在异构攻击下实现了更均衡的鲁棒性,在保持领先的干净准确率的同时,平均鲁棒准确率比最强基线提高了约11%。

英文摘要

Multi-threat robustness remains a fundamental challenge in deep learning. Although joint adversarial training (JAT) is widely adopted, it suffers from negative transfer under heterogeneous threats, particularly between $\ell_p$-bounded and semantic attacks. Through first-order gradient analysis, we formalize this as gradient incompatibility and theoretically establish the necessity of decoupled optimization. We further reveal that these conflicting threats exhibit separable spectral characteristics in the frequency domain. Motivated by this observation, we propose Threat-aware Frequency Decoupling (TaFD), a two-stage defense framework that reformulates JAT as a frequency-domain divide-and-conquer paradigm. TaFD first discovers latent threat domains via unsupervised clustering of attack spectral prototypes and trains a lightweight classifier for inference-time threat domain identification. Conditioned on the prediction, TaFD employs a Frequency-Conditional Convolution that learns threat-domain-specific spectral masks and routes each sample to the corresponding expert, enforcing structural parameter separation and alleviating optimization conflicts. We validate TaFD on three representative image-classification benchmarks (CIFAR-10, CIFAR-100, and Tiny-ImageNet) and on two representative architectures (the convolutional ResNet and the hybrid-transformer MobileViT). Extensive results demonstrate that TaFD achieves more balanced robustness against heterogeneous attacks than existing JAT and frequency-domain baselines, improving average robust accuracy by approximately 11\% over the strongest baseline while maintaining leading clean accuracy.

2606.17539 2026-06-17 cs.CV cs.AI 新提交

Reinforcing Dual-Path Reasoning in Spatial Vision Language Models

空间视觉语言模型中的双路径推理强化

Yatai Ji, An-Chieh Cheng, Yang Fu, Yukang Chen, Han Zhang, Zhaojing Yang, Wei Huang, Ka Chun Cheung, Song Han, Vidya Nariyambut Murali, Pavlo Molchanov, Jan Kautz, Simon See, Hongxu Yin, Ping Luo, Sifei Liu

发表机构 * The University of Hong Kong(香港大学) NVIDIA(英伟达) University of California, San Diego(加州大学圣迭戈分校)

AI总结 提出SR-REAL框架,通过强化学习融合语言推理和3D检测推理两条路径,显著提升空间VLM在复杂几何推理任务中的性能。

详情
AI中文摘要

空间VLM在几何感知方面取得了显著进展,但需要多步推理(涉及深度、距离和场景关系)的复杂空间推理仍然具有挑战性。此外,不同的空间查询需要根本不同的策略:有些最好通过纯语言的逐步演绎来解决,而另一些则需要在进行定量推理之前进行显式的3D定位。我们提出了SR-REAL(通过强化学习实现空间VLM的双路径空间推理),这是一个统一框架,为空间VLM配备了两条互补的推理路径:纯语言推理(LOR),执行逐步语言演绎;以及先检测后推理(DTR),通过区域标记检测3D几何线索(如中心或边界框),然后进行显式几何推理。SR-REAL首先进行冷启动监督微调阶段,构建LOR和DTR的思维链监督,并暴露区域到3D的接口;随后进行强化学习,使用准确性和格式奖励优化策略模型;对于DTR,基于离散中心的检测奖励进一步细化几何对齐。在多种空间基准测试中,SR-REAL显著优于空间VLM基线:(i) 单个RL训练模型支持两条推理路径,DTR通过精确的3D定位在区域感知任务中表现出色,LOR增强了一般空间推理;(ii) 联合训练两条路径促进相互强化;(iii) 高质量、混合的冷启动数据对于稳定的RL优化至关重要;(iv) 模型无需逐任务调整即可跨数据集和领域泛化,展示了LOR和DTR之间的正向迁移。

英文摘要

Spatial VLMs have made substantial progress in geometric perception, yet complex spatial reasoning requiring multi-step inference over depth, distance, and scene relations remains challenging. Moreover, different spatial queries call for fundamentally different strategies: some are best addressed through purely linguistic, step-by-step deduction, while others require explicit 3D grounding before quantitative inference. We present Dual-Path Spatial Reasoning via Reinforcement Learning for Spatial VLMs (SR-REAL), a unified framework that equips a spatial VLM with two complementary reasoning paths: Language-Only Reasoning (LOR), which performs step-by-step linguistic deduction, and Detect-Then-Reason (DTR), which detects 3D geometric cues (e.g., centers or bounding boxes) via region tokens before explicit geometric inference. SR-REAL begins with a cold-start supervised fine-tuning stage that constructs LOR and DTR chain-of-thought supervision and exposes a region-to-3D interface, followed by RL that optimizes the policy model with accuracy and format rewards; for DTR, a discrete center-based detection reward further refines geometric alignment. Across diverse spatial benchmarks, SR-REAL significantly outperforms spatial VLM baselines: (i) a single RL-trained model supports both reasoning paths, with DTR excelling in region-aware tasks through precise 3D localization and LOR enhancing general spatial reasoning; (ii) jointly training both paths fosters mutual reinforcement; (iii) high-quality, blended cold-start data is crucial for stable RL optimization; and (iv) the model generalizes across datasets and domains without per-task tuning, demonstrating positive transfer between LOR and DTR.

2606.17536 2026-06-17 cs.CV cs.AI 新提交

OmniDrive: An LLM-Choreographed Multi-Agent World Model with Unified Latent Co-Compression for Multi-View Driving Video Generation

OmniDrive: 一种由LLM编排的多智能体世界模型,用于多视角驾驶视频生成的统一潜在协同压缩

Zijie Meng, Yufei Liu, Chengqian Ma, Zhiyu Li, Jiyuan Liu, Wenhua Nie, Bingcai Wei, Shuqin Chen, Weichen Xu, Jiquan Yuan, Miao Zhang

发表机构 * Peking University(北京大学) Xiamen University(厦门大学) Korea Advanced Institute of Science and Technology (KAIST)(韩国科学技术院) National Taiwan University(国立台湾大学) Wuhan University(武汉大学) Wuhan University of Technology(武汉理工大学) Tsinghua University(清华大学) Jimei University(集美大学)

AI总结 提出DRIVE-CHOREO,一种由LLM编排的多智能体世界模型,通过三个Qwen2.5-VL智能体协同生成位置感知的潜在序列,并利用视图-时间置换与3D VAE协同压缩,实现可控多视角视频生成,在nuScenes上达到SOTA多视角一致性和BEV mAP 21.6。

Comments 24 pages, 10 figures

详情
AI中文摘要

自动驾驶的生成式世界模型面临两个未解决的对立:异构控制注入(自由形式语言、高清地图、轨迹和相机位姿存在于不兼容的表示空间)和事后跨视图融合(每个相机的潜在编码未能编码全局3D几何)。我们将两者追溯到同一个根本原因:在潜在标记级别上缺乏对齐语言、几何和像素的共享符号中间语言。我们提出DRIVE-CHOREO,一种由LLM编排的多智能体世界模型,将可控多视图视频生成重新定义为潜在编排。三个Qwen2.5-VL智能体——一个解析用户意图为结构化WorldScript的导演,一个将其接地为空间锚定布局标记的制图师,以及一个将跨视图批评反馈为辅助监督的审计员——共同创作一个单一的位置感知标记序列。该序列通过视图-时间置换与多视图视频协同压缩,在3D VAE的卷积感受野内强制实现相机间几何。在nuScenes上,DRIVE-CHOREO以具有竞争力的FVD(45.7)实现了新的最先进的多视图一致性和BEV mAP(21.6);仅在我们的合成数据上训练的检测器在真实验证集上获得了+2.4 NDS,验证了下游实用性。

英文摘要

Generative world models for autonomous driving face two unresolved tensions: heterogeneous control injection, where free-form language, HD-maps, trajectories, and camera poses reside in incompatible representational spaces, and post-hoc cross-view fusion, where per-camera latents fail to encode global 3-D geometry. We trace both to a single root cause: the absence of a shared symbolic interlingua aligning language, geometry, and pixels at the latent-token level. We present DRIVE-CHOREO, an LLM-choreographed multi-agent world model that recasts controllable multi-view video generation as latent choreography. Three Qwen2.5-VL agents - a Director parsing user intent into a structured WorldScript, a Cartographer grounding it into spatially-anchored layout tokens, and an Auditor feeding cross-view critiques back as auxiliary supervision - jointly author a single position-aware token sequence. This sequence is co-compressed with the multi-view video via a view-time permutation that enforces inter-camera geometry within the convolutional receptive field of a 3-D VAE. On nuScenes, DRIVE-CHOREO sets new state-of-the-art multi-view consistency and BEV mAP (21.6) with competitive FVD (45.7); a detector trained purely on our synthetic data gains +2.4 NDS on the real validation split, validating downstream utility.

2606.17534 2026-06-17 cs.RO 新提交

RICH-SLAM: Radar SLAM with Incremental and Continuous Hilbert Mapping

RICH-SLAM:基于增量连续希尔伯特映射的雷达SLAM

Bingbing Zhang, Huan Yin, Yang Xu, Shuo Liu, Shaojie Shen, Fumin Zhang, Wen Xu

发表机构 * State Key Laboratory of Ocean Sensing, Zhejiang University(浙江大学海洋传感国家重点实验室) Interdisciplinary Student Training Platform for Marine areas, Zhejiang University(浙江大学海洋交叉学科学生培养平台) Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology(香港科技大学电子与计算机工程系) School of AI and Robotics, Hunan University(湖南大学人工智能与机器人学院) Ocean College, Zhejiang University(浙江大学海洋学院) Institute of Deep-Sea Science and Engineering, Chinese Academy of Sciences(中国科学院深海科学与工程研究所)

AI总结 提出RICH-SLAM框架,采用Rao-Blackwellized粒子滤波后端和增量希尔伯特空间降秩高斯过程映射,从稀疏雷达测量中构建连续占用地图,并支持不确定性感知规划。

Comments 12 figures

详情
AI中文摘要

由于雷达对恶劣天气和光照条件具有固有的鲁棒性,使用雷达传感器进行同步定位与地图构建(SLAM)越来越受到关注。然而,与激光雷达和视觉数据相比,雷达测量具有稀疏和噪声大的特点,这给实现密集、连续且一致的地图表示带来了重大挑战。在本文中,我们提出了RICH-SLAM,一个旨在解决这些挑战的雷达SLAM框架。我们的方法采用基于Rao-Blackwellized粒子滤波的后端,使用粒子滤波进行位姿估计,卡尔曼滤波进行地图更新。我们提出了一种增量希尔伯特空间降秩高斯过程映射策略,能够在给定稀疏雷达输入的情况下实现连续且具有不确定性感知的地图表示。我们进一步引入了一种后验感知的粒子加权方案,利用地图参数的完整后验分布进行更鲁棒的似然评估。在自采集和公共ColoRadar数据集上的实验表明,RICH-SLAM能够从稀疏雷达测量中构建连续占用地图,并支持移动机器人的不确定性感知规划。

英文摘要

Simultaneous localization and mapping using radar sensors has gained increasing attention due to radar's inherent robustness to adverse weather and lighting conditions. However, radar measurements are characteristically sparse and noisy compared to LiDAR and visual data, posing significant challenges in achieving dense, continuous, and consistent map representations. In this paper, we present RICH-SLAM, a radar SLAM framework designed to address these challenges. Our approach features a Rao-Blackwellized particle filter-based back end that employs particle filtering for pose estimation and Kalman filtering for map updates. We propose an incremental Hilbert-space reduced-rank Gaussian process mapping strategy that enables continuous and uncertainty-aware map representations given sparse radar inputs. We further introduce a posterior-aware particle weighting scheme that leverages the full posterior distribution of map parameters for more robust likelihood evaluation. Experiments on self-collected and public ColoRadar datasets show that RICH-SLAM constructs continuous occupancy maps from sparse radar measurements and supports uncertainty-aware planning for mobile robots.

2606.17526 2026-06-17 cs.LG 新提交

MGUP: A Momentum-Gradient Alignment Update Policy for Stochastic Optimization

MGUP:一种用于随机优化的动量-梯度对齐更新策略

Da Chang, Ganzhao Yuan

发表机构 * Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences(中国科学院深圳先进技术研究院) Shenzhen University of Advanced Technology(深圳理工大学) Pengcheng Laboratory(鹏城实验室) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 提出MGUP机制,通过按固定比例选择参数施加大步长、其余参数用小步长,增强动量优化器,理论保证收敛,实验表明提升训练效率与稳定性。

Comments Published in NeurIPS 2025

详情
AI中文摘要

高效优化对于训练大型语言模型至关重要。尽管层内选择性更新已被探索,但缺乏一种既能实现细粒度控制又能保证收敛的通用机制。为填补这一空白,我们提出了\textbf{MGUP},一种新颖的选择性更新机制。\textbf{MGUP}通过每次迭代对选定的固定比例参数应用较大的步长,而对其余参数应用较小的非零步长,增强了标准的基于动量的优化器。作为一个近乎即插即用的模块,\textbf{MGUP}可无缝集成到AdamW、Lion和Muon等优化器中,产生强大的变体,如\textbf{MGUP-AdamW}、\textbf{MGUP-Lion}和\textbf{MGUP-Muon}。在标准假设下,我们为随机优化中的\textbf{MGUP-AdamW}(无权重衰减)提供了理论收敛保证。在包括MAE预训练、LLM预训练和下游微调在内的多种任务上的大量实验表明,与原始基础优化器相比,我们的\textbf{MGUP}增强型优化器实现了更优或更稳定的性能。我们提供了一种原则性、通用且具有理论基础的层内选择性更新策略,加速并稳定了大规模模型的训练。代码已在此https URL公开。

英文摘要

Efficient optimization is essential for training large language models. Although intra-layer selective updates have been explored, a general mechanism that enables fine-grained control while ensuring convergence guarantees is still lacking. To bridge this gap, we propose \textbf{MGUP}, a novel mechanism for selective updates. \textbf{MGUP} augments standard momentum-based optimizers by applying larger step-sizes to a selected fixed proportion of parameters in each iteration, while applying smaller, non-zero step-sizes to the rest. As a nearly {plug-and-play} module, \textbf{MGUP} seamlessly integrates with optimizers such as AdamW, Lion, and Muon. This yields powerful variants such as \textbf{MGUP-AdamW}, \textbf{MGUP-Lion}, and \textbf{MGUP-Muon}. Under standard assumptions, we provide theoretical convergence guarantees for \textbf{MGUP-AdamW} (without weight decay) in stochastic optimization. Extensive experiments across diverse tasks, including MAE pretraining, LLM pretraining, and downstream fine-tuning, demonstrate that our \textbf{MGUP}-enhanced optimizers achieve superior or more stable performance compared to their original base optimizers. We offer a principled, versatile, and theoretically grounded strategy for efficient intra-layer selective updates, accelerating and stabilizing the training of large-scale models. The code is publicly available at https://github.com/MaeChd/MGUP.

2606.17524 2026-06-17 cs.LG 新提交

Learning to Refine Hidden States for Reliable LLM Reasoning

学习精炼隐藏状态以实现可靠的LLM推理

Chia-Hsuan Hsu, Jui-Ming Yao

发表机构 * Tongyu0924

AI总结 提出ReLAR框架,通过强化学习引导的潜在状态精炼,自适应调整推理步数和方向,提升复杂多步推理的准确性和稳定性,降低推理开销。

Comments Code is available at tongyu0924/Learning-to-Refine-Hidden-States

详情
AI中文摘要

大型语言模型展现出强大的推理能力,但在复杂的多步设置中,其内部推理过程可能不稳定,早期隐藏状态错误可能传播到错误预测。我们提出ReLAR,一种强化引导的潜在精炼框架,在解码前迭代更新隐藏表示。ReLAR维护一个紧凑的潜在推理状态,并使用学习到的深度和动作控制器自适应地确定精炼步骤的数量和方向。控制器基于逐步似然改进的策略梯度目标进行训练,实现了高效的输入依赖推理,无需显式的思维链生成。在医学、数学、多跳推理和开放式生成基准上的实验表明,ReLAR提高了准确性、生成质量和推理稳定性,同时推理开销显著低于显式推理基线。

英文摘要

Large language models show strong reasoning ability, but their internal reasoning process can remain unstable in complex multi-step settings, where early hidden-state errors may propagate to incorrect predictions. We propose ReLAR, a reinforcement-guided latent refinement framework that iteratively updates hidden representations before decoding. ReLAR maintains a compact latent reasoning state and uses learned depth and action controllers to adaptively determine both the number and direction of refinement steps. The controllers are trained with a policy gradient objective based on step-wise likelihood improvement, enabling efficient input-dependent reasoning without explicit chain-of-thought generation. Experiments on medical, mathematical, multi-hop reasoning, and open-ended generation benchmarks show that ReLAR improves accuracy, generation quality, and reasoning stability with substantially lower inference overhead than explicit reasoning baselines.

2606.17522 2026-06-17 cs.CL cs.LG 新提交

An expressivity analysis of hierarchical modelling in deep transformers via bounded-depth grammars

深度Transformer中基于有界深度文法的层次建模的表达性分析

Vinoth Nandakumar, Qiang Qu, Pramod Thebe, Sakshi Khachariya, Tongliang Liu

发表机构 * University of Sydney(悉尼大学) San Francisco State University(旧金山州立大学) IIT Madras(印度理工学院马德拉斯分校)

AI总结 通过有界深度上下文无关文法,证明深度Transformer的深度随文法深度线性增长,神经元数随派生树形状和产生式规则数量缩放,支持线性表示假说。

详情
AI中文摘要

深度神经网络普遍被认为其表达能力源于形成层次表示的能力,即在各层中逐步捕获更抽象和组合的特征。在语言建模中,Transformer已成为主导架构,早期层捕获局部句法模式,后期层编码更复杂的从句级依赖。尽管这种直觉塑造了模型设计,但缺乏严格的理论工作来展示深度Transformer如何表示这种层次结构。本文通过有界深度、非递归上下文无关文法的形式化视角,分析深度Transformer模型的表达性。对于这类文法,我们显式构造了具有位置注意力的Transformer,其深度随文法深度线性增长,而神经元数量随派生树形状数量以及产生式规则数量的平方缩放。我们的理论结果支持线性表示假说,证明了这些架构具有将抽象语法状态编码为残差流中低维、线性可分子空间的结构能力。

英文摘要

Deep neural networks are widely believed to derive their expressive power from their ability to form \textbf{hierarchical representations}, capturing progressively more abstract and compositional features across layers. In language modeling, \textbf{transformers} have emerged as the dominant architecture, with early layers capturing local syntactic patterns and later layers encoding more complex clause-level dependencies. While this intuition has shaped model design, there remains a lack of rigorous theoretical work demonstrating \textbf{how} deep transformers represent such hierarchical structures. In this work, we analyze the expressiveness of deep transformer models through the formal lens of bounded-depth, non-recursive context-free grammars. For this class of grammars, we explicitly construct transformers with positional attention whose depth grows linearly with grammar depth, while the neuron count scales with the number of derivation-tree shapes and quadratically with the number of production rules. Our theoretical results support the linear representation hypothesis by demonstrating that these architectures possess the structural capacity to encode abstract grammatical states into low-dimensional, linearly separable subspaces within the residual stream.

2606.17520 2026-06-17 cs.RO cs.CV 新提交

GASE: Gaussian Splatting-Based Automated System for Reconstructing Embodied-Simulation Environments

GASE:基于高斯溅射的自动化系统用于重建具身仿真环境

Jiawei Zhang, Yiming Yan, Chao Liang, Nuo Xu, Seson Sun, Qichen Zhang, Yuhao Xu, Yantai Yang, Yingqiao Wang, Qin Jin, Zhipeng Zhang

发表机构 * AutoLab, SAI, Shanghai Jiao Tong University(上海交通大学SAI学院AutoLab实验室) AIM3 Lab, School of Information, Renmin University of China(中国人民大学信息学院AIM3实验室) Research Lab, Anyverse Dynamics(Anyverse Dynamics研究实验室)

AI总结 提出GASE系统,利用全景相机阵列和多视图视频流,通过相机位姿策略提取前景物体并修复场景,独立重建后导入物理仿真器,实现高效高保真仿真环境构建,分割精度提升超10%,真实机器人部署性能差距小于10%。

详情
AI中文摘要

在现实世界中训练具身代理需要熟练的操作人员和昂贵的硬件。仿真环境通过实现大规模、低成本的数据增强提供了一种引人注目的替代方案。因此,快速构建具有最小仿真到现实差距的高保真仿真场景已成为机器人学习的关键目标。尽管基于重建的方法提供了优越的视觉质量,但当前的工作流程受到低效的数据采集和次优的前景物体提取的阻碍。因此,我们提出了GASE,一个高度自动化的仿真场景构建系统。GASE利用全景相机阵列的多视角视频流实现快速环境扫描。为确保高质量的资产生成,我们的流程引入了一种基于相机位姿的策略,在2D域中跨帧鲁棒地提取物体,随后进行高保真场景修复。前景物体和静态背景随后被独立重建,并无缝导入物理仿真器用于策略训练。大量实验表明,GASE在分割精度上比现有的基于3D高斯的方法提高了超过10%,同时实现了最先进的修复质量。此外,在操作和导航任务中的真实机器人部署保持了与纯真实世界数据训练策略相比低于10%的性能差距。这些结果证实GASE为弥合仿真到现实差距提供了高效且高度有效的解决方案。代码将发布。

英文摘要

Training embodied agents in the real world requires skilled operators and expensive hardware. Simulation environments offer a compelling alternative by enabling large-scale, cost-effective data augmentation. Consequently, rapidly constructing high-fidelity simulation scenes with a minimal sim-to-real gap has become a critical objective in robot learning. While reconstruction-based methods provide superior visual quality, current workflows are hindered by inefficient data acquisition and subpar foreground object extraction. We thus propose GASE, a highly automated system for simulation scene construction. GASE leverages multi-view video streams from panoramic camera arrays to enable rapid environment scanning. To ensure high-quality asset generation, our pipeline introduces a camera-pose-based strategy that robustly extracts objects across frames in the 2D domain, followed by high-fidelity scene inpainting. Foreground objects and the static background are then reconstructed independently and seamlessly imported into physics simulators for policy training. Extensive experiments demonstrate that GASE outperforms existing 3D Gaussian-based methods in segmentation accuracy by over 10\% while achieving state-of-the-art inpainting quality. Furthermore, real-robot deployments across manipulation and navigation tasks maintains a performance gap of less than 10\% compared to policies trained purely on real-world data. These results confirm that GASE provides an efficient and highly effective solution for bridging the sim-to-real gap. Code will be released.

2606.17519 2026-06-17 cs.CL cs.AI 新提交

Scaling Enterprise Agent Routing: Degradation, Diagnosis, and Recovery

扩展企业智能体路由:退化、诊断与恢复

Kellen Gillespie, Robyn Perry

发表机构 * Superhuman, Inc.(Superhuman公司)

AI总结 研究企业助手工具库扩展时路由准确率下降问题,通过嵌入预选恢复F1分数10-17个百分点。

Comments 10 pages (6 main + 4 appendix), 4 figures, 6 tables

详情
AI中文摘要

生产级LLM助手将用户请求路由到日益增长的专用工具库,但随着目录规模扩大,路由准确率如何退化?我们在一个已部署的企业生产力助手的110个智能体、584个工具的目录上研究单步路由,评估了从10到110个智能体的三种前沿模型。在未充分指定的请求上,路由F1分数跨模型下降16-23个百分点。一个oracle分析将退化分解为检索差距(模型无法找到正确工具)和混淆差距(即使完美检索,oracle上限也下降10pp)。基于嵌入的预选在全部规模下为所有三种模型和两个提供商恢复+10-11pp F1分数。一项生产标注研究(1,435个人工标注话语,三个标注者)确认了在真实流量上的恢复,尽管绝对性能低10-15pp,但恢复幅度为+10-17pp。

英文摘要

Production LLM assistants route user requests to growing libraries of specialized tools, but how does routing accuracy degrade as the catalog scales? We study single-step routing on a 110-agent, 584-tool catalog from a deployed enterprise productivity assistant, evaluating three frontier models from 10 to 110 agents. Routing F1 on under-specified requests drops 16--23 percentage points across models. An oracle analysis decomposes the degradation into a \emph{retrieval} gap (the model cannot surface the right tool) and a \emph{confusion} gap (even with perfect retrieval, the oracle ceiling drops 10pp). Embedding-based shortlisting recovers +10--11pp F1 at full scale across all three models and two providers. A production annotation study (1,435 human-labeled utterances, three annotators) confirms the recovery on real traffic at +10--17pp despite 10--15pp lower absolute performance.

2606.17516 2026-06-17 cs.LG cs.AI stat.ME stat.ML 新提交

FoundCause: Causal Discovery with Latent Confounders from Observational Data

FoundCause: 从观测数据中发现含隐混淆因子的因果关系

Patrick Blöbaum, Krishnakumar Balasubramanian, Shiva Prasad Kasiviswanathan

发表机构 * Amazon Web Services(亚马逊云服务) Department of Statistics, University of California, Davis(加州大学戴维斯分校统计系)

AI总结 提出FoundCause,一种基于合成数据训练的摊销因果发现模型,通过单次前向传递直接映射数据集到因果图,显式建模隐混淆因子,在15个真实数据集上优于11种非摊销和4种摊销方法。

Comments Download the model at https://github.com/amazon-science/foundcause

详情
AI中文摘要

从观测数据中发现因果关系仍然具有挑战性,因为需要在没有干预的情况下恢复有向结构和隐混淆因子。我们提出了FoundCause,一种完全在合成数据上训练的摊销因果发现模型,它通过单次前向传递直接将数据集映射到因果图。通过从大量模拟结构因果模型中学习,FoundCause捕获了可迁移的统计模式,这些模式泛化到单个数据集之外。该架构融合了因果发现的几个关键归纳偏置。它使用一个置换不变的Transformer编码器,通过交替关注样本和变量来联合建模跨变量依赖性和每个变量的分布。通过统计条件注意力注入来自经典非对称度量的成对统计特征,引导模型朝向已知的因果信号。一个分解的解码器将边的存在性与方向分离,而一个三角细化模块使得能够推理高阶因果模式,如链和碰撞器。此外,一个基于可学习隐令牌的专用混淆因子模块显式建模隐藏的共同原因,并且模型通过其掩码输入表示显式处理缺失数据。据我们所知,FoundCause是第一个显式建模隐混淆因子的摊销因果发现方法。FoundCause在15个真实数据集上优于11种经典非摊销方法(如PC、GES、NOTEARS风格优化)和4种摊销因果发现方法,相对于最强的非摊销方法,在$F_1$上提高了9.6%,在AUROC上提高了1.2%,结构汉明距离减少了18.9%,同时仅需单次前向传递即可完成推理。

英文摘要

Causal discovery from observational data remains challenging due to the need to recover directed structure and latent confounding without interventions. We propose FoundCause, an amortized causal discovery model trained entirely on synthetic data that maps datasets directly to causal graphs in a single forward pass. By learning from large collections of simulated structural causal models, FoundCause captures transferable statistical patterns that generalize beyond individual datasets. The architecture incorporates several key inductive biases for causal discovery. It uses a permutation-invariant transformer encoder with alternating attention over samples and variables to jointly model cross-variable dependence and per-variable distributions. Pairwise statistical features derived from classical asymmetry measures are injected through statistics-conditioned attention, guiding the model toward known causal signals. A factorized decoder separates edge existence from direction, while a triangular refinement module enables reasoning over higher-order causal motifs such as chains and colliders. In addition, a dedicated confounder module based on learnable latent tokens explicitly models hidden common causes, and the model explicitly handles missing data via its masked input representation. To our knowledge, FoundCause is the first amortized causal discovery approach to explicitly model latent confounding. FoundCause outperforms 11 classical non-amortized methods (e.g., PC, GES, NOTEARS-style optimization) and 4 amortized causal discovery methods on 15 real-world datasets, achieving +9.6% improvement in $F_1$, +1.2% in AUROC, and an 18.9% reduction in structural Hamming distance relative to the strongest non-amortized methods, while performing inference in a single forward pass.

2606.17513 2026-06-17 cs.LG cs.AI 新提交

Geometry-Aware Post-Hoc Uncertainty Quantification in Operator Learning

几何感知的算子学习事后不确定性量化

Oriol Vendrell-Gallart, Nima Negarandeh, Ramin Bostanabad

发表机构 * Department of Mechanical and Aerospace Engineering, University of California, Irvine(加州大学尔湾分校机械与航空航天工程系)

AI总结 提出REEF-GP框架,通过高斯过程拟合冻结神经算子的残差,利用其内在坐标-特征表示构建几何感知的不确定性,在多个PDE基准上实现校准的不确定性估计,且计算成本远低于深度集成。

详情
AI中文摘要

神经算子为偏微分方程提供快速代理模型,但其确定性预测限制了在需要不确定性量化(UQ)的任务中的使用,尤其是在几何变化下。现有方法主要对网络参数进行不确定性建模,很大程度上忽略了算子本身学习的几何感知表示。我们提出REEF-GP(残差嵌入特征高斯过程),一种事后UQ框架,将高斯过程拟合到冻结神经算子的残差上,该算子的内部嵌入定义了核特征空间。REEF-GP不学习单独的特征映射,而是调整算子固有的坐标-特征表示以构建几何感知的不确定性。为了确保非结构化域上的稳定性和可扩展性,REEF-GP结合了谱归一化投影、异方差几何感知噪声以及高效基于子集的训练,避免了限制性的低秩近似。在五个具有不同几何形状的PDE基准测试中,REEF-GP保持了预测准确性,同时实现了与深度集成相竞争但成本仅为其一小部分的校准不确定性估计。我们的方法在几何分布偏移下保持鲁棒性,不确定性集中在物理上有意义的区域(例如激波前沿)。我们的结果表明,神经算子的准确且可扩展的事后UQ可以直接在其学习的特征空间中实现,为参数中心方法提供了实用替代方案。

英文摘要

Neural operators provide fast surrogates for PDEs but their deterministic predictions limit their use in tasks requiring uncertainty quantification (UQ), especially under geometric variability. Existing approaches primarily model uncertainty in network parameters, largely overlooking the geometry-aware representations learned by the operator itself. We propose REEF-GP (Residual on Embedded Features Gaussian Process), a post-hoc UQ framework that fits a GP to the residuals of a frozen neural operator whose internal embeddings define the kernel feature space. Rather than learning a separate feature map, REEF-GP adapts the operator's intrinsic coordinate-feature representations to construct geometry-aware uncertainties. To ensure stability and scalability on unstructured domains, REEF-GP incorporates spectral-normalized projections, heteroscedastic geometry-aware noise, and efficient subset-based training that avoids restrictive low-rank approximations. Across five PDE benchmarks with varying geometries, REEF-GP preserves predictive accuracy while achieving calibrated uncertainty estimates competitive with deep ensembles but at a fraction of their cost. Our approach remains robust under geometric distribution shift, with uncertainty concentrating in physically meaningful regions (e.g., shock fronts). Our results demonstrate that accurate and scalable post-hoc UQ for neural operators can be achieved directly in their learned feature space, offering a practical alternative to parameter-centric approaches.

2606.17511 2026-06-17 cs.RO cs.AI cs.CV 新提交

MagicSim: A Unified Infrastructure for Executable Embodied Interaction

MagicSim: 可执行具身交互的统一基础设施

Haoran Lu, Songling Liu, Yue Chen, Guo Ye, Mutian Shen, Shuyang Yu, Yu Xiao, Jihai Zhao, Shang Wu, Jianshu Zhang, Xiangtian Gui, Chuye Hong, Yuran Wang, Maojiang Su, Jiayi Wang, Ruihai Wu, Zhaoran Wang, Han Liu

发表机构 * Northwestern University(西北大学) Peking University(北京大学) University of California, Berkeley(加州大学伯克利分校) ShanghaiTech University(上海科技大学)

AI总结 提出MagicSim,一个基于确定性批处理运行时和共享MDP的具身交互基础设施,通过YAML规范解耦内容、放置、行为和智能体暴露,统一世界构建、执行、评估和自动生成轨迹。

详情
AI中文摘要

机器人学习和具身智能体现在需要模拟作为连接控制、技能和规划的共享执行基底,而不仅仅是渲染器、控制器测试平台或固定任务环境。现有的流水线通过“魔法”动作、脱节的训练环境或仅前向渲染来分割这些层,无法重现、评估和标注同一情节。我们提出MagicSim,一个围绕确定性批处理运行时和共享马尔可夫决策过程(MDP)构建的具身交互基础设施。通过YAML优先的规范解耦内容、放置、行为和智能体暴露,MagicSim在单一重置-步进循环中构建多样化的可执行世界,涵盖任务族、交互模式、物理、布局、传感器、化身和机器人具身。一个通用的执行接口通过控制器、原子技能、规划器原语和异步规划将高级命令具体化,将其实现为机器人动作而非模拟器端的状态编辑。一个任务定义支持三种能力:基准测试和强化学习评估、自动收集接口(自动将命令转化为具体轨迹)以及面向智能体/VLM的交互。对于自动执行,命令流经Command->Skill->Planner->Robot->Record流水线,而每个环境的命令、技能、规划、重试、标注和情节状态在共享物理滴答之上独立推进。成功的展开被保存为结构化的多模态轨迹,将语言监督、动作表示、视觉/几何表示和任务级别状态与执行的情节对齐。因此,MagicSim在一个规划器在环运行时中统一了多样化的世界构建、具身执行、任务评估、自动展开生成和交互式智能体接口。

英文摘要

Robot learning and embodied agents now require simulation to serve as a shared execution substrate linking control, skills, and planning, not only as a renderer, controller testbed, or fixed task environment. Existing pipelines split these layers with "magic" actions, disconnected training environments, or forward-only renders that cannot reproduce, evaluate, and annotate the same episode. We present MagicSim, an embodied interaction infrastructure built around one deterministic batched runtime and a shared Markov decision process (MDP). From YAML-first specifications that decouple contents, placement, behavior, and agent exposure, MagicSim constructs diverse executable worlds spanning task families, interaction regimes, physics, layouts, sensors, avatars, and robot embodiments in one reset-and-step loop. A common execution interface grounds high-level commands through controllers, atomicskills, planner primitives, and asynchronous planning, realizing them as robot actions rather than simulator-side state edits. One task definition supports three capabilities: benchmark and RL evaluation, an autocollect interface that automatically turns commands into grounded trajectories, and agent/VLM-facing interaction. For automatic execution, commands flow through a Command->Skill->Planner->Robot->Record pipeline, while per-environment command, skill, planning, retry, annotation, and episode states advance independently above the shared physics tick. Successful rollouts are saved as structured multimodal trajectories aligning language supervision, action representations, visual/geometric representations, and task-level status with the executed episode. MagicSim thus unifies diverse world construction, embodied execution, task evaluation, automatic rollout generation, and interactive agent interfaces in one planner-in-the-loop runtime.

2606.17508 2026-06-17 cs.LG cs.DC cs.PL cs.SE 新提交

When the Next Step Is Not One Step: Distribution-Aware Execution Modeling for Concurrent Go Programs

当下一步不是一步:面向并发Go程序的分布感知执行建模

Kaviru Hapuarachchi

发表机构 * University of Colombo School of Computing(科伦坡大学计算学院)

AI总结 针对并发程序非确定性调度导致单标签预测困难的问题,提出分布感知训练方法,通过多次运行聚合经验分布并微调7B模型,在真实Go缺陷预测中准确率达36.2%,并降低期望校准误差。

Comments 10 pages, 2 figures

详情
AI中文摘要

训练模型预测并发程序中的下一步比看起来更难:从相同跟踪前缀出发的同一程序的两次运行可能产生不同的有效下一事件,因为调度器是非确定性的。针对单一标签训练的模型实际上是在学习猜测随机过程的一个结果。我们反过来利用这种非确定性作为训练信号。我们将每个程序运行多次,将观察到的下一事件聚合成经验分布,并使用KL散度目标微调一个7B模型。在从真实生产Go缺陷(CockroachDB、Kubernetes、gRPC、etcd)中抽取的798个保留预测上,对少于一千个跟踪进行微调达到了36.2%的准确率,超过了零样本的Gemini 3.5 Flash(34.8%)和未微调的同一模型(28.6%)。分布训练在准确率上与交叉熵相当(35.8% vs. 36.2%),同时将期望校准误差从0.205降低到0.169。我们还推导出一类select阻塞goroutine的形式化goroutine泄漏特征,其中P(GoUnblock)=0由调度器语义保证,而非学习得到。我们发布了数据集、训练适配器和所有工具。

英文摘要

Training a model to predict the next step in a concurrent program is harder than it looks: two runs of the same program from the same trace prefix can produce different next events, both valid, because the scheduler is nondeterministic. A model trained against a single label is learning to guess one outcome of a random process. We turn this around and use the nondeterminism as a training signal. We run each program many times, aggregate the observed next events into an empirical distribution, and fine-tune a 7B model to match that distribution with a KL objective. On 798 held-out predictions drawn from real production Go bugs (CockroachDB, Kubernetes, gRPC, etcd), fine-tuning on fewer than a thousand traces reaches 36.2% accuracy, ahead of Gemini 3.5 Flash used zero-shot (34.8%) and the same model without fine-tuning (28.6%). Distribution training matches cross-entropy on accuracy (35.8% vs. 36.2%) while reducing Expected Calibration Error from 0.205 to 0.169. We also derive a formal goroutine-leak signature for a class of select-blocked goroutines where P(GoUnblock)=0 holds by scheduler semantics, not by learning. We release the dataset, trained adapters, and all tooling.