arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.02498 2026-06-02 cs.CV

GloResNet: A lightweight 3D CNN with global topological features for preterm brain injury prediction

GloResNet：一种用于早产儿脑损伤预测的轻量级3D CNN与全局拓扑特征

Boyu Yuan, Jiamiao Lu, Weichuan Zhang, Benqing Wu, Tuo Wang, Changshan Wang, Changming Sun, Liang Guo

发表机构 * Image Computing Laboratory, Shaanxi University of Science and Technology（陕西科技大学图像计算实验室）； Department of Neonatology, Shenzhen University of Advanced Technology General Hospital（深圳先进技术医院新生儿科）； Department of Neurosurgery, The First Affiliated Hospital of Xi’an Jiaotong University（西安交通大学第一附属医院神经外科）； CSIRO Technology（澳大利亚CSIRO技术）

AI总结提出基于ResNet-10的轻量级3D CNN GloResNet，结合全局流形映射和预处理策略，在dHCP数据集上实现早产儿脑损伤预测，平均准确率75.18%。

详情

AI中文摘要

本研究引入了一个自动化深度学习框架，用于从T2加权MRI（dHCP数据集）预测早产儿脑损伤（BI）。我们提出了GloResNet，一种基于ResNet-10的轻量级3D CNN，并在MedicalNet上预训练以应对数据稀缺。一种全局流形映射策略首先将每个3D体积重采样为128x128x128，然后应用逐样本z分数强度归一化，从而在标准化外观的同时保留全局拓扑。训练集成了mixup、类别加权和测试时增强以提高鲁棒性。在5折交叉验证中，GloResNet达到了75.18%的平均准确率（峰值81.82%），特异性0.81，敏感性0.76。结果表明，拓扑感知的轻量级CNN能够有效预测新生儿脑损伤，提供了一种非侵入性筛查工具。本文源代码可从GitHub仓库获取：https://github.com/ICL-SUST/GloResNet-Preterm-Brain

英文摘要

This study introduces an automated deep learning framework for predicting brain injury (BI) in preterm infants from T2-weighted MRI (dHCP dataset). We propose GloResNet, a lightweight 3D CNN based on ResNet-10, pretrained on MedicalNet to address data scarcity. A global manifold mapping strategy first resamples each 3D volume to 128x128x128 and then applies subject-wise z-score intensity normalization, thereby preserving global topology while standardizing appearance. Training integrates mixup, class weighting, and test-time augmentation for robustness. In 5-fold cross-validation, GloResNet achieved 75.18% average accuracy (peak 81.82%), with specificity 0.81 and sensitivity 0.76. Results demonstrate that a topology-aware lightweight CNN has the capability to effectively predict neonatal BI, offering a non-invasive screening tool. The source code of this paper can be obtained from the GitHub repository: https://github.com/ICL-SUST/GloResNet-Preterm-Brain

URL PDF HTML ☆

赞 0 踩 0

2606.02491 2026-06-02 cs.CV

MORPHOS: Autoregressive 4D Generation with Temporal Structured Latents

MORPHOS: 基于时间结构化潜变量的自回归4D生成

Minkyung Kwon, Jinhyeok Choi, Youngjin Shin, Jaeyeong Kim, JongMin Lee, Seungryong Kim

发表机构 * KAIST AI（韩国国立科学技术院人工智能实验室）

AI总结提出MORPHOS框架，利用时间结构化潜变量（T-SLAT）统一表示4D动态资产，通过自回归因果注意力生成，解决多表示兼容、拓扑变化和长时间一致性问题。

Comments Project page: https://cvlab-kaist.github.io/MORPHOS/

详情

AI中文摘要

我们提出MORPHOS，一种新颖的自回归框架，能够从视频生成动态3D资产，支持多种表示，包括网格、3D高斯和辐射场。现有方法通常局限于单一表示，难以建模拓扑变化，或在长视频中无法保持时间一致性。为解决这些限制，我们引入时间结构化潜变量（T-SLAT），一种统一的4D表示，沿时间维度联合编码几何和外观。利用T-SLAT，MORPHOS通过因果注意力自回归生成动态3D资产，将每一帧条件于其先前历史，以确保时间一致性并处理演化的拓扑。我们还提出一种时间结构增强，以减轻自回归生成中的误差累积。MORPHOS在多个基准测试中实现了外观方面的最先进性能和几何方面的竞争性结果，展示了跨多种表示的卓越泛化能力和长时程生成的鲁棒性。

英文摘要

We present MORPHOS, a novel autoregressive framework that generates dynamic 3D assets from videos across diverse representations, including meshes, 3D Gaussians, and radiance fields. Existing methods are typically limited to a single representation, struggle to model topological changes, or fail to maintain temporal consistency over long videos. To address these limitations, we introduce the Temporal Structured Latents (T-SLAT), a unified 4D representation that jointly encodes geometry and appearance along the temporal dimension. Leveraging T-SLAT, MORPHOS autoregressively generates dynamic 3D assets via causal attention, conditioning each frame on its preceding history to ensure temporal consistency while handling evolving topologies. We also propose a temporal-structural augmentation to mitigate error accumulation in autoregressive generation. MORPHOS achieves state-of-the-art performance in appearance and competitive results in geometry across multiple benchmarks, demonstrating superior generalization across various representations and robustness in long-horizon generation.

URL PDF HTML ☆

赞 0 踩 0

2606.02490 2026-06-02 cs.LG

Expressivity of congruence-based architectures for DNNs on positive-definite matrices

基于同余结构的深度神经网络在正定矩阵上的表达能力

Antonin Oswald, Estelle Massart

发表机构 * Antonin Oswald ； Estelle Massart

AI总结研究同余层（输入矩阵左乘和右乘权重矩阵及其转置）在正定矩阵分类中的表达能力，发现半正交约束会限制网络表达能力，导致退化为单隐藏层等价结构，并分析了不同黎曼分类器与同余层特征图的兼容性。

Comments Accepted for Eusipco 2026

2606.02488 2026-06-02 cs.AI

RASER: Recoverability-Aware Selective Escalation Router for Multi-Hop Question Answering

RASER: 可恢复性感知的选择性升级路由器用于多跳问答

Yuyang Li, Zihe Yan, Tobias Käfer

发表机构 * Institute AIFB, Karlsruhe Institute of Technology, Karlsruhe, Germany（卡尔斯鲁厄理工学院AIFB研究所）； Shanghai Jiao Tong University, Shanghai, China（上海交通大学）

AI总结提出RASER路由器，基于单次RAG的六个特征决定是否升级到更昂贵的检索策略，在不增加额外LLM调用的情况下，在F1分数与SOTA相当的同时节省大量token。

Comments Under Review

详情

AI中文摘要

多跳问答系统通常对每个问题使用昂贵的检索。它们可能会分解问题、运行多轮检索或通过桥接实体搜索后再回答。所有这些策略都依赖于重复的LLM调用来重写或分解问题，这增加了额外的token成本，并且在LLM预算紧张时不适用。然而，我们的分析表明，许多多跳问题已经被单次RAG正确回答，因此对每个问题都进行额外检索浪费了预算。我们引入了RASER（可恢复性感知的选择性升级路由器），这是一系列基于单次RAG及其六个特征的廉价路由器。RASER-2决定是停止还是升级到额外检索动作PRUNE。RASER-3在单次RAG、PRUNE和迭代检索IRCoT之间进行选择，使用相同的特征但增加了显式的成本-准确率权衡。两个路由器都不需要额外的LLM调用来做决定。在六个LLM和三个多跳QA基准测试中，两个路由器在F1分数上与最先进的基线保持竞争力，同时仅消耗始终PRUNE的41-49%的token，并且也少于迭代和分解检索基线。

英文摘要

Multi-hop question-answering systems often use expensive retrieval on every question. They may decompose the question, run several retrieval rounds, or search through bridge entities before answering. All of these strategies rely on repeated LLM calls to rewrite or decompose the question, which increases extra token cost, and it is not fitting when the LLM budget is tight. However, our analysis shows that lots of multi-hop questions are already answered correctly by a single one-shot RAG, so running an extra retrieval on every question wastes the budget. We introduce RASER (Recoverability-Aware Selective Escalation Router), a family of cheap routers built on one-shot RAG and six features from it. RASER-2 decides whether to stop or escalate to the extra-retrieval action PRUNE. RASER-3 chooses among one-shot RAG, PRUNE, and iterative retrieval IRCoT, using the same features but adding an explicit cost-accuracy trade-off. Neither router makes an extra LLM call to decide. Across six LLMs and three multi-hop QA benchmarks, both routers stay competitive with the other state-of-the-art (SOTA) baselines in F1 while spending only 41-49% of always-prune's tokens and also less than the iterative and decomposition retrieval baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.02487 2026-06-02 cs.CL

Towards Multidisciplinary Summarization of Hospital Stays: Efficient Sentence-Level Clinical Provenance Categorization

迈向多学科住院总结：高效的句子级临床来源分类

Baris Karacan, Vaibhav Bhargava, Barbara Di Eugenio, Natalie Parde, Mary Khetani, Yu-Shan Tseng, Vanessa Barbosa, Julie Vignato, Lindsey Knake, Rajashree Dahal, Emily Spellman, Danielle Hitzel, Janine Petitgout, Kristi Haughey, Amanda Karstens, Brianna Clarahan, Rachel Dawson, Lauren Boyd, Mackenzie Weis, Angie Tipton, Jaewon Bae, Catherine K. Craven, Karen Dunn Lopez, Andrew D. Boyd

发表机构 * University of Illinois Chicago（伊利诺伊大学芝加哥分校）； University of Iowa（爱荷华大学）； University of Missouri-Columbia（密苏里大学哥伦比亚分校）

AI总结本研究提出一个基于大语言模型监督微调的临床来源分类流水线，用于多学科住院总结，通过量化70B模型在跨领域迁移中提升F1分数7%，并证明模型容量对语义灵活性的关键作用。

Comments 5 pages. Submitted preprint version of a paper accepted to AIME 2026. This version may differ from the camera-ready manuscript and the final Version of Record. The Version of Record will be available from Springer Nature once published

详情

AI中文摘要

在新生儿重症监护室（NICU）等高复杂性环境中，有效的“全团队”总结需要聚合来自不同学科（医生、护士、治疗师）的见解，这些见解分散在数百份临床自由文本记录中。简单地将异质文本汇集在一起往往会导致输出不连贯。因此，结构化总结首先需要对跨多源记录的句子级来源进行准确分类。本初步研究引入了一个临床来源分类流水线，使用大语言模型（LLM）的监督微调（SFT）。我们将两个Llama-3模型（8B和70B）适配到MedSecId，这是一个包含2,002份MIMIC-III（成人ICU）记录并带有临床来源标题注释的语料库，两个模型在域内均实现了超过92%的宏F1分数。为了评估跨域泛化能力，我们评估了模型容量（8B vs. 70B）和量化在一个由来自三个多学科NICU总结的227个句子级跨度组成的金标准数据集上的表现。实验结果表明存在规模依赖的迁移效应：虽然SFT对8B模型仅产生边际变化，但显著改进了70B模型，使宏F1提高了7%。值得注意的是，量化微调的70B模型在显著降低计算需求的同时，超越了其全精度基线。这些发现表明，足够的模型容量对于在跨域临床迁移中保持语义灵活性至关重要，并且高效的量化适配可以为下游总结实现结构化来源建模。

英文摘要

Effective "all-team" summarization in high-complexity settings like the Neonatal Intensive Care Unit (NICU) requires aggregating insights from diverse disciplines (physicians, nurses, therapists) spread across hundreds of clinical free-text notes. Simply pooling heterogeneous text often leads to incoherent outputs. Structured summarization therefore first requires accurate categorization of sentence-level provenance across multi-source notes. This pilot study introduces a clinical provenance categorization pipeline using supervised fine-tuning (SFT) of large language models (LLMs). We adapted two Llama-3 models (8B and 70B) to MedSecId, a corpus of 2,002 MIMIC-III (Adult ICU) notes annotated with clinical provenance headers, achieving in-domain Macro F1 scores above 92% for both models. To evaluate cross-domain generalization, we assessed model capacity (8B vs. 70B) and quantization on a gold-standard dataset of 227 sentence-level spans derived from three multi-disciplinary NICU summaries. Experimental results demonstrate a scale-dependent transfer effect: while SFT produced only marginal changes for the 8B model, it substantially improved the 70B model, increasing Macro F1 by 7%. Notably, the quantized fine-tuned 70B model outperformed its full-precision baseline while substantially reducing computational requirements. These findings suggest that sufficient model capacity is critical for preserving semantic flexibility during cross-domain clinical transfer and that efficient quantized adaptation can enable structured provenance modeling for downstream summarization.

URL PDF HTML ☆

赞 0 踩 0

2606.02486 2026-06-02 cs.RO

Intercepting the Future: Latent-Space Predictive World Model for Dynamic VLA Manipulation

拦截未来：用于动态VLA操作的潜在空间预测世界模型

Shahram Najam Syed, Arthur Jakobsson, Haoran Hao, Jeffrey Ichnowski

发表机构 * Robotics Institute, Carnegie Mellon University（卡内基梅隆大学机器人研究所）

AI总结提出AHEAD框架，通过潜在空间世界模型预测未来视觉特征，使冻结的VLA模型在动态场景中实现高成功率操作。

Comments 28 pages, 7 figures, 16 tables, Su

详情

AI中文摘要

视觉-语言-动作（VLA）模型在静态操作中具有泛化能力，但当物体在任务执行过程中移动时则失效。它们将当前观测映射为动作，并假设观测与执行之间场景静止，因此在任何非平凡的物体速度下，产生的延迟都会超过可用的抓取时间。我们通过AHEAD（自适应动态预期视界外推）弥补了这一差距，这是一种先预测后执行的包装器，用运动感知的潜在世界模型增强冻结的VLA。一个在操作视频上训练的小型世界模型，基于光流计算的每个令牌的速度和加速度，预测VLA特征空间中的未来块令牌。语言和运动显著性掩码将预测集中在任务相关的块上，模型向前滚动自适应视界，当预测不确定性超过阈值时停止。然后冻结的动作解码器接收预测的未来令牌代替当前令牌。AHEAD为冻结的7B OpenVLA增加了4.9M参数，在20个动态模拟场景中达到79%至97%的成功率，而最强基线仅为31%至58%。在物理UFactory xArm 7上，AHEAD在三个传送带和滚球任务中成功率为29/30至30/30，在桨叶拦截任务中为23/30，在抛射物捕捉任务中为19/30，而所有基线均为0/30。

英文摘要

Vision-Language-Action (VLA) models generalize across static manipulation but fail when objects move during task execution. They map the current observation to an action and assume the scene is stationary between observation and execution, so at any non-trivial object speed the resulting latency exceeds the time available to grasp. We close this gap with AHEAD (Anticipatory Horizon Extrapolation with Adaptive Dynamics), a predict-then-act wrapper that augments a frozen VLA with a motion-aware latent world model. A small world model trained on manipulation video forecasts future patch tokens in the VLA's feature space, conditioned on per-token velocity and acceleration from optical flow. A language-and-motion saliency mask concentrates prediction on task-relevant patches, and the model rolls forward for an adaptive horizon, halting when prediction uncertainty crosses a threshold. The frozen action decoder then receives the predicted future tokens in place of the current ones. AHEAD adds 4.9M parameters to a frozen 7B OpenVLA and reaches 79 to 97% success across 20 dynamic simulation scenarios where the strongest baseline reaches 31 to 58%. On a physical UFactory xArm 7, AHEAD succeeds on 29/30 to 30/30 on three conveyor and rolling-ball tasks, 23/30 on paddle interception, and 19/30 on projectile catching where every baseline scores 0/30.

URL PDF HTML ☆

赞 0 踩 0

2606.02484 2026-06-02 cs.AI cs.LG

Iteris: Agentic Research Loops for Computational Mathematics

Iteris: 计算数学的智能体研究循环

Leheng Chen, Zihao Liu, Wanyi He, Bin Dong

发表机构 * School of Mathematical Sciences, Peking University（北京大学数学科学学院）； Beijing International Center for Mathematical Research and the New Cornerstone Science Laboratory, Peking University（北京大学北京国际数学研究中心和新基石科学实验室）； Center for Machine Learning Research, Peking University（北京大学机器学习研究中心）； Center for Intelligent Computing, Great Bay Institute for Advanced Study, Great Bay University（大湾研究院先进研究所智能计算中心）； Zhongguancun Academy（中关村学院）

AI总结提出Iteris智能体研究系统，通过数值实验、构造和证明草稿解决计算数学中的两个开放问题，经专家验证后获得可验证结果。

Comments 43 pages

详情

AI中文摘要

大型语言模型和智能体AI系统的最新进展使得数学发现取得了显著进展，从解决竞赛问题到处理研究级猜想。然而，计算数学中的开放问题受到的关注相对较少：该领域的研究通常不仅需要证明，还需要数值实验、对抗性构造和算法设计。在本文中，我们介绍了一个面向计算数学开放问题的智能体研究系统Iteris。我们将Iteris应用于近期Simons Workshop论文集（arXiv:2602.05394）中的两个开放问题。在这些案例研究中，Iteris生成了数值证据、构造和证明草稿，经过专家评审和修正后，得到了可验证的结果。第一个结果是关于幂律谱上共轭梯度与随机坐标下降渐近比较的相图；第二个结果是一个反例，表明即使低相干性下，带列主元的QR分解也可能无法选择良态子矩阵。这些案例研究表明，智能体AI系统可以有意义地参与计算数学开放问题的研究工作流程，而人工验证仍然至关重要。

英文摘要

Recent advances in large language models and agentic AI systems have enabled significant progress in mathematical discovery, from solving competition problems to tackling research-level conjectures. However, open problems in computational mathematics have received comparatively less attention: research in this area often requires not only proofs but also numerical experimentation, adversarial constructions, and algorithm design. In this paper, we introduce an agentic research system, Iteris, designed for open problems in computational mathematics. We apply Iteris to two open problems from a recent Simons Workshop collection (arXiv:2602.05394). In these case studies, Iteris generated numerical evidence, constructions, and proof drafts that led, after expert review and correction, to verified results. The first result is a phase diagram for the asymptotic comparison between conjugate gradient and randomized coordinate descent on power-law spectra; the second is a counterexample showing that QR factorization with column pivoting can fail to select well-conditioned submatrices even under low coherence. These case studies suggest that agentic AI systems can participate meaningfully in research workflows for open problems in computational mathematics, while human validation remains essential.

URL PDF HTML ☆

赞 0 踩 0

2606.02479 2026-06-02 cs.CV

Retrieve What's Missing: Coverage-Maximizing Retrieval for Consistent Long Video Generation

检索缺失内容：面向一致长视频生成的覆盖最大化检索

Minseok Joo, Dogyun Park, Taehoon Lee, Kyujin Lee, Hyunwoo J. Kim

发表机构 * Korea University（韩国大学）； KAIST（韩国科学技术院）

AI总结提出基于深度的覆盖最大化检索增强生成框架COVRAG，利用预训练3D先验构建轻量级覆盖图作为记忆证据，通过迭代检索最大化残差覆盖来提升长视频生成的几何一致性。

Comments 19 pages, 10 figures, 5 tables

详情

AI中文摘要

对于长时域自回归视频生成，保持长期几何一致性仍然具有挑战性。记忆增强生成模型通过检索历史帧来解决这一问题，但其有效性取决于两个关键设计选择：哪些3D几何证据应代表过去的观测，以及如何从这些证据中选择记忆帧。现有方法通常依赖相机位姿或视场重叠，这些方法轻量但过于粗糙，无法推理像素级可见性；或者使用显式3D重建，提供细粒度证据但在长序列中维护成本高昂。我们提出覆盖最大化检索增强生成（COVRAG），一种基于深度的记忆检索框架，利用预训练3D先验构建目标视图覆盖图作为轻量级3D记忆证据。在帧选择方面，COVRAG最大化残差覆盖增益，迭代检索能够解释当前上下文或先前选择的记忆未覆盖的目标视图区域的帧。为了提高长视频生成的可扩展性，我们引入滑动窗口深度缓存以实现高效的几何估计。在RealEstate10K和DL3DV10K上的实验表明，COVRAG在保持低延迟的同时，相比基线方法改善了长时域几何一致性。

英文摘要

Maintaining long-term geometric consistency remains challenging for long-horizon autoregressive video generation. Memory-augmented generative models address this by retrieving historical frames, but their effectiveness depends on two key design choices: what 3D-geometric evidence should represent past observations, and how memory frames should be selected from this evidence. Existing methods often rely on camera poses or field-of-view overlap, which are lightweight but too coarse to reason about pixel-wise visibility, or use explicit 3D reconstruction, which provides fine-grained evidence but is costly to maintain over long rollouts. We propose Coverage-Maximizing Retrieval-Augmented Generation (COVRAG), a depth-based memory retrieval framework that uses pretrained 3D priors to construct a target-view coverage map as lightweight 3D memory evidence. For frame selection, COVRAG maximizes residual coverage gain, iteratively retrieving frames that explain target-view regions not covered by the current context or previously selected memories. To improve scalability in long-video generation, we introduce sliding-window depth caching for efficient geometry estimation. Experiments on RealEstate10K and DL3DV10K show that COVRAG improves long-horizon geometric consistency while maintaining low latency compared to baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.02470 2026-06-02 cs.AI

MCP-Persona: Benchmarking LLM Agents on Real-World Personal Applications via Environment Simulation

MCP-Persona：通过环境模拟在真实个人应用上基准测试LLM智能体

Wenhao Wang, Peizhi Niu, Gongyi Zou, Xiyuan Yang, Jingxing Wang, Haoting Shi, Yaxin Du, Jingyi Chai, Xianghe Pang, Shuo Tang, Yanfeng Wang, Siheng Chen

发表机构 * Tsinghua University（清华大学）

AI总结针对现有基准忽略个人社交应用中工具与个人账户或本地数据库交互的挑战，提出MCP-Persona基准，通过模拟真实个性化MCP工具评估LLM智能体性能，实验表明现有智能体在个性化工具使用上存在显著困难。

Comments ICML 2026 Camera Ready

详情

AI中文摘要

模型上下文协议（MCP）已成为连接大型语言模型（LLM）与外部数据源和工具的变革性标准，并已迅速在个人应用和开发平台中得到采用。然而，现有基准主要关注通用信息搜索工具，未能捕捉个人社交应用带来的实际挑战，在这些应用中工具与个人账户或本地数据库交互。为弥合这一关键差距，我们引入了MCP-Persona，这是首个专门用于评估智能体在真实世界个性化MCP工具上性能的基准。MCP-Persona涵盖了一系列多样化的广泛使用的应用，从社交媒体平台如Reddit和小红书（Rednote）到企业协作套件如飞书（Lark）和Slack。我们在各种最先进（SOTA）智能体上的广泛实验表明，它们在个性化工具使用上存在显著困难，从而凸显了该基准在识别和解决这些局限性方面的关键作用。MCP-Persona公开可用：https://github.com/wwh0411/MCP-Persona。

英文摘要

The Model Context Protocol (MCP) has emerged as a transformative standard for connecting large language models (LLMs) with external data sources and tools, and has been rapidly adopted across personal applications and development platforms. However, existing benchmarks predominantly focus on generic information-seeking tools and fail to capture the practical challenges posed by personal social applications, where tools interact with individual accounts or local databases. To bridge this critical gap, we introduce MCP-Persona, the first benchmark specifically designed for evaluating agent performance on real-world, personalized MCP tools. MCP-Persona encompasses a diverse set of widely-used applications, ranging from social media platforms like Reddit and Xiaohongshu (Rednote) to enterprise collaboration suites such as Lark (Feishu) and Slack. Our extensive experiments on various state-of-the-art (SOTA) agents demonstrate their significant struggles with personalized tool use, thereby highlighting the benchmark's crucial role in identifying and addressing these limitations. MCP-Persona is publicly available at https://github.com/wwh0411/MCP-Persona}{https://github.com/wwh0411/MCP-Persona.

URL PDF HTML ☆

赞 0 踩 0

2606.02465 2026-06-02 cs.CL cs.AI

Learning When to Translate for Multilingual Reasoning

学习何时翻译以实现多语言推理

Deokhyung Kang, Hyounghun Kim, Gary Geunbae Lee

发表机构 * Graduate School of Artificial Intelligence（人工智能研究生院）； Department of Computer Science & Engineering（计算机科学与工程系）； POSTECH

AI总结提出Luar框架，通过强化学习训练推理语言模型在直接理解不可靠时选择性调用翻译，从而缩小多语言推理差距。

Comments preprint

详情

AI中文摘要

推理语言模型（RLMs）在复杂推理任务上表现出色，但仍存在显著的多语言推理差距，这主要源于非英语输入中的语言理解失败。英语翻译可以通过将非英语输入转换为RLMs更可靠解释的形式来缓解这些失败，但当模型能够从原始查询中可靠推理时，翻译每个输入是不必要的。为应对这一挑战，我们提出Luar，一种语言理解边界感知的强化学习框架，训练RLMs在直接理解不可靠时选择性调用翻译。Luar训练模型在直接解决原始输入和对其英语翻译进行推理之间做出选择，仅在翻译增强推理预期显著优于直接推理时鼓励翻译。在多语言推理基准测试中，Luar优于标准GRPO和其他基于训练的基线，在低资源语言上尤其获得巨大提升。进一步分析表明，Luar在直接推理足够的情况下避免不必要的翻译，同时将其翻译调用行为扩展到未见过的低资源语言。总之，我们的工作提出了一种选择性多语言推理方法：RLMs可以学习仅在直接理解不可靠时调用翻译。该项目将在https://github.com/deokhk/LUAR公开。

英文摘要

Reasoning language models (RLMs) achieve strong performance on complex reasoning tasks, but still exhibit substantial multilingual reasoning gaps, largely due to language-understanding failures in non-English inputs. English translation can mitigate these failures by expressing non-English inputs in a form that RLMs can more reliably interpret, yet translating every input is unnecessary when the model can reason reliably from the original query. To address this challenge, we propose Luar, a Language Understanding Boundary-aware Reinforcement Learning framework that trains RLMs to selectively invoke translation when direct understanding is unreliable. Luar trains the model to choose between solving the original input directly and reasoning over its English translation, encouraging translation only when translator-augmented reasoning is expected to substantially outperform direct reasoning. Across multilingual reasoning benchmarks, Luar outperforms standard GRPO and other training-based baselines, with particularly large gains on low-resource languages. Further analysis shows that Luar avoids unnecessary translation in cases where direct reasoning is sufficient, while extending its translator-call behavior to unseen low-resource languages. Together, our work suggests a selective approach to multilingual reasoning: RLMs can learn to invoke translation only when their direct understanding is unreliable. The project will be made publicly available at https://github.com/deokhk/LUAR

URL PDF HTML ☆

赞 0 踩 0

2606.02463 2026-06-02 cs.CV cs.AI

MASER: Modality-Adaptive Specialist Routing for Embodied 3D Spatial Intelligence

MASER: 面向具身3D空间智能的模态自适应专家路由

Hilton Raj, Vishnuram AV

发表机构 * Boston University（波士顿大学）

AI总结提出MASER框架，通过训练共享VLM骨干的五个模态适配器并学习基于问题选择最佳适配器的神经路由策略，解决具身代理在3D环境中多模态推理时忽略问题语义的问题。

Comments Accepted to CVPR 2026 Foundation Models Meet Embodied Agents Workshop

详情

AI中文摘要

在3D环境中，具身代理通过推理自然语言、RGB图像、点云、深度图和相机位姿等多模态信息来回答空间相关问题。现有的视觉语言模型（VLM）在单一模态上微调，完全忽略了可能偏好不同于微调模态的问题语义。为解决这一问题，我们提出MASER（模态自适应专家路由），一个轻量级框架，训练共享VLM骨干的五个不同模态适配器，并学习一个神经路由策略，在推理时根据问题选择最佳适配器。我们使用冻结的句子变换器对每个问题进行编码，并将嵌入通过一个小型多层感知器（MLP），该感知器在oracle适配器-准确率标签上训练。我们在Open3D-VQA基准上评估我们的方法，评估结果表明没有单一模态是普遍最优的——点云答案在51.5%的情况下最佳。MASER以51.3%的oracle一致性进行路由，优于随机森林消融（43.5%），且每个问题仅调用一次适配器。

英文摘要

In 3D environments, Embodied Agents answer spatially relevant questions through reasoning from a mixture of modalities including natural language, RGB images, point clouds, depth maps and camera poses. Existing Vision-Language models (VLMs) are fine-tuned over a single modality. This completely ignores the question semantics which may favor a different modality than the finetuned modality. To address this, we propose MASER (Modality-Adaptive SpEcialist Routing), a lightweight framework that trains five different modality adapters of a shared VLM backbone and learns a neural routing policy that selects the best adapter based on the question during inference. We encode each question with a frozen sentence transformer and pass the embedding through a small Multi-layer Perceptron (MLP) trained on oracle adapter-accuracy labels. We evaluate our methodology over the Open3D-VQA benchmark and our evaluations show that no single modality is universally optimal -- point-cloud answers are best in 51.5% of cases. MASER routes with 51.3% oracle agreement, outperforming a Random-Forest ablation (43.5%), with only a single adapter call per question.

URL PDF HTML ☆

赞 0 踩 0

2606.02459 2026-06-02 cs.CV

Active Exploring like a Pigeon: Reinforcing Spatial Reasoning via Agentic Vision-Language Models

像鸽子一样主动探索：通过智能视觉语言模型强化空间推理

Wei Deng, Xianlin Zhang, Mengshi Qi

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出一种受鸽子认知地图启发的智能视觉语言模型管道，通过动态认知地图和空间断言代码提供密集奖励信号，在MindCube基准上实现80.5%的总体准确率，在Rotation子集上相对提升53.2%。

Comments Accepted by ICML 2026

详情

AI中文摘要

使视觉语言模型（VLM）能够进行空间推理仍然具有挑战性。现有方法将VLM视为被动观察者，这在实际应用中难以奏效。此外，强化学习方法依赖稀疏奖励，限制了其在复杂推理任务中的有效性。受鸽子构建和利用认知地图进行导航的启发，我们提出了一种新颖的智能管道用于空间推理。首先，我们引入了一种新的\emph{动态认知地图}，将场景布局参数化为物体位置和朝向，作为新观测的持久记忆。其次，我们提出了一种新颖的\emph{空间断言代码（SAC）}，即用Python表达式编程描述空间关系。通过与动态认知地图协作，SAC能够验证中间推理步骤，提供密集的奖励信号。我们通过监督学习和强化微调来优化模型。在MindCube基准上的实验表明，我们的方法达到了\emph{80.5\%}的总体准确率，在具有挑战性的 extsc{Rotation}子集上，比当前最佳方法高出\emph{29.5}个准确率点（相对提升\emph{53.2\%}）。我们的代码和数据已在https://github.com/dw-dengwei/active-spatial-reasoning.git开源。

英文摘要

Enabling Vision-Language Models (VLMs) to perform spatial reasoning remains challenging. Existing approaches treat VLMs as passive observers, which is difficult for real-world applications. Moreover, reinforcement learning methods rely on sparse rewards, limiting their effectiveness for complex reasoning tasks. Inspired by pigeons' building and exploiting cognitive maps for navigation, we propose a novel agentic pipeline for spatial reasoning. First, we introduce a new \emph{dynamic cognitive map} parameterizing scene layout as object positions and orientations, serving as persistent memory for new observations. Second, we propose a novel \emph{Spatial Assertion Codes (SAC)}, Python expressions programmatically describing spatial relationships. By collaborating with the dynamic cognitive map, SAC enables verification of intermediate reasoning steps, providing dense reward signals. We optimize the model via supervised and reinforcement finetuning. Experiments on the MindCube benchmark demonstrate state-of-the-art performance with \emph{80.5\%} overall accuracy, outperforming the best current method by \emph{29.5} accuracy points (a relative improvement of \emph{53.2\%}) on the challenging \textsc{Rotation} subset. Our code and data are open-sourced at https://github.com/dw-dengwei/active-spatial-reasoning.git.

URL PDF HTML ☆

赞 0 踩 0

2606.02455 2026-06-02 cs.LG cond-mat.mtrl-sci physics.chem-ph physics.comp-ph stat.CO

Speculative Sampling For Faster Molecular Dynamics

用于加速分子动力学的推测采样

Arthur Kosmala, Stephan Günnemann, Meng Gao, Brandon Wood

发表机构 * FAIR at Meta（Meta FAIR）； School of Computation, Information & Technology, Technical University of Munich（技术大学慕尼黑计算、信息与技术学院）； Munich Data Science Institute（慕尼黑数据科学研究所）； Munich Center For Machine Learning（慕尼黑机器学习中心）

AI总结提出Langevin推测动力学（LSD），一种分布式且模型无关的推测采样方法，通过草稿模型快速提议步长并用目标模型并行验证，实现分子动力学模拟的3-9倍加速而不增加相对误差。

Comments Forty-Third International Conference on Machine Learning (ICML 2026). 32 pages, 14 figures, 8 tables

详情

AI中文摘要

分子动力学（MD）是模拟原子系统动力学行为的关键工具。然而，MD本质上是串行的，这使得通过并发计算提高单系统吞吐量变得困难。为了解决这个问题，我们引入了Langevin推测动力学（LSD），一种分布式且模型无关的推测采样器，用于在不增加相对误差的情况下加速MD。受语言和扩散建模中推测方法的启发，LSD使用草稿模型提议快速模拟步长，并用较慢的目标模型并行验证，应用从草稿分布到目标分布的传输映射。我们将推测采样扩展到二阶Langevin动力学，推导出作为物理参数函数的可实现加速比，表明LSD在不同系统和草稿-目标组合中实现3-9倍加速，并从理论和实验上证实LSD从其目标模型分布中采样轨迹。

英文摘要

Molecular dynamics (MD) is a key tool for simulating the dynamical behavior of atomic systems. However, MD is inherently serial, which makes it difficult to increase single-system throughput with concurrent compute. To address this, we introduce Langevin Speculative Dynamics (LSD), a distributed and model-agnostic speculative sampler for accelerating MD without adding relative error. Inspired by speculative methods in language and diffusion modeling, LSD uses a draft model to propose fast simulation steps and verifies them in parallel with a slower target model, applying a transport map from the draft to the target distribution. We extend speculative sampling to second-order Langevin dynamics, derive the achievable speedup as a function of physical parameters, show that LSD generalizes across different systems and draft-target combinations with a 3-9x speedup, and confirm theoretically and empirically that LSD samples trajectories from its target model distribution.

URL PDF HTML ☆

赞 0 踩 0

2606.02453 2026-06-02 cs.CV cs.AI

Initialization is Half the Battle: Generating Diverse Images from a Guidance Potential Posterior

初始化即半程：从引导势后验生成多样图像

Xiang Li, Dianbo Liu, Kenji Kawaguchi

发表机构 * University of California, Berkeley（加州大学伯克利分校）； University of Tokyo（东京大学）

AI总结针对生成模型模式崩溃问题，提出从引导势后验中采样初始噪声的DivIn方法，利用朗之万动力学引导初始化远离崩溃区域，提升多样性且兼容扩散与流匹配模型。

Comments Accepted by ICML 2026 Spotlight

详情

AI中文摘要

尽管生成模型具有显著的保真度，但它们经常遭受模式崩溃。现有的增强多样性的策略主要集中于在生成轨迹期间进行干预。我们发现一个关键的疏忽：标准高斯初始化通常导致轨迹崩溃到主导模式，因为它对引导势景观是无关的。在这项工作中，我们从引导势后验中公式化选择初始噪声，这有效地将先验重新加权到多样性丰富的区域。为了高效地从该分布中采样，我们引入了多样性诱导初始化（DivIn），它利用朗之万动力学主动导航初始化景观，将初始噪声引导远离崩溃区域，同时将其锚定到有效的数据流形。我们的方法作为一种推理时多样性增强，与扩散和流匹配模型都兼容。大量实验表明，DivIn在类到图像和文本到图像场景中都表现出优越的性能。此外，我们强调，由于DivIn与基于轨迹的方法是正交的，将它们结合起来显著扩展了多样性-质量帕累托前沿，超越了任何单独方法所能达到的。

英文摘要

Despite the remarkable fidelity of generative models, they frequently suffer from mode collapse. Existing strategies for enhancing diversity predominantly focus on intervening during the generation trajectory. We identify a critical oversight that the standard Gaussian initialization often causes trajectories to collapse into dominant modes because it is agnostic to the guidance potential landscape. In this work, we formulate selecting the initial noise from a guidance potential posterior, which effectively re-weights the prior towards diversity-rich regions. To sample from this distribution efficiently, we introduce Diversity-inducing Initialization (DivIn), which leverages Langevin dynamics to actively navigate the initialization landscape, steering initial noise away from collapsing regions while anchoring them to the valid data manifold. Our method serves as an inference-time diversity enhancement compatible with both diffusion and flow matching models. Extensive experiments show that DivIn exhibits a superior performance in both class-to-image and text-to-image scenarios. Furthermore, we highlight that as DivIn is orthogonal to trajectory-based methods, combining them significantly expands the diversity-quality Pareto frontier beyond what either achieves in isolation.

URL PDF HTML ☆

赞 0 踩 0

2606.02449 2026-06-02 cs.AI cs.CL cs.CV cs.LG cs.MM

HLL: Can Agents Cross Humanity's Last Line of Verification?

HLL：智能体能否跨越人类最后一道验证防线？

Xinhao Song, Su Su, Sirui Song, Hongliang Wu, Wen Shen, Zhihua Wei, Gongshen Liu, Linfeng Zhang, Dongrui Liu

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Shandong University（山东大学）； Tongji University（同济大学）

AI总结提出HLL基准，通过交互式CAPTCHA验证评估多模态智能体在受保护工作流中替代人类的能力，发现当前智能体在定位、动作校准、状态跟踪和过程一致性方面存在脆弱性。

Comments 27 pages, 14 figures

详情

AI中文摘要

多模态智能体越来越被期望代表用户操作界面，这引发了一个核心部署问题：在服务特意防止自动化的流程中，它们能否真正替代人类？CAPTCHA验证使这个问题具体化。它不仅仅是一个视觉谜题，更是在账户创建、内容访问、表单提交和其他受保护操作之前设置的人类验证边界。我们引入了 extbf{人类最后一道验证防线（HLL）}，这是一个受控基准，使用交互式CAPTCHA验证来评估智能体是否能够通过基于环境的类人交互（而非仅识别）跨越这一边界。HLL涵盖了多种CAPTCHA交互，并让智能体暴露于受控的现实压力因素下，包括杂乱的网页、更困难的任务变体以及解决过程的轨迹条件验证。我们在闭环GUI环境中评估了八个前沿多模态智能体。结果表明，当前智能体在这个人类替代边界上仍然脆弱：性能在不同验证类型间差异显著，在现实界面条件下下降，当正确答案必须由有效动作轨迹支持时进一步下降。通过揭示定位、动作校准、状态跟踪和过程一致性方面的差距，HLL为衡量多模态智能体在受保护的真实世界工作流中作为人类替代品有多接近提供了一个具体的测试平台。我们的代码可在https://github.com/XinhaoS0101/HLL获取。

英文摘要

Multimodal agents are increasingly expected to operate interfaces on behalf of users, raising a central deployment question: can they truly substitute for humans in workflows that services deliberately protect against automation? CAPTCHA verification makes this question concrete. It is not merely a visual puzzle, but a human-verification boundary placed before account creation, content access, form submission, and other protected actions. We introduce \textbf{Humanity's Last Line of Verification (HLL)}, a controlled benchmark that uses interactive CAPTCHA verification to evaluate whether agents can cross this boundary through grounded, human-like interaction rather than recognition alone. HLL covers diverse CAPTCHA interactions and exposes agents to controlled realism stressors, including cluttered webpages, harder task variants, and trace-conditioned validation of the solving process. We evaluate eight frontier multimodal agents in a closed-loop GUI environment. The results show that current agents remain brittle at this human-substitution boundary: performance varies sharply across verification types, degrades under realistic interface conditions, and drops further when correct answers must be supported by valid action traces. By exposing gaps in localization, action calibration, state tracking, and process consistency, HLL provides a concrete testbed for measuring how close multimodal agents are to acting as human substitutes in protected real-world workflows. Our code is available at https://github.com/XinhaoS0101/HLL

URL PDF HTML ☆

赞 0 踩 0

2606.02444 2026-06-02 cs.AI cs.CL

Food Noise & False Safety: A Systematic Evaluation of How LLMs Fail to Adapt to Eating Disorder Queries with Clinician Feedback

食物噪音与虚假安全：系统评估LLMs如何在临床医生反馈下未能适应饮食障碍查询

Giulia Pucci, Emily Hemendinger, Ruizhe Li, Gavin Abercrombie, Tanvi Dinkar, Arabella Sinclair

发表机构 * University of Aberdeen（阿伯丁大学）； University of Colorado Anschutz（科罗拉多大学安舒茨分校）； Heriot-Watt University（赫里奥特-瓦特大学）； University College London（伦敦大学学院）

AI总结本研究通过与临床饮食障碍专家合作，系统评估了大型语言模型在处理饮食障碍用户查询时，因不加批判地适应不安全或自伤请求而可能产生的危害。

2606.02443 2026-06-02 cs.CL cs.AI cs.CV

PaSBench-Video: A Streaming Video Benchmark for Proactive Safety Warning

PaSBench-Video: 面向主动安全预警的流式视频基准

Yusong Zhao, Yuejin Xie, Youliang Yuan, Junjie Hu, Jitian Guo, Yujiu Yang, Pinjia He

发表机构 * The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； Tsinghua University（清华大学）

AI总结提出PaSBench-Video基准，包含740个视频，评估多模态大模型在危险发生前及时发出预警的能力，发现现有模型在时序精度和低误报率上表现不佳。

详情

AI中文摘要

从危险的第一个可见迹象到事故发生之间，通常存在一个仍可干预的时间窗口。具备视频能力的多模态大语言模型（MLLM）可以作为始终在线的安全监控器，在此窗口内发出警告。然而，当前的基准测试并未检验这一能力：它们依赖静态输入，忽略时间精度，并且省略了对安全场景的误报测量。我们提出了PaSBench-Video，一个包含740个视频的基准测试，涵盖驾驶、医疗、日常生活和工业生产四个领域，其中包含481个风险视频和259个无风险视频。风险视频标注了帧级别的风险起始点和事故边界。模型必须以因果方式观察视频，并发出在时间上校准且内容正确的警告。测试了13个MLLM后，我们发现没有模型在我们的最严格指标上超过20.0%，并且召回率与误报率紧密相关，皮尔逊相关系数为0.64：更高的检测率只能以在大多数安全片段上触发警告为代价。性能按领域显著分化：在日常生活领域，模型在低误报率下实现了中等召回率，因为该领域的风险本质上是异常的；而在驾驶领域，模型不加区分地触发警告，因为常规场景和危险场景看起来相似。这些结果表明，当前模型依赖于场景级别的活动线索，而不是推理正在出现的危害。

英文摘要

Between the first visible sign of danger and the moment an accident occurs, there is often a window where intervention remains possible. Video-capable multimodal large language models (MLLMs) could serve as always-on safety monitors that issue warnings during this window. Yet current benchmarks do not test this ability: they rely on static inputs, ignore timing precision, and omit false-positive measurement on safe scenes. We present PaSBench-Video, a 740-video benchmark with 481 risk and 259 no-risk videos across four domains: driving, healthcare, daily life, and industrial production. Risk videos are annotated with frame-level risk onset and accident boundaries. A model must observe the video causally and produce a warning that is both temporally calibrated and content-correct. Testing 13 MLLMs, we find that no model exceeds 20.0% on our strictest metric, and recall is tightly coupled with false-positive rate, with Pearson correlation 0.64: higher detection comes only at the cost of triggering warnings on the majority of safe clips. Performance splits sharply by domain: models achieve moderate recall at low false-positive rates in daily life, where risks are inherently anomalous, yet fire indiscriminately in driving, where routine and hazardous scenes look alike. These results indicate that current models rely on scene-level activity cues rather than reasoning about emerging harm.

URL PDF HTML ☆

赞 0 踩 0

2606.02441 2026-06-02 cs.CV

Spatial-Temporal Decoupled Reference Conditioning for Identity-Preserving Text-to-Video Generation

空间-时间解耦参考条件用于身份保持的文本到视频生成

Yuheng Chen, Teng Hu, Yuji Wang, Qingdong He, Lizhuang Ma, Jiangning Zhang

发表机构 * Shanghai Jiao Tong University（上海交通大学）； University of Electronic Science and Technology of China（电子科技大学）； Zhejiang University（浙江大学）

AI总结提出ST-DRC框架，通过空间-时间解耦参考条件、TASS-RoPE机制和身份目标，实现高保真身份保持视频生成。

详情

AI中文摘要

身份保持视频生成（IPVG）旨在合成高保真视频，遵循文本提示同时忠实保持参考身份。尽管最近取得进展，现有IPVG方法仍难以平衡高级语义控制和低级身份保真度。为弥合这一差距，我们提出ST-DRC，一种有效的空间-时间解耦参考条件框架，用于身份保持的文本到视频生成。在框架层面，ST-DRC通过使用视频VAE编码参考图像并将其与噪声视频潜在变量拼接，执行潜在上下文特征注入，无需额外适配器即可访问丰富的低级身份细节。为将身份感知参考检索与外观复制分离，我们引入TASS-RoPE，一种时间相邻-空间偏移的RoPE方案，将参考令牌在时间上靠近视频序列但在空间上偏移，允许参考信息通过时空注意力流动，同时抑制像素级复制粘贴捷径。为进一步防止捷径学习并增强扩散目标中被稀释的身份监督，我们结合外观不变参考增强与面部引导身份目标，鼓励模型在颜色、姿态和布局变化下保持身份。在推理时，我们引入三流参考无分类器引导策略，独立控制文本遵循度和参考保真度。实验表明，ST-DRC在基于LTX-2.3的轻量级设计下，实现了强身份保持、提示对齐、时间一致性和视频质量。我们的方法在面部身份保持视频生成赛道中排名靠前，验证了空间-时间解耦参考条件的有效性。

英文摘要

Identity-preserving video generation (IPVG) aims to synthesize high-fidelity videos that follow text prompts while faithfully preserving a reference identity. Despite recent progress, existing IPVG methods still struggle to balance high-level semantic control and low-level identity fidelity. To bridge this gap, we propose ST-DRC, an effective Spatial-Temporal Decoupled Reference Conditioning framework for identity-preserving text-to-video generation. At the framework level, ST-DRC performs latent in-context feature injection by encoding the reference image with the video VAE and concatenating it with noisy video latents, enabling rich low-level identity details to be accessed without additional adapters. To separate identity-aware reference retrieval from appearance copying, we introduce TASS-RoPE, a Temporal-Adjacent Spatial-Shifted RoPE scheme that places reference tokens near the video sequence in time but shifts them in space, allowing reference information to flow through spatio-temporal attention while suppressing pixel-level copy-paste shortcuts. To further prevent shortcut learning and strengthen the otherwise diluted identity supervision in the diffusion objective, we combine appearance-invariant reference augmentation with face-guided identity objectives, encouraging the model to preserve identity under variations in color, pose, and layout. At inference time, we introduce a three-stream reference classifier-free guidance strategy that independently controls text adherence and reference fidelity. Experiments demonstrate that ST-DRC achieves strong identity preservation, prompt alignment, temporal consistency, and video quality with a lightweight design built on LTX-2.3. Our method ranks among the top submissions in the facial identity-preserving video generation track, validating the effectiveness of spatial-temporal decoupled reference conditioning.

URL PDF HTML ☆

赞 0 踩 0

2606.02436 2026-06-02 cs.CV

Geometry-Aware Implicit Memory for Video World Models

几何感知隐式记忆用于视频世界模型

Zhengxuan Wei, Xu Guo, Xinghui Li, Xunzhi Xiang, Min Wei, Yiran Zhu, Qiulin Wang, Xintao Wang, Pengfei Wan, Xiangwang Hou, Qi Fan

发表机构 * School of Intelligence Science and Technology, Nanjing University（南京大学智能科学与技术学院）； Kling Team, Kuaishou Technology（快手技术 Kling 团队）； Tsinghua University（清华大学）

AI总结提出GIM-World框架，通过轻量级Transformer编码器将可变长度历史压缩为固定大小的记忆令牌，并利用相机可查询的几何头在训练期间从冻结的基础模型中蒸馏3D场景结构，从而在长时程视频生成中保持几何和视觉一致性。

Comments Project page: https://gim-world.github.io/

详情

AI中文摘要

视频世界模型旨在模拟可控的视觉环境，但长时程展开取决于模型在观察离开其原生上下文窗口后记住的内容。显式记忆保留帧或在线3D重建，可能会遭受启发式检索错误、冗余外观存储或重建伪影。隐式记忆将历史压缩为紧凑状态，但现有设计没有明确约束以编码跨视图场景几何。我们提出GIM-World，一种用于视频世界模型的几何感知隐式记忆框架。轻量级Transformer编码器将可变长度历史压缩为固定大小的记忆令牌，相机可查询的几何头在训练期间从冻结的基础模型中将3D场景结构蒸馏到记忆中，信息引导的剪枝规则在历史增长时保持编码成本有界。在推理时丢弃几何教师，留下轻量级记忆模块。在MIND上的实验表明，GIM-World在保持长时程几何和视觉一致性方面优于显式和隐式记忆基线。

英文摘要

Video world models aim to simulate controllable visual environments, but long-horizon rollouts depend on what the model remembers after observations leave its native context window. Explicit memories retain frames or online 3D reconstructions, which can suffer from heuristic retrieval errors, redundant appearance storage, or reconstruction artifacts. Implicit memories compress history into a compact state, but existing designs are not explicitly constrained to encode cross-view scene geometry. We propose GIM-World, a geometry-aware implicit memory framework for video world models. A lightweight transformer encoder compresses variable-length history into fixed-size memory tokens, a camera-queryable geometry head distills 3D scene structure from a frozen foundation model into the memory during training, and an information-guided pruning rule keeps encoding cost bounded as history grows. The geometry teacher is discarded at inference, leaving a lightweight memory module. Experiments on MIND show that GIM-World better preserves long-horizon geometric and visual consistency than both explicit- and implicit-memory baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.02434 2026-06-02 cs.AI

Bridging the Sim-to-Real Gap in Semiconductor Visual Program Synthesis via Input Binarization

通过输入二值化弥合半导体视觉程序合成中的仿真到现实差距

Yusuke Ohtsubo, Kota Dohi, Koichiro Yawata, Koki Takeshita, Tatsuya Sasaki

发表机构 * National Institute of Information and Communications Technology, Japan（日本信息通信技术全国研究所）

AI总结提出一种视觉程序合成框架，利用输入二值化策略消除扫描电子显微镜图像的纹理和噪声，使视觉语言模型专注于几何结构，从而弥合仿真到现实的差距，在MIIC数据集上将平均Dice系数从0.4393提升至0.5256。

详情

AI中文摘要

精确的电路几何参数控制对于半导体检测至关重要，但获取足够的真实训练数据成本高昂。尽管扩散模型和生成对抗网络等生成模型可以扩充训练数据，但它们无法保证计量任务所需的纳米级几何精度。我们提出一个视觉程序合成框架，其中视觉语言模型将检测图像转换为描述电路几何的可编辑领域特定语言代码，从而能够通过精确参数操作可控地生成训练数据。由于视觉语言模型仅使用合成的DSL渲染数据进行训练，在处理真实扫描电子显微镜图像时会出现领域差距。我们通过输入二值化策略弥合这一差距，该策略去除SEM特有的纹理和噪声，使模型专注于几何结构。在MIIC数据集上，二值化输入将平均Dice系数从原始输入基线的0.4393提升至0.5256，表明简单的纹理抽象显著缓解了仿真到现实的差距。

英文摘要

Precise parametric control over circuit geometry is essential for semiconductor inspection, yet obtaining sufficient real training data remains costly. Although generative models such as diffusion models and Generative Adversarial Networks (GANs) can augment training data, they cannot guarantee the nanometer-scale geometric accuracy required for metrology tasks. We propose a visual program synthesis framework in which a Vision-Language Model (VLM) converts inspection images into editable Domain-Specific Language (DSL) code describing circuit geometries, enabling controlled generation of training data with exact parameter manipulation. Because the VLM is trained solely on synthetic DSL-rendered data, a domain gap arises when processing real Scanning Electron Microscope (SEM) images. We bridge this gap with an input binarization strategy that strips SEM-specific texture and noise, letting the model focus on geometric structure. On the MIIC dataset, binarized inputs improve the mean Dice coefficient from 0.4393 to 0.5256 over the raw-input baseline, demonstrating that simple texture abstraction substantially mitigates the sim-to-real gap.

URL PDF HTML ☆

赞 0 踩 0

2606.02432 2026-06-02 cs.RO

NDPP-Grasp: Non-Differentiable Physical Plausibility Constraint-Guided Task-Oriented Dexterous Grasp Generation

NDPP-Grasp：非可微物理合理性约束引导的任务导向灵巧抓取生成

Qiuchi Xiang, Haoxuan Qu, Hossein Rahmani, Jun Liu

发表机构 * Lancaster University（兰卡斯特大学）

AI总结提出一种框架，通过将非可微物理合理性约束直接注入任务对齐的抓取扩散模型的去噪过程，实现物理合理性引导的灵巧抓取生成，同时保持任务对齐。

详情

AI中文摘要

任务导向的灵巧抓取生成旨在产生既物理合理又适用于特定操作任务的灵巧抓取姿态。现有的基于扩散的方法通常以解耦的方式处理这两个要求：它们首先训练一个用于任务对齐的抓取扩散模型，然后依赖生成后的细化来提高物理合理性。然而，这种事后修正策略仅在抓取已经生成后才应用物理合理性指导，使得生成轨迹本身不受物理约束引导，可能导致次优的抓取。为了解决这个问题，我们提出了一种新颖的框架，该框架以实用且有效的方式将物理合理性指导直接注入任务对齐的抓取扩散模型的去噪过程中，即使物理合理性约束是非可微的。这使得物理合理性能够在整个去噪过程中塑造抓取生成，同时保持任务对齐。大量实验证明了我们框架的有效性。

英文摘要

Task-oriented dexterous grasp generation aims to produce dexterous grasp poses that are both physically plausible and functionally suitable for specified manipulation tasks. Existing diffusion-based methods often address these two requirements in a decoupled manner: they first train a grasp diffusion model for task alignment and then rely on post-generation refinement to improve physical plausibility. However, this after-the-fact correction strategy applies physical plausibility guidance only once the grasp has already been generated, leaving the generation trajectory itself unguided by physical constraints and potentially leading to suboptimal grasps. To address this problem, we propose a novel framework that directly injects physical plausibility guidance into the denoising process of a task-aligned grasp diffusion model in a practical and effective manner, even when physical plausibility constraints are non-differentiable. This allows physical plausibility to shape grasp generation throughout denoising while preserving task alignment. Extensive experiments demonstrate the efficacy of our framework.

URL PDF HTML ☆

赞 0 踩 0

2606.02424 2026-06-02 cs.CV cs.AI cs.LG

GC-MoE: Genomics-Guided Cell-Type-Specific Mixture of Experts for Histology-Based Single-Cell Spatial Transcriptomics

GC-MoE: 基因组引导的细胞类型特异性专家混合模型用于基于组织学的单细胞空间转录组学

Kaito Shiku, Ahtisham Fazeel Abbasi, Ryoma Bise, Yuichiro Iwashita, Kazuya Nishimura, Andreas Dengel, Muhammad Nabeel Asim

发表机构 * Kyushu University（九州大学）； German Research Center for Artificial Intelligence (DFKI GmbH)（德国人工智能研究中心）； RPTU University Kaiserslautern-Landau（科布伦茨-劳恩堡大学）； The University of Osaka（大阪大学）； IntelligentX GmbH ； Osaka Metropolitan University（大阪 Metropolitan 大学）

AI总结提出GC-MoE模型，通过路由网络估计细胞类型概率并软组合细胞类型特异性专家，结合细胞类型特异性共表达感知预测器和细胞间交互注意力模块，从组织学图像和细胞位置预测单细胞基因表达，在公共数据集上优于现有方法。

详情

AI中文摘要

基于组织学的单细胞空间转录组学（ST）估计旨在从组织病理学图像和细胞位置预测单个细胞的基因表达，从而减少对昂贵的单细胞ST测量的需求。与现有的组织学到ST方法主要预测包含多个细胞的局部区域的斑点级谱不同，该任务需要对细胞间的表达变异性进行建模，而这种变异性强烈地由细胞类型结构化。我们提出了基因组引导的细胞类型特异性专家混合模型（GC-MoE），该模型通过路由网络估计细胞类型概率，并软组合细胞类型特异性专家进行基因表达预测。为了进一步编码细胞类型依赖的基因程序，我们引入了细胞类型特异性共表达感知预测器（CAP），以及一个轻量级的细胞间交互注意力（C2CA）模块用于邻域细胞上下文。在公共单细胞ST数据集上的实验和消融研究表明，该方法在现有单细胞和适应性斑点级基线方法上均有一致的改进。

英文摘要

Histology-based single-cell spatial transcriptomics (ST) estimation aims to predict gene expression for individual cells from histopathological images and cell locations, reducing the need for costly single-cell ST measurements. Unlike existing histology-to-ST methods that mainly predict spot-level profiles for local regions containing multiple cells, this task requires modeling cell-to-cell expression variability, which is strongly structured by cell type. We propose Genomics-Guided Cell-Type-Specific Mixture-of-Experts (GC-MoE), which estimates cell-type probabilities with a routing network and softly combines cell-type-specific experts for gene expression prediction. To further encode cell-type-dependent gene programs, we introduce the Cell-Type-Specific Co-Expression-Aware Predictor (CAP), together with a lightweight Cell-to-Cell Interaction Attention (C2CA) module for neighboring-cell context. Experiments and ablations on public single-cell ST datasets show consistent improvements over existing single-cell and adapted spot-level baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.02423 2026-06-02 cs.CL cs.LG

Investigating and Alleviating Harm Amplification in LLM Interactions

调查和缓解大语言模型交互中的危害放大

Ruohao Guo, Wei Xu, Alan Ritter

发表机构 * Georgia Institute of Technology（佐治亚理工学院）

AI总结提出HarmAmp基准和TrajSafe监控器，用于评估和缓解多轮对话中大语言模型对危害的放大效应。

详情

AI中文摘要

大语言模型（LLM）可以作为有用的助手，但它们同样可以作为危害放大器，使恶意用户通过扩展交互实现超出其能力的危害结果。这种风险沿着两个轴显现，即民主化领域专业知识，使新手能够产生专门的有害内容，以及以手动努力无法匹敌的规模扩大有害操作。然而，现有工作往往忽略了LLM在多轮对话中如何加剧危害。我们引入了HarmAmp，这是一个新的基准，用于涵盖十二个风险类别的多轮危害放大场景。每个场景都基于现实世界的威胁，并满足严格的标准，即实质性放大、操作特异性和多轮必要性。我们进一步提出了TrajSafe，一种主动监控器，可以预测有害轨迹并通过诸如探测用户真实意图和引导模型更安全地完成等行动进行干预。我们的广泛实验表明，TrajSafe显著降低了多轮交互中产生的危害性，同时保持了低过度拒绝率和目标模型的一般能力。我们的工作为缓解LLM交互中微妙的安全风险提供了一个有前景的范式。

英文摘要

Large language models (LLMs) can serve as helpful assistants, yet they can equally function as harm amplifiers that enable malicious users to achieve harmful outcomes beyond their capabilities through extended interactions. This risk manifests along two axes, i.e., democratizing domain expertise that allows novices to produce specialized harmful content, and scaling harmful operations at volumes that manual effort cannot match. Existing works, however, often overlook how LLMs compound harm across multi-turn conversations. We introduce HarmAmp, a new benchmark for multi-turn harm amplification scenarios spanning twelve risk categories. Each scenario is grounded in real-world threats and satisfies rigorous criteria, i.e., substantive amplification, operational specificity, and multi-turn necessity. We further propose TrajSafe, a proactive monitor that anticipates harmful trajectories and intervenes through actions such as probing users' genuine intents and steering the models towards safer completion. Our extensive experiments demonstrate that TrajSafe significantly reduces the harmfulness incurred in multi-turn interactions while preserving a low over-refusal rate and the target model's general capabilities. Our work offers a promising paradigm to alleviate the nuanced safety risks in LLM interactions.

URL PDF HTML ☆

赞 0 踩 0

2606.02406 2026-06-02 cs.CV

Edge Prediction for Roof Wireframe Reconstruction with Transformers

基于Transformer的屋顶线框重建边预测

Gustav Hanning, Ludvig Dillén, Jonathan Astermark, Johanna Lidholm, Viktor Larsson

发表机构 * Centre for Mathematical Sciences, Lund University（卢德大学数学科学中心）

AI总结提出一种端到端Transformer编码器-解码器架构，利用稀疏SfM点云和语义分割图重建3D屋顶线框，在HoHo 22k数据集上取得0.6476的混合结构分数，位列挑战赛第二名。

Comments Presented at the 3rd Urban Scene Modeling (USM3D) Workshop at CVPR 2026

详情

AI中文摘要

本文提出了一种针对S23DR Challenge 2026的竞争性解决方案，该挑战旨在从稀疏SfM点云、地面级语义分割图和深度图中重建3D房屋屋顶线框模型。我们的方法采用受DETR启发的端到端Transformer编码器-解码器架构。为了有效处理几何和语义数据，稀疏SfM点云输入基于语义优先级进行动态子采样，并增强以Gestalt和ADE20k类别特征。为了进一步增加分割上下文，我们将点特征与额外的Gestalt特征编码融合，这些编码通过将点投影到冻结自编码器产生的潜在特征图中获得。然后，学习到的查询嵌入通过交叉注意力机制直接解码为3D线框边。在“HoHo 22k”数据集上的评估表明，我们的方法显著优于手工和学习的基线方法，取得了0.6476的混合结构分数（HSS），并在挑战赛私有排行榜上获得第二名。

英文摘要

This paper presents a competitive solution to the S23DR Challenge 2026, which aims to reconstruct 3D house roof wireframe models from sparse SfM point clouds and ground-level semantic segmentations and depth maps. Our proposed method utilizes an end-to-end Transformer encoder-decoder architecture inspired by DETR. To effectively process the geometric and semantic data, the sparse SfM point cloud input is dynamically subsampled based on semantic priority and augmented with Gestalt and ADE20k class features. To further increase segmentation context, we fuse the point features with additional Gestalt feature encodings which are obtained by projecting the points into latent feature maps produced by a frozen autoencoder. Learned query embeddings are then decoded directly into 3D wireframe edges via cross-attention mechanisms. Evaluated on the "HoHo 22k" dataset, our approach significantly outperforms both handcrafted and learned baselines, achieving a Hybrid Structure Score (HSS) of 0.6476 and securing the second-highest position on the challenge's private leaderboard.

URL PDF HTML ☆

赞 0 踩 0

2606.02404 2026-06-02 cs.CL

K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts

K-BrowseComp：基于韩国语境的网页浏览代理基准测试

Nahyun Lee, Dongkeun Yoon, Guijin Son, Geewook Kim, Dayoon Ko, Jeonghun Park, Haneul Yoo, Jaewon Cho, Junghun Park, Changyoon Lee, Kyochul Jang, Jaeyeon Kim, Eunsu Kim, Woojin Cho, Seungone Kim

发表机构 * Chung-Ang University（Chung-Ang 大学）； KAIST（韩国科学技术院）； Seoul National University（首尔国立大学）； OnelineAI ； NAVER Cloud AI ； Carnegie Mellon University（卡内基梅隆大学）

AI总结针对韩国语境，构建包含400个问题的网页浏览代理基准K-BrowseComp，评估前沿模型性能，发现其准确率显著低于BrowseComp，并公开数据与代码。

详情

AI中文摘要

前沿模型评估正从基础能力（如指令遵循和推理）转向组合性、代理性能力，但韩语代理基准仍然稀缺。我们介绍了K-BrowseComp，一个基于韩国语境的网页浏览代理基准，包含400个问题。其中300个问题的K-BrowseComp-Verified子集由母语为韩语的人手动构建和验证。在该子集上，包括GPT-5.5、DeepSeek-V4-Pro和GLM-5.1在内的前沿LLM仅达到30.00–45.67%，相比BrowseComp大幅下降，而通过韩国专有AI基础模型计划发布的韩语LLM仅获得0.00–10.33%。我们进一步利用解决和创建网页浏览问题之间的不对称性，通过硬样本少样本示例和失败模式导向生成构建了一个100个问题的合成分割。在对抗性过滤的合成诊断分割上，最强模型仅达到26.00%，我们将此分割作为定向压力测试单独报告。我们公开发布了数据和代码。

英文摘要

Frontier model evaluations are shifting from foundational capabilities (e.g., instruction following and reasoning) toward compositional, agentic ones, but Korean agentic benchmarks remain scarce. We introduce K-BrowseComp, a web-browsing agent benchmark grounded in Korean contexts, consisting of 400 problems. The 300-problem K-BrowseComp-Verified subset is manually constructed and validated by native Korean speakers. On this subset, frontier LLMs, including GPT-5.5, DeepSeek-V4-Pro, and GLM-5.1, reach only 30.00--45.67\%, a substantial drop from BrowseComp, while Korean LLMs released through Korea's Proprietary AI Foundation Model program obtain only 0.00--10.33\%. We further construct a 100-problem synthetic split using hard few-shot exemplars and failure-mode-targeted generation to exploit the asymmetry between solving and creating web browsing problems. On the adversarially filtered synthetic diagnostic split, the strongest model reaches only 26.00\%, and we report this split separately as a targeted stress test. We publicly release our data and code.

URL PDF HTML ☆

赞 0 踩 0

2606.02398 2026-06-02 cs.LG cs.CL

A Local Perturbation Theory for Cross-Domain Interference and Recovery in Multi-Domain RL

跨域干扰与恢复的局部微扰理论：多领域强化学习

Lei Yang, Siyu Ding, Deyi Xiong

发表机构 * TJUNLP Lab, College of Intelligence and Computing, Tianjin University（天津大学智能与计算学院TJUNLP实验室）； Baidu Inc.（百度公司）

AI总结针对多领域RL训练中一个领域性能下降的问题，提出局部微扰理论，证明后期领域训练主要通过二阶损伤项在低维共享冲突子空间中损害早期领域，并通过短时领域刷新实现选择性恢复。

详情

AI中文摘要

强化学习后训练在数学推理、代码生成、问答和创意写作等单个领域上改进了大型语言模型，但在一个领域上的训练往往会降低其他领域的性能。基于灾难性遗忘或全局梯度冲突的现有解释是不完整的：即使全模型梯度几乎正交，也可能发生实质性干扰。我们表明，单领域RL产生稀疏、小量级的参数编辑，且top变化神经元之间的重叠较弱，而不同领域仍然共享大量的活跃计算路径，这些路径上的更新方向决定了它们是协同还是冲突。在此观察指导下，我们在多领域RL的局部微扰模型下证明，后期领域训练主要通过二阶损伤项损害早期领域，在观察到的稀疏路径结构下，该损伤项集中在低维共享冲突子空间中。此外，短时领域刷新会收缩该子空间上的有害成分，从而在有限的附带损伤下实现选择性恢复。与理论一致，在Code → Math → QA → CW之后进行短暂的Re-Math刷新，将Math从57.66恢复到66.04，同时基本保持其他领域的性能，得到最佳平均分66.39。除了刷新之外，针对Math-QA对的稀疏代理冲突坐标集进行无训练回滚可部分恢复Math，为局部损伤提供了直接的代理级证据。这些结果为多领域RL中的干扰和恢复提供了局部机制解释。

英文摘要

Reinforcement learning (RL) post-training improves large language models (LLMs) on individual domains such as mathematical reasoning, code generation, question answering, and creative writing (CW), but training on one domain often degrades performance on others. Existing explanations based on catastrophic forgetting or global gradient conflict are incomplete: substantial interference can occur even when full-model gradients are nearly orthogonal. We show that single-domain RL produces sparse, small-magnitude parameter edits with weak overlap among top-changed neurons, while different domains still share substantial active computation routes on which update directions determine whether they act synergistically or conflict. Guided by this observation, we prove under a local perturbation model of multi-domain RL that later-domain training harms an earlier domain mainly through a second-order damage term, which under the observed sparse route structure concentrates in a low-dimensional shared conflict subspace. Moreover, a short domain refresh contracts the harmful component on this subspace, enabling selective recovery with limited collateral damage. Consistent with the theory, a brief Re-Math refresh after Code $\rightarrow$ Math $\rightarrow$ QA $\rightarrow$ CW recovers Math from 57.66 to 66.04 while largely preserving performance on the other domains, yielding the best average score of 66.39. Beyond refresh, a training-free rollback on a sparse proxy conflict coordinate set for the Math-QA pair partially restores Math, providing direct proxy-level evidence for localized damage. These results provide a localized mechanistic account of interference and recovery in multi-domain RL.

URL PDF HTML ☆

赞 0 踩 0

2606.02388 2026-06-02 cs.LG cs.AI

Policy and World Modeling Co-Training for Language Agents

语言智能体的策略与世界模型协同训练

Ning Lu, Baijiong Lin, Shengcai Liu, Jiahao Wu, Haoze Lv, Yanbin Wei, Lingting Zhu, Shengju Qian, Xin Wang, Ying-Cong Chen, Qi Wang, Ke Tang

发表机构 * Southern University of Science and Technology（南方科技大学）； Hong Kong University of Science and Technology（香港科学大学）； Hong Kong University of Science and Technology (Guangzhou)（香港科学大学（广州））； Hong Kong Polytechnic University（香港理工大学）； LIGHTSPEED

AI总结提出PaW框架，通过在强化学习过程中添加辅助世界模型监督，无需改变推理范式，提升语言智能体在多个任务上的性能。

Comments 9 pages, 6 figures

详情

AI中文摘要

强化学习通过教导大语言模型智能体哪些行动能带来高奖励来改进它们，但对这些行动对环境的影响提供很少的监督。世界建模可以填补这一空白，但现有方法通常需要单独的模拟器、额外的训练阶段或额外的推理时计算。我们观察到，在策略强化学习 rollout 已经包含了所需的信号：每个转移将行动与其产生的下一个观察配对。基于这一观察，我们提出了PaW，一个策略和世界模型协同训练框架，它在强化学习过程中向同一策略添加辅助世界模型监督，而不改变推理范式。为了使辅助世界模型监督信息丰富且稳定，PaW引入了三个组件：基于行动熵的世界模型数据选择、噪声容忍的世界模型损失和奖励自适应的损失平衡。在三个智能体任务基准上的实验表明，在不同模型和强化学习算法上，PaW相对于强强化学习基线有一致的改进。这些结果表明，标准的强化学习 rollout 是语言智能体训练中世界模型监督的实用来源。

英文摘要

Reinforcement learning (RL) improves large language model (LLM) agents by teaching them which actions lead to high rewards, but provides little supervision on what those actions do to the environment. World modeling (WM) can fill this gap, yet existing approaches often require separate simulators, extra training stages, or additional inference-time computation. We observe that on-policy RL rollouts already contain the needed signal: each transition pairs an action with its resulting next observation. Based on this observation, we propose PaW, a Policy and World modeling co-training framework that adds auxiliary WM supervision to the same policy during RL, without changing the inference paradigm. To make auxiliary WM supervision informative and stable, PaW introduces three components: action-entropy-based WM data selection, noise-tolerant WM loss, and reward-adaptive loss balancing. Experiments on three agentic task benchmarks show consistent improvements over strong RL baselines across models and RL algorithms. These results suggest that standard RL rollouts are a practical source of WM supervision for language-agent training.

URL PDF HTML ☆

赞 0 踩 0

2606.02381 2026-06-02 cs.AI cs.LG math.DS

A Mathematical Conflict Framework for Contextual Data Modulation

上下文数据调制的数学冲突框架

Hakan Emre Kartal

发表机构 * GitHub

AI总结提出一个基于算子的数学冲突框架，将冲突视为局部、方向性和上下文敏感的量，通过统一抽象算子整合权重、尺度行为和输出映射，作为独立于优化过程的数学对象。

Comments 15 pages, 3 figures, framework paper

2606.02380 2026-06-02 cs.CL cs.AI

SPADE-Bench: Evaluating Spontaneous Strategic Deception in Agents via Plan-Action Divergence

SPADE-Bench：通过计划-行动分歧评估智能体中的自发性策略欺骗

Yuyan Bu, Haowei Li, Qirui Zheng, Bowen Dong, Kaiyue Yang, Jiaming Ji, Yingshui Tan, Wenxin Li, Yaodong Yang, Juntao Dai

发表机构 * Beijing Academy of Artificial Intelligence（北京人工智能研究院）； Peking University（北京大学）； University of Science and Technology of China（中国科学技术大学）； University of Chinese Academy of Science（中国科学院大学）； Alibaba Group（阿里巴巴集团）

AI总结针对LLM智能体在工具使用中可能出现的自发性策略欺骗（计划与行动不一致），提出SPADE-Bench基准，通过结合实际工具执行和受控压力场景，严格区分欺骗与幻觉，实验证实该问题真实且紧迫。

详情

AI中文摘要

随着基于LLM的智能体扩展其操作范围，可靠性成为实际部署的前提。然而，在实际应用中，人类用户无法监控每一个即时行为；相反，执行过程往往是一个黑箱，用户仅依赖智能体的自我报告更新。这种不透明性带来了关键风险：智能体可能呈现与执行行动不一致的面向观察者的报告，使得系统不可控，尤其是在高风险自主场景中。我们将这种自我报告的计划-行动分歧称为智能体欺骗。为了评估这一点，我们引入了SPADE-Bench，一个旨在评估自发性计划-行动分歧的基准。与先前的欺骗基准不同，SPADE-Bench同时集成了实际工具执行和受控压力场景。这种设计确保了生态效度，并通过在压力下进行受控的计划-行动比较，严格区分策略欺骗与单纯的幻觉。跨主流模型的实验证实，智能体欺骗在工具使用环境中是一个真实且紧迫的问题。通过提供一个全面且稳健的评估框架，SPADE-Bench填补了智能体安全中的关键空白，促进社区朝着构建可信和可控的自主系统迈进。

英文摘要

As LLM-based agents expand their operational scope, reliability becomes a prerequisite for real-world deployment. However, in practical applications, human users cannot monitor every immediate behavior; instead, the execution process often remains a black box, leaving users dependent solely on the agent's self-reported updates. This opacity creates a critical risk: agents may present observer-facing reports that diverge from their executed actions, rendering the system uncontrollable, especially in high-stakes autonomous scenarios. We term such self-reported plan-action divergence as agent deception. To assess this, we introduce SPADE-Bench, a benchmark designed to evaluate spontaneous plan-action divergence. Unlike prior deception benchmarks, SPADE-Bench simultaneously integrates actual tool execution and controlled pressure scenarios. This design ensures ecological validity and rigorously distinguishes strategic deception from mere hallucination through controlled plan-action comparisons under pressure. Experiments across mainstream models confirm that agent deception is a genuine and pressing issue in tool-use contexts. By providing a comprehensive and robust evaluation framework, SPADE-Bench fills a critical gap in agent safety, facilitating the community's progress toward building trustworthy and controllable autonomous systems.

URL PDF HTML ☆

赞 0 踩 0

2606.02379 2026-06-02 cs.CV

Honey, I Shrunk the Arc de Triomphe!

亲爱的，我把凯旋门缩小了！

Yuanbo Xiangli, Hanyu Chen, Xueqing Tsang, Noah Snavely

发表机构 * Cornell University（康奈尔大学）； Shanghai Jiao Tong University（上海交通大学）

AI总结针对单目度量几何估计中的“尺度坍缩”现象，通过构建新数据集MetricScenes并采用两阶段泊松补全方法提升深度图质量，微调MoGe-2模型显著缓解了尺度低估问题。

Comments Project page: https://metricscenes.github.io/

详情

AI中文摘要

度量尺度单目几何估计通过大规模数据聚合取得了显著进展，但当前的基础模型存在持续的“尺度坍缩”现象：远处地标和广阔景观被度量低估。我们假设这一性能差距源于训练数据瓶颈，现有度量尺度数据集受硬件限制，要么是均匀的车辆捕获LiDAR或短距离室内扫描，要么是缺乏物理世界语义复杂性的合成数据。为弥补这一差距，我们整理了一个新的度量级野外数据集MetricScenes，从多种来源收集，包括互联网照片集和立体图像。我们使用现成方法估计每个场景的相机姿态和初始深度图，并从地理标记元数据以及已知立体相机基线恢复绝对尺度。我们还通过一种新的两阶段泊松补全方法改进了从MetricScenes导出的深度图质量。在我们的数据集上微调MoGe-2显著缓解了尺度坍缩，并在无约束的开放域场景中实现了优越的度量精度，同时在标准基准上保持了最先进的性能。

英文摘要

Metric scale monocular geometry estimation has seen significant progress through large-scale data aggregation, yet current foundation models suffer from a persistent ''scale-collapse'' phenomenon: distant landmarks and vast landscapes are metrically underestimated. We hypothesize that this performance gap stems from a training data bottleneck, where existing metric-scale datasets are hardware-constrained to homogenous vehicle-captured LiDAR or short-range indoor scans, or consist of synthetic data that lacks the semantic complexity of the physical world. To bridge this gap, we curate a new metrically-grounded, in-the-wild dataset that we call MetricScenes, gathered from a variety of sources including Internet photo collections and stereo imagery. We estimate camera poses and initial depth maps for each scene using off-the-shelf methods, and recover absolute scale from geo-tagged metadata as well as known stereo camera baselines. We also improve the quality of depth maps derived from MetricScenes via a new two-stage Poisson completion method. Fine-tuning MoGe-2 on our dataset significantly mitigates scale-collapse and achieves superior metric accuracy in unconstrained, open-domain scenes while maintaining state-of-the-art performance on standard benchmarks.

URL PDF HTML ☆

赞 0 踩 0