arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.08402 2026-06-17 cs.CV cs.AI cs.MA 新提交

SceneConductor: 3D Scene Generation from a Single Image with Multi-Agent Orchestration

SceneConductor: 基于多智能体编排的单图像3D场景生成

Jeonghwan Kim, Yushi Lan, Yongwei Chen, Hieu Trung Nguyen, Chuanyu Pan, Xingang Pan

发表机构 * Nanyang Technological University（南洋理工大学）； University of Oxford（牛津大学）； Meshy AI

AI总结提出多智能体编排框架，将单图像3D场景生成分解为场景初始化、环境构建和多智能体细化三个阶段，并引入几何感知布局预测器，在几何精度、空间一致性和感知真实性上超越现有方法。

详情

AI中文摘要

从单张图像生成完整3D场景需要从本质上模糊的视觉证据中推断全局一致的几何、物体关系和环境上下文。尽管联合布局和网格生成近期取得进展，现有方法通常依赖整体或弱分解的流水线，将许多因素纠缠在一起，需要大量场景级监督，限制了其对复杂真实环境的泛化。我们提出一个多智能体编排框架，将单图像3D场景生成分解为三个结构化阶段：场景初始化、环境构建和多智能体细化。初始化阶段提取图像派生的物体掩码，构建物体级3D表示，并预测初始空间布局以形成粗略3D场景。环境构建阶段随后利用该初始化以及点图几何，构建支撑表面、房间边界、材质和光照的环境支架。最后，在细化阶段，规划器智能体识别结构和视觉不一致性，直接应用简单修正，并派遣专家智能体进行复杂的局部修订，再整合回全局场景。为提供可靠的结构初始化同时减少对场景级标注的依赖，我们进一步引入一个几何感知布局预测器，由点图派生的稀疏几何先验监督。与全监督布局生成器不同，该预测器可从分割级数据训练，并稳健泛化到多样真实场景。在基准数据集上的大量实验表明，我们的方法在几何精度、空间一致性和感知真实性上持续优于先前方法。

英文摘要

Generating complete 3D scenes from a single image requires inferring globally consistent geometry, object relationships, and environmental context from inherently ambiguous visual evidence. Despite recent progress in joint layout-and-mesh generation, existing methods often rely on holistic or weakly decomposed pipelines that entangle many factors at once and demand extensive scene-level supervision, limiting their generalization to complex real-world environments. We propose a multi-agent orchestration framework that decomposes single-image 3D scene generation into three structured stages: scene initialization, environment construction, and multi-agent refinement. The initialization stage extracts image-derived object masks, builds object-level 3D representations, and predicts an initial spatial layout to form a coarse 3D scene. The environment-construction stage then leverages this initialization together with point-map geometry to build an environmental scaffold of supporting surfaces, room boundaries, materials, and illumination. Finally, in the refinement stage, a planner agent identifies structural and visual inconsistencies, applies simple corrections directly, and dispatches specialist agents for complex localized revisions that are reintegrated into the global scene. To provide reliable structural initialization while reducing reliance on scene-level annotations, we further introduce a geometry-aware layout predictor supervised by sparse geometric priors derived from point maps. Unlike fully supervised layout generators, the predictor can be trained from segmentation-level data and generalizes robustly to diverse real-world scenes. Extensive experiments on benchmark datasets show that our method consistently outperforms prior approaches in geometric accuracy, spatial consistency, and perceptual realism.

URL PDF HTML ☆

赞 0 踩 0

2606.07555 2026-06-17 cs.CL cs.LG 新提交

Priors Persist Through Suppression: A Stroop Paradigm for Lexical Override

先验通过抑制持续存在：词汇覆盖的斯特鲁普范式

Han-yu Wang

发表机构 * The University of Hong Kong（香港大学）

AI总结通过斯特鲁普范式实验，发现语言模型中的词汇先验在局部规则覆盖后仍持续存在，并通过激活修补定位到源位置三元组，揭示了先验是干扰起源和覆盖痕迹的共同通道。

详情

AI中文摘要

词汇表、技术规范和系统提示通常要求语言模型以不熟悉的方式使用熟悉的词汇。当这种方式有效时，词汇先验通过覆盖而非替换持续存在：它在局部规则应用后继续运作，规则降低其logit而非在顶部安装新含义。我们通过斯特鲁普风格范式对此进行测试：一个重映射规则（“doctor”意为“forest”）与查询词的词汇先验干扰项（“hospital”）对抗，并匹配中性对照。在跨越四个家族和1B-9B参数的11个开源权重模型中，即使在项目级别控制答案先验、频率、分词和提示措辞后，词汇先验强度仍能预测干扰。对五个对齐模型的激活修补定位到一个源位置三元组（定义主语、定义目标、查询词），该三元组几乎完全恢复了冲突效应（聚合$R \in [0.92, 1.06]$）。定义目标交换表明该三元组执行绑定而非身份匹配。分离实验将目标保留隔离为绑定特定特征：干扰抑制在匹配、交换和项目不匹配条件下均发生，而目标logit崩溃仅在定义目标位置被破坏时发生。行为和机制汇聚到同一通道：词汇先验既是干扰的起源，也是覆盖留下痕迹的地方。

英文摘要

Glossaries, technical specifications, and system prompts routinely ask language models to use familiar words in unfamiliar ways. When this works, the local rule does not install the new meaning on top of the old one; the pretrained prior keeps operating underneath, and its strength still shows through. We test this with a Stroop-style paradigm: a remapping rule (doctor means forest) pitted against the query word's lexical-prior distractor (hospital), with matched neutral controls. Across 11 open-weight models spanning four families and 1B-9B parameters, lexical-prior strength predicts interference even after item-level controls for answer prior, frequency, tokenization, and prompt wording. Activation patching on five aligned models locates a source-position triplet (definition subject, definition target, query word) that nearly fully recovers the conflict effect (aggregate $R \in [0.92, 1.06]$); a definition-target swap shows the triplet performs binding rather than identity matching. Dissociation experiments isolate target preservation as the binding-specific signature: distractor suppression occurs under matched, swap, and item-mismatched conditions alike, whereas target logit collapse occurs only when the definition-target position is corrupted. Behavior and mechanism converge on the same channel: the prior's strength both predicts which overrides fail and marks where the causal repair lands.

URL PDF HTML ☆

赞 0 踩 0

2606.06523 2026-06-17 cs.AI cs.LG cs.LO cs.SE 新提交

Lean4Agent: Formal Modeling and Verification for Agent Workflow and Trajectory

Lean4Agent：面向智能体工作流与轨迹的形式化建模与验证

Ruida Wang, Jerry Huang, Pengcheng Wang, Xuanqing Liu, Luyang Kong, Tong Zhang

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Independent researcher（独立研究者）

AI总结提出Lean4Agent框架，利用依赖类型形式语言Lean4对智能体工作流进行形式化建模与验证，通过FormalAgentLib库和LeanEvolve方法提升工作流可靠性，实验验证通过的工作流性能平均提升11.94%。

详情

AI中文摘要

使大型语言模型（LLMs）能够执行可靠的多步工作流已成为人工智能领域的核心挑战。尽管LLMs的智能体能力近期取得了进展，但大多数智能体系统仍缺乏用于指定、验证和调试其工作流及执行轨迹的形式化方法。这一挑战类似于数学中长期存在的问题，其中自然语言（NL）的模糊性促使了形式语言（FL）的发展。受此范式启发，我们提出了**Lean4Agent**，据我们所知，这是首个使用依赖类型形式语言Lean4来建模和验证智能体行为的框架。**Lean4Agent**推出了**FormalAgentLib**，一个可扩展的Lean4库，用于在显式假设下形式化建模和验证智能体工作流的语义一致性，并能够定位轨迹揭示的运行时故障。基于**FormalAgentLib**，我们进一步开发了**LeanEvolve**，它应用**FormalAgentLib**中的结果来修订工作流以增强其能力。在SWE-Bench-Verified的困难子集和ELAIP-Bench子集上，针对5个领先LLMs的大量实验表明，通过验证的工作流比未通过的工作流平均性能提升**11.94%**，而**LeanEvolve**进一步将SWE性能平均提升**7.47%**。此外，**Lean4Agent**为使用表达能力强的依赖类型形式语言形式化建模和验证智能体行为这一新领域奠定了基础。

英文摘要

Equipping Large Language Models (LLMs) to execute reliable multi-step workflows has become a central challenge in artificial intelligence. Despite recent advances in LLMs' agentic capabilities, most agent systems still lack formal methods for specifying, verifying, and debugging their workflow and execution trajectories. This challenge mirrors a long-standing problem in mathematics, where the ambiguity of natural languages (NLs) motivates the development of formal languages (FLs). Inspired by this paradigm, we propose **Lean4Agent**, to the best of our knowledge, the first framework that uses Lean4, a dependent-type FL to model and verify agent behavior. **Lean4Agent** launches **FormalAgentLib**, an extensible Lean4 library for formally modeling and verifying agent workflows' semantic consistency under explicit assumptions, and enabling localization of execution-time failures revealed by trajectories. Building on **FormalAgentLib**, we further develop **LeanEvolve**, which applies results in **FormalAgentLib** to revise workflows to enhance its capability. Extensive experiments on a hard problem subset of SWE-Bench-Verified and a subset of ELAIP-Bench across 5 leading LLMs indicate that the verification-passing workflows outperform the failing ones by an average of **11.94%**, and **LeanEvolve** further improves SWE performance by **7.47%** on average. Furthermore, **Lean4Agent** establishes a foundation for a new field of using expressive dependent-type FL to formally model and verify agent behavior.

URL PDF HTML ☆

赞 1 踩 0

2606.10376 2026-06-17 cs.AI cs.IT math.IT 交叉投稿

Belief-Space Control for Personalized Cancer Treatment via Active Inference

基于主动推理的个性化癌症治疗信念空间控制

Deniz Sargun, H. Bugra Tulay, C. Emre Koksal

发表机构 * American Association for Cancer Research（美国癌症研究协会）； AACR Project GENIE registry（AACR Project GENIE 注册中心）； AACR Project GENIE Biopharma Collaborative（AACR Project GENIE 生物制药合作组织）

AI总结提出用主动推理将癌症治疗建模为信念空间规划问题，在测量预算下统一目标导向控制与信息获取，实现患者分类与高效治疗。

Comments 11 pages including appendix

2605.05172 2026-06-17 cs.RO cs.AI 版本更新

When Life Gives You BC, Make Q-functions: Extracting Q-values from Behavior Cloning for On-Robot Reinforcement Learning

当生活给你行为克隆，就做Q函数：从行为克隆中提取Q值用于机器人强化学习

Lakshita Dodeja, Ondrej Biza, Shivam Vats, Stephen Hart, Stefanie Tellex, Robin Walters, Karl Schmeckpeper, Thomas Weng

发表机构 * Rai-Inst

AI总结提出Q2RL算法，通过从行为克隆策略中提取Q函数并利用Q门控切换策略，实现高效的离线到在线强化学习，在机器人操作任务中达到100%成功率和3.75倍提升。

Comments Robotics: Science and Systems, 2026

详情

AI中文摘要

行为克隆（BC）已成为机器人学习的一种高效范式。然而，BC在收集演示后缺乏自我引导的在线改进机制。现有的离线到在线学习方法常常由于离线数据与在线学习之间的分布不匹配，导致策略替换先前学习的好动作。在这项工作中，我们提出了Q2RL（从BC进行Q估计和Q门控用于强化学习），一种高效的离线到在线学习算法。我们的方法包括两部分：（1）Q估计通过与环境的少量交互步骤从BC策略中提取Q函数，然后进行在线RL；（2）Q门控根据各自的Q值在BC和RL策略动作之间切换，以收集用于RL策略训练的样本。在D4RL和robomimic基准测试的操作任务中，Q2RL在成功率和收敛时间上优于最先进的离线到在线学习基线。Q2RL足够高效，可应用于机器人上的RL设置，在1-2小时的在线交互中学习接触密集和高精度操作任务（如管道组装和套件装配）的鲁棒策略，成功率达到100%，相比原始BC策略提升高达3.75倍。代码和视频见https://this URL。

英文摘要

Behavior Cloning (BC) has emerged as a highly effective paradigm for robot learning. However, BC lacks a self-guided mechanism for online improvement after demonstrations have been collected. Existing offline-to-online learning methods often cause policies to replace previously learned good actions due to a distribution mismatch between offline data and online learning. In this work, we propose Q2RL, Q-Estimation and Q-Gating from BC for Reinforcement Learning, an algorithm for efficient offline-to-online learning. Our method consists of two parts: (1) Q-Estimation extracts a Q-function from a BC policy using a few interaction steps with the environment, followed by online RL with (2) Q-Gating, which switches between BC and RL policy actions based on their respective Q-values to collect samples for RL policy training. Across manipulation tasks from D4RL and robomimic benchmarks, Q2RL outperforms SOTA offline-to-online learning baselines on success rate and time to convergence. Q2RL is efficient enough to be applied in an on-robot RL setting, learning robust policies for contact-rich and high precision manipulation tasks such as pipe assembly and kitting, in 1-2 hours of online interaction, achieving success rates of up to 100% and up to 3.75x improvement against the original BC policy. Code and video are available at https://pages.rai-inst.com/q2rl_website/

URL PDF HTML ☆

赞 0 踩 0

2605.01973 2026-06-17 cs.CL cs.LG 版本更新

Learn-To-Learn on Arbitrary Textual Conditioning: A Hypernetwork-Driven Meta-Gated LLM

在任意文本条件下学习：一种超网络驱动的元门控大语言模型

Luo Ji, Qi Qin, Ningyuan Xi, Teng Chen, Qingqing Gu, Hongyan Li

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出一种超网络驱动的元门控机制，通过动态调整SwiGLU块中的β参数，使LLM适应不同文本条件，优于微调和元学习基线。

Comments Accepted by ICML2026

2606.09337 2026-06-17 cs.RO 版本更新

TORL-VLA: Tactile Guided Online Reinforcement Learning for Contact-Rich Manipulation

TORL-VLA：触觉引导的在线强化学习用于接触丰富操作

Huaihang Zheng, Yi Yang, Kai Ma, Shenglin Xu, Tian Xie, Guozheng Li, Xiangyu Wang, Yiren Ma, Si Liu, Yinian Mao, Baoxu Liu

发表机构 * Meituan（美团）； Beijing Institute of Technology（北京理工大学）； Beihang University（北京航空航天大学）； State Key Lab of Multimodal Artificial Intelligence Systems, Institute of Automation, CAS（中国科学院自动化研究所多模态人工智能系统国家重点实验室）； China University of Mining and Technology (Beijing)（中国矿业大学（北京））

AI总结提出TORL-VLA框架，结合触觉反馈与在线强化学习，通过触觉导出的力矩感知VLA预测参考动作，并利用轻量在线RL模块优化动作，解决接触条件变化时的策略适应问题，在长时接触任务中提升成功率和执行效率。

Comments Project page: https://torl-vla.github.io/

详情

AI中文摘要

视觉-语言-动作（VLA）模型已成为机器人操作的有力框架，最近的研究将触觉或力反馈引入VLA以处理接触丰富的任务。然而，这些模型通常作为离线策略部署。当接触条件偏离训练分布时，策略无法进行在线适应，导致接触力不当和重试效率低下等问题。因此，我们提出TORL-VLA，一种触觉引导的在线强化学习框架，将触觉反馈与策略优化相结合用于接触丰富操作。我们的方法引入了一个触觉导出的力矩感知VLA来预测参考动作和未来的力矩序列，同时使用轻量级在线RL模块来优化参考动作。为了稳定地从混合的探索性策略生成和人工干预数据中学习，我们引入了一个干预审查评论家，防止干预后的成功被错误地归因于干预前的策略生成动作。在包括门闩操作、咖啡杯放置和鸡蛋处理等长时接触丰富任务上的真实机器人实验表明，TORL-VLA在子任务和完整任务级别上提高了成功率，并在时间约束的执行效率上优于强基线。

英文摘要

Vision-Language-Action (VLA) models have become a powerful framework for robotic manipulation, and recent studies have introduced tactile or force feedback into VLAs to address contact-rich tasks. However, these models are typically deployed as offline policies. When contact conditions shift from the training distribution, the policy cannot perform online adaptation, leading to problems such as inappropriate contact forces and inefficient retries. Therefore, we propose TORL-VLA, a tactile-guided online reinforcement learning framework that couples tactile feedback with policy refinement for contact-rich manipulation. Our method introduces a tactile-derived wrench-aware VLA to predict reference actions and future wrench sequences, while a lightweight online RL module is used to refine the reference actions. To stabilize learning from mixed exploratory policy-generated and human-intervention data, we introduce an intervention-censored critic that prevents post-intervention success from being wrongly credited to policy-generated actions preceding intervention. Real-robot experiments on long-horizon contact-rich tasks, including latch manipulation, coffee-cup placement, and egg handling, show that TORL-VLA improves success rates at both subtask and full-task levels, as well as time-bounded execution efficiency over strong baselines. Project page: https://torl-vla.github.io/

URL PDF HTML ☆

赞 0 踩 0

2606.04513 2026-06-17 cs.AI 版本更新

MapAgent: An Industrial-Grade Agentic Framework for City-scale Lane-level Map Generation

MapAgent: 一个工业级的城市规模车道级地图生成智能框架

Deguo Xia, Zihan Li, Haochen Zhao, Dong Xie, Yuyao Kong, Xiyan Liu, Jizhou Huang, Mengmeng Yang, Diange Yang

发表机构 * Tsinghua University（清华大学）； Baidu（百度）； University of Macau（澳门大学）； Institute of Information Engineering, Chinese Academy of Sciences（中国科学院信息工程研究所）

AI总结提出MapAgent框架，通过结合视觉语言模型和约束感知推理，在验证驱动的Judge-Planner-Worker循环中修正车道地图生成中的规范违规问题，实现城市规模的高自动化生产。

Comments Accepted by KDD 2026

详情

DOI: 10.1145/3770855.3818443

AI中文摘要

车道级地图是自动驾驶和车道级导航的关键基础设施，但为数百个城市构建和维护标准化车道网络仍然高度劳动密集。最近的端到端矢量化映射方法可以直接从传感器数据预测车道几何和拓扑，但它们通常将映射规范和交通规则视为隐式的、依赖于数据集的监督。此外，在复杂场景中（例如，磨损或缺失的标记和遮挡），仅凭视觉证据往往难以确定正确的车道配置，使得规范违规成为人工后期编辑的主要来源。我们提出MapAgent，一个工业级智能架构，它增强了一个矢量化主干，用于生成符合规范的车道地图。MapAgent不仅仅是在地图预测上添加一个智能体循环，而是在一个有界、验证驱动的Judge-Planner-Worker循环中，将主干感知与明确的规范验证、约束感知推理和确定性地图编辑相结合。一个视觉语言Judge通过联合检查视觉证据和草稿向量来诊断错误，而一个工具调用Planner生成最小的修正编辑并进行编辑后重新验证。为了保持城市规模生产的可扩展性，MapAgent仅在主干置信度低的图块上选择性触发，增加了适度的开销同时保持吞吐量。在真实世界数据集上的实验显示，与强大的生产基线相比，特别是在复杂和长尾场景中，性能持续提升。此外，MapAgent已集成到百度地图中，支持全国超过360个城市的车道级地图生成，并将整体生产自动化率提升至95%以上，证明了MapAgent在大规模车道级地图生成中的实用性和有效性。

英文摘要

Lane-level maps are critical infrastructure for autonomous driving and lane-level navigation, yet constructing and maintaining standardized lane networks for hundreds of cities remains highly labor-intensive. Recent end-to-end vectorized mapping methods can predict lane geometry and topology directly from sensor data, but they typically treat mapping specifications and traffic regulations as implicit, dataset-dependent supervision. Moreover, in complex scenes (e.g., worn or missing markings and occlusions), correct lane configurations are often under-determined by visual evidence alone, making specification violations a major source of human post-editing. We propose MapAgent, an industrial-grade agentic architecture that augments a vectorization backbone for specification-compliant lane-map production. Rather than merely adding an agent loop to map prediction, MapAgent couples backbone perception with explicit specification verification, constraint-aware reasoning, and deterministic map editing under a bounded, verification-driven Judge-Planner-Worker loop. A vision-language Judge diagnoses errors by jointly inspecting visual evidence and draft vectors, while a tool-calling Planner generates minimal corrective edits with post-edit re-validation. To remain scalable for city-scale production, MapAgent is selectively triggered only on tiles with low backbone confidence, adding modest overhead while preserving throughput. Experiments on real-world datasets show consistent gains over strong production baselines, especially in complex and long-tail scenarios. Additionally, MapAgent has been integrated into Baidu Maps, supporting lane-level map generation for over 360 cities nationwide and elevating the overall production automation to over 95%, demonstrating MapAgent's practicality and effectiveness for large-scale lane-level map generation.

URL PDF HTML ☆

赞 0 踩 0

2606.03609 2026-06-17 cs.RO cs.LG 版本更新

A 3D Isovist World Model -- Revealing a City's Unseen Geometry and Its Emergent Cross-City Signature

3D 等视域世界模型——揭示城市不可见几何及其涌现的跨城市特征

Xuhui Lin, Stephen Law, Nanjiang Chen, Kunyao Li, Tao Yang

发表机构 * The Bartlett School of Sustainable Construction University College London, UK（可持续建设学院伦敦大学学院，英国）； Department of Geography University College London, UK（地理系伦敦大学学院，英国）； School of Project Management, Faculty of Engineering The University of Sydney, AU（工程学院项目管理学院悉尼大学，澳大利亚）； School of Engineering Cardiff University, UK（工程学院卡迪夫大学，英国）； School of Architecture Tsinghua University, Beijing, CN（建筑学院清华大学，北京，中国）

AI总结提出一种预测3D等视域（球形可见性深度图）的具身世界模型，通过深度残差和自滚动调度采样训练，发现跨城市空间特征可从时间潜变量中线性解码。

详情

AI中文摘要

在城市中导航的具身智能体依赖于世界模型来预测其移动时周围环境的变化。但对于导航而言，重要的不是建筑物的外观，而是智能体可以到达的位置。尽管如此，大多数世界模型仍然预测外观，学习场景的外观而非智能体可穿行的空间。那些确实针对几何的模型，如鸟瞰占用网格，将三维环境压缩到地面平面，忽略了塑造真实导航的地上和多层结构。目前缺少的是一个能够捕捉智能体实际穿行的可导航几何的预测目标，既不受光度信息干扰，也不丢失第三维度。我们的核心思想是对建筑物之间的开放体积（负空间）进行建模，编码为3D等视域：一个球形可见性深度图，记录每个方向上到最近表面的距离。我们引入了一个具身世界模型，根据过去短时间内的等视域历史和运动动作预测下一个等视域。预测被公式化为深度残差，使解码器继承锐利的建筑边缘，通过自滚动调度采样进行训练以保持几何流形上的上下文，并配备持久潜鸟瞰空间图以实现跨路径一致性。我们的核心发现是涌现且出乎意料的：一个在曼哈顿和巴黎上训练的单一城市盲模型发展出了跨城市空间特征，其城市身份可从时间潜变量中线性解码，远高于单帧基线，因此该特征存在于学习到的动力学中而非外观中。该表示轻量、可解释且可复现，为具身AI、机器人和城市分析中的空间推理提供了几何基础，并随附开放数据集和流程发布。

英文摘要

Embodied agents that navigate cities rely on world models that predict how their surroundings will change as they move. But for navigation, what matters is not what the buildings look like; it is where the agent can go. Most world models nonetheless predict appearance, learning how a scene looks rather than the space an agent can move through. Those that do target geometry, such as bird's-eye-view occupancy grids, flatten the three-dimensional environment onto a ground plane, discarding the above-ground and multi-level structure that shapes real navigation. What is missing is a predictive target that captures the navigable geometry an agent actually traverses, without photometric entanglement and without collapsing the third dimension. Our key idea is to model the open volume between buildings, the negative space, encoded as a 3D isovist: a spherical visibility-depth map recording the distance to the nearest surface in every direction. We introduce an embodied world model that predicts the next isovist from a short history of past isovists and a movement action. The prediction is formulated as a depth residual so the decoder inherits sharp building edges, trained with self-rollout scheduled sampling to keep corrupted context on the geometry manifold, and equipped with a persistent latent bird's-eye-view spatial map for cross-path consistency. Our central finding is emergent and unexpected: a single city-blind model trained on Manhattan and Paris develops a cross-city spatial signature, with city identity linearly decodable from its temporal latents far above single-frame baselines, so the signature lives in the learned dynamics rather than in appearance. The representation is lightweight, interpretable, and reproducible, offering a geometric substrate for spatial reasoning in embodied AI, robotics, and urban analysis, released with an open dataset and pipeline.

URL PDF HTML ☆

赞 0 踩 0

2606.03177 2026-06-17 cs.RO 版本更新

ConTrack: Constrained Hand Motion Tracking with Adaptive Trade-off Control

ConTrack: 具有自适应权衡控制的约束手部运动跟踪

Yutong Liang, Quanquan Peng, Ri-Zhao Qiu, Xiaolong Wang

发表机构 * University of California San Diego（加州大学圣地亚哥分校）

AI总结提出一种基于强化学习的框架ConTrack，通过将物体跟踪视为约束并利用双变量更新自适应调整任务-风格权衡，同时结合自适应中轨迹重置库，实现长时域、接触密集的手部运动跟踪，在仿真和真实机器人上显著提升成功率和物体位姿精度。

详情

AI中文摘要

人类演示为机器人操作提供了强大的先验，但由于运动学差距，将其转移到真实机器人上执行并非易事。在灵巧操作中，即使在仿真器中跟踪长时域、接触密集的序列仍然具有挑战性：参考跟踪策略必须保持物体在其目标轨迹上，同时保留演示的关节运动和接触时序。现有方法通常依赖于需要针对每个序列进行调整的手工奖励调节，并且在有限的交互预算下会失效。我们提出了ConTrack，一种随跟踪数据扩展的强化学习（RL）框架。ConTrack将物体跟踪视为约束，并将剩余控制权限分配给运动保真度，从而通过双变量更新在线适应任务-风格权衡。此外，ConTrack还通过一个自适应中轨迹重置库来稳定长时域学习，该库重用策略可达的仿真器状态。我们在仿真跟踪和真实机器人上的定性和定量结果表明，ConTrack在保持关节和接触保真度的同时，显著提高了成功率和物体位姿精度，优于现有技术。网站：此 https URL。

英文摘要

Human demonstrations provide strong priors for robot manipulation, yet it is non-trivial to transfer them to execute on real robots due to the kinematic gap. In dexterous manipulation, it remains challenging to track long-horizon, contact-rich sequences even in simulators: a reference-tracking policy must keep objects on their target trajectories while preserving demonstrated joint motion and contact timing. Existing approaches often rely on hand-crafted reward tuning that require per-sequence tuning and break under limited interaction budgets. We introduce ConTrack, a reinforcement learning (RL) framework that scales with tracking data. ConTrack treats object tracking as a constraint and allocates remaining control authority to motion fidelity, which allows it to adapt task--style trade-offs online using a dual-variable update. In addition, ConTrack also stabilizes long-horizon learning with an adaptive mid-trajectory reset library that reuses policy-reachable simulator states. Our qualitative and quantitative results in simulation tracking and real robot demonstrate that ConTrack improves success and object pose accuracy significantly over prior arts while preserving joint and contact fidelity. Website: https://www.lyt0112.com/projects/ConTrack.

URL PDF HTML ☆

赞 0 踩 0

2606.03089 2026-06-17 cs.LG cs.AI 版本更新

Constitutional On-Policy Safe Distillation

宪法性在策略安全蒸馏

Ming Wen, Yuxuan Liu, Kun Yang, Yunhao Feng, Zhuoer Xu, Yuhao Sun, Shiwen Cui, Xiang Zheng, Guoyu Wang, Xingjun Ma, Yu-Gang Jiang

发表机构 * Institute of Trustworthy Embodied AI（可信具身人工智能研究院）； Fudan University（复旦大学）； Shanghai Innovation Institute（上海创新研究院）； Ant Group（蚂蚁集团）； Zhejiang University（浙江大学）； City University of Hong Kong（香港城市大学）

AI总结针对在策略自蒸馏在安全对齐中因宪法条件导致教师分布收缩、表达能力下降的问题，提出宪法性在策略安全蒸馏（COPSD），通过交叉SFT冷启动校准教师分布，再进行宪法条件在策略蒸馏，在12个基准上实现了更优的安全-有用性权衡并降低安全税。

详情

AI中文摘要

在策略自蒸馏（OPSD）通过使用基于特权信息条件的教师提供密集的令牌级监督，已成为一种高效的后训练范式。先前工作表明，OPSD在可验证推理任务中可能崩溃，但安全对齐不同，它由高层宪法而非显式目标答案指导，因此是重新审视密集蒸馏的自然场景。然而，我们的初步研究表明，安全OPSD仍然遭受严重崩溃：宪法条件将教师分布收缩为短且过于保守的响应，而反向KL进一步将这种收缩放大为表达能力下降。我们将此效应形式化为非正交语义空间中安全边界下的几何泄漏，其中安全压力转移到表达能力维度。基于此分析，我们提出宪法性在策略安全蒸馏（COPSD），首先通过交叉SFT冷启动校准教师，然后执行宪法条件在策略蒸馏。在12个基准上的实验表明，COPSD比基线实现了持续更强的安全-有用性权衡，同时大幅降低了对通用推理能力的安全税。

英文摘要

On-policy self-distillation (OPSD) has emerged as an efficient post-training paradigm by using a teacher conditioned on privileged information to provide dense token-level supervision. Prior work has shown that OPSD can collapse in verifiable reasoning tasks, but safety alignment differs in that it is guided by high-level constitutions rather than explicit target answers, making it a natural setting to revisit dense distillation. However, our pilot study show that safety OPSD still suffers from severe collapse: constitutional conditioning contracts the teacher distribution toward short and overly conservative responses, and Reverse KL further amplifies this contraction into reduced expressiveness. We formalize this effect as geometric leakage under safety boundaries in a non-orthogonal semantic space, where safety pressure transfers into the expressiveness dimension. Based on this analysis, we propose Constitutional On-Policy Safe Distillation (COPSD), which first calibrates the teacher through a Cross-SFT cold-start and then performs constitution-conditioned on-policy distillation. Experiments on 12 benchmarks show that COPSD achieves a consistently stronger safety--helpfulness trade-off than baselines while substantially reducing the safety tax on general reasoning ability.

URL PDF HTML ☆

赞 0 踩 0

2606.00588 2026-06-17 cs.CV 版本更新

Response-Aware Multimodal Learning for Post-Treatment Visual Acuity Forecasting

响应感知的多模态学习用于治疗后视力预测

Phuoc-Nguyen Bui, Van-Vi Vo, Duc-Tai Le, Junghyun Bum, Van-Nguyen Pham, Ki-Young Kim, Seung-Young Yu, Hyunseung Choo

发表机构 * Research Convergence Institute（研究融合研究所）； Sungkyunkwan University（全北大学）； Dept. of AI Systems Engineering（人工智能系统工程系）； Dept. of Ophthalmology（眼科系）； Kyung Hee University Medical Center（庆熙大学医学院）； Dept. of Electrical and Computer Engineering（电气与计算机工程系）

AI总结提出ReVA框架，利用基线与第1个月OCT影像及表格数据，通过多模态融合预测糖尿病性黄斑水肿患者抗VEGF治疗后3-24个月的视力轨迹。

Comments Accepted to MICCAI 2026

详情

AI中文摘要

抗VEGF治疗后长期视力（VA）结果对于糖尿病性黄斑水肿（DME）患者的咨询、期望设定和随访计划至关重要。然而，在临床实践中，医生通常仅根据早期治疗后发现来估计长期视力轨迹，使得可靠的预后判断变得困难。尽管先前基于OCT的学习方法主要关注短期反应或单终点预测，但利用早期纵向观测数据建模多个未来时间点的VA轨迹仍未被充分探索。在本研究中，我们收集了一个由188名接受抗VEGF治疗的DME患者组成的真实世界队列，配有配对基线和第1个月OCT扫描，以及表格化的OCT衍生生物标志物和非影像临床变量。仅使用这些早期数据，我们构建了一个多时间点VA预测问题，旨在预测3、6、12、18和24个月的视力结果，反映临床上有意义的随访间隔。我们提出了ReVA，一个响应感知的多模态框架，该框架整合了基线和第1个月OCT的结构特征与表格变量，以捕捉基线疾病状态和早期治疗反应。ReVA使用空间注意力保留局部预后成像特征，并使用依赖感知的表格编码器建模临床变量之间的交互。这些多模态表示被融合以预测患者特定的长期视力轨迹。所提出的框架在24个月VA预测中实现了MAE=0.1246，RMSE=0.1621，R^2=0.6064，并在所有预测时间点上表现一致。我们的研究结果表明，纳入早期治疗反应信号能够实现临床上有意义的长期视力预测，为常规抗VEGF管理中的数据驱动决策支持提供了依据。

英文摘要

Long-term visual acuity (VA) forecasting after anti-VEGF therapy is important for counseling and follow-up planning in diabetic macular edema (DME), yet remains challenging when only early post-treatment findings are available. While prior OCT-based methods mainly focus on short-term response or single-endpoint prediction, multi-horizon VA forecasting from early longitudinal data remains insufficiently under-explored. In this study, we assembled a real-world cohort of 188 anti-VEGF--treated DME patients with paired baseline and month-1 OCT scans, along with tabular OCT-derived biomarkers and non-imaging clinical variables. Using only these early data, we formulate a multi-horizon VA forecasting problem aimed at predicting visual outcomes at 3, 6, 12, 18, and 24 months, reflecting clinically meaningful follow-up intervals. We propose ReVA, a response-aware multimodal framework that combines baseline and month-1 OCT features with tabular variables to capture disease status and early treatment response. ReVA integrates spatial OCT attention, dependency-aware tabular encoding, and cross-modal fusion to predict patient-specific long-term VA trajectories. The proposed framework achieves MAE=0.1246, RMSE=0.1621, and R^2=0.6064 for 24-month VA prediction, with consistent performance across all forecast horizons. Our findings show that incorporating early treatment-response signals enables clinically meaningful long-term visual acuity forecasting, supporting data-driven decision support for routine anti-VEGF management. Code and pretrained models will be released on https://github.com/nguyenpbui/ReVA.

URL PDF HTML ☆

赞 0 踩 0

2606.00024 2026-06-17 cs.CL 版本更新

ART: Attention Run-time Termination for Efficient Large Language Model Decoding

ART：面向高效大语言模型解码的注意力运行时终止

Chen Qiu, Guozhong Li, Cristian McGee, Aritra Dutta, Panos Kalnis

发表机构 * King Abdullah University of Science and Technology（卡布尔大学科学与技术大学）； University of Central Florida（中央佛罗里达大学）

AI总结提出注意力运行时终止（ART）机制，通过跟踪累积注意力输出并在贡献可忽略时终止后续KV块访问，在不显著影响准确率的情况下将大批量生成吞吐量提升20%。

详情

AI中文摘要

大语言模型（LLM）中的长上下文解码受到获取大量键值（KV）缓存所需内存带宽的严重限制。大多数现有的KV管理方法依赖于解码前的仅键剪枝，尽管有证据表明注意力输出共同依赖于键和值，因为将值纳入其方法会带来过高的额外开销。在本文中，我们提出了注意力运行时终止（ART），一种轻量级的运行时机制，在内核执行期间跟踪累积的注意力输出，并在后续贡献变得可忽略时终止后续KV块访问。这种设计使ART与现有的基于键的KV缓存管理方法正交，从而能够与它们无缝集成。在LongBench基准上的实验表明，与最先进的基线相比，ART在大批量下实现了20%更高的生成吞吐量，同时保持了相当的准确率。

英文摘要

Long-context decoding in Large Language Models (LLMs) is constrained by the cost of accessing and processing the Key-Value (KV) cache. Despite evidence that attention outputs depend jointly on keys and values, most existing KV management methods rely on key-only pruning, since incorporating values incurs prohibitive overhead. In this paper, we propose Attention Run-time Termination (ART), a lightweight run-time mechanism that tracks accumulated attention outputs during kernel execution and terminates subsequent KV block accesses once further contributions become negligible. Rather than replacing KV selection, ART dynamically terminates redundant KV traversal on top of existing dense or sparse attention policies. We introduce a stability-based criterion that monitors both magnitude and directional changes of intermediate attention outputs and provideds a theoretical characterization of the resulting truncation error. Experiments on the LongBench and RULER Needle-in-a-Haystack tasks show that ART increases the generation throughput of existing KV-cache methods by up to 20%, without compromising the result quality.

URL PDF HTML ☆

赞 0 踩 0

2605.31286 2026-06-17 cs.RO cs.AI 版本更新

DeMaVLA: A Vision-Language-Action Foundation Model for Generalizable Deformable Manipulation

DeMaVLA：面向可泛化可变形物体操作的视觉-语言-动作基础模型

Taiyi Su, Jian Zhu, Tianjian Wang, Youzhang He, Zitai Huang, Jianjun Zhang, Chong Ma, Hanyang Wang, Tianjiao Zhang, Munan Yin, Weihao Ding, Yi Xu

发表机构 * Tongji University（同济大学）

AI总结提出DeMaVLA模型，采用VLM骨干与动作专家结合流匹配生成连续动作，通过剪枝Transformer层提升效率，并利用大规模真实世界数据和人类反馈数据聚合训练，实现可变形物体折叠操作的多类别泛化。

Comments 14 pages, 2 figures

详情

AI中文摘要

现实家庭机器人需要视觉-语言-动作（VLA）基础模型，能够在不同物体、任务条件和家庭环境中获取可重复使用的操作技能。可变形物体折叠是一个代表性挑战，要求机器人处理来自随机初始状态的衣物，涉及不同类别、几何形状、材料和场景。然而，现有的VLA系统通常为不同物体类别训练独立的策略，而简单混合的多任务训练常常遭受任务干扰和性能下降。为了超越类别特定的折叠策略，我们引入了DeMaVLA，一个面向可泛化可变形物体操作的VLA基础模型。DeMaVLA采用VLM骨干网络和动作专家，并使用流匹配来公式化连续动作生成。为了提高效率，动作专家通过剪枝每隔一个Transformer层构建，同时保持与VLM骨干网络的逐层对齐，从而降低训练和推理成本。DeMaVLA首先在大约5000小时精选的真实世界双臂演示数据上进行预训练，以获得通用的操作先验。然后，它在混合折叠数据上进行后训练，这些数据通过人类参与的数据聚合（DAgger）流程，聚合了自我收集的演示和来自多个折叠任务中真实机器人失败的纠正轨迹。实验表明，DeMaVLA在RoboTwin上取得了有竞争力的性能，并在我们的家庭折叠基准测试中取得了强大的真实世界结果。这些结果突显了可扩展的真实世界数据、高效的动作生成和纠正学习对于可变形物体操作中的通用VLA策略的价值。

英文摘要

Real-world household robots require Vision-Language-Action (VLA) foundation models that can acquire reusable manipulation skills across diverse objects, task conditions, and household environments. Deformable-object folding is a representative challenge, requiring robots to handle clothing items from random initial states across varying categories, geometries, materials, and scenes. However, existing VLA systems commonly train separate policies for different object categories, while naively mixed multi-task training often suffers from task interference and degraded performance. To move beyond category-specific folding policies, we introduce DeMaVLA, a VLA foundation model for generalizable Deformable Manipulation. DeMaVLA adopts a VLM backbone with an action expert and formulates continuous action generation using flow matching. To improve efficiency, the action expert is constructed by pruning every other transformer layer while preserving layer-wise alignment with the VLM backbone, reducing training and inference cost. DeMaVLA is first pre-trained on approximately 5,000 hours of selected real-world dual-arm demonstrations to acquire general manipulation priors. It is then post-trained on mixed folding data that aggregates self-collected demonstrations and corrective trajectories from real-robot failures across multiple folding tasks through a human-in-the-loop Data Aggregation~(DAgger) pipeline. Experiments show that DeMaVLA achieves competitive performance on RoboTwin 2.0 and strong real-world results on our household folding benchmark. These results highlight the value of scalable real-world data, efficient action generation, and corrective learning for general-purpose VLA policies in deformable-object manipulation.

URL PDF HTML ☆

赞 0 踩 0

2605.27023 2026-06-17 cs.AI 版本更新

Boosting Knowledge Graph Foundation Models via Enhanced Negative Sampling

通过增强负采样提升知识图谱基础模型

Yinan Liu, Wenjin Xu, Zhiyuan Zha, Xiaochun Yang, Bin Wang

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出自适应负采样方法KMAS，通过动态调整困难负三元组比例，增强知识图谱基础模型在零样本补全任务中的性能。

详情

AI中文摘要

知识图谱已成为问答和推荐系统等众多下游任务的核心支柱。然而，尽管如此，知识图谱往往非常不完整。为了在未见过的知识图谱（其关系词汇与预训练时不同）中进行零样本知识图谱补全，知识图谱基础模型受到了广泛关注。现有的知识图谱基础模型通常使用随机负三元组进行训练，这些负三元组是通过将正三元组的头实体或尾实体替换为随机实体构建的。然而，这些负三元组通常质量有限，为知识图谱基础模型训练提供的监督较弱。在本文中，我们提出了一种简单而有效的自适应负采样方法KMAS，以增强现有的知识图谱基础模型。KMAS通过从现有知识图谱基础模型的关系编码器生成的更新关系嵌入来构建困难负三元组。为了进一步自适应地与训练过程中知识图谱基础模型不断发展的能力对齐，KMAS在整个训练过程中动态调整困难负三元组的比例：在预热阶段后，线性增加比例，然后线性减少。在44个数据集上进行了大量实验。实验结果表明，我们提出的负采样方法可以在不需要过多额外时间或内存消耗的情况下增强许多最先进的知识图谱基础模型。

英文摘要

Knowledge graphs (KGs) have become the core backbone of numerous downstream tasks such as question answering and recommender systems. However, despite all this, KGs are often very incomplete. To perform zero-shot knowledge graph completion in unseen KGs, which have different relational vocabularies from those used for pre-training, KG foundation models (KGFMs) receive a wide range of attention. Existing KGFMs often perform training using random negative triples, which are constructed by replacing the head or tail entity of a positive triple with a random entity. However, these negative triples are often constructed with limited quality, providing weak supervision for KGFM training. In this paper, we propose a simple yet effective adaptive negative sampling approach, KMAS, to enhance existing KGFMs. KMAS constructs hard negative triples through the updated relation embeddings generated from the existing KGFM's relation encoder. To further adaptively align with the evolving capability of the KGFM during the training process, KMAS adjusts the ratio of hard negative triples dynamically throughout the whole training process: after a warmup phrase, it increases the ratio linearly and then decreases linearly. Extensive experiments are conducted over 44 data sets. Experimental results demonstrate that our proposed negative sampling method can enhance many SOTA KGFMs without requiring excessive additional time or memory consumption.

URL PDF HTML ☆

赞 0 踩 0

2605.26921 2026-06-17 cs.CV q-bio.NC 版本更新

Similarity-based representation factorization for revealing interpretable dimensions in representational data

揭示大脑、行为和AI中表征的核心维度

Florian P. Mahner, Ka Chun Lam, Francisco Pereira, Martin N. Hebart

发表机构 * Max Planck Institute for Human Cognitive and Brain Sciences（人类认知与脑科学最大平面研究所）； National Institute of Mental Health（心理健康国家研究所）； Justus Liebig University Giessen（吉森约斯特-利普大学）； Center for Mind, Brain and Behavior（心智、脑与行为中心）

AI总结提出相似性基表示因子分解（SRF）方法，从相似性矩阵中恢复低维、非负、可解释的嵌入，以揭示神经、行为和计算数据中表征的潜在维度。

详情

AI中文摘要

表征研究广泛存在于神经科学、心理学和人工智能等领域。虽然通常通过刺激之间的相似性来研究和比较表征，但现有方法仅能有限地访问塑造这些表征的维度，且可解释性有限。为克服这些挑战，本文引入相似性基表示因子分解（SRF），一种通用的计算方法，用于从测量数据导出的相似性矩阵中恢复低维、非负、可解释的嵌入。在模拟以及多种神经、行为和计算数据集中，SRF能从各种形式的表征数据中恢复可解释的维度，即使对于非常稀疏采样、不完整的数据也是如此。从这些数据集中导出的维度与任务特定模型获得的维度相匹配，预测独立的行为属性，改进探索性分析，并且与比较相似性矩阵相比，为验证性假设检验提供更高的统计功效。这些结果共同确立了SRF作为一种通用方法，在揭示、理解和利用表征背后的维度方面具有广泛的应用前景。

英文摘要

The study of representations is widespread across fields, including neuroscience, psychology, and artificial intelligence. While representations are often studied and compared through similarities between stimuli, current methods provide only limited access to the dimensions that shape these representations and are often limited in interpretability. To overcome these challenges, here we introduce Similarity-Based Representation Factorization (SRF), a general computational method for recovering low-dimensional, non-negative, interpretable embeddings from similarity matrices derived from measured data. Across simulations and many neural, behavioral, and computational datasets, SRF recovers interpretable dimensions from diverse forms of representational data, even for very sparsely sampled, incomplete data. The dimensions derived from these datasets match those obtained by task-specific models, predict independent behavioral properties, improve exploratory analysis, and offer higher power for confirmatory hypothesis testing than comparing similarity matrices. Together, these results establish SRF as a general-purpose method with broad applications for uncovering, understanding, and using the dimensions underlying representations.

URL PDF HTML ☆

赞 0 踩 0

2605.30036 2026-06-17 cs.AI cs.CL 版本更新

Teaching Values to Machines: Simulating Human-Like Behavior in LLMs

向机器传授价值观：在LLMs中模拟类人行为

Asaf Yehudai, Naama Rozen, Ariel Gera

发表机构 * The Hebrew University of Jerusalem（海法大学）； IBM Research（IBM研究院）； Tel-Aviv University（特拉维夫大学）

AI总结本研究基于心理学价值理论，通过大规模实验（超过500万个问题）评估价值提示的LLMs在价值结构和价值-行为关系上与人类的一致性，并证明引入人类价值分布可增强群体模拟。

Comments We had some disagreement regarding proper attribution; we hope to resolve it soon and upload the paper

详情

AI中文摘要

大型语言模型（LLMs）展示了采用不同角色和身份的能力；然而，它们是否能表现出符合连贯、类人价值结构的行为仍不清楚。在这项工作中，我们借鉴既定的心理学价值理论，在LLMs中诱导类人价值观，并评估它们与人类研究中观察到的模式的一致性。使用经过验证的心理学问卷，我们进行了大规模实验——超过500万个问题——以评估领先LLMs的价值结构和价值-行为关系，并将其与人类进行比较。我们的发现揭示了价值提示的LLMs与人类在两个维度上的强烈一致性。此外，引入人类价值分布增强了价值诱导LLMs的群体模拟。这些发现凸显了价值诱导LLMs作为有效的、基于心理学的模拟人类行为工具的潜力。

英文摘要

Large Language Models (LLMs) demonstrate a remarkable capacity to adopt different personas and roles; however, it remains unclear whether they can manifest behavior that adheres to a coherent, human-like value structure. In this work, we draw on established psychological value theory to induce human-like values in LLMs and assess their alignment with patterns observed in human studies. Using validated psychological questionnaires, we conduct large-scale experiments -- over 5 million questions -- to evaluate value structures and value-behavior relationships in leading LLMs and compare them to humans. Our findings reveal strong agreement between value-prompted LLMs and humans across both dimensions. Moreover, incorporating human value distributions enhances population-level simulations with value-induced LLMs. These findings highlight the potential of value-induced LLMs as effective, psychologically grounded tools for simulating human behavior.

URL PDF HTML ☆

赞 0 踩 0

2603.02803 2026-06-17 cs.CV 版本更新

Structure-Aware Text Recognition for Ancient Greek Critical Editions

面向古希腊校勘本的结构感知文本识别

Nicolas Angleraud, Antonia Karamolegkou, Benoît Sagot, Thibault Clérice

发表机构 * Inria（法国国家信息与自动化技术研究所）

AI总结本文通过构建大规模合成语料库和真实扫描基准，评估了视觉语言模型在结构感知文本识别上的性能，发现Qwen3VL-8B模型在真实扫描上达到1.0%的中位字符错误率。

详情

AI中文摘要

视觉语言模型（VLM）的最新进展已经改变了端到端的文档理解。然而，它们解释历史学术文本复杂布局语义的能力仍然有限。本文研究了面向古希腊校勘本的结构感知文本识别，这些校勘本具有密集的参考层次和广泛的边缘注释。我们引入了两个新资源：（i）从TEI/XML源生成的185,000页图像的大规模合成语料库，具有受控的排版和布局变化，以及（ii）跨越一个多世纪编辑和排版实践的真实扫描校勘本的精选基准。使用这些数据集，我们在零样本和微调设置下评估了三种最先进的VLM。我们的实验揭示了当前VLM架构在面对高度结构化的历史文档时的显著局限性。在零样本设置中，大多数模型的性能明显低于现有的现成软件。尽管如此，Qwen3VL-8B模型达到了最先进的性能，在真实扫描上实现了1.0%的中位字符错误率。这些结果既突显了当前VLM在结构感知识别复杂学术文档方面的不足，也展示了其未来潜力。

英文摘要

Recent advances in visual language models (VLMs) have transformed end-to-end document understanding. However, their ability to interpret the complex layout semantics of historical scholarly texts remains limited. This paper investigates structure-aware text recognition for Ancient Greek critical editions, which have dense reference hierarchies and extensive marginal annotations. We introduce two novel resources: (i) a large-scale synthetic corpus of 185,000 page images generated from TEI/XML sources with controlled typographic and layout variation, and (ii) a curated benchmark of real scanned editions spanning more than a century of editorial and typographic practices. Using these datasets, we evaluate three state-of-the-art VLMs under both zero-shot and fine-tuning regimes. Our experiments reveal substantial limitations in current VLM architectures when confronted with highly structured historical documents. In zero-shot settings, most models significantly underperform compared to established off-the-shelf software. Nevertheless, the Qwen3VL-8B model achieves state-of-the-art performance, reaching a median Character Error Rate of 1.0\% on real scans. These results highlight both the current shortcomings and the future potential of VLMs for structure-aware recognition of complex scholarly documents.

URL PDF HTML ☆

赞 0 踩 0

2605.29563 2026-06-17 cs.AI cs.CV cs.RO 版本更新

Planning with the Views

通过场景自我探索进行视图规划

Kangrui Wang, Linjie Li, Zhengyuan Yang, Shiqi Chen, Zihan Wang, Li Fei-Fei, Jiajun Wu, Leonidas Guibas, Lijuan Wang, Manling Li

发表机构 * Northwestern University（西北大学）； University of Washington（华盛顿大学）； Microsoft（微软）； University of Oxford（牛津大学）； Stanford University（斯坦福大学）

AI总结提出ViewSuite基准测试揭示VLM在多步视图规划中的不足，并设计迭代框架通过自我探索和视图图蒸馏将Qwen2.5-VL-7B的交互式视图规划准确率从2.5%提升至47.8%。

详情

AI中文摘要

VLM能否预测每个相机移动如何改变视图，并提前规划许多这样的移动？我们称这种能力为视图规划，需要(1)理解单个动作如何变换视图，以及(2)在多步规划中组合许多这样的变换以识别目标视图。我们在提出的ViewSuite中探测了这两种能力，ViewSuite是一个基于真实ScanNet场景的3D点云环境。在13个前沿VLM中，出现了一个关键的规划差距：它们具备基本的视图-动作知识，但无法在多步规划中组合这些知识，并且随着视点距离的增加，差距扩大。为了缩小这一差距，我们提出了一个迭代框架，交替进行自我探索和视图图蒸馏。关键洞察是，所有探索轨迹，无论其结果如何，共同形成一个视图图，紧凑地捕捉了场景中视点如何连接。将这个图蒸馏到多样化的监督任务中，重塑了策略分布，并克服了使纯RL停滞的稀疏奖励。这将Qwen2.5-VL-7B在交互式视图规划上的准确率从2.5%提升到47.8%，超过了GPT-5.4 Pro（18.5%）和Gemini 3.1 Pro（21.4%）。自我探索成为VLM在3D空间中主动推理和规划的一条有前景的路径。

英文摘要

Can VLMs predict how each camera move changes the view, and plan many such moves ahead? We call this capability view planning, requiring (1)understanding how a single action transforms the view, and (2)composing many such transformations across multi-turn plans to identify a target view. We probe both abilities in our proposed ViewSuite, a 3D point-cloud environment on real ScanNet scenes. Across 13 frontier VLMs, a critical planning gap emerges: they possess basic view-action knowledge but fail to compose it across multi-turn plans, with the gap widening as viewpoint distance grows. To close this gap, we propose an iterative framework that alternates self-exploration with view graph distillation. The key insight is that all exploration trajectories, regardless of their outcome, collectively form a view graph that compactly captures how viewpoints connect across a scene. Distilling this graph into diverse supervised tasks reshapes the policy distribution and overcomes the sparse rewards that stall pure RL. This improves Qwen2.5-VL-7B from 2.5% to 47.8% on interactive view planning, surpassing GPT-5.4 Pro (18.5%) and Gemini 3.1 Pro (21.4%). Self-exploration emerges as a promising path toward VLMs that can actively reason and plan in 3D space. Code and Data are at https://viewsuite.github.io.

URL PDF HTML ☆

赞 0 踩 0

2605.25652 2026-06-17 cs.CL cs.CY 版本更新

A Two-Phase Stability Study of LLM Judges and Bar Council Examiners on Thai Bar-Exam Free-Form Essays

LLM评审员与律师协会考官对泰国律师资格考试自由回答论文的两阶段稳定性研究

Pawitsapak Akarajaradwong, Wuttikrai Lertprasertphakorn, Chompakorn Chaksangchaichot, Sarana Nutanong

发表机构 * VISAI AI（VISAI人工智能）

AI总结通过泰国律师资格考试的自由回答论文评估，研究LLM评审员与人类考官在评分一致性上的不对称性，发现LLM评审员倾向于多数人类阅读而无法复制少数人类阅读。

详情

AI中文摘要

NLP中的自由形式法律论文评估将专家间评分者稳定性视为单一上限数字，并将LLM评审员与该上限的一致性视为评审员稳定性的证据。我们通过相同输入协议在泰国律师资格考试上检验这两个假设：三名律师协会培训的考官（A、B、C）和一个26个LLM评审员小组对来自相同四个输入（问题、官方律师协会评分规定、标准答案、考生答案）的15个交叉评分的答案进行评分。主要发现是不对称的。在评分标准规定两个轴的15个单元格中的10个上，所有29名评分者收敛在一个狭窄的区间内：小组一致性是普遍的。在其余5个单元格中，评分标准未规定如何评分一个正确但省略了决定性法定引用的最终答案，人类小组在两个连贯的解读之间分裂（B/C多数在评分标准上限区间，分数6-8；A少数在较低区间，分数1-2）。LLM评审员群体并不对称分裂：26个LLM中有22个在或接近B/C的有争议区间评分，3个位于规定沉默的中间间隙，只有1个（GPT-5.4 Nano）接近A的区间但未一致地在其内评分。我们26个评审员小组中的零个LLM在有争议的单元格上复制了少数人类阅读。B/C方向的集群跨越了我们测试的每个模型大小、供应商和价格层级。一个仪器化的三个LLM锚定子小组（Claude 4.6 Opus、Gemini 3.1 Pro、GPT-5.4 Pro）携带确定性探针、输入消融和自助法置信区间，并在15个单元格上达到锚定小组α=0.77，而人类小组α=0.36。高LLM小组α反映了系统性地收敛于多数阅读，而不是平衡地复制两种阅读；一个通过最大化与人类参考小组的一致性来选择其LLM评审员的基准将必然继承这种不对称性。

英文摘要

Free-form legal essay evaluation in NLP treats expert inter-rater stability as a single ceiling number, and treats LLM-judge agreement with that ceiling as evidence of judge stability. We test both assumptions on the Thai bar examination through an identical-inputs protocol: three Bar Council-trained examiners (A, B, C) and a 26-LLM judge panel score the same 15 cross-graded answers from the same four inputs (question, official Bar Council grading regulation, gold answer, candidate answer). The headline finding is asymmetric. On 10 of 15 cells where the rubric prescribes both axes, all 29 raters converge in a tight band: panel agreement is universal. On the remaining 5 cells where the rubric does not prescribe how to grade a correct final answer that omits a decisive statutory citation, the human panel splits between two coherent readings (B/C majority at the upper rubric band, score 6-8; A minority at the lower band, score 1-2). The LLM judge population does not split symmetrically: 22 of 26 LLMs score in or near B/C's contested band, 3 sit in the regulation-silent middle gap, and only 1 (GPT-5.4 Nano) approaches A's band without consistently scoring within it. Zero LLMs in our 26-judge panel reproduce the minority human reading on the contested cells. The B/C-direction cluster spans every model size, vendor, and price tier we tested. An instrumented three-LLM anchor sub-panel (Claude 4.6 Opus, Gemini 3.1 Pro, GPT-5.4 Pro) carries determinism probes, input ablations, and bootstrap CIs, and reaches anchor panel $α= 0.77$ on the 15 cells against human-panel $α= 0.36$. The high LLM-panel $α$ reflects systematic convergence on the majority reading rather than balanced reproduction of both readings; a benchmark that selects its LLM judge by maximising agreement with a human reference panel will inherit this asymmetry by construction.

URL PDF HTML ☆

赞 0 踩 0

2605.24003 2026-06-17 cs.CV cs.AI stat.AP 版本更新

Remote sensing data imputation using deep learning for multispectral imagery

基于深度学习的多光谱遥感数据插补

Shuang Liu, Fiona Johnson, Rohitash Chandra

发表机构 * Water Research Centre, University of New South Wales（新南威尔士大学水研究中心）； ARC ITTC Data Analytics for Resources and Environments, University of New South Wales（新南威尔士大学资源与环境数据分析师联盟）； Transitional Artificial Intelligence Research Group, School of Mathematics and Statistics, University of New South Wales（新南威尔士大学数学与统计学过渡人工智能研究组）

AI总结针对云覆盖导致的光学卫星数据缺失问题，本研究比较了线性插值与多种深度学习模型（CNN、Inception Resnet、Autoencoder及其与LSTM的组合）在四个有藻华历史记录的湖泊中重建缺失光谱波段的效果，发现深度学习模型显著优于基线方法，其中CNN表现最佳，且基于插补图像的藻华指数与观测数据吻合良好。

详情

AI中文摘要

近年来，遥感技术在水体应用中得到越来越多的利用。使用光学卫星数据的一个常见挑战是由于云覆盖导致的观测缺失。这些数据缺口可能导致错过对水资源管理部门高度关注的湖泊中关键事件（如藻华）的检测。因此，提高光学卫星数据集的完整性对于改善藻华的监测和预测至关重要。在本研究中，我们比较了传统数据插补方法（即线性插值）与深度学习模型在四个有藻华历史记录的湖泊中重建缺失光谱波段的效果。采用的深度学习模型包括基于CNN的架构（即CNN、Inception Resnet和Autoencoder）以及基于CNN-LSTM的架构（即CNN-LSTM、Resnet-LSTM和Autoencoder-LSTM）。我们的结果表明，在人工掩膜区域内插补光谱波段值时，深度学习模型显著优于基线线性插值方法。在这些模型中，CNN在大多数湖泊中表现最佳。此外，我们通过将插补图像与观测数据进行比较，评估了基于插补图像的藻华指数（即Green/Red和NDCI）的性能。我们的结果表明，深度学习模型对于插补PlanetScope SuperDove影像中的缺失数据是有效的，从而能够实现更可靠的水体监测应用。

英文摘要

Remote sensing techniques have been increasingly utilised in aquatic applications in recent years. A common challenge in using optical satellite data is the presence of missing observations due to cloud cover. These data gaps can lead to missed detection of critical events, such as algal blooms, in lakes of high interest to water authorities. As a result, enhancing the completeness of optical satellite datasets is crucial for improving the monitoring and prediction of algal blooms. In this study, we compared a traditional data imputation method (i.e., linear interpolation) with deep learning models for reconstructing missing spectral bands across four lakes with historical records of algal blooms. The deep learning models adopted include CNN-based architectures (i.e., CNN, Inception Resnet, and Autoencoder) and CNN-LSTM-based architectures (i.e., CNN-LSTM, Resnet-LSTM, and Autoencoder-LSTM). Our results demonstrated that deep learning models substantially outperformed the baseline linear interpolation method in imputing spectral band values within artificially masked regions. Among these models, CNN delivered the best performance across most lakes. Furthermore, we evaluated the performance of algal bloom indices (i.e., Green/Red and NDCI) derived from the imputed imagery by comparing them with the observed data. Our results demonstrate that deep learning models are effective for imputing missing data in PlanetScope SuperDove imagery, enabling more reliable applications in water monitoring.

URL PDF HTML ☆

赞 0 踩 0

2602.10635 2026-06-17 cs.AI cs.LG 版本更新

OmniSapiens: A Foundation Model for Social Behavior Processing via Heterogeneity-Aware Relative Policy Optimization

OmniSapiens: 一种通过异质性感知相对策略优化进行社会行为处理的基础模型

Keane Ong, Sabri Boughorbel, Luwei Xiao, Chanakya Ekbote, Wei Dai, Ao Qu, Jingyao Wu, Rui Mao, Ehsan Hoque, Erik Cambria, Gianmarco Mengaldo, Paul Pu Liang

发表机构 * Massachusetts Institute of Technology（麻省理工学院）； National University of Singapore（新加坡国立大学）； Nanyang Technological University（南洋理工大学）； Prince Sattam bin Abdulaziz University（普森·萨塔姆·本·阿卜杜勒阿齐兹大学）； University of Rochester（罗切斯特大学）

AI总结针对行为数据异质性导致的训练不平衡问题，提出Omnisapiens-7B 2.0基础模型，采用异质性感知相对策略优化（HARPO）方法，在10个行为任务和5个零样本泛化基准上取得最佳性能。

Comments Accepted to ICML 2026 Main Conference

详情

AI中文摘要

社交智能AI系统必须能够推理多样的人类行为任务，并泛化到新情境。然而，AI尚未达到这种社交智能水平。现有模型仍然受到行为数据训练引起的学习动态不平衡的根本限制。即，行为数据本质上是异质的，包含多种模态和预测目标，通常在不同样本间产生不均匀的训练信号。为了解决这个问题，我们开发了Omnisapiens-7B 2.0，一个专门处理异质行为数据学习的社会行为处理基础模型。这是通过异质性感知相对策略优化（HARPO）实现的，这是一种新颖的推理强化学习方法，明确地重新平衡样本间的学习信号。核心思想是近似策略更新的贡献信号，利用它们进行几何中心化和惯性平滑的优势调节。结果表明，Omnisapiens-7B 2.0在10个不同的行为任务上取得了最佳且最一致的性能，同时在所有五个保留的零样本泛化基准上也取得了最佳性能，分别提升了高达+12.02%和+9.37%。此外，Omnisapiens-7B 2.0展示了更一致和可解释的推理轨迹，支持可靠的现实世界行为应用。我们的模型和代码可在https://github.com/MIT-MI/human_behavior_atlas找到。

英文摘要

Socially intelligent AI systems must reason across diverse human behavioral tasks and generalize to new social contexts. However, behavioral data is inherently heterogeneous, comprising diverse modalities and prediction targets that produce uneven training signals across samples, creating imbalanced learning dynamics that challenge existing AI models. To address this, we develop Omnisapiens-7B 2.0, a foundation model for social behavior processing that explicitly addresses learning from heterogeneous behavioral data. This is enabled through Heterogeneity-Aware Relative Policy Optimization, a new RL method that rebalances learning signals across samples by approximating each sample's contribution to the policy update and using these estimates to drive geometrically centered, inertially smoothed advantage modulation for stable training. Omnisapiens-7B 2.0 achieves the best and most consistent performance across 10 behavioral tasks, while also attaining the best performance on all five held-out benchmarks, with gains of up to +12.02% and +9.37% respectively. Furthermore, it demonstrates more consistent and interpretable reasoning traces, supporting reliable real-world behavioral applications. Our model is available at https://github.com/MIT-MI/human_behavior_atlas.

URL PDF HTML ☆

赞 0 踩 0

2605.23733 2026-06-17 cs.RO cs.AI 版本更新

Any2Any: Efficient Cross-Embodiment Transfer for Humanoid Whole-Body Tracking

Any2Any: 高效跨本体迁移用于人形机器人全身跟踪

Ming Yang, Tao Yu, Feng Li, Hua Chen

发表机构 * LimX Dynamics（LimX动力学）

AI总结提出Any2Any范式，通过运动学对齐和动力学微调，实现预训练全身跟踪模型高效迁移至新的人形机器人本体，仅需少量数据和计算即可达到竞争性跟踪性能。

详情

AI中文摘要

全身跟踪（WBT）模型已成为人形机器人的关键基础，使其能够高保真地模仿各种运动。从头训练此类模型需要大规模数据和计算，使得在新人形平台上快速部署成本高昂。这自然引发一个问题：预训练的WBT模型能否通过最小化适应跨本体迁移？为回答这个问题，我们提出Any2Any，一种范式，能够高效地将现有WBT专家迁移到新人形本体，仅需少量数据和计算。Any2Any首先在源和目标人形之间进行运动学对齐，对齐其输入和输出空间，使得预训练的源策略可以在目标本体上有意义地重用。然后，Any2Any通过向选定的动力学敏感模块应用轻量级参数高效微调（PEFT）组件进行动力学适应，保留有用的行为先验，同时实现对目标机器人的定向适应。在多个人形平台和预训练骨干上的大量实验表明，与从头训练相比，Any2Any显著加速收敛并降低训练成本，同时实现具有竞争力或更优的跟踪性能。值得注意的是，仅使用完整训练所需计算和数据的1%，Any2Any成功将在Unitree G1上预训练的Sonic模型迁移到LimX Oli和LimX Luna。这些结果表明，预训练的WBT专家可以跨本体高效重用，为在新机器人上部署人形全身控制提供可扩展的路径。

英文摘要

Whole-body tracking (WBT) models have become a key foundation for humanoid robots, enabling them to imitate diverse motions with high fidelity. Training such models from scratch requires large-scale data and computation, making rapid deployment on new humanoid platforms costly. This raises a natural question: Can pretrained WBT models transfer across embodiments with minimal adaptation? To answer this question, we propose Any2Any, a paradigm that efficiently transfers an existing WBT specialist to a new humanoid embodiment with only a small amount of data and compute. Any2Any first performs kinematic alignment between source and target humanoids, aligning their input and output spaces so that the pretrained source policy can be meaningfully reused on the target embodiment.Any2Any then performs dynamics adaptation by applying lightweight parameter-efficient fine-tuning (PEFT) components to selected dynamics-sensitive modules, preserving useful behavioral priors while enabling targeted adaptation to the target robot. Extensive experiments on multiple humanoid platforms and pretrained backbones show that Any2Any substantially accelerates convergence and reduces training cost compared with training from scratch, while achieving competitive or superior tracking performance. Notably, using only 1% of the compute and data required for full training, Any2Any successfully transfers Sonic models pre-trained on Unitree G1 to LimX Oli and LimX Luna. These results suggest that pretrained WBT specialists can be efficiently reused across embodiments, providing a scalable path toward deploying humanoid whole-body control on new robots.

URL PDF HTML ☆

赞 0 踩 0

2605.23176 2026-06-17 cs.CV 版本更新

DRIVESPATIAL: A Benchmark for Spatiotemporal Intelligence in VLMs for Autonomous Driving

DRIVESPATIAL：自动驾驶中视觉语言模型时空智能的基准

Hao Vo, Khoa Vo, Phu Loc Nguyen, Sieu Tran, Duc Minh Nguyen, Ngo Xuan Cuong, Gladys Gawugah, Sreevenkata Anjani Tishita Godavarthi, Chase Rainwater, Nghi D. Q. Bui, Anh Nguyen, Duy Minh Ho Nguyen, Ngan Le

发表机构 * University of Arkansas, USA（美国阿肯色大学）； Google Research, Google（谷歌研究院）； University of Liverpool, UK（英国利物浦大学）； Max Planck Research School for Intelligent Systems（马克斯·普朗克智能系统研究学校）

AI总结提出DriveSpatial基准，通过多视角、时空推理任务评估视觉语言模型在自动驾驶中的场景构建、关系理解、时序推理和泛化能力，发现人类与模型间存在显著差距。

详情

AI中文摘要

自动驾驶中的时空智能要求智能体将多视角观测整合为连贯的场景表示，跨视角和时间保持物体连续性，并推理空间关系、交互和未来动态。然而，现有的自动驾驶视觉语言基准主要关注单视角、静态、自我中心或单源问答，尚不清楚当前视觉语言模型（VLM）能否真正构建和推理动态驾驶场景。我们引入了DriveSpatial，一个包含来自五个大规模自动驾驶数据集的20个任务、15.6K人工验证问答对的基准。DriveSpatial评估四种能力：认知场景构建、多视角关系理解、时序推理和泛化。与之前的基准不同，DriveSpatial是从一个动态多关系场景图生成的，该图编码了物体状态、空间关系、交互、相机可见性和时间对应关系，从而产生强制进行真正的跨视角和时空推理的问答对。评估15个代表性VLM揭示了显著的人机差距：最强模型落后人类28.4分，其中认知场景构建成为关键瓶颈。进一步诊断表明，仅语言提示不足，而显式BEV基础一致地提升性能。这些结果表明，当前VLM缺乏可靠的时空驾驶智能所需的场景构建能力。DriveSpatial及其构建流程将发布以支持未来研究。

英文摘要

Spatiotemporal intelligence in autonomous driving (AD) requires an agent to integrate multi-view observations into a coherent scene representation, maintain object continuity across viewpoints and time, and reason about spatial relations, interactions, and future dynamics. However, existing AD vision-language benchmarks largely focus on single-view, static, ego-centric, or single-source question answering, leaving it unclear whether current Vision-Language Models (VLMs) can truly construct and reason over dynamic driving scenes. We introduce DriveSpatial, a benchmark of 15.6K human-verified QA pairs across 20 tasks from five large-scale AD datasets. DriveSpatial evaluates four abilities: Cognitive Scene Construction, Multi-view Relational Understanding, Temporal Reasoning, and Generalization. Unlike prior benchmarks, DriveSpatial is generated from a dynamic multi-relational scene graph that encodes object states, spatial relations, interactions, camera visibility, and temporal correspondences, enabling QA pairs that enforce genuine cross-view and spatiotemporal reasoning. Evaluating 15 representative VLMs reveals a substantial human-model gap: the strongest model trails humans by 28.4 points, with Cognitive Scene Construction emerging as the key bottleneck. Further diagnostics show that language-only prompting is insufficient, while explicit BEV grounding consistently improves performance. These results suggest that current VLMs lack the scene-construction ability needed for reliable spatiotemporal driving intelligence. DriveSpatial and its construction pipeline will be released to support future research.

URL PDF HTML ☆

赞 0 踩 0

2605.21135 2026-06-17 cs.CL 版本更新

Smarter edits? Post-editing with error highlights and translation suggestions

更智能的编辑？基于错误高亮和翻译建议的后编辑

Fleur V. J. van Tellingen, Gautam Ranka, Dora Žugčić, Joyce van der Wal, Andrea Camasta, Livio Guerra, Alina Karakanta

发表机构 * Leiden University Centre for Linguistics（莱顿大学语言研究中心）； Visvesvaraya National Institute of Technology（维什瓦塞拉亚国家理工学院）； Department of Bionanoscience, Faculty of Applied Sciences, Delft University of Technology（应用科学学院生物纳米科学系，代尔夫特理工大学）； Pedagogical Sciences, Leiden University（莱顿大学教育科学）； Faculty of Science, Leiden University（莱顿大学科学学院）

AI总结本文研究了基于自动后编辑（APE）的错误高亮和纠正建议在后编辑任务中的有效性，发现虽然没有提升生产力和质量，但APE高亮和纠正建议提升了用户体验。

Comments Accepted at EAMT 2026

2605.20708 2026-06-17 cs.CV cs.AI 版本更新

Rethinking Cross-Layer Information Routing in Diffusion Transformers

重新思考扩散变换器中的跨层信息路由

Chao Xu, Maohua Li, Qirui Li, Yixuan Xu, Yanke Zhou, Yunhe Li, Cuifeng Shen, Hanlin Tang, Kan Liu, Tao Lan, Lin Qu, Shao-Qun Zhang

发表机构 * Nanjing University（南京大学）； Alibaba Group（阿里巴巴集团）； Zhejiang University（浙江大学）； City University of Hong Kong（香港城市大学）

AI总结本文研究了扩散变换器中跨层信息流动的问题，通过系统性的实证分析，识别了传统残差加法的三个具体症状，并提出了扩散适应性路由（DAR）方法，以实现可学习、时间步适应和非递增的子层输出聚合，从而提升模型性能。

详情

AI中文摘要

扩散变换器（DiTs）已成为现代视觉生成的事实性骨干，其设计的几乎所有主要轴线——分词、注意力、条件、目标和潜在自编码器——都已被广泛重新审视。然而，决定信息如何在层之间积累的残差流却直接继承自原始Transformer。在本文中，我们对DiTs中的跨层信息流进行了系统性的实证分析，同时考虑深度和去噪时间步，并识别出传统残差加法的三个具体症状，即单调的前向幅度膨胀、急剧的反向梯度衰减和显著的块状冗余。受此诊断的启发，我们提出了扩散适应性路由（DAR），一种可直接替换残差的机制，能够对子层输出的历史进行可学习、时间步适应和非递增的聚合。此外，所提出的DAR与许多现代Transformer增强方法，如REPA，具有兼容性。在ImageNet 256×256上，DAR将SiT-XL/2的FID值提升了2.11（7.56 vs. 9.67），并且在8.75倍更少的训练迭代中达到了基线的收敛质量。在REPA之上堆叠时，它在早期阶段实现了2倍的训练加速，表明跨层信息路由是扩散建模中一个未被充分探索的设计轴，该轴与现有表示对齐目标相互独立。除了预训练外，DAR还可以在大规模T2I模型的微调阶段应用，并在分布匹配蒸馏中保留高频细节。

英文摘要

Diffusion Transformers (DiTs) have become a de facto backbone of modern visual generation, and nearly every major axis of their design -- tokenization, attention, conditioning, objectives, and latent autoencoders -- has been extensively revisited. The residual stream that governs how information accumulates across layers, however, has been directly inherited from the original Transformer. In this paper, we present a systematic empirical analysis of cross-layer information flow in DiTs, jointly along depth and denoising timestep, and identify three concrete symptoms of traditional residual addition, namely monotonic forward magnitude inflation, sharp backward gradient decay, and pronounced block-wise redundancy. Motivated by this diagnosis, we propose Diffusion-Adaptive Routing (\textsc{DAR}), a drop-in residual replacement that performs \emph{learnable, timestep-adaptive, and non-incremental} aggregation over the history of sublayer outputs. Moreover, the proposed \textsc{DAR} is compatible with many modern Transformer enhancement methods, such as REPA. On ImageNet $256\times256$, \textsc{DAR} improves SiT-XL/2 by $2.11$ FID ($7.56$ vs.\ $9.67$) and matches the baseline's converged quality with $8.75\times$ fewer training iterations. Stacked on top of REPA, it yields a $2\times$ training acceleration in the early stage, suggesting cross-layer information routing as an underexplored design axis in diffusion modeling, one that operates orthogonally to existing representation-alignment objectives. Beyond pretraining, \textsc{DAR} can also be applied during the fine-tuning stage of large-scale T2I models and preserves high-frequency details during Distribution Matching Distillation.

URL PDF HTML ☆

赞 0 踩 0

2510.21583 2026-06-17 cs.CV cs.AI 版本更新

Principled RL for Flow Matching Emerges from the Chunk-level Policy Optimization

基于流匹配的原理化强化学习从片段级策略优化中涌现

Yifu Luo, Haoyuan Sun, Xinhao Hu, Penghui Du, Keyu Fan, Bo Li, Sinan Du, Xu Wan, Zhiyu Chen, Bo Xia, Yongzhe Chang, Changqian Yu, Kun Gai, Tiantian Zhang, Xueqian Wang

发表机构 * GitHub

AI总结本文提出了一种基于片段级策略优化的流匹配强化学习方法GCPO，通过将连续步骤聚合为相干片段并改变策略优化层级，有效缓解了优势归因不准确的问题，实验表明其在文本到图像生成任务中表现优于现有方法。

Comments ICML 2026

2605.15980 2026-06-17 cs.CV 版本更新

Flash-GRPO: Efficient Alignment for Video Diffusion via One-Step Policy Optimization

Flash-GRPO：通过单步策略优化实现视频扩散的高效对齐

Xiaoxuan He, Siming Fu, Zeyue Xue, Weijie Wang, Ruizhe He, Yuming Li, Dacheng Yin, Shuai Dong, Haoyang Huang, Hongfa Wang, Nan Duan, Bohan Zhuang

发表机构 * Zhejiang University（浙江大学）； Joy Future Academy ； Independent Researcher（独立研究员）； Tsinghua University（清华大学）

AI总结提出Flash-GRPO单步训练框架，通过等时分组和时间梯度校正解决计算瓶颈，在低计算预算下实现优于全轨迹训练的对齐质量和训练效率。

详情

AI中文摘要

群体相对策略优化已成为将视频扩散模型与人类偏好对齐的关键，但面临一个关键的计算瓶颈：训练一个14B参数的模型通常每个实验需要数百个GPU天。现有的效率方法通过滑动窗口子采样训练时间步来降低成本，但从根本上损害了优化，表现出严重的不稳定性，并且无法达到完整的轨迹性能。我们提出了Flash-GRPO，一个单步训练框架，在低计算预算下在对齐质量上优于全轨迹训练，同时大幅提高了训练效率。Flash-GRPO解决了两个关键挑战：等时分组通过强制提示级别的时间一致性消除了时间步混淆的方差，将策略性能与时间步难度解耦；时间梯度校正中和了导致不同时间步梯度幅度极不一致的时间依赖缩放因子。在1.3B到14B参数模型上的实验验证了Flash-GRPO的有效性，展示了显著的训练加速，同时保持了一致的稳定性和最先进的对齐质量。

英文摘要

Group Relative Policy Optimization has emerged as essential for aligning video diffusion models with human preferences, but faces a critical computational bottleneck: training a 14B parametered model typically demands hundreds of GPU days per experiment. Existing efficiency methods reduce costs through sliding window subsampling training timesteps, but fundamentally compromise optimization, exhibiting severe instability and failing to reach full trajectory performance. We present Flash-GRPO, a single-step training framework that outperforms full trajectory training in alignment quality under low computational budgets while substantially improving training efficiency. Flash-GRPO addresses two critical challenges: iso-temporal grouping eliminates timestep-confounded variance by enforcing prompt-wise temporal consistency, decoupling policy performance from timestep difficulty; temporal gradient rectification neutralizes the time-dependent scaling factor that causes vastly inconsistent gradient magnitudes across timesteps. Experiments on 1.3B to 14B parameter models validate Flash-GRPO's effectiveness, demonstrating substantial training acceleration with consistent stability and state-of-the-art alignment quality.

URL PDF HTML ☆

赞 0 踩 0

2506.13127 2026-06-17 cs.SD eess.AS 版本更新

Leveraging Local and Global Knowledge Integration with Time-Frequency Calibrated Distillation for Speech Enhancement

利用局部和全局知识整合与时间频率校准蒸馏进行语音增强

Jiaming Cheng, Ruiyu Liang, Ye Ni, Chao Xu, Jing Li, Wei Zhou, Rui Liu, Björn W. Schuller, Xiaoshuai Hao

发表机构 * School of Computer Science, Nanjing Audit University（南京审计大学计算机科学学院）； School of Communication Engineering, Nanjing Institute of Technology（南京工程技术学院通信工程学院）； School of Information Science and Engineering, Southeast University（东南大学信息科学与工程学院）； Cardiff University（卡迪夫大学）； Inner Mongolia University（内蒙古大学）； CHI – the Chair of Health Informatics, TUM University Hospital（健康信息学系，技术大学医院）； GLAM – the Group on Language, Audio, & Music, Imperial College London（语言、音频与音乐组，伦敦帝国理工学院）； Xiaomi EV（小米电动车）

AI总结本文提出了一种融合框架，通过时间频率校准知识蒸馏提升语音增强性能，结合局部信息聚焦与全局知识流通，改进了低复杂度学生模型的表现。

Comments submitted to IEEE Transactions on Cognitive and Developmental Systems

详情

AI中文摘要

本文提出了一种内集和外集递归融合框架，结合时间频率校准知识蒸馏（I$^2$SRF-TFCKD）用于语音增强。与以往的语音增强蒸馏策略不同，该框架充分利用了语音的时间频率差异信息，同时促进局部信息聚焦和全局知识流通。首先，我们构建了内集和外集的相关蒸馏范式。在相关集合内，多层教师-学生特征进行成对匹配以实现校准蒸馏。随后，通过递归融合生成每个相关集合的代表性特征，形成融合特征集以促进跨集知识交互。其次，我们提出了一种基于双流时间频率交叉校准的多层交互蒸馏，分别在时间和频率域内计算教师-学生相似性校准权重，并进行交叉加权，从而根据语音特性对不同层的蒸馏贡献进行精细化分配。所提出的蒸馏策略应用于在L3DAS23挑战赛语音增强赛道排名第一的双路径扩张卷积循环网络（DPDCRN）。为了评估I$^2$SRF-TFCKD的有效性，我们在单通道和多通道语音增强数据集上进行了实验。客观评估显示，所提出的KD策略一致且有效地提升了低复杂度学生模型的性能，并优于其他蒸馏方案。

英文摘要

In this paper, we propose an intra-set and inter-set recursive fusion framework with time-frequency calibrated knowledge distillation (I$^2$SRF-TFCKD) for SE. Different from previous distillation strategies for SE, the proposed framework fully exploits the time-frequency differential information of speech while facilitating both local information focusing and global knowledge circulation. Firstly, we construct a collaborative distillation paradigm for intra-set and inter-set correlations. Within a correlated set, multi-layer teacher-student features are pairwise matched for calibrated distillation. Subsequently, we generate representative features from each correlated set through recursive fusion to form the fused feature set that enables inter-set knowledge interaction. Secondly, we propose a multi-layer interactive distillation based on dual-stream time-frequency cross-calibration, which calculates the teacher-student similarity calibration weights in the time and frequency domains respectively and performs cross-weighting, thus enabling refined allocation of distillation contributions across different layers according to speech characteristics. The proposed distillation strategy is applied to the dual-path dilated convolutional recurrent network (DPDCRN) that ranked first in the SE track of the L3DAS23 challenge. To evaluate the effectiveness of I$^2$SRF-TFCKD, we conduct experiments on both single-channel and multi-channel SE datasets. Objective evaluations demonstrate that the proposed KD strategy consistently and effectively improves the performance of the low-complexity student model and outperforms other distillation schemes.

URL PDF HTML ☆

赞 0 踩 0

2605.12646 2026-06-17 cs.LG cs.AI cs.HC 版本更新

Learning to Decide with AI Assistance under Human-Alignment

在人工智能协助下的人类对齐决策学习

Nina Corvelo Benz, Eleni Straitouri, Manuel Gomez-Rodriguez

发表机构 * GitHub

AI总结本文研究了在高风险领域中，人工智能如何通过预测结果帮助决策者，并探讨了AI预测信心与决策者自身信心的对齐程度对决策学习复杂性的影响。

详情

AI中文摘要

人们普遍认为，当人工智能模型通过预测感兴趣的结果来协助决策者时，它们应传达预测的置信度。然而，实证证据表明，决策者往往难以仅根据传达的置信度来判断何时信任预测。在此背景下，近期的理论和实证工作表明，AI辅助决策的效用与AI置信度和决策者自身置信度之间的对齐程度之间存在正相关性。关键的是，这些发现尚未阐明这种对齐程度如何影响通过重复交互学习做出最佳决策的复杂性。在本文中，我们考虑二元预测和二元决策的典型情况，首先证明该问题等价于具有完全反馈的双臂在线上下文学习问题，并建立了任何学习者可以达到的期望遗憾的下界为$Ω(\sqrt{|H| \cdot |B| \cdot T} )$，其中$H$和$B$分别表示人类和AI置信度的集合。然后我们证明，在AI和人类置信度完全对齐的情况下，学习者可以达到期望遗憾为$O(\sqrt{|H| \cdot T\log T})$，当$\sqrt{|H|} = O(\log T)$且$B$是可数的时，Dvoretzky-Kiefer-Wolfowitz不等式的非平凡推广将遗憾界改进到$O(\sqrt{T\log T})$。这些结果表明，对齐可以减少在人工智能协助下学习决策的复杂性。在两个不同的人类主体研究中，参与者通过AI模型协助解决简单决策任务的实验证明，我们的理论结果在完全对齐被违反时仍然稳健。

英文摘要

It is widely agreed that when AI models assist decision-makers in high-stakes domains by predicting an outcome of interest, they should communicate the confidence of their predictions. However, empirical evidence suggests that decision-makers often struggle to determine when to trust a prediction based solely on this communicated confidence. In this context, recent theoretical and empirical work suggests a positive correlation between the utility of AI-assisted decision-making and the degree of alignment between the AI confidence and the decision-makers' confidence in their own predictions. Crucially, these findings do not yet elucidate the extent to which this alignment influences the complexity of learning to make optimal decisions through repeated interactions. In this paper, we address this question in the canonical case of binary predictions and binary decisions. We first show that this problem is equivalent to a two-armed online contextual learning problem with full feedback, and establish a lower bound of $Ω(\sqrt{|H| \cdot |B| \cdot T} )$ on the expected regret any learner can attain, where $H$ and $B$ denote the sets of human and AI confidence values. We then demonstrate that, under perfect alignment between AI and human confidence, a learner can attain an expected regret of $O(\sqrt{|H| \cdot T\log T})$ and, when $\sqrt{|H|} = O(\log T)$ and $B$ is countable, a non-trivial generalization of the Dvoretzky-Kiefer-Wolfowitz inequality improves the regret bound to $O(\sqrt{T\log T})$. Taken together, these results reveal that alignment can reduce the complexity of learning to make decisions with AI assistance. Experiments on real data from two different human-subject studies where participants solve simple decision-making tasks assisted by AI models show that our theoretical results are robust to violations of perfect alignment.

URL PDF HTML ☆

赞 0 踩 0