arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.20853 2026-05-21 cs.SD eess.AS

SEABAD: A Tropical Bird Activity Detection Dataset for Passive Acoustic Monitoring

SEABAD：一种用于被动声学监测的热带鸟类活动检测数据集

Muhammad Mun'im Ahmad Zabidi, Mohd Yamani Idna Idris, Norisma Idris

发表机构 * Faculty of Computer Science and Information Technology, Universiti Malaya（马来亚大学计算机科学与信息技术学院）； Faculty of Electrical Engineering, Universiti Teknologi Malaysia（马来西亚科技大学电气工程学院）

AI总结本文提出SEABAD数据集，用于解决热带地区鸟类活动检测中物种丰富和声学复杂性带来的挑战，通过平衡的鸟类存在和不存在样本以及标准化音频格式，支持高效的声学监测和低功耗推理。

Comments 14 pages, 4 figures

详情

AI中文摘要

被动声学监测（PAM）能够实现大规模生物多样性评估，但连续录音会产生大量非信息性音频，给存储、能耗和长期边缘部署带来挑战。鸟类音频检测（BAD）通过在下游分析前过滤无关录音来减轻这一负担。然而，大多数BAD系统是在温带数据集上训练的，尽管热带声音景观更密集、物种更丰富且声学不可预测。为了解决这一差距，我们引入了SEABAD（东南亚鸟类活动检测），包含50,000个经过精心挑选的三秒剪辑，平衡鸟类存在和不存在的样本。该数据集涵盖1,677个鸟类物种，并标准化为16 kHz单声道音频以支持嵌入式和低功耗推理。我们开发了双分支编目流程：一个六阶段正标签工作流应用于Xeno-Canto录音，以及六个来源特定的负标签提取从环境数据集中。这些程序将类别不平衡降低了13.7%（基尼系数：0.601到0.519）。对1,000个正样本的手动审核确认了97.8%±0.9%的标注准确性。使用MobileNetV3-Small的基线实验在三个随机种子上实现了99.57%±0.25%的准确率和0.9985±0.0002的AUC。SEABAD和完整的编目流程已公开发布，以支持热带BAD研究和节能声学监测。

英文摘要

Passive acoustic monitoring (PAM) enables large-scale biodiversity assessment, but continuous recording generates large amounts of non-informative audio, creating challenges for storage, power consumption, and long-term edge deployment. Bird audio detection (BAD), which identifies bird vocalizations, can reduce this burden by filtering irrelevant recordings before downstream analysis. However, most BAD systems are trained on temperate datasets despite tropical soundscapes being denser, more species-rich, and acoustically unpredictable. To address this gap, we introduce SEABAD (Southeast Asian Bird Activity Detection), a dataset of 50,000 curated three-second clips from Southeast Asian soundscapes, evenly balanced between bird-present and bird-absent samples. The dataset spans 1,677 bird species and is standardized to 16 kHz mono audio for embedded and low-power inference. We developed a dual-branch curation pipeline: a six-stage positive-label workflow applied to Xeno-Canto recordings, alongside six source-specific negative-label extractions from environmental datasets. These procedures reduced class imbalance by 13.7% (Gini coefficient: 0.601 to 0.519). A manual audit of 1,000 positive clips confirmed 97.8% +/- 0.9% labeling accuracy. Baseline experiments using MobileNetV3-Small achieved 99.57% +/- 0.25% accuracy and 0.9985 +/- 0.0002 AUC across three random seeds. SEABAD and the full curation pipeline are publicly released to support tropical BAD research and energy-efficient acoustic monitoring.

URL PDF HTML ☆

赞 0 踩 0

2605.20850 2026-05-21 cs.RO

SmoCap: Unified Scale-Pose Canonicalization with Proxy-Mapped Trust-Region QP

SmoCap: 一种统一的尺度-姿态规范化方法，结合代理映射信任区域QP

Shihao Li, Naohiko Sugita

发表机构 * The Research into Artifacts, Center for Engineering, The University of Tokyo（艺术研究机构，工程中心，东京大学）

AI总结 SmoCap通过在稀疏控制子空间中联合估计形态和姿态，解决阶段式工作流导致的形态-姿态补偿问题，实现了统一的尺度-姿态规范化框架，提高了运动规范化的实用性。

Comments 11 pages, 6 figures, 4 tables

详情

AI中文摘要

目标：阶段式工作流将模型缩放和逆运动学分开，会导致形态-姿态补偿，产生在弱观测方向上解不一致但数值上可接受的解。我们提出了SmoCap，一种抗泄漏的规范化框架，它在稀疏控制子空间中的每个局部信任区域二次规划（QP）中联合估计形态和姿态。方法：SmoCap通过分析代理映射姿态和缩放雅可比矩阵求解约束信任区域QP。低维代理映射稳定了弱观测方向并驱动协调结构。可选的预求解在困难配置中提供热启动。该框架使用队列荧光膝运动、人体测量学真实值和极端瑜伽序列进行评估。结果：SmoCap在荧光摄影膝屈曲上实现了2.9度RMSE，人体测量学端点误差约为3%。在泄漏审计中，SmoCap减少了标记RMSE、FE误差和人体测量学端点误差。代理耦合在瑜伽消融中保持了表达性和协调的脊柱运动，与基线模型相比，拟合误差增加（+0.14 mm，+0.6%）。中位标记RMSE约为20 mm，中位运行时间在0.204-0.332 ms/帧之间，通过一致的2-3次迭代实现。结论：SmoCap提供了一种经过外部验证的统一耦合感知尺度-姿态框架，使其在数据集规模上实现一致的运动规范化成为可能。

英文摘要

Objective: Stage-wise workflows that separate model scaling and inverse kinematics can induce morphology-posture compensation, resulting in anatomically inconsistent yet numerically acceptable solutions, especially in weakly observed directions. We present SmoCap, a leakage-resistant canonicalization framework that estimates morphology and posture jointly in each local trust-region quadratic program (QP) within a sparse control subspace. Methods: SmoCap solves a constrained trust-region QP with analytical proxy-mapped pose and scale Jacobians. The low dimensional proxy map stabilizes weakly observed directions and drives coordinated structures. An optional pre-solve provides warm starts in difficult configurations. The framework is evaluated using cohort fluoroscopy knee motion, anthropometric ground truth, and extreme yoga sequences. Results: SmoCap achieved 2.9 degree knee flexion RMSE against fluoroscopy, and a pooled anthropometric endpoint error around 3%. In the leakage audit against segment wise scaling, SmoCap also reduced marker RMSE, FE error, and anthropometric endpoint error. Proxy coupling preserved expressive and coordinated spine motion with marginal fitting error increase (+0.14 mm, +0.6%) against baseline models in yoga ablation. Median marker RMSE was around 20 mm, and median runtime was 0.204-0.332 ms/frame, achieved with consistently 2-3 iterations. Conclusion: SmoCap provides an externally validated unified coupling-aware scale-pose framework, making externally consistent motion canonicalization practical at dataset scale.

URL PDF HTML ☆

赞 0 踩 0

2605.20839 2026-05-21 cs.CV cs.LG

Activation-Free Backbones for Image Recognition: Polynomial Alternatives within MetaFormer-Style Vision Models

无需激活的图像识别回骨：在MetaFormer风格视觉模型中的多项式替代方案

Jeffrey Wang, Jonathan Gregory, Grigorios G. Chrysos

发表机构 * University of Wisconsin--Madison（威斯康星大学麦迪逊分校）

AI总结本文提出无需激活函数的多项式替代方法，用于在MetaFormer风格的视觉模型中实现图像识别，展示了多项式模块在多个数据集上的优越性能。

Comments Accepted to ICML 2026

2605.20838 2026-05-21 cs.CV cs.AI

USV: Towards Understanding the User-generated Short-form Videos

USV: 向理解用户生成的短视频迈进

Haoyue Cheng, Su Xu, Liwei Jin, Wayne Wu, Chen Qian, Limin Wang

发表机构 * State Key Laboratory for Novel Software Technology（新型软件技术国家重点实验室）

AI总结本文提出了USV数据集，用于高层面的视频语义理解，通过用户生成的短视频进行主题识别和视频-文本检索任务，提出了MMF-Net和VTCL两种有效基线方法。

详情

AI中文摘要

近年来，已经发布了多个大规模视频数据集，推动了视频理解领域的发展。然而，新兴的用户生成的短视频却很少被研究。本文提出了USV数据集，用于高层面的视频语义理解。该数据集包含约224,000个视频，通过标签查询从UGC平台收集，无需额外的人工验证和剪辑。尽管视频理解近年来取得了显著进展，但大多数工作集中在实例级识别，这不足以学习视频高层面语义信息的表示。因此，我们进一步在USV上建立了两个任务：主题识别和视频-文本检索。我们提出了两种统一且有效的基线方法：多模态融合网络（MMF-Net）和视频-文本对比学习（VTCL），分别用于主题识别和视频-文本检索任务，并进行了全面的基准测试以促进未来研究。我们的项目页面是https://usvdataset.github.io。

英文摘要

Several large-scale video datasets have been published these years and have advanced the area of video understanding. However, the newly emerged user-generated short-form videos have rarely been studied. This paper presents USV, the User-generated Short-form Video dataset for high-level semantic video understanding. The dataset contains around 224K videos collected from UGC platforms by label queries without extra manual verification and trimming. Although video understanding has achieved plausible improvement these years, most works focus on instance-level recognition, which is not sufficient for learning the representation of the high-level semantic information of videos. Therefore, we further establish two tasks: topic recognition and video-text retrieval on USV. We propose two unified and effective baseline methods Multi-Modality Fusion Network (MMF-Net) and Video-Text Contrastive Learning (VTCL), to tackle the topic recognition task and video-text retrieval respectively, and carry out comprehensive benchmarks to facilitate future research. Our project page is https://usvdataset.github.io.

URL PDF HTML ☆

赞 0 踩 0

2605.20837 2026-05-21 cs.CV cs.AI

ArchSIBench: Benchmarking the Architectural Spatial Intelligence of Vision-Language Models

ArchSIBench: 评估视觉-语言模型的建筑空间智能

Qirui Shen, Wenda Wang, Jiachen Lu, Zilong Huang, Jin Bai, Lei He, Hongxuan Chen, Weixin Huang

发表机构 * School of Architecture, Tsinghua University（清华大学建筑学院）

AI总结本文提出ArchSIBench，一个基于建筑学、认知科学和心理学视角的建筑空间智能评估基准，通过17个细粒度子任务和3000个问题-答案对，评估多种VLMs在建筑空间感知、推理、导航、转换和配置方面的性能，发现大多数模型在空间转换和配置推理上仍与有建筑训练的人类评估者存在差距。

Comments 51 pages

详情

AI中文摘要

建筑空间智能，即识别和推断建筑空间的能力，是机器人导航、具身交互和3D场景理解和生成等任务的基础。尽管已有大量研究评估了视觉-语言模型（VLMs）的基本空间技能，如相对方向、距离比较和物体计数，但这些任务仅涵盖空间认知的最基础层次，且忽略了更高层次的建筑空间认知，包括布局理解、通行模式和功能分区。在本文中，我们提出ArchSIBench，一个基于建筑学、认知科学和心理学视角的建筑空间智能评估基准。ArchSIBench涵盖五个核心维度：感知、推理、导航、转换和配置，包含17个细粒度子任务。通过专家的精心人工标注，我们构建了3,000个问题-答案对，以实现对建筑空间智能的全面评估。基于ArchSIBench，我们评估了各种VLMs，并发现大多数模型在建筑空间智能方面与人类基线有显著差异；此外，模型在能力维度上表现出显著的差异性。一些最先进的模型可以接近没有建筑训练的人类评估者水平。然而，与有建筑训练的人类评估者相比，仍存在明显差距，特别是在空间转换和配置推理方面。我们相信，ArchSIBench将为测量和提升VLMs的建筑空间智能提供重要的见解和系统资源。数据集和代码可在https://huggingface.co/datasets/ArchSIBench/ArchSIBench获取。

英文摘要

Architectural spatial intelligence, the ability to recognize and infer architectural space, is fundamental to tasks such as robot navigation, embodied interaction, and 3D scene understanding and generation. Although extensive research has evaluated the basic spatial skills of Vision-Language Models (VLMs) such as relative orientation, distance comparison, and object counting, these tasks cover only the most elementary levels of spatial cognition and largely overlook higher-level cognition of architectural space, including layout understanding, circulation patterns, and functional zoning. In this work, we present ArchSIBench, a Benchmark for Architectural Spatial Intelligence based on the perspectives from architecture, cognitive science, and psychology. ArchSIBench covers five core dimensions: perception, reasoning, navigation, transformation, and configuration, comprising 17 fine-grained subtasks. Through careful manual annotation by experts with architectural backgrounds, we construct 3,000 question-answer pairs to enable comprehensive evaluation of architectural spatial intelligence. Based on ArchSIBench, we evaluate various VLMs and find that the architectural spatial intelligence of most models shows significant differences from human baselines; additionally, models exhibit substantial variability across capability dimensions. Some state-of-the-art models can approach the level of human evaluators without architectural training. However, a clear gap remains compared to human evaluators with architectural training, particularly in spatial transformation and configuration reasoning. We believe that ArchSIBench will provide important insights and systematic resources for measuring and advancing the architectural spatial intelligence of VLMs. The dataset and code are available at https://huggingface.co/datasets/ArchSIBench/ArchSIBench.

URL PDF HTML ☆

赞 0 踩 0

2605.20834 2026-05-21 cs.AI cs.LG

Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment

DPO与RLHF的条件等价性：隐含假设、失败模式与可证明对齐

Zhiqin Yang, Yonggang Zhang, Wei Xue, Dong Fang, Bo Han, Yike Guo

发表机构 * The Hong Kong University of Science（香港科技大学）； Hong Kong Baptist University（香港 Baptist 大学）

AI总结本文研究了DPO与RLHF的等价性问题，指出其等价性依赖于一个隐含假设，当该假设不成立时，DPO会优化相对优势而非绝对对齐，从而导致路径性收敛。作者提出CPO方法，通过引入约束实现可证明对齐，并通过几何解释揭示DPO的margin ranking机制。

Comments 49 pages

详情

AI中文摘要

直接偏好优化（DPO）作为一种替代强化学习从人类反馈（RLHF）的方法，理论上等价但实现更简单。我们证明这种等价性是条件性的而非普遍的，取决于一个隐含假设：RLHF最优策略必须偏好人类偏好响应。当该假设不成立时，DPO优化参考策略的相对优势而非绝对对齐人类偏好，导致路径性收敛，即策略降低DPO损失但偏好不被偏好响应。我们刻画了该假设被违反的情况，展示了不可取的解空间存在，并证明在这些情况下DPO和RLHF优化根本不同的目标。为解决此问题，我们引入约束偏好优化（CPO），通过在RLHF中加入约束以实现可证明对齐。我们进一步通过软边距排名提供几何解释，揭示DPO实现边距排名但可能具有潜在负目标。我们的理论分析确立了DPO保证成立的条件，并提供了保持简单性的同时具有可证明对齐的解决方案。在标准基准上的全面实验表明，CPO实现了最先进的性能。代码可在：https://github.com/visitworld123/CPO获取。

英文摘要

Direct Preference Optimization (DPO) has emerged as a popular alternative to Reinforcement Learning from Human Feedback (RLHF), offering theoretical equivalence with simpler implementation. We prove this equivalence is conditional rather than universal, depending on an implicit assumption frequently violated in practice: the RLHF-optimal policy must prefer human-preferred responses. When this assumption fails, DPO optimizes relative advantage over the reference policy rather than absolute alignment with human preferences, leading to pathological convergence where policies decrease DPO loss while preferring dispreferred responses. We characterize when this assumption is violated, show the existence of an undesirable solution space, and prove that DPO and RLHF optimize fundamentally different objectives in such cases. To address this, we introduce Constrained Preference Optimization (CPO), augmenting RLHF with constraints for provable alignment. We further provide a geometric interpretation through soft margin ranking, revealing that DPO implements margin ranking with potentially negative targets. Our theoretical analysis establishes when DPOs' guarantees hold and provides solutions preserving simplicity with provable alignment. Comprehensive experiments on standard benchmarks demonstrate that CPO achieves state-of-the-art performance. Code is available at: https://github.com/visitworld123/CPO.

URL PDF HTML ☆

赞 0 踩 0

2605.20833 2026-05-21 cs.CL

MemGym: a Long-Horizon Memory Environment for LLM Agents

MemGym: 一种长时间跨度的记忆环境用于LLM智能体

Wujiang Xu, Yu Wang, Kai Mei, Kaiqu Liang, Zhenting Wang, Mingyu Jin, Han Zhang, Shi-Xiong Zhang, Wenyue Hua, Sambit Sahu, Dimitris N. Metaxas

发表机构 * Rutgers University（罗杰斯大学）； Capital One ； Princeton University（普林斯顿大学）； Microsoft Research（微软研究院）

AI总结本文提出MemGym，一种用于评估LLM智能体记忆能力的基准测试环境，通过统一现有智能体 gym 和内部记忆基础管道，提供一个记忆推理接口。MemGym 包含五个评估赛道，涵盖四个智能体领域，能够独立评估记忆性能，排除推理、检索和工具使用能力的干扰。

详情

AI中文摘要

记忆是LLM智能体在长时间任务中运营的核心能力。现有的记忆基准测试主要评估多轮聊天场景中个性化信息的保留能力，忽略了在长时间智能体执行过程中发生的动态记忆形成。因此，它们所生成的记忆系统在现实的智能体环境中（如编程和网络导航）转移效果差。我们提出了MemGym，一个用于智能体记忆的基准测试，它将现有的智能体 gym 和内部记忆基础管道统一到一个记忆推理接口下。MemGym涵盖五个评估赛道，分为四个智能体领域：工具使用对话（tau2-bench）、多轮深度研究搜索（MEMGYM-DR）、编程（SWE-Gym和MEMGYM-CODEQA）、计算机使用（WebArena-Infinity）。MemGym报告出的记忆隔离分数将记忆性能与推理、检索和工具使用能力分离，因此可以独立对记忆策略进行排名。我们的合成管道为MEMGYM-CODEQA和MEMGYM-DR是长度可控的，在每个阶段都经过消融验证，并紧密对齐下游场景。为了使在编程环境中的评估在学术上具有可操作性，我们训练了MemRM，一个轻量级的奖励模型（使用Qwen3-1.7B微调QLoRA），它以快速标量读取的方式评分压缩质量，而不是完整的Docker回放。

英文摘要

Memory is a central capability for LLM agents operating across long-horizon tasks. Existing memory benchmarks predominantly evaluate retention of personalized information in multi-turn chat scenarios, overlooking the dynamic memory formation that occurs during extended agent execution. Consequently, the memory systems they produce transfer poorly to realistic agentic environments, such as coding and web navigation. We present MemGym, a benchmark for agentic memory that unifies existing agent gyms and in-house memory-grounded pipelines behind one memory-reasoning interface. MemGym spans five evaluation tracks grouped into four agentic regimes: tool-use dialogue (tau2-bench), multi-turn deep-research search (MEMGYM-DR), coding (SWE-Gym and MEMGYM-CODEQA), and computer use (WebArena-Infinity). MemGym reports memory-isolated scores that decouple memory performance from reasoning, retrieval, and tool-use ability, so memory strategies can be ranked without those confounders. Our synthetic pipelines for MEMGYM-CODEQA and MEMGYM-DR are length-controllable, ablation-verified at every stage, and tightly aligned with downstream scenarios. To make evaluation on coding environments academically tractable, we train MemRM, a lightweight reward model (Qwen3-1.7B fine-tuned with QLoRA) that scores compression quality as a fast scalar read in place of full Docker rollouts.

URL PDF HTML ☆

赞 0 踩 0

2605.20827 2026-05-21 cs.CV

HyDAR-Pano3D: A Hybrid Disentangled Anatomical Recovery Framework for Panoramic-to-3D Reconstruction

HyDAR-Pano3D: 一种用于全景到3D重建的混合解耦解剖恢复框架

Yaoyao Yue, Jérôme Schmid, Xiaoshuang Li, Eduardo Delamare, Jinman Kim

发表机构 * School of Computer Science, the University of Sydney（悉尼大学计算机科学学院）； Geneva School of Health Sciences, HES-SO University of Applied Sciences and Arts Western Switzerland（日内瓦健康科学学院，HES-SO应用科学和艺术西瑞士大学）； Department of Computer Science and Engineering, Shanghai Jiao Tong University（上海交通大学计算机科学与工程系）； Sydney Dental School, Faculty of Medicine and Health, The University of Sydney（悉尼牙科学院，医学与健康学院，悉尼大学）

AI总结本文提出HyDAR-Pano3D框架，通过解耦解剖恢复问题来解决全景影像到CBCT重建中的模糊问题，实验表明其在PSNR、SSIM和Dice评分上均优于基线方法，能够有效恢复临床相关的解剖结构。

Comments 10 pages

详情

AI中文摘要

全景放射影像（PR）在常规牙科护理中被广泛使用，但其本质上只能提供复杂的三维颅面解剖的二维投影。大多数现有的基于学习的方法试图通过直接回归原生锥束CT（CBCT）体积来计算恢复这种三维信息。然而，这种直接映射要求模型同时学习常见的解剖结构和患者特定的形态变化。这种纠缠的公式使二维到三维的逆问题变得高度模糊，通常会产生过度平滑的重建和模糊的解剖边界。为了解决这个问题，我们提出了HyDAR-Pano3D，一个两阶段框架，将PR到CBCT重建重新公式化为解耦的解剖恢复问题。在第一阶段，一个双编码器网络整合了放射影像特征与SAM衍生的语义先验，以重建一个归一化的标准体积。在第二阶段，一个解剖恢复网络预测一个先验约束的结构变形场，将这个标准体积映射回原空间，恢复个体形态变化。在三个大规模数据集上的实验表明，HyDAR-Pano3D显著优于基线方法（p < 0.05），实现了25.76 dB PSNR，85.70% SSIM，以及83.83%的整体解剖Dice评分。合成的体积成功支持下游的完整牙齿（82.4% Dice）和下颌骨管（72.2% Dice）分割，证明了我们的解耦方法能够保留临床相关的结构，当CBCT数据不可用时，能够实现稳健的解剖感知评估。

英文摘要

Panoramic radiograph (PR) is fundamentally used in routine dental care, but it inherently provides only a two-dimensional (2D) projection of complex three-dimensional (3D) craniofacial anatomy. Most existing learning-based methods attempt to computationally recover this 3D information by directly regressing native cone-beam computed tomography (CBCT) volumes from PR. However, this direct mapping requires the model to simultaneously learn common anatomical structures and patient-specific morphological variations. This entangled formulation makes the ill-posed 2D-to-3D inverse problem highly ambiguous, often producing over-smoothed reconstructions with blurred anatomical boundaries. To address this, we propose HyDAR-Pano3D, a two-stage framework that reformulates PR-to-CBCT reconstruction as a disentangled anatomical recovery problem. In Stage 1, a dual-encoder network integrates radiographic features with SAM-derived semantic priors to reconstruct an arch-normalized canonical volume. In Stage 2, an Anatomical Restoration Network predicts a prior-constrained structured deformation field to map this canonical volume back to the native space, restoring individual morphological variations. Experiments on three large-scale datasets show that HyDAR-Pano3D significantly outperforms baseline methods ($p < 0.05$), achieving a 25.76 dB PSNR, 85.70\% SSIM, and an 83.83\% overall anatomical Dice score. The synthesized volumes successfully support downstream segmentation of whole teeth (82.4\% Dice) and the inferior alveolar canal (72.2\% Dice), demonstrating that our disentangled approach preserves clinically relevant structures to enable robust anatomy-aware assessment when CBCT data is unavailable.

URL PDF HTML ☆

赞 0 踩 0

2605.20824 2026-05-21 cs.LG

Markovian Circuit Tracing for Transformer State Dynamic

马尔可夫电路追踪用于Transformer状态动态

Abdullah X

发表机构 * Project AWARE and Zephara AI（项目AWARE和Zephara AI）

AI总结本研究提出马尔可夫电路追踪（MCT）方法，用于评估Transformer激活是否包含粗粒度的状态转移结构，通过合成的隐马尔可夫模型任务验证了残差激活中包含部分贝叶斯信念信息，并展示了状态抽象在不同状态下恢复粗粒度转移信号的效果。

详情

AI中文摘要

许多序列计算更容易通过内部状态的运动来研究，而不是孤立的局部电路。我们引入了马尔可夫电路追踪（MCT），一种用于测试Transformer激活是否包含粗粒度状态转移结构的诊断流程。该基准使用合成的隐马尔可夫模型（HMM）任务，其中潜在状态、转移矩阵、贝叶斯信念向量、贝叶斯最优预测以及强制状态反事实目标都是已知的。在六个HMM家族和每个家族三个种子的情况下，tiny因果Transformer学习接近贝叶斯的下一个token预测器，其平均超额损失为0.0138。残差激活在受控的合成基准中包含部分贝叶斯信念信息。从这些激活中提取的状态抽象在持久和低状态领域恢复粗粒度转移信号最强，在模糊发射和六状态领域则较弱。最清晰的结果来自状态强制。修复恢复的状态质心将KL值从未修复模型中的0.1957降低到0.0532，平均上优于错误状态、均值激活、随机激活和洗牌标签控制。本研究的贡献是一个受控的基准和评估框架，用于Transformer状态动态可解释性，MCT作为简单的参考流程。

英文摘要

Many sequence computations are easier to study as movement through internal states than as isolated local circuits. We introduce Markovian Circuit Tracing (MCT), a diagnostic pipeline for testing whether transformer activations contain coarse state-transition structure. The benchmark uses synthetic Hidden Markov Model (HMM) tasks where latent states, transition matrices, Bayesian belief vectors, Bayes-optimal predictions, and forced-state counterfactual targets are known exactly. Across six HMM families and three seeds per family, tiny causal transformers learn near-Bayes next-token predictors, with mean excess loss over Bayes of 0.0138. Residual activations contain partial Bayesian belief information in this controlled synthetic benchmark. State abstractions extracted from these activations recover coarse transition signal, strongest in persistent and lower-state regimes, and weaker in ambiguous-emission and six-state regimes. The clearest result comes from state forcing. Patching a recovered-state centroid reduces KL to the exact HMM counterfactual target from 0.1957 in the unpatched model to 0.0532 on average, beating wrong-state, mean-activation, random-activation, and shuffled-label controls. The contribution is a controlled benchmark and evaluation framework for transformer state-dynamics interpretability, with MCT as a simple reference pipeline

URL PDF HTML ☆

赞 0 踩 0

2605.20822 2026-05-21 cs.CV

TERDNet: Transformer Encoder-Recurrent Decoder Network for Scene Change Detection

TERDNet: 用于场景变化检测的Transformer编码器-递归解码器网络

Jiae Yoon, Ue-Hwan Kim

发表机构 * Department of AI Convergence, Gwangju Institute of Science and Technology (GIST)（人工智能融合系，全州科学技术院（GIST））

AI总结本文提出TERDNet，一种用于场景变化检测的Transformer编码器-递归解码器网络，通过多级特征提取、特征融合模块、递归解码器和上采样模块，提升了场景变化检测的精度和鲁棒性。

Comments 8 pages, 4 figures. Accepted to the IEEE International Conference on Robotics and Automation (ICRA) 2026

详情

AI中文摘要

在本文中，我们针对场景变化检测（SCD）这一挑战，其目标是在不同时间拍摄的同一地点的两幅图像之间识别变化。现有的SCD模型通常忽略了不同层之间特征重要性的变化，使用单步解码器限制了细化过程，并且对编码器预训练策略提供了有限的见解。我们提出了TERDNet，一种Transformer编码器-递归解码器网络，旨在克服这些限制。TERDNet由基于Transformer的编码器提取多级表示，一个融合相关体积与这些特征的特征融合模块，一个执行迭代细化的递归3门GRU解码器，以及一个结合卷积和插值的上采样器组成。在四个公开基准上的大量实验表明，TERDNet在性能上始终优于先前的方法，并产生了更准确和详细的变更掩码。消融研究证实了基于分割的预训练的优势以及我们融合设计的有效性。此外，在视角偏移下的鲁棒性测试确认了TERDNet在现实世界机器人系统中的部署潜力，其中可靠的感知至关重要。我们的代码可在https://github.com/AutoCompSysLab/TERDNet上获得。

英文摘要

In this work, we address the challenge of Scene Change Detection (SCD), where the goal is to identify variations between two images of the same location captured at different times. Existing SCD models often overlook the varying importance of features across layers, employ single-step decoders that confine refinement, and provide limited insight into encoder pretraining strategies. We propose TERDNet, a Transformer Encoder-Recurrent Decoder Network designed to overcome these limitations. TERDNet consists of a transformer-based encoder that extracts multi-level representations, a feature fusion module that integrates correlation volumes with these features, a recurrent 3-gate-GRU decoder that performs iterative refinement, and a combined convolution-interpolation upsampler that restores fine-grained resolution. Extensive experiments on four public benchmarks show that TERDNet consistently outperforms prior approaches and produces more accurate and detailed change masks. Ablation studies confirm the benefit of segmentation-based pretraining and the effectiveness of our fusion design. In addition, robustness tests under viewpoint misalignment confirm TERDNet's potential for deployment in real-world robotic systems, where reliable perception is critical. Our code is available at https://github.com/AutoCompSysLab/TERDNet.

URL PDF HTML ☆

赞 0 踩 0

2605.20821 2026-05-21 cs.CV cs.RO

VSCD: Video-based Scene Change Detection in Unaligned Scenes

VSCD: 基于视频的非对齐场景变化检测

Jiae Yoon, Ue-Hwan Kim

发表机构 * Department of AI Convergence, Gwangju Institute of Science（人工智能融合系，全州科学研究院）； GIST InnoCORE AI-Nano Convergence Institute for Early Detection of Neurodegenerative Diseases, Gwangju Institute of Science（全州科学研究院AI-纳米融合研究所，用于早期检测神经退行性疾病的机构）

AI总结本研究提出VSCD，一种用于非对齐场景中视频基变化检测的方法，通过查询帧生成像素级变化掩码，利用多参考模型和局部补丁对应来对齐参考特征，并融合候选变化特征以生成高分辨率掩码，实现了优于现有图像和视频基基线的性能。

Comments 18 pages, 7 figures. Accepted to the 43rd International Conference on Machine Learning (ICML 2026)

详情

AI中文摘要

检测环境中变化对于长期自主性至关重要，但大多数变化检测设置假设固定视角、轻微错位或仅少数变化对象。我们引入视频基场景变化检测（VSCD），该方法在给定参考和查询RGB视频的情况下，为每个查询帧预测像素级变化掩码。这两个视频记录于不同时间，且相机运动不受约束，视频之间没有时间同步，许多对象实例可能出现或消失。为研究此设置，我们构建了一个包含超过110万帧的大型基准，这些帧标注了像素级变化掩码，并附有现实世界测试集以评估迁移至现实的性能。我们提出了一种以查询为中心的多参考模型，该模型从变化掩码监督中隐式学习时间匹配，通过局部补丁对应对齐候选参考特征，并在解码高分辨率掩码前使用帧级和补丁级置信度融合每个候选的变化特征。我们的方法在强大的图像和视频基基线中实现了最先进的性能，并通过在移动机器人上部署验证其现实影响，用于两个下游应用——视觉监控和对象增量学习。

英文摘要

Detecting what has changed in an environment is essential for long-term autonomy, yet most change detection settings assume fixed viewpoints, mild misalignment, or only a few changed objects. We introduce Video-based Scene Change Detection (VSCD), which predicts a pixel-wise change mask for each query frame, given a reference and a query RGB video of the same indoor space recorded at different times under unconstrained camera motion. The two videos are not temporally synchronized, and many object instances may appear or disappear. To study this setting, we build a large-scale benchmark with over 1.1 million frames annotated with pixel-accurate change masks, together with a real-world test set for evaluating transfer beyond simulation. We propose a query-centric multi-reference model that learns temporal matching implicitly from change-mask supervision, aligns candidate reference features to the query via local patch correspondence, and fuses per-candidate change features using frame-level and patch-level confidence before decoding a high-resolution mask once per frame. Our approach achieves state-of-the-art performance against strong image- and video-based baselines, and we validate its real-world impact by deploying it on a mobile robot for two downstream applications -- visual surveillance and object incremental learning.

URL PDF HTML ☆

赞 0 踩 0

2605.20820 2026-05-21 cs.CV

AIR: Amortized Image Reconstruction Framework for Self-Supervised Feed-Forward 2D Gaussian Splatting

AIR: 一种用于自监督前馈2D高斯点散射的 amortized 图像重建框架

Zhaojie Zeng, Yuesong Wang, Yawei Luo, Tao Guan

发表机构 * School of Computer Science and Technology, Huazhong University of Science and Technology（华中科技大学计算机科学与技术学院）； School of Software Technology, Zhejiang University（浙江大学软件学院）

AI总结本文提出了一种自监督前馈框架AIR，通过将迭代高斯拟合 amortized 到单次网络传递中，消除了每张图像测试时的优化需求。该框架采用分阶段残差架构，逐步从重建残差中预测额外的高斯原始体，并结合显式的阶段控制机制，仅在欠重建区域激活新的原始体。通过预测-优化-蒸馏训练策略，稳定了多阶段预测，最终实现了更高效的图像重建。

Comments preprint version

详情

AI中文摘要

2D高斯点散射提供了一种高效的显式图像重建表示，但现有方法仍然需要昂贵的逐图像迭代优化或依赖手工设计的先验知识来分配原始体。我们提出了AIR，一种自监督前馈框架，将迭代高斯拟合 amortized 到单次网络传递中，消除了每张图像测试时的优化需求。AIR采用分阶段残差架构，逐步从重建残差中预测额外的高斯原始体，并结合显式的阶段控制机制，仅在欠重建区域激活新的原始体。一种预测-优化-蒸馏训练策略通过将短周期优化的高斯增量蒸馏回预测器，稳定了多阶段预测。稳定后的预测器随后在各阶段联合微调，并配备图像自适应量化器以实现紧凑的高斯存储。在Kodak和DIV2K上的实验表明，AIR在重建质量上优于代表性的基于高斯的基线方法，同时将编码时间减少到160-300毫秒。代码：https://github.com/whoiszzj/AIR.git

英文摘要

2D Gaussian splatting provides an efficient explicit representation for image reconstruction, but existing methods still require costly per-image iterative optimization or rely on handcrafted priors for primitive allocation. We present AIR, a self-supervised feed-forward framework that amortizes iterative Gaussian fitting into a single network pass, eliminating per-image test-time optimization. AIR adopts a stage-wise residual architecture that progressively predicts additional Gaussian primitives from reconstruction residuals, together with an explicit Stage Control mechanism that activates new primitives only in under-reconstructed regions. A Predict--Optimize--Distill training strategy stabilizes multi-stage prediction by distilling short-horizon optimized Gaussian increments back into the predictor. The stabilized predictor is then jointly finetuned across stages and equipped with an image-adaptive quantizer for compact Gaussian storage. Experiments on Kodak and DIV2K show that AIR achieves better reconstruction quality than representative Gaussian-based baselines while reducing encoding time to 160--300\,ms. Code: https://github.com/whoiszzj/AIR.git

URL PDF HTML ☆

赞 0 踩 0

2605.20818 2026-05-21 cs.CV

OSGNet with MLLM Reranking @ Ego4D Episodic Memory Challenge 2026

Yisen Feng, Leigang Qu, Haoyu Zhang, Qiaohui Chu, Meng Liu, Xuemeng Song, Weili Guan, Liqiang Nie

发表机构 * Harbin Institute of Technology (Shenzhen)（哈尔滨工业大学（深圳））； National University of Singapore（新加坡国立大学）； Pengcheng Laboratory（鹏城实验室）； Shandong Jianzhu University（山东建筑大学）； Southern University of Science and Technology（南方科技大学）

AI总结本文提出一种基于多模态大语言模型（MLLM）的重排序框架，用于解决Ego4D事件记忆挑战2026中的自然语言查询和目标步 tracks，通过结合现有定位模型OSGNet的候选片段和MLLM的视频-语言推理能力，提升时间片段的定位精度。

Comments Champion solution for the Natural Language Queries and GoalStep tracks of the Ego4D Challenge at the CVPR EgoVis Workshop 2026

2605.20815 2026-05-21 cs.CL cs.AI cs.IR cs.LG

GraphRAG on Consumer Hardware: Benchmarking Local LLMs for Healthcare EHR Schema Retrieval

在消费级硬件上实现GraphRAG：对本地LLMs在医疗EHR模式检索中的基准测试

Peter Fernandes, Ria Kanjilal

发表机构 * Department of Computer Engineering（计算机工程系）； California Polytechnic State University（加州州立大学波特兰分校）

AI总结本文研究了在消费级硬件上使用本地LLMs进行医疗EHR模式检索的GraphRAG方法，评估了四种不同模型在索引效率、知识图构建、查询延迟、回答质量和幻觉方面的表现，发现模型参数大小和检索模式对结果有显著影响。

Comments 9 pages, 1 figure, 5 tables

详情

AI中文摘要

基于图的检索增强生成（GraphRAG）扩展了检索增强生成，以支持对复杂语料库的结构化推理，但其在资源受限、隐私敏感的部署中的可靠性仍不清楚。在医疗领域，电子健康记录（EHR）数据复杂且严格监管，依赖云基于大语言模型（LLMs）会带来成本、延迟和合规性的挑战。本文系统评估了GraphRAG在EHR模式检索中的应用，使用本地部署的开源LLMs。我们实现了Microsoft GraphRAG管道在真实的EHR模式文档上，并基准测试了四种模型，包括Llama 3.1（8B）、Mistral（7B）、Qwen 2.5（7B）和Phi-4-mini（3.8B），这些模型通过Ollama在单个消费级GPU（8 GB VRAM）上部署。我们评估了索引效率、知识图构建、查询延迟、回答质量和幻觉在全局和局部检索模式下的表现。我们的结果揭示了显著差异：Llama 3.1生成最丰富的知识图（1,172个实体），Qwen 2.5达到最佳回答质量（3.3/5），Phi-4-mini因结构化输出错误无法完成流程，而Mistral表现出退化重复行为。我们进一步表明，GraphRAG具有实际容量阈值，其中模型参数低于约7B的模型无法可靠地生成有效的结构化输出并无法完成流程。此外，索引和回答质量在不同模型之间是脱耦的，局部检索在延迟和事实基础方面均优于全局总结，且幻觉减少。这些发现表明，GraphRAG可以在消费级硬件上实现，同时强调了模型选择和检索设计在受监管环境中的重要性。

英文摘要

Graph-based Retrieval Augmented Generation (GraphRAG) extends retrieval-augmented generation to support structured reasoning over complex corpora, but its reliability under resource-constrained, privacy-sensitive deployments remains unclear. In healthcare, where Electronic Health Record (EHR) data is complex and strictly regulated, reliance on cloud-based large language models (LLMs) introduces challenges in cost, latency, and compliance. In this work, we present a systematic evaluation of GraphRAG for EHR schema retrieval using locally deployed open-source LLMs. We implement the Microsoft GraphRAG pipeline on real-world EHR schema documentation and benchmark four models, including Llama 3.1 (8B), Mistral (7B), Qwen 2.5 (7B), and Phi-4-mini (3.8B), each deployed via Ollama on a single consumer GPU (8 GB VRAM). We evaluate indexing efficiency, knowledge graph construction, query latency, answer quality, and hallucination under both global and local retrieval modes. Our results reveal substantial differences: Llama 3.1 produces the richest knowledge graph (1,172 entities), Qwen 2.5 achieves the best answer quality (3.3/5), Phi-4-mini fails to complete the pipeline due to structured-output errors, and Mistral exhibits degenerate repetition behavior. We further show that GraphRAG exhibits a practical capacity threshold, where models below approximately 7B parameters fail to reliably produce valid structured outputs and cannot complete the pipeline. In addition, indexing and answer quality are decoupled across models, and local retrieval consistently outperforms global summarization in both latency and factual grounding, with reduced hallucination. These findings demonstrate that GraphRAG is feasible on consumer hardware while highlighting the importance of model selection and retrieval design for robust deployment in regulated settings.

URL PDF HTML ☆

赞 0 踩 0

2605.20813 2026-05-21 cs.CL

PulseCol: Periodically Refreshed Column-Sparse Attention for Accelerating Diffusion Language Models

PulseCol: 周期性刷新的列稀疏注意力用于加速扩散语言模型

Yanyi Lyu, Letian Chen, Futing Sun, Miao Zhang, Weili Guan, Liqiang Nie

发表机构 * Harbin Institute of Technology (Shenzhen)（哈尔滨工业大学（深圳））

AI总结本文提出PulseCol，一种周期性刷新的列稀疏注意力方法，通过更细粒度的稀疏化策略提升扩散语言模型的计算效率和加速性能，同时保持模型质量。

详情

AI中文摘要

在扩散大语言模型（dLLMs）的推理过程中，计算成本很高，因为每次去噪步骤都需要重复执行完整的自注意力机制，而没有KV缓存。最近的稀疏注意力方法通过块稀疏计算来缓解这一成本，但只在后期迭代中应用，当模型性能对粗粒度稀疏近似不敏感时，但这种方法在计算效率和加速方面提升有限。这促使我们提出一种更细粒度的稀疏化策略，可以在早期迭代中应用，并利用可重用的稀疏模式，从而实现进一步的效率提升。在本文中，我们介绍了PulseCol，一种用于加速扩散语言模型的周期性刷新列稀疏注意力方法。PulseCol将粗粒度的块稀疏性替换为更细粒度的列稀疏结构，使重要的注意力交互更加精确地保留，同时暴露更大的稀疏性。基于这种列级公式，PulseCol进一步在去噪的早期步骤中识别稀疏模式，并在后续迭代中重用这些模式，在少量中间步骤中刷新它们，以跟踪去噪过程中稀疏注意力模式的变化。实验表明，PulseCol在稀疏性和实际加速方面优于先前的稀疏注意力方法，同时保持模型质量。通过优化的GPU内核实现列稀疏注意力，PulseCol在多个上下文长度上实现了比FlashAttention高达1.95倍的端到端加速。

英文摘要

Inference in diffusion large language models (dLLMs) is computationally expensive, as full self-attention must be repeatedly executed at each step of the denoising process without KV cache. Recent sparse attention methods for dLLMs mitigate this cost via block-sparse computation, which is applied only in later iterations when model performance is less sensitive to coarse-grained sparse approximation, but yields limited improvements in computational efficiency and acceleration. This motivates a finer-grained sparsification strategy that can be applied from earlier iterations and leverages reusable sparsity patterns, enabling further efficiency gains. In this work, we introduce PulseCol, a periodically refreshed column-sparse attention method for accelerating diffusion language models. PulseCol replaces coarse block-level sparsity with a finer-grained column-sparse structure, allowing important attention interactions to be retained more precisely while exposing greater sparsity. Built on this column-level formulation, PulseCol further identifies sparse patterns at the early denoising step and reuses them across subsequent iterations, refreshing them only at a small number of intermediate steps to track the evolution of sparse attention patterns during denoising. Experiments show that PulseCol achieves higher sparsity and greater practical speedup than prior sparse attention methods for dLLMs, while maintaining model quality. Enabled by optimized GPU kernels for column-sparse attention, PulseCol delivers up to 1.95$\times$ end-to-end speedup over FlashAttention across several context lengths.

URL PDF HTML ☆

赞 0 踩 0

2605.20811 2026-05-21 cs.RO

Demo-JEPA: Joint-Embedding Predictive Architecture for One-shot Cross-Embodiment Imitation

Demo-JEPA: 一种用于单次跨体态模仿的联合嵌入预测架构

Jingyang He, Guangrun Li, Jieyu Zhang, Chengkai Hou, Zhengping Che, Shanghang Zhang

发表机构 * State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University（信息处理国家重点实验室，计算机学院，北京大学）； University of Washington（华盛顿大学）； Beijing Innovation Center of Humanoid Robotics（北京人形机器人创新中心）

AI总结本文提出Demo-JEPA，一种跨体态模仿框架，通过解耦示范意图与体态特定的执行，利用共享预测表示空间将源视觉示范转换为目标兼容的未来潜在轨迹，使目标代理通过规划实现这些子目标，从而在异构体态间实现灵活的模仿。

详情

AI中文摘要

机器人模仿学习通常被视为复制演示动作，但动作本质上是体态特定的。当演示来自具有不同形态、运动学或动作空间的人类或机器人时，这种以动作为中心的观点需要共享动作空间、启发式重定向或大规模多体态联合训练。我们相反地将演示视为未来目标的隐含规范：目标代理应推断演示者试图实现的状态，而非演示者如何执行它。我们提出Demo-JEPA，一种跨体态模仿框架，通过基于JEPA的世界模型构建，将源视觉示范转换为目标兼容的未来潜在轨迹，这些轨迹在共享的预测表示空间中。目标代理随后利用这些潜在轨迹作为子目标，并通过其自身学习的向前动力学进行规划以实现它们。由于Demo-JEPA避免了动作层面的对应关系，仅需视觉示范和目标代理自身的交互经验，它支持在异构体态间灵活的模仿。在RLBench和真实世界操作任务中的实验表明，Demo-JEPA在专门的领域规划器中表现优异，并能泛化到未见的任务和体态配置，而此前的方法在此类情况下失效。

英文摘要

Robotic imitation learning is often treated as reproducing demonstrated actions, but actions are inherently embodiment-specific. When demonstrations come from humans or robots with different morphology, kinematics, or action spaces, this action-centric view requires shared action spaces, heuristic retargeting, or large-scale multi-embodiment co-training. We instead view demonstrations as implicit specifications of future goals: the target agent should infer what state the demonstrator is trying to realize, rather than how the demonstrator executes it. We propose Demo-JEPA, a cross-embodiment imitation framework that decouples demonstration intent from embodiment-specific execution. Built on a JEPA-based world model, Demo-JEPA translates source visual demonstrations into target-compatible future latent trajectories in a shared predictive representation space. The target agent then uses these latent trajectories as subgoals and realizes them through planning under its own learned forward dynamics. Because Demo-JEPA avoids action-level correspondence and requires only visual demonstrations plus the target agent's own interaction experience, it supports flexible imitation across heterogeneous embodiments. Experiments on RLBench and real-world manipulation tasks show that Demo-JEPA matches specialized in-domain planners and generalizes to unseen tasks and embodiment configurations where prior methods fail.

URL PDF HTML ☆

赞 0 踩 0

2605.20809 2026-05-21 cs.CL

Refining and Reusing Annotation Guidelines for LLM Annotation

对LLM注释的注释指南进行细化和重用

Kon Woo Kim, Jin-Dong Kim, Akiko Aizawa

发表机构 * The Graduate University for Advanced Studies, SOKENDAI（Graduate University for Advanced Studies, SOKENDAI）； National Institute of Informatics (NII)（National Institute of Informatics）； BioData Science Initiative (BSI), National Institute of Genetics (NIG)（BioData Science Initiative (BSI), National Institute of Genetics）

AI总结本文提出了一种系统性的注释指南重用和细化方法，通过迭代审核框架来对LLM注释进行改进，并在生物医学NER任务中验证了指南整合的有效性、推理优化模型的优势以及在最小监督下的审核可行性。

Comments 14 pages, 7 figures. Accepted to the ACL 2026 Main Conference

2605.20808 2026-05-21 cs.CV

Spatial Gram Alignment for Ultra-High-Resolution Image Synthesis

基于超高清图像合成的空间图对齐

Jinjin Zhang, Xiefan Guo, Di Huang

发表机构 * Beihang University（北航大学）

AI总结本文提出空间图对齐（SGA）方法，通过利用视觉基础模型的表示先验，保留LDMs的生成能力，解决超高清图像合成中生成质量与结构完整性之间的冲突，实现高质量的文本到图像合成。

Comments Technical Report

详情

AI中文摘要

现代超高清图像合成严重依赖大规模预训练潜在扩散模型（LDMs）的强大生成能力。尽管最近的表示对齐方法通过从基础模型（如SAM或DINO）中蒸馏视觉先验到生成潜在特征而有效，但将这些方法扩展到预训练LDMs在极端分辨率下暴露了学习性与保真度之间的关键冲突。具体而言，强制直接的块级特征蒸馏会扰动预训练的潜在流形，最终导致生成退化。为了解决这个瓶颈，我们提出了空间图对齐（SGA），一种新的框架，它明确利用视觉基础模型的表示先验，同时保留LDMs的本原生成能力。超越限制性的直接对齐，SGA通过将生成特征的内部自相似性与基础先验的自相似性对齐，施加一种非侵入性的空间约束。这种空间约束有效地建立了宏观结构的连贯性，而本原的生成目标保留了原始LDMs的微观像素级保真度。值得注意的是，这种通用策略可以无缝整合到预训练LDMs的中间扩散特征和VAE潜在空间中。广泛的实验表明，SGA在超高清文本到图像合成中实现了最先进的性能，有效协调了全局结构完整性和细粒度视觉细节。代码可在https://github.com/zhang0jhon/SGA获取。

英文摘要

Modern ultra-high-resolution image synthesis relies heavily on the robust generative capacity of large-scale pre-trained Latent Diffusion Models (LDMs). While recent representation alignment methods have proven effective by distilling visual priors from foundation models (e.g., SAM or DINO) into generative latent features, scaling these approaches to pre-trained LDMs at extreme resolutions exposes a critical learnability-fidelity conflict. Specifically, forcing direct patch-wise feature distillation inherently perturbs the pre-trained latent manifold, ultimately leading to generation degradation. To address this bottleneck, we propose Spatial Gram Alignment (SGA), a novel framework that explicitly leverages the representation priors of vision foundation models while preserving the native generative capacity of LDMs. Moving beyond restrictive direct alignment, SGA imposes a non-invasive spatial constraint by aligning the internal self-similarities of the generative features with those of the foundation priors. This spatial constraint effectively establishes macroscopic structural coherence, while the native generative objectives retain the microscopic pixel-level fidelity inherent to the original LDMs. Notably, this versatile strategy integrates seamlessly across both intermediate diffusion features and VAE latents within pre-trained LDMs. Extensive experiments demonstrate that SGA achieves state-of-the-art performance for ultra-high-resolution text-to-image synthesis, yielding an effective reconciliation between global structural integrity and fine-grained visual details. Code is available at https://github.com/zhang0jhon/SGA.

URL PDF HTML ☆

赞 0 踩 0

2605.20807 2026-05-21 cs.CV

Decomposing Subject-Driven Image Generation via Intermediate Structural Prediction

通过中间结构预测分解主体驱动的图像生成

Hanzhong Guo, Yizhou Yu

发表机构 * School of Computing and Data Science, The University of Hong Kong（计算与数据科学学院，香港大学）

AI总结该研究提出了一种两阶段框架，通过先预测Canny图再基于源外观和预测结构生成最终图像，以解决主体驱动文本到图像生成中高频率身份细节如logo、图案和文本的保留问题，并通过自动管道构建了10万对文本感知数据集，实验结果表明中间结构预测能有效提升高保真主体驱动生成的性能。

2605.20804 2026-05-21 cs.CV cs.LG

OlmoEarth v1.1: A more efficient family of OlmoEarth models

OlmoEarth v1.1: 一个更高效的OlmoEarth模型家族

Gabriel Tseng, Yawen Zhang, Favyen Bastani, Henry Herzog, Joseph Redmon, Hadrien Sablon, Piper Wolters, Patrick Alan Johnson, Christopher Wilhelm, Patrick Beukema

发表机构 * Allen Institute for AI（人工智能研究所）

AI总结本文提出了一种改进的OlmoEarth模型家族，通过优化训练和推理过程，显著降低了计算成本，同时保持了模型的整体性能。

2605.20803 2026-05-21 cs.LG cs.AI

Tunable MAGMAX: Preference-Aware Model Merging for Continual Learning

可调MAGMAX：面向持续学习的偏好感知模型融合

Kei Hiroshima, Kento Uchida, Shinichi Shirakawa

发表机构 * Yokohama National University（Yokohama国立大学）

AI总结本文提出了一种名为可调MAGMAX的模型融合框架，通过引入偏好向量控制任务特定性能，以适应不同的部署环境和用户偏好，从而在持续学习中实现更有效的模型融合。

Comments 17 pages, 4 figures. Accepted at ICPR 2026

详情

AI中文摘要

持续学习（CL）旨在顺序训练多个任务的同时，减轻对之前学习知识的灾难性遗忘。最近在大预训练模型（LPMs）和模型融合技术，如MAGMAX方面的进展，通过结合任务特定参数展示了有效的CL性能。然而，现有方法主要关注所有任务的平均性能，并未充分解决如何构建能够适应不同部署环境或变化用户偏好的模型的问题。本文提出了一种模型融合框架，称为可调MAGMAX，它使持续学习中的任务特定性能能够受到偏好控制。我们的方法引入了一个偏好向量，该向量在模型融合过程中控制从每个任务向量中选择的元素数量，使我们能够根据部署需求调整融合模型的性能。我们进一步提出了一种方法，通过利用少量目标环境数据和模型训练任务的数据集，自动构建合适的偏好向量，从而消除了手动指定的需要。在CL基准任务上的实验结果表明，可调MAGMAX有效地控制了任务层面的性能，并成功地将融合模型适应于各种目标环境。所提出的可调MAGMAX在性能上优于或与基线方法相当，使其成为部署到各种环境中的实用解决方案，其中每个任务的偏好不同。

英文摘要

Continual learning (CL) aims to train models sequentially on multiple tasks while mitigating catastrophic forgetting of previously learned knowledge. Recent advances in large pre-trained models (LPMs) and model merging techniques, such as MAGMAX, have demonstrated effective CL performance by combining task-specific parameters. However, existing methods primarily focus on average performance across all tasks and do not adequately address how to construct models accommodating different deployment environments or varying user preferences. This paper proposes a model merging framework, termed Tunable MAGMAX, which enables preference-aware control of task-specific performance in CL. Our method introduces a preference vector that controls the number of elements selected from each task vector during model merging, allowing us to adjust the merged model performance according to their deployment needs. We further propose a method for automatically constructing appropriate preference vectors by leveraging small amounts of target environment data and datasets from model training tasks, thereby eliminating the need for manual specification. The experimental result on CL benchmark tasks demonstrates that Tunable MAGMAX effectively controls task-wise performance and successfully adapts merged models to various target environments. The proposed Tunable MAGMAX achieves superior or comparable performance to baseline methods, making it a practical solution for deploying CL models to various environments where the preferences of each task performance differ.

URL PDF HTML ☆

赞 0 踩 0

2605.20801 2026-05-21 cs.RO quant-ph

Q-SpiRL: Quantum Spiking Reinforcement Learning for Adaptive Robot Navigation

Q-SpiRL：量子脉冲强化学习用于自适应机器人导航

Mohamed Khair Altrabulsi, Nouhaila Innan, Alberto Marchisio, Muhammad Kashif, Muhammad Shafique

发表机构 * eBRAIN Lab, Division of Engineering, New York University Abu Dhabi (NYUAD)（eBRAIN实验室，工程系，纽约大学阿布扎比分校）； Center for Quantum and Topological Systems (CQTS), NYUAD Research Institute（量子与拓扑系统中心（CQTS），NYUAD研究机构）

AI总结本文提出Q-SpiRL框架，结合量子增强的脉冲神经网络，实现了在动态环境中高效稳定的机器人导航，通过实验验证了其在任务完成、轨迹效率和运动平滑度之间的最佳平衡。

Comments 11 pages, 6 figures

详情

AI中文摘要

在动态环境中实现自适应机器人导航需要能够可靠到达目标并产生高效稳定轨迹的策略。本文提出了Q-SpiRL，一种用于障碍感知机器人导航的量子脉冲强化学习框架。该框架开发并评估了五个智能体家族：表格Q学习、经典MLP、经典SNN、量子增强MLP（QMLP）和量子增强脉冲神经网络（QSNN）。尽管所有模型均在统一的训练和评估管道下实现，但QSNN是重点研究的中央架构，因为它结合了基于脉冲的时间处理与变分量子特征变换。实验在三个逐渐增大尺寸的网格世界环境中进行，即20x20、30x30和40x40，包含静态和动态障碍。性能评估使用成功率、成功率加权路径长度、路径长度和转弯率，在确定性推理下进行。结果表明，QSNN在最具有挑战性的设置中实现了最强的整体权衡，达到99%的成功率，同时保持高路径效率。在IBM量子硬件上的执行进一步证明了所提出混合策略在真实设备条件下的可行性。

英文摘要

Adaptive robot navigation in dynamic environments requires policies that can reach the target reliably while producing efficient and stable trajectories. This paper presents Q-SpiRL, a quantum spiking reinforcement learning framework for obstacle-aware robot navigation. The framework develops and evaluates five agent families: tabular Q-learning, classical MLP, classical SNN, quantum-enhanced MLP (QMLP), and quantum-enhanced spiking neural network (QSNN). While all models are implemented under a unified training and evaluation pipeline, the QSNN is the central architecture of interest, as it combines spike-based temporal processing with variational quantum feature transformation. Experiments are conducted across three grid-world environments of increasing size, namely 20x20, 30x30, and 40x40, with both static and dynamic obstacles. Performance is assessed using success rate, success-weighted path length, path length, and turn rate under deterministic inference. Results show that QSNN achieves the strongest overall trade-off between task completion, trajectory efficiency, and motion smoothness, reaching up to 99% success rate while maintaining high path efficiency in the most challenging setting. Execution on IBM quantum hardware further demonstrates the feasibility of deploying the proposed hybrid policy under real-device conditions.

URL PDF HTML ☆

赞 0 踩 0

2605.20798 2026-05-21 cs.LG cs.CL

Most Transformer Modifications Still Do Not Transfer at 1-3B: A 2020-2026 Update to Narang et al. (2021) with Downstream Evaluation and a Noise Floor

大多数变换器修改仍无法在1-3B规模上迁移：对Narang等人（2021）的2020-2026年更新，包含下游评估和噪声底限

Yang Zhao, Jiahao Lu, Bin Huang, Guhua Zhang, Jie Zhou

发表机构 * Tencent（腾讯）

AI总结本文在1-3B参数规模下，大多数变换器修改仍无法迁移，通过严格的等数据、等计算、等配方控制测试，并结合下游评估和噪声底限进行验证。

Comments 19 pages, 3 figures, under review at EMNLP 2026

详情

AI中文摘要

Narang等人（2021）在T5-base规模上评估了40多种变换器修改，并得出结论，大多数修改无法迁移。五年后，典型的运行模式已转移到1-3B参数规模，下游评估已取代预训练困惑度，且出现了一大批新的修改类别。我们通过在1.2B和3B参数规模上严格测试20种2021年后出现的变换器修改，采用多种子基线噪声底限和CLIMB-12下游评估作为主要指标，重新审视其问题。核心发现在此精心挑选的集合中重现了他们的结论：大多数修改仍无法迁移。在20种修改中，只有两种在1.2B规模上通过Bonferroni校正；其中一种在3B规模下采用共享配方时无法稳定训练。我们还发现，Tay等人（2023）报告的损失-下游差距对于注意力输出修改而言扩大了几倍：两种显著失败的修改在基准验证损失上接近2-3%，但CLIMB点数却下降了6-16点。我们得出结论，噪声底限报告、下游评估和跨规模稳定性测试现在是1-3B参数规模下架构比较的必要条件。

英文摘要

Narang et al. (2021) evaluated 40+ Transformer modifications at T5-base scale and concluded that most did not transfer. Five years later, the typical working regime has moved to 1-3B parameters, downstream evaluation has replaced pretraining perplexity, and a substantially different catalogue of modifications has emerged. We revisit their question by testing 20 post-2021 Transformer modifications at 1.2B and 3B under strict iso-data, iso-compute, iso-recipe control, with a multi-seed baseline noise floor and CLIMB-12 downstream evaluation as the primary metric. The central finding reproduces theirs at this curated set: most modifications do not transfer. Of the 20 modifications, only two clear Bonferroni correction at 1.2B; one of those two further fails to train stably at 3B under the shared recipe. We also find that the loss-downstream gap reported by Tay et al. (2023) enlarges several-fold for attention-output modifications: two significant failures converge to within 2-3% of baseline validation loss yet drop 6-16 CLIMB-points. We conclude that noise-floor reporting, downstream evaluation, and cross-scale stability testing are now prerequisites for architecture comparisons at 1-3B.

URL PDF HTML ☆

赞 0 踩 0

2605.20797 2026-05-21 cs.LG

Beyond Numerical Features: CNN-Driven Algorithm Selection via Contour Plots for Continuous Black-Box Optimization

超越数值特征：通过等高线图进行CNN驱动的算法选择用于连续黑盒优化

Yiliang Yuan, Xiang Shi, Mustafa Misir

发表机构 * Mohamed bin Zayed University of Artificial Intelligence, Masdar City, United Arab Emirates（莫扎德人工智能大学，马斯达尔城，阿拉伯联合酋长国）

AI总结本文提出了一种基于表示的实例级算法选择方法，应用于黑盒优化，用于自动从固定组合中选择最有前途的求解器。传统连续优化工作主要依赖于数值描述符，包括探索景观分析特征和学习嵌入如Deep-ELA。本文研究了一种互补的表示：探测景观的等高线图可视化。一个CNN回归器利用多个实例特定的等高线视图（堆叠或编码每个视图并聚合）来预测每个求解器的性能，从而通过预测的最佳值进行选择。在标准BBOB 2009单目标协议上，所得到的选者显著优于单最佳求解器（SBS），并与基于特征的基线具有竞争力。随后在DeepELA设置下的双目标评估进一步表明，当使用窗口等高线视图时，基于图像的原则同样具有竞争力。总体而言，结果表明，简单的视觉模型可以利用探测景观中的空间结构进行算法选择，而无需手工设计ELA特征。

详情

AI中文摘要

本文介绍了一种新的基于表示的方法，用于实例级算法选择，应用于黑盒优化，以自动从固定组合中选择最有前途的求解器。传统连续优化工作主要依赖于数值描述符，包括探索景观分析特征和学习嵌入如Deep-ELA。本文研究了一种互补的表示：探测景观的等高线图可视化。一个CNN回归器利用多个实例特定的等高线视图（堆叠或编码每个视图并聚合）来预测每个求解器的性能，从而通过预测的最佳值进行选择。在标准BBOB 2009单目标协议上，所得到的选者显著优于单最佳求解器（SBS），并与基于特征的基线具有竞争力。随后在DeepELA设置下的双目标评估进一步表明，当使用窗口等高线视图时，基于图像的原则同样具有竞争力。总体而言，结果表明，简单的视觉模型可以利用探测景观中的空间结构进行算法选择，而无需手工设计ELA特征。

英文摘要

The present paper introduces a new representation-driven approach to per-instance algorithm selection, applied to black-box optimization, for automatically choosing the most promising solver from a fixed portfolio. Prior work in continuous optimization largely relies on numerical descriptors, including Exploratory Landscape Analysis features and learned embeddings such as Deep-ELA. This work studies a complementary representation: contour-map visualizations of probed landscapes. A CNN regressor takes multiple instance-specific contour views (stacked or encoded per view and aggregated) and predicts per-solver performance, enabling selection by the predicted best value. On the standard BBOB 2009 single-objective protocol, the resulting selectors significantly outperform the single best solver (SBS) and are competitive with feature-based baselines. A subsequent bi-objective evaluation under the DeepELA setting further indicates that the same image-based principle can be competitive when using windowed contour views. Overall, the results suggest that simple vision models can exploit spatial structure in probed landscapes for algorithm selection without handcrafted ELA features.

URL PDF HTML ☆

赞 0 踩 0

2605.20796 2026-05-21 cs.RO

CMC-Opt: Constraint Manifold with Corners for Inequality-Constrained Optimization

CMC-Opt: 带角落的约束流形用于不等式约束优化

Yetong Zhang, Frank Dellaert

发表机构 * College of Computing（计算学院）； Georgia Institute of Technology（佐治亚理工学院）； Atlanta, USA（美国亚特兰大）

AI总结本文提出了一种基于流形的框架，用于解决机器人中存在等式和不等式约束的优化问题。通过引入带角落的约束流形，将原问题直接转换为无约束优化问题，从而在约束状态空间上进行优化，并在大规模动力学规划问题中验证了该框架的有效性和鲁棒性。

2605.20793 2026-05-21 cs.CL

Assessing socio-economic climate impacts from text data

从文本数据评估社会经济气候影响

Mariana Madruga de Brito, Brielen Madureira, Taís Maria Nunes Carvalho, Damien Delforge, Aglaé Jézéquel, Murathan Kurfalı, Ni Li, Gabriele Messori, Joakim Nivre, Barbara Pernici, Niko Speybroeck, Stefano Terzi, Wim Thiery, Bram Valkenborg, Jingxian Wang, Shorouq Zahra, Jakob Zscheischler, Jan Sodoge

发表机构 * Helmholtz Centre for Environmental Research, Germany（德国海德堡环境研究中心）； Leipzig University, Germany（德国莱比锡大学）； UCLouvain Brussels, Belgium（比利时布鲁塞尔UCLouvain）； Ecole des Ponts, France（法国École des Ponts）； Université PSL, École Polytechnique, Institut Polytechnique de Paris, Sorbonne Université, CNRS, France（法国巴黎理工学院、巴黎理工研究院、巴黎政治学院、法国国家科学研究中心）； RISE Research Institutes of Sweden, Sweden（瑞典RISE研究机构）； Swedish Centre for Impacts of Climate Extremes (Climes), Sweden（瑞典气候极端影响研究中心）； Vrije Universiteit Brussel, Belgium（比利时布鲁塞尔自由大学）； TUD Dresden University of Technology, Germany（德国德累斯顿理工大学）； Uppsala University, Sweden（瑞典乌普萨拉大学）； Politecnico di Milano, Italy（意大利米兰理工大学）； Istituto Universitario di Studi Superiori IUSS Pavia, Italy（意大利皮亚维亚大学研究院）； Center for Climate Change and Transformation, Eurac Research, Italy（意大利Eurac研究机构气候变化与转型中心）； Royal Museum for Central Africa, Belgium（比利时皇家中非博物馆）； Technopolis Group, Germany（德国Technopolis集团）

AI总结本文提出了一种利用文本数据系统分析气候灾害社会经济影响的方法，旨在解决现有研究碎片化、缺乏明确指导的问题，通过总结常见实践和挑战，提出改进建议以构建更可靠的文本衍生社会经济影响数据集。

Comments Work in progress

详情

AI中文摘要

近年来，自然语言处理（NLP）和大语言模型（LLMs）的进步使得能够系统利用新闻、社交媒体和报告中的大规模文本数据，创建包含洪水、干旱、风暴和多灾害事件等气候灾害的社会经济影响数据集。随着文本作为数据用于影响评估领域的扩展，其方法学复杂性也在增加。然而，研究仍处于碎片化状态，缺乏明确的指南来定义什么是影响、处理时间和空间偏差以及选择适当的建模和后处理策略。这种不连贯性限制了研究的透明度和可比性。本文通过整合常见实践、描述使用文本数据方法分析社会经济影响数据的关键挑战，并提出解决这些问题的建议来填补这一空白。通过提供最佳实践的指导，我们旨在支持构建更加稳健的文本衍生社会经济影响数据集，以更准确地支持灾害风险管理和归因研究。

英文摘要

Recent advances in natural language processing (NLP) and large language models (LLMs) have enabled the systematic use of large-scale textual data from news, social media, and reports to create datasets with socio-economic impacts of climate hazards such as floods, droughts, storms, and multi-hazard events. As the field of text-as-data for impact assessment expands, so does its methodological complexity. Yet research remains fragmented, with no clear guidelines for defining what constitutes an impact, handling temporal and spatial biases, and selecting appropriate modeling and post-processing strategies. This lack of coherence limits transparency and comparability across studies. Here, we address this gap by synthesising common practices, describing key challenges specific to the use of text-as-data methods for analyzing socio-economic impact data, and proposing recommendations to address them. By providing guidance on best practices, we aim to support the construction of robust text-derived socio-economic impact datasets that can more accurately inform disaster risk management and attribution studies.

URL PDF HTML ☆

赞 0 踩 0

2605.20786 2026-05-21 cs.CL

Building Arabic NLP from the Ground Up: Twenty Years of Lessons, Failures, and Open Problems

从基础建设开始构建阿拉伯语NLP：二十年的经验、失败与开放问题

Wajdi Zaghouani

发表机构 * Northwestern University in Qatar（卡塔尔西北大学）

AI总结本文总结了过去二十年阿拉伯语NLP资源和研究基础设施建设的经验，指出构建数据集是社会和技术过程的结合，社区围绕共享任务形成比任务本身更重要，从语言资源到计算社会科学暴露了传统NLP训练无法解决的挑战。

Comments Accepted at the ACL 2026 Workshop : The Big Picture 2026: Crafting a Research Narrative v2

详情

AI中文摘要

本文回顾了过去二十年构建阿拉伯语NLP资源和研究基础设施的经验，阿拉伯语是一种被数亿人使用的语言，但在历史上相对英语或中文等语言而言被忽视。第一十年专注于基础语言基础设施，第二十年转向计算社会科学、社交媒体分析和社会导向应用。本文并未列举产出，而是探讨了构建这些产出的经验。三个反直觉的教训浮现：构建数据集是社会和技术过程的结合；围绕共享任务形成的社区比任务本身更重要；从语言资源到计算社会科学暴露了传统NLP训练无法解决的挑战。我们讨论了三个失败：一个从未达到临床实践的抑郁检测语料库，一个在过多共享任务中扩展而缺乏深度的时期，以及一个长期存在的假设，即现代标准阿拉伯语基础设施可以干净地转移到方言任务中。这些经验表明，在开发欠发达社区的NLP中，最困难的问题不是语言学的，而是社会、制度和认识论的，需要该领域很少教授的能力。

英文摘要

This paper reflects on twenty years of building NLP resources and research infrastructure for Arabic, a language spoken by hundreds of millions yet historically underserved relative to languages such as English or Chinese. The first decade focused on foundational linguistic infrastructure; the second shifted toward computational social science, social media analysis, and socially oriented applications. Rather than cataloguing outputs, the paper examines what the experience of building them revealed. Three counterintuitive lessons emerge: building datasets is as much a social process as a technical one; communities formed around shared tasks often matter more than the tasks themselves; and moving from language resources to computational social science exposes challenges that traditional NLP training does not address. We discuss three failures: a depression detection corpus that never reached clinical practice, a period of spreading across too many shared tasks without sufficient depth, and a long-standing assumption that Modern Standard Arabic infrastructure would transfer cleanly to dialectal tasks. These experiences suggest that the hardest problems in developing NLP for underserved communities are not linguistic but social, institutional, and epistemic, and require competencies the field rarely teaches.

URL PDF HTML ☆

赞 0 踩 0

2605.20784 2026-05-21 cs.AI cs.LG

Interaction Locality in Hierarchical Recursive Reasoning

层次递归推理中的交互局部性

Yosuke Miyanishi, Tetsuro Morimura

发表机构 * CyberAgent Inc.（CyberAgent公司）

AI总结本文提出交互局部性框架，用于测量信息流是否在附近单元或语义段内传输或跨越，通过在HRM和TRM等层次递归推理模型上应用，验证了局部执行与全局规划的可重复测量框架。

详情

AI中文摘要

空间推理需要位置绑定计算和位置不变结构：智能体必须在保持路线、对象或约束层次计划的同时进行局部移动。我们提出交互局部性，一种任务-几何感知的框架，用于衡量信息流是否在附近单元或语义段内传输或跨越。我们通过稀疏自动编码器特征消融和有限噪声激活补丁来实例化该框架，并在附录中报告了结构性雅可比和注意力检查。将其应用于Maze-Hard、Sudoku Extreme和ARC-AGI等模型。在这些模型中，激活补丁给出了最清晰的架构指纹：高层递归状态倾向于在附近单元或相同段内写入信息，而重复的递归更新将这些局部写入累积到更广泛的解决方案结构中。这种模式在迷宫路径、数独约束和ARC-AGI对象邻域中均成立，其中TRM表现最强。为了测试交互局部性是否超越玩具但具有挑战性的网格基准，我们还将其应用于MTU3D，一个大规模的具身3D场景-grounding模型。在MTU3D设置中，因果空间局部性主要出现在视觉场景特征传递给下游grounding模块的过渡处，而不是在视觉编码器中均匀分布。这种对比表明，HRM和TRM中观察到的局部到全局的交接与显式递归推理动态有关，而具身3D模型可能在模块边界集中因果空间结构。交互局部性将直观的局部执行/全局规划故事转化为可重复测量的递归和具身空间推理框架。

英文摘要

Spatial reasoning requires both location-bound computation and location-invariant structure: agents must make local moves while preserving route, object, or constraint-level plans. We propose interaction locality, a task-geometry-aware framework for measuring whether information flow stays within nearby cells or semantic segments, or crosses them. We instantiate the framework with sparse-autoencoder feature ablations and finite-noise activation patching, with structural Jacobian and attention checks reported in the appendix, and apply it to HRM and TRM, two compact hierarchical and recursive reasoning models, on Maze-Hard, Sudoku Extreme, and ARC-AGI. Across these models, activation patching gives the clearest architectural fingerprint: high-level recurrent states tend to write information within nearby cells or same-segment units, while repeated recursive updates accumulate these local writes into broader solution structure. This pattern holds across maze paths, Sudoku constraints, and ARC-AGI object neighborhoods, with the strongest concentration in TRM. To test whether interaction locality extends beyond toy-yet-challenging grid benchmarks, we also apply it to MTU3D, a large-scale embodied 3D scene-grounding model. In this MTU3D setting, causal spatial locality appears primarily at the transition where visual scene features are handed to the downstream grounding module, rather than uniformly throughout the visual encoder. This contrast suggests that the local-to-global handoff observed in HRM and TRM is tied to explicit recursive reasoning dynamics, while embodied 3D models may concentrate causal spatial structure at module boundaries. Interaction locality turns the intuitive local-execution/global-planning story into a reproducible measurement framework for recursive and embodied spatial reasoning.

URL PDF HTML ☆

赞 0 踩 0

2605.20782 2026-05-21 cs.LG

Causal Machine Learning Is Not a Panacea: A Roadmap for Observational Causal Inference in Health

因果机器学习并非万能：健康领域观察性因果推断的路线图

Donna Tjandra, Trenton Chang, Sonali Parbhoo, Rajesh Ranganath, Andre Kurepa Waschka, William Mitchell, Maggie Makar, Shalmali Joshi, Finale Doshi-Velez, Leo Anthony Celi, Jenna Wiens

发表机构 * Division of Computer Science and Engineering, University of Michigan（密歇根大学计算机科学与工程系）； Department of Electrical and Electronic Engineering, Imperial College London（伦敦帝国理工学院电子与电气工程系）； Courant Institute of Mathematical Sciences, New York University（纽约大学Courant数学科学研究所）； Center for Data Science, New York University（纽约大学数据科学中心）； Department of Mathematics & Statistics, Elon University（埃洛伊大学数学与统计学系）； Department of Ophthalmology, Cambridge University Hospitals（剑桥大学医院眼科部）； Department of Biomedical Informatics, Columbia University（哥伦比亚大学生物医学信息学系）； School of Engineering and Applied Science, Harvard University（哈佛大学工程与应用科学学院）； Laboratory for Computational Physiology, Institute for Medical Engineering and Science, Massachusetts Institute of Technology（麻省理工学院医学工程与科学研究所计算生理学实验室）； Department of Medicine, Beth Israel Deaconess Medical Center（贝斯以色列德aconess医疗中心医学部）； Department of Biostatistics, Harvard T.H. Chan School of Public Health（哈佛T.H. Chan公共卫生学院生物统计学系）

AI总结本文探讨了因果机器学习在观察性数据中的应用，强调了验证有效性假设和合理使用因果机器学习的重要性，提出了加强因果分析严谨性和可解释性的模板。

详情

AI中文摘要

目的：随着大规模观察性临床数据集的日益可用以及随机对照试验的挑战，使用因果机器学习（ML）进行观察性数据中的因果推断引发了热情。我们提出了应用因果ML到观察性数据的路线图。材料和方法：我们概述了在可用数据中评估有效性假设的重要性，并负责任地应用于临床专家使用因果ML和ML从业者有限的临床专业知识。观察：尽管因果ML有所进步，但其限制在各学科中仍然被低估。这种知识缺口可能影响发现的有效性。讨论：因果假设必须得到满足，模型选择必须得到证明。否则，这些方法可能会产生有偏见或误导性的结果，对临床研究和患者护理产生影响。结论：因果ML可以成为生成因果假设的强大工具。我们提供了一个模板来加强因果分析的严谨性和可解释性。

英文摘要

Objective: The growing availability of large-scale observational clinical datasets and challenges in conducting randomized controlled trials have spurred enthusiasm in using causal machine learning (ML) for causal inference in observational data. We present a roadmap for applying causal ML to observational data. Materials and methods: We outline the importance of assessing validity assumptions within available data and applying causal ML responsibly for clinical experts using causal ML and ML practitioners with limited clinical expertise. Observations: Despite advances in causal ML, its limitations remain largely under-appreciated across disciplines. This gap in shared knowledge may impact the validity of findings. Discussion: Causal assumptions must be satisfied and modeling choices justified. Otherwise, these approaches risk producing biased or misleading results, with consequences for clinical research and patient care. Conclusion: Causal ML can be a powerful tool for generating causal hypotheses. We provide a template to strengthen the rigor and interpretability of causal analyses.

URL PDF HTML ☆

赞 0 踩 0

2605.20780 2026-05-21 cs.LG cs.CV

Learning to Think in Physics: Breaking Shortcut Learning in Scientific Diffusion via Representation Alignment

学习物理中的推理：通过表征对齐打破科学扩散中的捷径学习

Haozhe Jia, Pengyu Yin, Wenshuo Chen, Shaofeng Liang, Lei Wang, Bowen Tian, Xiucheng Wang, Nanqian Jia, Yutao Yue

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港理工大学（广州））； Shandong University（山东大学）； LimX Dynamics Technology Co., Ltd.（LimX动态技术有限公司）； Xidian University（西安电子科技大学）； Peking University（北京大学）； Institute of Deep Perception Technology, Jiangsu Industrial Technology Research Institute (JITRI)（感知技术研究院，江苏工业技术研究院（JITRI））； Griffith University（格里菲斯大学）

AI总结该研究提出了一种无需教师的框架REPA-P，通过使用原理残差对中间特征与物理状态进行对齐，以解决物理信息扩散模型中中间表示在边界条件变化时容易产生捷径学习的问题，从而在四个PDE任务中提高了收敛速度、减少了物理残差并增强了分布外鲁棒性。

详情

AI中文摘要

物理信息扩散模型通常只在最终输出上强制实施PDE约束，导致中间表示不受约束且在边界条件变化时容易产生捷径学习。我们引入了REPA-P，一种无需教师、架构无关的框架，通过原理残差对中间特征与物理状态进行对齐。REPA-P在选定的层上附加轻量级1×1投影头，将隐藏激活解码为物理量，并在训练过程中应用PDE残差损失。这些头在推理时被丢弃，引入了零开销。在四个PDE任务中，包括达西流、拓扑优化、静电势和湍流通道流，REPA-P通过2倍的收敛加速、66.4%的残差减少和49.3%的分布外鲁棒性提升，实现了在U-Net和扩散变换器骨干网络上的持续收益。消融实验显示，监督少量中间层捕获了大部分收益，并补充了输出级物理损失。代码可在[https://github.com/Hxxxz0/REPA-P](https://github.com/Hxxxz0/REPA-P)获得。

英文摘要

Physics-informed diffusion models typically enforce PDE constraints only on final outputs, leaving intermediate representations unconstrained and prone to shortcut learning under shifted boundary conditions. We introduce **REPA-P**, a teacher-free, architecture-agnostic framework that aligns intermediate features with physical states using first-principles residuals. REPA-P attaches lightweight $1{\times}1$ projection heads to selected layers, decodes hidden activations into physical quantities, and applies PDE residual losses during training. These heads are discarded at inference, introducing **zero overhead**. Across four PDE tasks, including Darcy flow, topology optimization, electrostatic potential, and turbulent channel flow, REPA-P accelerates convergence by up to $2{\times}$, reduces physics residuals by up to $66.4\%$, and improves out-of-distribution robustness by up to $49.3\%$, with consistent gains on both U-Net and Diffusion Transformer backbones. Ablations show that supervising a small set of intermediate layers captures most benefits and complements output-level physics losses. Code is available at [https://github.com/Hxxxz0/REPA-P](https://github.com/Hxxxz0/REPA-P).

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

SEABAD: A Tropical Bird Activity Detection Dataset for Passive Acoustic Monitoring

SmoCap: Unified Scale-Pose Canonicalization with Proxy-Mapped Trust-Region QP

Activation-Free Backbones for Image Recognition: Polynomial Alternatives within MetaFormer-Style Vision Models

USV: Towards Understanding the User-generated Short-form Videos

ArchSIBench: Benchmarking the Architectural Spatial Intelligence of Vision-Language Models

Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment

MemGym: a Long-Horizon Memory Environment for LLM Agents

HyDAR-Pano3D: A Hybrid Disentangled Anatomical Recovery Framework for Panoramic-to-3D Reconstruction

Markovian Circuit Tracing for Transformer State Dynamic

TERDNet: Transformer Encoder-Recurrent Decoder Network for Scene Change Detection

VSCD: Video-based Scene Change Detection in Unaligned Scenes

AIR: Amortized Image Reconstruction Framework for Self-Supervised Feed-Forward 2D Gaussian Splatting

OSGNet with MLLM Reranking @ Ego4D Episodic Memory Challenge 2026

GraphRAG on Consumer Hardware: Benchmarking Local LLMs for Healthcare EHR Schema Retrieval

PulseCol: Periodically Refreshed Column-Sparse Attention for Accelerating Diffusion Language Models

Demo-JEPA: Joint-Embedding Predictive Architecture for One-shot Cross-Embodiment Imitation

Refining and Reusing Annotation Guidelines for LLM Annotation

Spatial Gram Alignment for Ultra-High-Resolution Image Synthesis

Decomposing Subject-Driven Image Generation via Intermediate Structural Prediction

OlmoEarth v1.1: A more efficient family of OlmoEarth models

Tunable MAGMAX: Preference-Aware Model Merging for Continual Learning

Q-SpiRL: Quantum Spiking Reinforcement Learning for Adaptive Robot Navigation

Most Transformer Modifications Still Do Not Transfer at 1-3B: A 2020-2026 Update to Narang et al. (2021) with Downstream Evaluation and a Noise Floor

Beyond Numerical Features: CNN-Driven Algorithm Selection via Contour Plots for Continuous Black-Box Optimization

CMC-Opt: Constraint Manifold with Corners for Inequality-Constrained Optimization

Assessing socio-economic climate impacts from text data

Building Arabic NLP from the Ground Up: Twenty Years of Lessons, Failures, and Open Problems

Interaction Locality in Hierarchical Recursive Reasoning

Causal Machine Learning Is Not a Panacea: A Roadmap for Observational Causal Inference in Health

Learning to Think in Physics: Breaking Shortcut Learning in Scientific Diffusion via Representation Alignment