arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2115
2606.17433 2026-06-17 cs.CV 新提交

LADBench: A Benchmark for Logical Fault Detection in Images

LADBench: 图像中逻辑故障检测的基准

Sahasra Kondapalli, Lara Radovanovic, Aadi Palnitkar, Mingyang Mao, Xiaomin Lin

发表机构 * University of South Florida(南佛罗里达大学)

AI总结 提出LAD-Bench基准,包含1000多张合成图像的四域逻辑异常,通过分层提示协议评估模型,揭示现有VLM在隐式逻辑故障检测上的不足。

Comments Accepted to the IEEE International Conference on Development and Learning (ICDL 2026)

详情
AI中文摘要

大型视觉语言模型在视觉问答和语义定位方面表现出色,但其自主逻辑推理能力仍未被充分探索。现有的异常基准强调视觉错误或直接提示,而非开放世界部署所需的物理和社会常识。为此,我们引入了LAD-bench,一个包含1000多张精心策划的合成图像基准,涵盖四个领域的逻辑异常:住宅、城市、协作和自然。我们进一步提出了一种基于渐进式揭示的分层提示协议,该协议衡量模型在定位和推理逻辑故障时需要多少显式帮助。评估领先的基础模型揭示了重大弱点:即使最好的模型也仅达到70.11%的整体准确率,表明隐式逻辑故障检测仍未解决。关键的是,即使在更深层次收到显式提示后,模型也常常无法识别异常。通过揭示这些序列多模态推理中的局限性,LAD-Bench为推进自主视觉系统的安全性、可靠性和认知对齐提供了一个严格的框架。数据集和代码:此 https URL

英文摘要

Large Vision Language Models (VLMs) excel at visual question answering and semantic grounding, but their capacity for autonomous logical reasoning remains underexplored. Existing anomaly benchmarks emphasize visual errors or direct prompting rather than the physical and social common sense needed for open-world deployment. To address this, we introduce LAD-bench, a benchmark of more than 1,000 curated synthetic images with logical anomalies across four domains: Residential, Urban, Collaborative, and Nature. We further propose a Tiered Prompting Protocol based on progressive disclosure, which measures how much explicit assistance a model needs to localize and reason about a logical fault. Evaluating leading foundation models reveals substantial weaknesses: even the best achieves only 70.11% overall accuracy, showing that implicit logical fault detection remains unsolved. Crucially, models often fail to identify anomalies even after receiving explicit hints in deeper tiers. By surfacing these limitations in sequential multimodal reasoning, LAD-Bench offers a rigorous framework for advancing the safety, reliability, and cognitive alignment of autonomous visual systems. Dataset and Code: https://huggingface.co/datasets/SahasraK/LADBench

2606.17431 2026-06-17 cs.CV 新提交

Visual Retrieval-Augmented Generation for Silhouette-Guided Animal Art

视觉检索增强生成:基于轮廓引导的动物艺术创作

Quoc-Duy Tran, Anh-Tuan Vo, Trung-Nghia Le

发表机构 * University of Science, VNU-HCM(胡志明市国立大学理科大学) Vietnam National University, Ho Chi Minh(胡志明市国立大学)

AI总结 提出视觉检索增强生成(Visual-RAG)框架,通过检索与自然轮廓结构相似的动物形状,结合ControlNet和IP-Adapter引导扩散模型生成动物艺术,实现计算空想性视错觉。

Comments SOICT 2025

详情
AI中文摘要

生成式AI已经提升了渲染逼真或艺术图像的能力,但在人类创造力的一个关键方面仍然有限:解释模糊形状。这种现象根植于空想性视错觉,使人类能够从云、石头或树叶等随机图案中感知有意义的形状。为了在计算上复制这一想象过程,我们引入了视觉检索增强生成(Visual-RAG),这是一个直接从自然轮廓生成动物艺术的框架。我们的方法从包含28,586个高质量轮廓的精选语料库中检索结构相似的动物形状,并将其作为参考示例,通过ControlNet和IP-Adapter引导基于扩散的生成。消融研究证实,使用RANSAC的形状上下文提供了最准确的匹配,而去除形状标准化会使内点比率降至仅13.4%,强调了结构保真度在Visual-RAG中的重要性。一项包含12名参与者的用户研究从美学、轮廓保真度和整体印象方面评估了输出结果。结果表明,虽然Visual-RAG提供了合理的解释,但在实现高感知影响力方面仍存在挑战。这项工作为计算空想性视错觉奠定了基础,展示了机器如何为想象发现的早期阶段做出贡献。

英文摘要

Generative AI has advanced the ability to render photorealistic or artistic images, yet it remains limited in a key aspect of human creativity: interpreting ambiguous shapes. This phenomenon, rooted in pareidolia, allows humans to perceive meaningful forms in random patterns such as clouds, stones, or leaves. To computationally replicate this imaginative process, we introduce Visual Retrieval-Augmented Generation (Visual-RAG), a framework that generates animal art directly from natural silhouettes. Our method retrieves structurally similar animal shapes from a curated corpus of 28,586 high-quality silhouettes and uses them as reference exemplars to guide diffusion-based generation with ControlNet and IP-Adapter. Ablation studies confirm that shape Context with RANSAC provides the most accurate alignment, while removing shape standardization reduces the inlier ratio to just 13.4\%, underscoring the importance of structural fidelity in Visual-RAG. A user study with 12 participants evaluated the outputs in terms of aesthetics, silhouette fidelity, and overall impression. Results reveal that while Visual-RAG provides plausible interpretations, challenges remain in achieving high perceptual impact. This work lays the foundation for computational pareidolia, showing how machines can contribute to the early stages of imaginative discovery.

2606.17430 2026-06-17 cs.CV 新提交

CIAN: Multi-Stage Framework for Event-Enriched Image Captioning via Retrieval-Augmented Generation

CIAN:基于检索增强生成的事件丰富图像描述的多阶段框架

Trinh Thi Thu Hien, Trung-Nghia Le

发表机构 * University of Science, Ho Chi Minh City(胡志明市理科大学) Vietnam National University, Ho Chi Minh City(越南国家大学胡志明市分校)

AI总结 提出多阶段框架CIAN,通过检索相关文章并利用LoRA微调Qwen模型生成叙事,结合N-Gram精炼,在OpenEvents-V1上提升CIDEr从0.030到0.094,实现事件丰富的图像描述。

Comments SOICT 2025

详情
AI中文摘要

事件丰富的图像描述不仅描述可见内容,还描述事件的更广泛背景,包括时间、地点和参与者,这是大多数基于像素的模型所缺乏的能力。我们提出了上下文图像-文章叙述器(CIAN),这是一个多阶段框架,通过外部叙述丰富描述。CIAN使用SigLIP检索相关文章,总结它们以指导使用LoRA微调的Qwen模型进行叙事生成,并应用基于N-Gram的精炼以提高流畅性和连贯性。在OpenEvents-V1基准上,CIAN实现了高检索性能(mAP 0.979),并提高了描述质量,将CIDEr从0.030提升到0.094。这些结果突显了检索增强推理与语言精炼相结合在生成上下文感知、类人描述方面的有效性。

英文摘要

Event-enriched image captioning describes not only visible content but also the broader context of events, including timing, location, and participants, capabilities missing in most pixel-bound models. We propose the Contextual Image-Article Narrator (CIAN), a multi-stage framework that enriches captions with external narratives. CIAN retrieves relevant articles using SigLIP, summarizes them to guide a Narrative Generation stage with a LoRA-fine-tuned Qwen model, and applies N-Gram-based Refinement for fluency and coherence. On the OpenEvents-V1 benchmark, CIAN achieves high retrieval performance (mAP 0.979) and improves caption quality, increasing CIDEr from 0.030 to 0.094. These results highlight the effectiveness of retrieval-augmented reasoning combined with linguistic refinement for generating context-aware, human-like captions.

2606.17427 2026-06-17 cs.CV cs.HC 新提交

Impact of Hand Impairment and Occlusions on Hand Pose Estimation Accuracy in Augmented Reality Applications

手部损伤和遮挡对增强现实应用中手部姿态估计精度的影响

Damian M. Manzone, Mathew Szymanowski, Olga Taran, Shuo Cai, Melissa Marquez-Chin, Tammy Zeng, Hardeep Singh, Cesar Marquez-Chin, José Zariffa

发表机构 * KITE Research Institute, Toronto Rehabilitation Institute, University Health Network(大学健康网络多伦多康复研究所KITE研究所) Institute of Biomedical Engineering, University of Toronto(多伦多大学生物医学工程研究所) Department of Health Sciences and Technology, ETH Zürich(苏黎世联邦理工学院健康科学与技术系) Department of Occupational Science & Occupational Therapy and the Rehabilitation Sciences Institute, University of Toronto(多伦多大学职业科学与职业治疗系及康复科学研究所)

AI总结 研究评估了HoloLens 2和多种姿态估计算法在手部损伤和物体遮挡条件下的精度,发现算法可泛化至手部损伤人群,透明物体略有优势。

详情
AI中文摘要

混合现实应用可设计用于手部康复。增强现实(AR)头戴式显示器(HMD)特别允许生态有效的任务,因为个体可以看到真实环境并与真实物体交互,同时在HMD上接收额外提示。虽然这些应用依赖于准确的手部姿态估计,但在调查手部损伤或真实物体交互遮挡对姿态估计精度的影响方面存在空白。此外,AR HMD预测与最先进姿态估计方法之间的比较尚未建立。本研究评估了HoloLens 2 HMD和最先进姿态估计算法(WiLoR、HaMeR、WildHands和MediaPipe)在颈椎损伤(cSCI;n=13,神经损伤水平:C3-C6;美国脊柱损伤协会损伤量表:A-D)和15名未受伤对照者与透明和不透明物体交互时的姿态估计精度。通过多摄像头设置三角测量生成3D关节位置的真实值。姿态估计精度在cSCI和未受伤对照组之间没有差异,表明HoloLens 2和姿态估计算法的3D关节预测可以泛化到手部损伤人群。此外,透明物体比不透明物体提供了微小的精度优势(0.1毫米),WiLoR和HaMeR的预测比HoloLens 2略精确(2毫米)。总体而言,这些结果表明HoloLens 2可能适用于手部康复应用,生成的数据集可用于改进手部损伤人群的姿态估计方法。

英文摘要

Mixed reality applications can be designed for hand rehabilitation. Augmented reality (AR) head mounted displays (HMDs) specifically allow for ecologically valid tasks because individuals can see their real environment and interact with real objects while receiving additional cues on the HMD. While these applications rely on accurate hand pose estimation, there is a gap in investigating the influence of hand impairment or occlusion from real-object interactions on pose estimation accuracy. Further, comparisons between AR HMD predictions and state-of-the-art pose estimation methods have not been established. The current study assessed pose estimation accuracy of the HoloLens 2 HMD and state-of-the-art pose estimation algorithms (WiLoR, HaMeR, WildHands, and MediaPipe) while individuals with cervical spinal cord injury (cSCI; n = 13, Neurological Level of Injury: C3-C6; American Spinal Injury Association Impairment Scale: A-D) and 15 uninjured controls interacted with clear and opaque objects. Ground truth estimates of 3D joint positions were generated via triangulation from a multi-camera setup. Pose estimation accuracy did not differ between the cSCI and uninjured control groups suggesting that 3D joint predictions from the HoloLens 2 and pose estimation algorithms can generalize to populations with hand impairment. Further, clear objects provided a small accuracy advantage over opaque objects (0.1 mm) and predictions from both WiLoR and HaMeR were slightly more accurate than the HoloLens 2 (2 mm). Overall, these results suggest that the HoloLens 2 may be viable for hand rehabilitation applications and the dataset generated can be used to refine pose estimation methods for hand-impaired populations.

2606.17418 2026-06-17 cs.RO 新提交

DexLink Hand: A Compact, Affordable, 16-DOF Linkage-Driven Hand with Human-Like Dexterity

DexLink Hand:一款紧凑、经济、16自由度连杆驱动且具有类人灵巧性的手部

Hao Wu, Yanzhe Wang, Yu Feng, Jian Liu, Jihao Li, Jianshu Zhou, Huixu Dong

发表机构 * Zhejiang University(浙江大学) National University of Singapore(新加坡国立大学)

AI总结 提出一种紧凑、低成本的连杆驱动仿人手,通过混合平面与空间连杆机构实现16个独立驱动、20个关节的高灵巧性,重320g、成本低于400美元,达到最大Kapandji评分并复现全部33种Feix抓取类型。

详情
AI中文摘要

灵巧机器人手在灵巧性、紧凑性和经济性之间长期面临权衡。特别是,高自由度设计通常需要复杂的驱动和传动,阻碍了其集成到人形尺寸中。为解决这些挑战,本文提出一种紧凑、低成本的连杆驱动仿人手,实现了高灵巧性、结构集成和类人手功能。该手集成了由16个独立驱动器驱动的20个关节,所有驱动、传感和传动组件紧凑地嵌入人手大小的结构中。最终原型仅重320g,总成本低于400美元。为实现这些目标,提出了一种结合平面和空间连杆机构的混合机械架构,实现了解耦的多向运动、仿生关节协同和高被动承载能力。拇指进一步采用了支持类人重构和对掌运动的仿生特征。通过这些机构和结构布局的协调集成,原型实现了具有仿生灵巧性的高度集成设计。实验评估表明,该手达到了最大Kapandji评分,复现了所有33种Feix抓取类型,并在多种日常物品和工具上实现了稳定抓取和灵巧操作。这些结果验证了所提出的手作为面向以人为中心环境中灵巧操作、遥操作和机器人学习的低成本、紧凑且机械高效的平台。

英文摘要

Dexterous robotic hands face a longstanding trade-off among dexterity, compactness, and affordability. Particularly, high-degree-of-freedom designs typically demand complex actuation and transmission, hindering integration into human-scale forms. To address these challenges, this work presents a compact, low-cost linkage-driven anthropomorphic hand that achieves high dexterity, structural integration, and human-hand-like functionality. The hand integrates 20 joints driven by 16 independent actuators, with all actuation, sensing, and transmission components compactly embedded within a human-hand-sized structure. The resulting prototype weighs only 320g at a total cost below USD 400. To meet these objectives, a hybrid mechanical architecture combining planar and spatial linkage mechanisms is proposed, enabling decoupled multidirectional motion, biomimetic joint synergies, and high passive load-bearing capability. The thumb further incorporates biomimetic features supporting human-like reconfiguration and opposition movements. Through the coordinated integration of these mechanisms and structural layout, the prototype achieves a highly integrated design with anthropomorphic dexterity. Experimental evaluations demonstrate that the hand achieves the maximum Kapandji score, reproduces all 33 Feix grasp types, and performs stable grasping and dexterous manipulation across a wide variety of daily objects and tools. These results validate the proposed hand as an affordable, compact, and mechanically efficient platform for dexterous manipulation, teleoperation, and robot learning in human-centered environments.

2606.17417 2026-06-17 cs.SD cs.LG 新提交

A Closer Look at Failure Modes in Temporal Understanding of Large Audio-Language Models

大型音频语言模型时间理解失败模式的深入分析

Apoorva Kulkarni, Kaousheik Jayakumar, Sreyan Ghosh, Sarah Wiegreffe, Dinesh Manocha, Ramani Duraiswami

发表机构 * University of Maryland, College Park(马里兰大学帕克分校)

AI总结 本文通过行为与因果机制分析,揭示大型音频语言模型在时间推理中因模态不平衡而失败,并提出注意力重分配方法提升准确率。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

大型音频语言模型(LALMs)在各种音频理解任务上表现出色,但在时间推理这一人类听觉感知的核心能力上仍存在困难。理解这些失败的原因仍然具有挑战性,因为现有基准报告了性能差距,但没有探究潜在机制。为此,我们引入了一个包含1657个问题的基准测试,涵盖三项基础任务,专门用于机制分析。检查模型在不同输入设置下的输出(行为分析)表明,当文本线索可用时,模型往往未充分利用音频。我们还首次对LALMs中的时间推理失败进行了因果机制分析。比较注意力加权与缩放,我们发现重新分配音频令牌上的注意力比增加音频注意力更有效。针对任务相关令牌进一步提升了效果。这些发现表明,模态不平衡本身不能解释失败。瓶颈层的注意力缩放在不进行微调的情况下将准确率从55.9%提高到59.1%,为未来工作展示了一个有前景的方向。

英文摘要

Large Audio Language Models (LALMs) achieve strong performance on a variety of audio understanding tasks but continue to struggle with temporal reasoning, a fundamental capability central to human auditory perception. Understanding the causes of these failures remains challenging as existing benchmarks report performance gaps without probing underlying mechanisms. To address this, we introduce a benchmark with 1,657 questions across three foundational tasks designed specifically for mechanistic analysis. Examining model outputs across varying input settings (behavioral analysis) reveals that models often under-utilize audio when textual cues are available. We also provide the first causal mechanistic analysis of temporal reasoning failures in LALMs. Comparing attention upweighting against scaling, we find that redistributing attention across audio tokens is more effective than increasing audio attention. Targeting task-relevant tokens yields further gains. These findings suggest that modality imbalance alone cannot explain failures. Attention scaling at bottleneck layers improves accuracy from 55.9% to 59.1% without fine-tuning, demonstrating a promising direction for future work.

2606.17416 2026-06-17 cs.SD cs.AI 新提交

L-Proto: Language-Aware Episodic Prototypical Training for Multilingual Speaker Verification

L-Proto: 面向多语言说话人验证的语言感知情景原型训练

Hyung-Seok Oh, Deok-Hyeon Cho, Seung-Bin Kim, Seong-Whan Lee

发表机构 * Department of Artificial Intelligence, Korea University(高丽大学人工智能系)

AI总结 针对多语言说话人验证中语言相关声学变异导致说话人身份与语言特征纠缠的问题,提出语言感知情景原型训练策略L-Proto,通过构建语言一致的训练情景减少语言驱动变异,提升跨语言泛化能力。

Comments Accepted by INTERSPEECH 2026

详情
AI中文摘要

多语言说话人验证仍然具有挑战性,因为语言相关的声学变异导致说话人身份与语言特征纠缠,降低了跨语言的泛化能力。在多语言训练中,嵌入向量通常将语言线索与说话人身份一起编码,导致说话人形成特定语言的聚类。我们提出L-Proto,一种语言感知的情景原型训练策略,该策略构建语言一致的训练情景。通过在每个情景中从单一语言采样说话人,L-Proto减少了训练期间的语言驱动变异,并鼓励嵌入向量更直接地关注说话人身份。在TidyVoice挑战基准上的实验表明,与传统的微调和随机情景采样相比,在多种骨干架构上均取得了一致的性能提升。

英文摘要

Multilingual speaker verification remains challenging because language-dependent acoustic variability causes speaker identity to become entangled with linguistic characteristics, degrading generalization across languages. In multilingual training, embeddings often encode language cues with speaker identity, causing speakers to form language-specific clusters. We propose L-Proto, a language-aware episodic prototypical training strategy that constructs language-consistent episodes. By sampling speakers from a single language per episode, L-Proto reduces language-driven variation during training and encourages embeddings to focus more directly on speaker identity. Experiments on the TidyVoice Challenge benchmark demonstrate consistent performance improvements over conventional fine-tuning and random episodic sampling across multiple backbone architectures.

2606.17413 2026-06-17 cs.LG stat.AP 新提交

Amortized Probabilistic Retrieval of Atmospheric CO2 from OCO-2 Spectra Using Deep Learning with Laplace Approximations and Normalizing Flows

基于深度学习的OCO-2光谱大气CO2摊销概率检索:结合拉普拉斯近似与归一化流

Alejandro Calle-Saldarriaga, Felix Jimenez, Jack Grosskreuz, Jiazheng Wang, Jonathan Hobbs, Matthias Katzfuss

发表机构 * University of Wisconsin–Madison(威斯康星大学麦迪逊分校) Jet Propulsion Laboratory, California Institute of Technology(加州理工学院喷气推进实验室)

AI总结 提出深度学习框架,利用拉普拉斯近似和归一化流从OCO-2光谱中快速、准确地检索大气CO2浓度,并量化不确定性,相比传统方法加速数个数量级且精度更高。

Comments 23 pages, 8 figures

详情
AI中文摘要

基于空间的大气二氧化碳(CO2)监测对于约束全球碳收支至关重要。NASA的轨道碳观测者-2号(OCO-2)利用高分辨率光谱估算柱平均干空气CO2摩尔分数(XCO2)。然而,当前的操作检索算法计算成本高且未能正确量化不确定性。我们提出了一种新颖的深度学习框架来解决这些挑战。由于真实卫星观测的地面真值数据难以获取,我们使用高保真模拟数据集开发并验证了我们的方法。该数据集旨在支持OCO-2不确定性量化(UQ),并包含了真实的前向模型误差。我们的架构使用多分支神经网络编码光谱波段,并通过两种可扩展的UQ方法——拉普拉斯近似和归一化流——来估计完整CO2柱或其所需汇总的后验分布。与操作性的“全物理”求解器相比,我们的方法具有五个关键优势:(1)摊销:推理速度提高数个数量级,能够实时处理海量数据流;(2)模型误差鲁棒性:通过在明确包含模型差异的模拟数据上训练,我们的方法考虑了标准反演中常被忽略的系统误差;(3)点估计精度:与基线方法相比,我们实现了更优的预测精度;(4)改进的UQ:概率输出提供了校准更好的不确定性估计;(5)非高斯后验:当使用归一化流时,我们的框架成功建模了复杂、非对称的后验分布,克服了高斯假设的局限性。这些结果表明,基于模拟的深度学习是迈向下一代操作处理系统的可行路径。

英文摘要

Space-based monitoring of atmospheric carbon dioxide (CO2) is essential for constraining the global carbon budget. NASA's Orbiting Carbon Observatory-2 (OCO-2) estimates column-averaged dry-air mole fractions of CO2 (XCO2) using high-resolution spectra. However, current operational retrieval algorithms are computationally expensive and do not properly quantify uncertainties. We present a novel deep learning framework that addresses these challenges. Due to the difficulties of ground-truth data for real satellite observations, we develop and validate our approach using a high-fidelity simulation dataset. This dataset, created to support OCO-2 uncertainty quantification (UQ), incorporates realistic forward model errors. Our architecture encodes spectral bands using a multi-branch neural network and estimates posteriors of the full CO2 column or desired summaries thereof using two scalable UQ methods: Laplace approximations and normalizing flows. Our approach has five key advantages relative to operational "full-physics" solvers: (1) Amortization: Inference is orders of magnitude faster, enabling real-time processing of massive data streams; (2) Model error robustness: By training on simulations that explicitly include model discrepancies, our method accounts for systematic errors often neglected by standard inversions; (3) Point estimate accuracy: We achieve superior predictive accuracy compared to baseline methods; (4) Improved UQ: The probabilistic outputs yield better-calibrated uncertainty estimates; and (5) Non-Gaussian posteriors: When utilizing normalizing flows, our framework successfully models complex, asymmetric posterior distributions, overcoming the limitations of the Gaussian assumption. These results suggest that simulation-based deep learning is a viable path toward next-generation operational processing systems.

2606.17412 2026-06-17 cs.CV cs.AI 新提交

Enhancing Pathological VLMs with Cross-scale Reasoning

增强病理视觉语言模型的跨尺度推理能力

Chi Phan, Tianyi Zhang, Qiaochu Xue, Yufeng Wu, Dan Hu, Zeyu Liu, Sudong Wang, Yueming Jin

发表机构 * Department of Electrical and Computer Engineering, National University of Singapore(新加坡国立大学电气与计算机工程系) PuzzleLogic Pte Ltd(PuzzleLogic私人有限公司) Department of Pathology, Fujian Medical University Cancer Hospital & Fujian Cancer Hospital(福建医科大学附属肿瘤医院病理科暨福建省肿瘤医院)

AI总结 提出首个跨尺度训练与评估范式,通过多倍率视觉问答任务增强病理视觉语言模型的跨尺度推理能力,并构建高质量基准数据集Scale-VQA及模型ScaleReasoner-R1,实现最优性能。

详情
AI中文摘要

病理图像本质上是多尺度的,要求病理学家整合从低倍放大下的整体组织结构到高倍放大下的细胞形态的证据以进行准确诊断。虽然现有的视觉语言模型(VLM)病理数据集包含多种尺度,但它们通常缺乏明确的跨尺度推理目标。这一限制阻碍了VLM捕获关键的跨尺度表示和学习基于证据的推理。为弥补这一差距,我们引入了首个跨尺度训练和评估范式,将病理解释表述为多倍率推理。然而,创建这样的任务揭示了一个关键挑战:多图像视觉问答(VQA)容易受到仅文本捷径的影响,这使得模型能够利用与放大倍数相关的伪影而非视觉证据来猜测答案。为解决此问题,我们提出了一种泄漏感知的策展流程,结合了对抗性仅文本筛选和约束引导的问题设计。利用该流程,我们构建了Scale-VQA,一个高质量基准,包含4,685个多项选择题,基于2,537张跨多个放大级别的病理图像。最后,我们提出了ScaleReasoner-R1,一个通过强化学习训练的模型,以优化跨尺度VQA任务的性能。ScaleReasoner-R1在我们的跨尺度推理基准上达到了最先进的性能,并在已有的单尺度基准上泛化到最先进的性能。研究结果表明,即使是有限的跨尺度监督也能显著改善病理理解。代码和演示将开源。

英文摘要

Pathological images are inherently multi-scale, requiring pathologists to integrate evidence from global tissue architecture at low magnification to cellular morphology at higher magnification for accurate diagnosis. While existing pathological datasets for vision-language model (VLM) include various scales, they often lack an explicit cross-scale reasoning objective. This limitation prevents VLMs from capturing essential cross-scale representations and learning evidence-based reasoning. To bridge this gap, we introduce the first cross-scale training and evaluation paradigm that formulates pathology interpretation as multi-magnification reasoning. However, creating such a task reveals a critical challenge: multi-image visual question answering (VQA) is prone to text-only shortcuts, which allow models to guess answers using magnification-dependent artifacts rather than visual evidence. To address this, we propose a leakage-aware curation pipeline that combines adversarial text-only screening with constraint-guided question design. Using this pipeline, we construct Scale-VQA, a high-quality benchmark with 4,685 multiple-choice questions grounded in 2,537 pathology images across multiple magnification levels. Finally, we present ScaleReasoner-R1, a model trained via reinforcement learning to optimize performance on the cross-scale VQA task. ScaleReasoner-R1 achieves state-of-the-art performance on our cross-scale reasoning benchmark and generalizes to SOTA performance on established single-scale benchmarks. Findings suggest that even the limited cross-scale supervision can significantly improve pathological understanding. The code and demos will be open-sourced.

2606.17410 2026-06-17 cs.CV 新提交

Attention Alignment Between Humans and Vision-Language Models

人类与视觉语言模型之间的注意力对齐

Isaac R. Christian, Udith Haputhanthrige, Hanna Hornfeld, Declan Campbell, Samuel Nastase, Taylor Webb, Michael Graziano

发表机构 * Princeton Neuroscience Institute, Princeton University(普林斯顿大学普林斯顿神经科学研究所) Department of Psychology, Princeton University(普林斯顿大学心理学系) Department of Computer Science, Princeton University(普林斯顿大学计算机科学系) Department of Psychology and Center for Computational Language Sciences, University of Southern California(南加州大学心理学系与计算语言科学中心) Department of Psychology, Université de Montréal(蒙特利尔大学心理学系)

AI总结 本研究比较了六种视觉语言模型的空间注意力图与人类注视热图,发现解码器架构(LSTM vs Transformer)主导对齐程度,LSTM解码器对齐度更高但空间分散且任务区分度低,而Transformer解码器注意力更集中且任务区分度强。

详情
AI中文摘要

视觉感知依赖于自上而下的目标和自下而上的感觉机制。视觉语言模型同时实现了这两种机制,使我们能够将每个组成部分视为关于驱动我们注视位置的可分离假设。我们比较了六种视觉语言模型的空间注意力图与在200张图像上两个任务(一般描述和社交字幕)中记录的人类注视热图。这六种模型跨越了CNN与ViT编码器乘以LSTM与Transformer解码器的2×2因子设计,外加Molmo 7B-D和Qwen3.5 9B。我们发现解码器和编码器架构都影响对齐,但解码器选择占主导地位。LSTM与Transformer解码器使对齐度提高了40-50个百分点(分别达到人类噪声上限的80-87%和40-59%)。相比之下,CNN与ViT编码器根据解码器家族的不同贡献了5-20个百分点的次要优势,其中CNN-LSTM是整体对齐度最高的模型(85-87%)。尽管对齐度有优势,但LSTM解码器的注意力图在空间上分散且任务区分度最小;而对齐度最弱的ViT-Transformer则显示出最尖锐的空间集中度和最强的任务区分度。一项半空间忽略模拟证实,消融注意力对LSTM解码器的影响大于Transformer解码器。在使用TRIBE模拟的合成神经反应的探索性扩展中,注视对齐和神经相关性分离:CNN-Transformer注意力图尽管注视对齐度较低,但能更好地预测合成大脑活动,其中注意力图最佳预测早期视觉皮层。总之,自上而下和自下而上的组件在行为和合成神经数据中预测的内容上存在权衡。

英文摘要

Visual perception depends on top-down goals and bottom-up sensory mechanisms. Vision-language models implement both, allowing us to treat each component as a separable hypothesis about what drives where we look. We compared spatial attention maps from six vision-language models against human fixation heatmaps recorded on 200 images during two tasks (general description and social captioning). The six models spanned a 2$\times$2 factorial of CNN vs.\ ViT encoders crossed with LSTM vs.\ Transformer decoders, plus Molmo 7B-D and Qwen3.5 9B. We found that both decoder and encoder architecture shaped alignment, but decoder choice dominated. LSTM vs.\ Transformer decoders increased alignment by 40--50 percentage points (80--87\% vs.\ 40--59\% of the human noise ceiling). In contrast, CNN vs.\ ViT encoders contributed a secondary 5--20 point advantage depending on decoder family, with CNN-LSTM the most aligned model overall (85--87\%). Despite their alignment advantage, LSTM-decoder attention maps were spatially diffuse and minimally task-differentiated; ViT-Transformer, the weakest in alignment, showed the sharpest spatial concentration and strongest task differentiation. A hemispatial-neglect simulation confirmed that ablating attention impacted LSTM decoders more than Transformer decoders. In an exploratory extension using TRIBE-simulated synthetic neural responses, fixation alignment and neural relevance dissociate: CNN-Transformer attention maps better predicted synthetic brain activity despite lower fixation alignment, with attention maps best predicting early visual cortex. Together, top-down and bottom-up components trade off what they predict in behavioral and synthetic neural data.

2606.17409 2026-06-17 cs.LG cs.AI 新提交

Discrete Autoregressive Transformer for Generative Mechanism Synthesis

离散自回归变压器用于生成式机构综合

Anar Nurizada, Anurag Purwar

发表机构 * Computer-Aided Design and Innovation Lab, Department of Mechanical Engineering, Stony Brook University(石溪大学机械工程系计算机辅助设计与创新实验室)

AI总结 提出离散自回归变压器,将平面路径综合转化为条件序列建模,通过VAE潜在变量和机构类型令牌生成关节坐标,实现多样准确机构设计。

详情
AI中文摘要

平面路径综合需要机构的耦合曲线匹配预定轨迹;从曲线到连杆的映射本质上是一对多的,跨越四杆、六杆和八杆拓扑。我们通过模拟接地评估,在一个包含超过一百万个机构的策划语料库上解决这个设计问题,报告了正向运动学和几何对齐后的Chamfer距离和动态时间规整。我们将综合问题表述为条件自回归序列建模:关节坐标被均匀量化成令牌,并由一个解码器-only变压器生成,该变压器具有目标曲线的变分自编码器(VAE)潜在变量和一个显式的机构类型令牌。训练结合了令牌交叉熵和一个高斯平滑的bin辅助损失,该损失尊重bin之间的序数结构。在推理时,一个有界潜在噪声调度在每个噪声水平下解码所有机构类型;我们根据几何误差保留前五个候选,从而在没有数据集查找的情况下产生多样准确的族。在保留测试中,平均Chamfer距离为$0.0132$,平均动态时间规整为$0.153$;一个潜在$k$-最近邻基线,在VAE空间中基于训练集邻居潜在变量进行条件化,使用相同的解码器实现了匹配拓扑的平均Chamfer距离$0.0071$和平均动态时间规整$0.117$。

英文摘要

Planar path synthesis requires mechanisms whose coupler curves match a prescribed trajectory; the mapping from curve to linkage is inherently one-to-many across four-, six-, and eight-bar topologies. We address this design problem with simulation-grounded evaluation on a curated corpus of over one million mechanisms, reporting Chamfer distance and dynamic time warping after forward kinematics and geometric alignment. We formulate synthesis as conditional autoregressive sequence modeling: joint coordinates are uniformly quantized to tokens and generated by a decoder-only transformer with a variational-autoencoder (VAE) latent of the target curve and an explicit mechanism-type token. Training combines token cross-entropy with a Gaussian-smoothed bin auxiliary loss that respects ordinal structure among bins. At inference, a bounded latent-noise schedule decodes all mechanism types at each noise level; we retain the top five candidates by geometric error, yielding diverse accurate families without dataset lookup. On held-out tests, aggregate mean Chamfer distance is $0.0132$ and mean dynamic time warping is $0.153$; a latent $k$-nearest-neighbor baseline that conditions on training-set neighbor latents in VAE space achieves matched-topology mean Chamfer distance $0.0071$ and mean dynamic time warping $0.117$ using the same decoder.

2606.17408 2026-06-17 cs.RO cs.CV cs.LG 新提交

Where Should Action Generation Begin? A Learnable Source Prior for Generative Robot Policies

动作生成应从何处开始?面向生成式机器人策略的可学习源先验

Meipo Dai, Qiyuan Zhuang, He-Yang Xu, Ying-Jie Shuai, Yijun Wang, Qi Dou, Xiu-Shen Wei

发表机构 * Southeast University(东南大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出LeaP,用轻量MLP预测基于本体感知的对角高斯分布作为动作生成源先验,替代标准高斯分布,在15个RoboTwin任务中平均成功率81.6%,优于基线方法6.5-25.5个百分点。

详情
AI中文摘要

生成式机器人策略通常从与观测无关的标准高斯分布开始动作生成,源分布的选择尚未被充分探索。本文提出一个简单问题:动作生成应从何处开始?我们提出LeaP,一种可学习源先验,用基于本体感知的对角高斯分布(作用于动作块)替代标准高斯分布。通过轻量MLP参数化,LeaP联合预测源分布的均值和状态自适应方差,同时保持下游生成器架构和推理求解器不变。这种设计提供了观测信息驱动的随机初始化,使生成器能够专注于精确的动作细化,而非从无信息的噪声源传输样本。在15个RoboTwin操作任务中,LeaP实现了81.6%的平均成功率,优于四个代表性基线——包括确定性源方法、无先验对应方法和扩散桥策略——6.5至25.5个百分点。相同的先验一致地改进了流匹配和扩散桥生成器,同时使用更少的参数且收敛更快。该优势延续到实际部署中,LeaP取得了最佳性能。这些结果表明,源分布是生成式机器人策略的一个独立且可重用的设计轴,与生成动力学的选择互补。

英文摘要

Generative robot policies typically begin action generation from an observation-independent standard Gaussian distribution, leaving the choice of source distribution underexplored. This work asks a simple question: where should action generation begin? We propose LeaP, a Learnable source Prior that replaces the standard Gaussian with a proprioception-conditioned diagonal Gaussian over action chunks. Parameterized by a lightweight MLP, LeaP jointly predicts the mean and state-adaptive variance of the source distribution, while keeping the downstream generator architecture and inference solver unchanged. This design provides an observation-informed yet stochastic initialization, allowing the generator to focus on precise action refinement rather than transporting samples from an uninformed noise source. On 15 RoboTwin manipulation tasks, LeaP achieves an average success rate of 81.6%, outperforming four representative baselines -- including deterministic-source methods, a no-prior counterpart, and a diffusion-bridge policy -- by 6.5 to 25.5 percentage points. The same prior consistently improves both flow-matching and diffusion-bridge generators, while using fewer parameters and converging faster. The advantage carries over to real-world deployment, where LeaP attains the best performance. These results suggest that the source distribution is an independent and reusable design axis for generative robot policies, complementary to the choice of generative dynamics.

2606.17406 2026-06-17 cs.CV cs.AI 新提交

Graph Neural Networks for Semi-Supervised Image Classification with Multi-Feature Aggregation

基于多特征聚合的图神经网络用于半监督图像分类

Marina Chagas Bulach Gapski, Vinicius Atsushi Sato Kawai, Gustavo Rosseto Leticio, Lucas Pascotti Valem, Daniel Carlos Guimarães Pedronette, Mohand Said Allili

发表机构 * Department of Statistics, Applied Mathematics, and Computing (DEMAC), São Paulo State University (UNESP)(圣保罗州立大学统计、应用数学与计算系) Institute of Mathematics and Computer Science (ICMC), University of São Paulo (USP)(圣保罗大学数学与计算机科学研究所) Department of Computer Science and Engineering, University of Quebec in Outaouais (UQO)(魁北克大学乌塔韦校区计算机科学与工程系)

AI总结 提出一种结合多种特征提取器和图表示进行半监督图像分类的GNN方法,通过流形学习和排名聚合提升分类精度。

详情
AI中文摘要

特征提取涉及识别和提取显著特征或模式,包括边缘、纹理、形状和颜色属性。当代特征提取器主要利用深度学习架构,如卷积神经网络(CNN)和视觉变换器(VIT)。文献中各种特征提取器的可用性提供了广泛的特征表示。从图像中提取的特征取决于具体应用、所选提取器及其配置。因此,通过组合不同的提取器来整合互补信息,为提高性能提供了一种有前景的方式。图神经网络(GNN),特别是图卷积网络(GCN),已成为半监督图像分类的强大且广泛采用的方法,因为它们有效利用标记和未标记数据,同时利用捕捉样本间关系的底层图结构。本研究提出了一种新颖的GNN方法,适用于标记数据稀缺的场景,通过整合来自不同提取器的多样化特征和图表示集进行分类。进行了实验研究,包括不同特征和图提取器的组合,以及排名聚合策略。实验发现强调了本研究的主要贡献,表明特征和图表示的策略性组合,结合流形学习用于图处理,在大多数实验条件下显著提高了分类精度。此外,利用排名聚合技术整合来自不同提取器的特征,被证明能增强分类精度。

英文摘要

Feature extraction involves the identification and extraction of salient characteristics or patterns, including edges, textures, shapes, and color attributes. Contemporary feature extractors predominantly leverage deep learning architectures, such as Convolutional Neural Networks (CNNs) and Vision Transformers (VITs). The availability of diverse feature extractors in the literature provides a wide range of feature representations. Features extracted from an image depend on the specific application, the chosen extractor, and its configuration. Therefore, integrating complementary information by combining distinct extractors offers a promising way to enhance performance. Graph Neural Networks (GNNs), particularly Graph Convolutional Networks (GCNs), have emerged as powerful and widely adopted approaches for semi-supervised image classification, as they effectively leverage both labeled and unlabeled data while exploiting the underlying graph structures that capture relationships among samples. This study proposes a novel approach for GNNs in scenarios where labeled data is scarce, by integrating diverse sets of feature and graph representations derived from various extractors in classification scenarios. Experimental investigations were conducted, encompassing combinations of distinct feature and graph extractors, as well as rank aggregation strategies. The primary contributions of this work are underscored by the experimental findings, which demonstrate that the strategic combination of feature and graph representations, coupled with the application of manifold learning for graph processing, leads to significant improvements in classification accuracy across the majority of experimental conditions. Furthermore, the utilization of rank aggregation techniques to integrate features from different extractors was shown to enhance classification accuracy.

2606.17405 2026-06-17 cs.AI 新提交

Treatment Response Optimized Clinical Decision Support AI System via Digital Twin Simulation

基于数字孪生模拟的治疗响应优化临床决策支持AI系统

Xinyu Qin, Anil K. Sood, Ruiheng Yu, Sara Corvigno, Elaine Stur, Lu Wang

发表机构 * The Cancer Genome Atlas (TCGA)(癌症基因组图谱(TCGA))

AI总结 提出在线自适应框架,结合治疗效果估计、患者数字孪生和强化学习,在安全约束下实时优化治疗推荐,经合成和真实临床数据验证有效且稳定。

Comments Accepted for presentation at the IEEE Engineering in Medicine and Biology Conference (EMBC) 2026

详情
AI中文摘要

临床决策支持AI系统必须适应实时变化的患者状况,同时遵守严格的安全约束。我们提出了一个在线自适应框架,整合了治疗效果估计以量化临床获益、患者数字孪生以模拟治疗轨迹,以及强化学习用于序贯决策。AI系统最初在历史医疗记录上训练,并在持续学习循环中运行。为确保安全,一个基于规则的模块监测生命体征并阻止禁忌治疗。内部模型强烈不一致的案例被标记以供临床医生审查,在我们的实验中通过预训练的结果模型模拟。我们使用合成临床模拟器和来自癌症基因组图谱的真实卵巢癌数据集验证了我们的框架。在模拟和临床环境中,我们的方法在推荐治疗方面比标准计算基线表现出更优越的有效性和稳定性。此外,AI系统保持低延迟,并且在我们实验验证中仅需对少数案例进行专家咨询,展示了其作为临床医生监督下的个性化医疗安全工具的潜力,通过实际使用持续改进。

英文摘要

Clinical decision support AI systems (CDSASs) must adapt to evolving patient conditions in real-time while adhering to strict safety constraints. We present an online adaptive framework that integrates Treatment Effect (TE) estimation to quantify clinical benefits, a patient Digital Twin (DT) to simulate treatment trajectories, and Reinforcement Learning (RL) for sequential decision-making. The AI system is initially trained on historical medical records and operates in a continuous learning loop. To ensure safety, a rule-based module monitors vital signs and blocks contraindicated treatments. Cases with strong internal model disagreement are flagged for clinician review, simulated in our experiments via a pre-trained outcome model. We validate our framework using both a synthetic clinical simulator and a real-world ovarian cancer dataset from The Cancer Genome Atlas (TCGA). In both simulated and clinical settings, our method demonstrated superior effectiveness and stability in recommending treatments compared to standard computational baselines. Furthermore, the AI system maintains low latency and requires expert consultation for only a minority of cases in our experimental validation, demonstrating its potential as a safe, clinician-supervised tool for personalized medicine that continuously improves through practical use.

2606.17403 2026-06-17 cs.CV cs.AI 新提交

Bridging Spatial And Frequency Views For Disaster Assessment: Benefits And Limitations

桥接空间与频率视角进行灾害评估:优势与局限

Shikha V. Chandel, Yadav Raj Ghimire, Timothy Agboada, Leila Hashemi-Beni

发表机构 * College of Science and Technology(科学与技术学院) Computational Data Science and Engineering(计算数据科学与工程)

AI总结 本研究对比了空间域、频率域及双域深度学习方法在建筑损伤分类中的表现,发现双域模型优于单域模型,但所有模型对轻微损伤检测仍存在困难。

Comments Copyright 2026 IEEE. Published in the 2026 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2026)

详情
AI中文摘要

从卫星图像快速评估建筑损伤对于有效的灾害响应和恢复至关重要。虽然大多数深度学习方法依赖于空间域特征,但频率域表示可以捕捉互补的结构线索,如碎片模式和坍塌引起的纹理。本研究使用来自xView2(xBD)数据集灾后图像,对空间域、频率域和双域深度学习方法进行了受控比较,用于多类建筑损伤分类。为确保公平,所有模型均基于EfficientNet-B0骨干网络,并在相同设置下训练,仅输入表示和融合策略不同。使用准确率、宏F1分数、每类指标和混淆矩阵评估性能。结果表明,双域模型比单域方法提供了可衡量的改进。双空间配置实现了最高的测试准确率(0.4688)和最低的损失,而仅空间模型获得了最佳的宏F1分数(0.4254),表明类别性能更平衡。相比之下,仅频率模型表现最差并出现过拟合,表明泛化能力有限。尽管有这些改进,所有模型仍难以检测细微损伤级别,特别是Minor类别,这是由于类别不平衡和细粒度视觉模糊性。虽然双域方法改进了严重损伤的检测,但挑战依然存在。这些发现突出了混合表示的优势和局限,并推动了未来在数据平衡、高级融合和正则化方面的工作。

英文摘要

Rapid assessment of building damage from satellite imagery is essential for effective disaster response and recovery. While most deep learning methods rely on spatial-domain features, frequency-domain representations can capture complementary structural cues such as debris patterns and collapse-induced textures. This study presents a controlled comparison of spatial-domain, frequency-domain, and dual-domain deep learning approaches for multi-class building damage classification using post-disaster imagery from the xView2 (xBD) dataset. To ensure fairness, all models are built on an EfficientNet-B0 backbone and trained under identical settings, differing only in their input representations and fusion strategies. Performance is evaluated using accuracy, macro F1-score, per-class metrics, and confusion matrices. Results show that dual-domain models provide measurable improvements over single-domain approaches. The dual spatial configuration achieves the highest test accuracy (0.4688) and lowest loss, while the spatial-only model attains the best macro F1-score (0.4254), indicating more balanced class performance. In contrast, frequency-only models perform worst and exhibit overfitting, suggesting limited generalization. Despite these gains, all models struggle to detect subtle damage levels, particularly the Minor class, due to class imbalance and fine-grained visual ambiguity. While dual-domain approaches improve detection of severe damage, challenges remain. These findings highlight the benefits and limitations of hybrid representations and motivate future work on data balancing, advanced fusion, and regularization.

2606.17399 2026-06-17 cs.LG cs.AI 新提交

The Discrete-Log Clock: How a Transformer Learns Modular Multiplication

离散对数时钟:Transformer如何学习模乘法

Huu Danh Nguyen

发表机构 * Stanford University(斯坦福大学)

AI总结 通过乘法特征变换分析,发现Transformer在模乘法任务中学习到稀疏的傅里叶谱,其嵌入和MLP神经元主要编码少数乘法频率,表明模型实现了离散对数空间中的加法运算,即“离散对数时钟”算法。

Comments 5 pages, 5 figures. Accepted to the Mechanistic Interpretability Workshop at ICML 2026

详情
AI中文摘要

当小型Transformer在模乘法任务中实现“grok”时,先前研究报告学习到的嵌入具有“密集”的傅里叶谱,需要所有频率。这与模加法形成对比,后者只需一组稀疏的关键频率。我们证明这种密度是错误基下分析的伪像。乘法的自然傅里叶变换不是标准加法DFT,而是乘法特征变换,它将乘法群$(\mathbb{Z}/p\mathbb{Z})^*$上的函数分解为其不可约表示。将此变换应用于在$a \cdot b \bmod 113$上训练的grokked Transformer,我们发现嵌入谱变得高度稀疏(基尼系数0.58 vs 加法基下的0.07),仅4个关键频率携带显著能量。此外,96.9%的MLP神经元被干净地调谐到单个乘法频率,并且神经元激活热图在按离散对数重排序后显示出二维周期结构。这些结果表明Transformer将乘法简化为离散对数空间中的加法,实现了类似于Nanda等人针对加法的Clock算法的“离散对数时钟”算法。该方法具有普适性:将分析基与任务的代数结构匹配,可以在标准工具视为噪声的地方揭示可解释结构。

英文摘要

When small transformers grok modular multiplication, prior work reports that the learned embedding has a "dense" Fourier spectrum requiring all frequencies. This contrasts with modular addition, where only a sparse set of key frequencies suffices. We show this density is an artifact of analyzing in the wrong basis. The natural Fourier transform for multiplication is not the standard additive DFT but the multiplicative character transform, which decomposes functions on the multiplicative group $(\mathbb{Z}/p\mathbb{Z})^*$ into its irreducible representations. Applying this transform to a grokked transformer trained on $a \cdot b \bmod 113$, we find the embedding spectrum becomes highly sparse (Gini coefficient 0.58 vs. 0.07 in the additive basis) with only 4 key frequencies carrying significant energy. Furthermore, 96.9% of MLP neurons are cleanly tuned to a single multiplicative frequency, and neuron activation heatmaps reveal 2D-periodic structure when reordered by the discrete logarithm. These results demonstrate the transformer reduces multiplication to addition in discrete-log space, implementing a "Discrete-Log Clock" algorithm analogous to Nanda et al.'s Clock algorithm for addition. The methodology generalizes: matching the analysis basis to the algebraic structure of the task reveals interpretable structure where standard tools see noise.

2606.17394 2026-06-17 cs.RO cs.LG 新提交

Damage Adaptation in Seconds for Architected Materials

结构材料的秒级损伤自适应

James Avtges, Jake Ketchum, Helena Young, Taekyoung Kim, Ryan Truby, Todd Murphey

发表机构 * Northwestern University(西北大学)

AI总结 提出LEAP方法,利用潜在损伤表示和集成学习,在软驱动系统中实现一分钟内对灾难性损伤的自适应,无需仿真。

Comments Proceedings of Robotics: Science and Systems

详情
AI中文摘要

对损伤的自适应和原位物理修复对于长期机器人自主性至关重要,但在狭义定义和良好预期的范围之外具有挑战性。在这项工作中,我们在软驱动系统中在一分钟内本体感知地适应灾难性损伤。结构材料非常适合自适应:执行器故障是逐渐发生而非急性,并且损伤可以在低维、离散坐标空间中描述。令人惊讶的是,潜在损伤表示加上简单而稳健的集成方法足以实时适应未见过的损伤。此外,我们确定了指数样本复杂度降低为线性样本复杂度的条件,用于结构材料的学习表示,这是相对于刚性组件或连续软机构的明显优势。我们通过基于手性剪切拉胀(HSA)执行器的6自由度软手腕的追踪任务,演示了我们的自适应本体感知方法LEAP。我们的算法能够适应切割、烧伤和执行器修复,实现了无仿真的实时自适应,这对于在实验室外实现软机器人的承诺至关重要。视频和更多信息请访问此https URL。

英文摘要

Adaptation to damages and in-situ physical repairs is essential for long-term robot autonomy, yet challenging outside of narrowly defined and well-anticipated bounds. In this work we proprioceptively adapt to catastrophic damage in soft-actuated systems in under one minute. Architected materials are well equipped for adaptation: actuator failure occurs gradually rather than acutely, and damage can be described in a low-dimensional, discrete coordinate space. Surprisingly, latent damage representations plus a simple yet robust ensemble method is sufficient for adapting to unseen damage in real-time. Moreover, we identify conditions under which exponential sample complexity collapses to linear sample complexity for learned representations of architected materials, a concrete advantage over rigid components or continuum soft mechanisms. We demonstrate LEAP, our method for adaptive proprioception, via a tracing task for a 6DoF soft wrist based on Handed Shearing Auxetic (HSA) actuators. Our algorithm is able to adapt to cuts, burns, and actuator repairs, enabling simulation-free real-time adaptation that is critical for realizing the promise of soft robots outside the lab. Videos and more information are available at https://murpheylab.github.io/leap.

2606.17391 2026-06-17 cs.CL cs.AI cs.LG 新提交

NarrativeWorldBench: A Frontier-Saturated Benchmark and a Latent World Model for Long-Horizon Co-Creative Audio Drama

NarrativeWorldBench:面向长程共创音频剧的前沿饱和基准与潜在世界模型

Logan Mann, Abdur Rahman, Mohammad Saifullah, Taaha Kazi, Vasu Sharma

发表机构 * University of California, Santa Barbara(加州大学圣塔芭芭拉分校) Pocket FM

AI总结 提出NarrativeWorldBench基准,在九种叙事结构指标上评估21个模型,并引入N-VSSM变分状态空间模型,通过Mamba-2骨干和事件条件后验在200集以上维持结构化潜在状态,在长弧一致性和可控性上超越Claude Opus 4.5。

Comments 10 pages. Accepted to the ICML 2026 Workshops on High-dimensional Learning Dynamics (HiLD) and Culture x AI

详情
AI中文摘要

长篇连载音频剧,其剧情弧线跨越200至800集,是一种重要的创意媒介,也是前沿大语言模型(LLM)表现不佳的场景。我们在一组统一的叙事结构指标上,对21个模型进行了基准测试,涵盖经典、微调、开放前沿、封闭前沿和推理层级。所有封闭前沿系统在情节节拍F1上饱和于[0.78, 0.81]区间,并在视界h=200时下降约-0.20 F1。我们引入了NarrativeWorldBench,一个开放基准,包含九种叙事结构指标,在h∈{10, 20, 50, 100, 200}的视界上评估,并在四种印度语言(印地语、泰米尔语、泰卢固语、马拉地语)上进行跨语言评估。我们提出了N-VSSM,一种叙事变分状态空间模型,通过Mamba-2骨干网络和事件条件后验以及8B解码器,在超过200集的时间内维持一个结构化的256维潜在世界状态。N-VSSM在所有视界上保持情节节拍F1≥0.84,计算量仅为封闭前沿区间的1/4。学习到的文化迁移函数将跨语言忠实度提高了+0.20至+0.23 Likert分。在一项受试者内作家研究(n=12位专业作者,240次试验)中,N-VSSM在长弧一致性上以71%的偏好率优于Claude Opus 4.5,在可控性上评分高出+1.3 Likert分。

英文摘要

Long-form serialized audio drama, with arcs that run for 200 to 800 episodes, is a major creative medium and a setting where frontier large language models (LLMs) fail. We benchmark 21 models, spanning classical, fine-tuned, open-frontier, closed-frontier, and reasoning tiers, on a uniform set of structural narrative metrics. All closed-frontier systems saturate at a plot-beat F1 in the band [0.78, 0.81] and collapse by about -0.20 F1 at horizon h=200. We introduce NarrativeWorldBench, an open benchmark of nine narrative-structure metrics evaluated across horizons h in {10, 20, 50, 100, 200}, with cross-lingual evaluation across four Indic languages (Hindi, Tamil, Telugu, Marathi). We introduce N-VSSM, a Narrative Variational State-Space Model that maintains a structured 256-dimensional latent world state over more than 200 episodes via a Mamba-2 backbone with an event-conditioned posterior and an 8B decoder. N-VSSM holds plot-beat F1 >= 0.84 across all horizons at 4x lower compute than the closed-frontier band. A learned Cultural Transfer Function lifts cross-language fidelity by +0.20 to +0.23 Likert points. In a within-subjects writer study (n = 12 professional authors, 240 trials), N-VSSM is preferred over Claude Opus 4.5 on long-arc consistency 71% of the time and rated +1.3 Likert points higher on controllability.

2606.17389 2026-06-17 cs.CV cs.AI cs.CL cs.LG 新提交

Visuals Lie, Consistency Speaks: Disentangling Spatial Attention from Reliability in Vision-Language Models

视觉会撒谎,一致性说话:在视觉-语言模型中解耦空间注意力与可靠性

Logan Mann, Yi Xia, Ajit Saravanan, Ishan Dave, Saadullah Ismail, Shikhar Shiromani, Emily Huang, Ruizhe Li, Kevin Zhu

发表机构 * University of California, Santa Barbara(加州大学圣塔芭芭拉分校) Algoverse AI Research(Algoverse AI研究) University of California, Berkeley(加州大学伯克利分校)

AI总结 本文提出VLM可靠性探针(VRP),通过结构注意力指标和生成动态分析,发现空间注意力与准确性几乎无关(R≈0.001),而自一致性是可靠性的主要预测因子(R=0.429),揭示了视觉特征与最终生成之间的符号脱离现象。

Comments 16 pages. Accepted to the ICLR 2026 Workshop on Multimodal Intelligence. Code: https://github.com/itsloganmann/VLM-Reliability-Probe

详情
AI中文摘要

多模态基础模型越来越多地被用作推理代理,因此可靠性(即知道模型何时可能产生幻觉)变得至关重要。一种常见的直觉,我们称之为注意力-置信度假设,认为可靠性源于“结构性”视觉感知:对相关区域的紧密注意力应表明答案可信,而分散的注意力则表示困惑。我们通过VLM可靠性探针(VRP)挑战这一观点,这是一项对当代视觉-语言模型(VLM)中可靠性信号进行的系统性跨家族研究。我们引入了结构注意力指标——簇计数(C_k)和空间熵(H_s)——来量化视觉编码器的注视点,并追踪其跨层的演化(ΔH_s)。这揭示了一种“符号脱离”:模型通常“早期锁定”视觉特征,但随后注意力扩散,切断了早期感知与最终生成的联系。与接地假设相反,我们发现“簇失效”:空间注意力与准确性几乎零相关(R≈0.001)。相反,可靠性是生成动态和内部状态分布的现象。自一致性,即采样推理路径之间的一致率,是真实性的主要预测因子(R=0.429)。扩展因果干预揭示了尖锐的架构差异:LLaVA将其预测锁定在脆弱的后期瓶颈中,而PaliGemma和Qwen2-VL全局分布可靠性,即使其最具预测性的层被破坏约50%或更多,仍保持韧性。对于当前的VLM,可靠性信号与视觉接地图脱离,最好通过生成时动态和隐藏状态探针来推断。

英文摘要

Multimodal Foundation Models are increasingly used as reasoning agents, making reliability, knowing when a model may hallucinate, critical. A common intuition, which we call the Attention-Confidence Assumption, holds that reliability follows from "structural" visual perception: tight attention on relevant regions should signal a trustworthy answer, while scattered attention signals confusion. We challenge this through the VLM Reliability Probe (VRP), a systematic cross-family study of reliability signals in contemporary Vision-Language Models (VLMs). We introduce structural-attention metrics, cluster counts (C_k) and spatial entropy (H_s), to quantify the visual encoder's gaze, and track its evolution (Delta H_s) across layers. This reveals a "Symbolic Detachment": models often "Early Lock" visual features only to diffuse attention later, severing early perception from final generation. Contrary to the grounding hypothesis, we find a "Cluster Failure": spatial attention has near-zero correlation (R approx 0.001) with accuracy. Instead, reliability is a phenomenon of generation dynamics and internal-state distributions. Self-Consistency, the agreement rate across sampled reasoning paths, is the dominant predictor of truth (R = 0.429). Scaling causal interventions exposes a sharp architectural divergence: LLaVA locks its prediction in a fragile late-stage bottleneck, whereas PaliGemma and Qwen2-VL distribute reliability globally, staying resilient even when ~50% or more of their most predictive layer is destroyed. For current VLMs, reliability signals are detached from visual grounding maps and are best inferred from generation-time dynamics and hidden-state probes.

2606.17388 2026-06-17 cs.RO cs.CG cs.SY eess.SY 新提交

Agent Utilities over Generalized Voronoi Regions and their Gradients

广义Voronoi区域上的智能体效用及其梯度

Andre N. Costa, Petter Ögren, Carlos H. C. Ribeiro

发表机构 * Royal Institute of Technology (KTH)(皇家理工学院(KTH)) Aeronautics Institute of Technology(航空技术研究所)

AI总结 本文通过引入成本诱导Voronoi区域,将智能体效用定义为效用密度在该区域上的积分,并利用雷诺输运定理推导效用梯度,在足球双队示例中验证了方法,计算时间比有限差分法减少约一个数量级。

Comments Under review at IEEE Control Systems Letters (L-CSS)

详情
AI中文摘要

在本文中,我们推广了Voronoi区域的概念,将智能体效用定义为相应Voronoi区域上效用密度的积分,推导了效用的梯度,并在足球双队示例中说明了该方法。Voronoi区域的推广形式为所谓的成本诱导Voronoi(CIV)区域,其中智能体状态空间可能与划分的空间不同。这类区域的一个例子是当成本由LQR控制问题的最优解给出时。此时,智能体状态包括位置和速度,而划分的空间仅包括位置。智能体效用通过将某个效用密度在智能体的CIV区域上积分来定义。该效用密度可能是某个有益事件(例如在足球中接球)的概率密度。那么效用就是接球的总体概率,梯度表示提高该概率的方法。我们展示了如何使用流体力学中的雷诺输运定理计算该效用梯度,并且该方法在达到类似精度的同时,计算时间比基准有限差分近似减少约一个数量级。

英文摘要

In this paper, we generalize the concept of Voronoi regions, define agent utility as the integral of a utility density over the corresponding Voronoi region, derive gradients of the utility, and illustrate the approach in a two-team example from soccer. The generalization of Voronoi regions is in the form of so-called Cost-Induced Voronoi (CIV) regions, where the agent state space may differ from the space being partitioned. One example of such regions is when the cost is given by the optimal solution of an LQR control problem. Then the agent states include position as well as velocity, while the partitioned space only includes positions. The agent utility is defined by integrating some utility density over the CIV region of the agent. This utility density might be the probability density of some beneficial event, such as receiving a pass in soccer. The utility is then the overall probability of receiving a pass and the gradient represents a way to improve that probability. We show how this utility gradient can be computed using the Reynolds Transport Theorem from fluid mechanics, and that this approach achieves similar accuracy while reducing computation time by about an order of magnitude compared to a baseline finite-difference approximation.

2606.17386 2026-06-17 cs.CV cs.AI cs.RO 新提交

TerraTransfer: Learning End-to-End Driving Policies Without Expert Demonstrations

TerraTransfer: 无需专家示范的端到端驾驶策略学习

Zikang Xiong, Weixin Li, Zhouchonghao Wu, Akshay Rangesh, Saarth Bonde, Grantland Hall, Chen Tang, Yihan Hu, Wei Zhan

发表机构 * Applied Intuition UCLA(加州大学洛杉矶分校) UC Berkeley(加州大学伯克利分校)

AI总结 提出一种无需专家示范的端到端驾驶方法,通过向量化模拟器中的自博弈预训练策略,再与预训练视觉骨干对齐,降低了数据成本并达到或超越现有方法。

详情
AI中文摘要

端到端自动驾驶在基准测试和实际部署中取得了最先进的性能。然而,其标准训练流程在所有阶段都成本高昂:收集和标注数百万驾驶帧代价昂贵,而在图像上进行闭环强化学习受限于每步的光真实感渲染和大视觉骨干的前向传播成本。在向量化模拟器中进行自博弈改变了经济性:每秒数百万次 rollout 步骤,状态分布自然包含碰撞、近碰撞和恢复等驾驶日志中不包含的情况。我们的方法通过解耦学习驾驶和学习视觉来利用这种不对称性。我们通过自博弈预训练单个策略,然后通过动作 KL 散度和批量关系低秩结构损失将其潜在空间与预训练视觉骨干对齐。动作目标来自自博弈策略,因此对齐从未对记录的轨迹进行监督:只需要一个(图像、场景状态)帧的配对数据集,无需模仿预训练所依赖的精心策划的专家示范。在光真实感 3D 高斯泼溅闭环场景中,得到的端到端策略匹配或超越了先前的端到端方法。

英文摘要

End-to-end autonomous driving has achieved state-of-the-art performance on benchmarks and real-world deployments. Its standard training recipe, however, is expensive across all stages: collecting and labeling millions of driving frames is costly, and closed-loop RL on images is bottlenecked by the per-step cost of photorealistic rendering plus a forward pass through a large vision backbone. Self-play in vectorized simulators changes the economics: millions of rollout steps per second, and a state distribution naturally rich in collisions, near-misses, and recoveries that no driving log contains. Our approach exploits this asymmetry by decoupling learning to drive from learning to see. We pretrain a single policy by self-play, then align its latent space with a pretrained vision backbone, through the action KL divergence and a batch-relational low-rank structural loss. The action target comes from the self-play policy, so alignment never supervises against a logged trajectory: a paired dataset of (image, scene-state) frames suffices, with no need for the curated expert demonstrations that imitation pretraining is built on. On photorealistic 3D Gaussian splatting closed-loop scenarios, the resulting end-to-end policy matches or exceeds prior end-to-end methods.

2606.17385 2026-06-17 cs.RO 新提交

EgoInfinity: A Web-Scale 4D Hand-Object Interaction Data Engine for Any-View Robot Retargeting and Video-to-Action Robot Learning

EgoInfinity: 一个面向任意视角机器人重定向与视频到动作机器人学习的网络规模4D手物交互数据引擎

Gaotian Wang, Kejia Ren, Andrew Morgan, Yiting Chen, Howard H. Qian, Podshara Chanrungmaneekul, Kaiyu Hang

发表机构 * Rice University(莱斯大学) Robotics and AI Institute(机器人与人工智能研究所)

AI总结 提出EgoInfinity引擎,从互联网视频自动生成4D手物交互数据,实现跨机器人形态的动作重定向与技能学习,无需人工标注。

Comments 24 pages. Project page: https://huggingface.co/spaces/Rice-RobotPI-Lab/EgoInfinity

详情
AI中文摘要

互联网视频构成了具身人类操作知识的最大储备,然而将任意RGB视频转化为可操作的机器人训练数据仍然是一个主要瓶颈。现有的实验室或工厂收集的数据集在规模和多样性上有限,限制了开放世界机器人学习。我们不提出静态数据集,而是引入EgoInfinity,一个通用的4D手物交互数据引擎,能够为机器人重定向和学习生成网络规模的数据。EgoInfinity是一个模块化引擎,集成了感知、分割、重建、交互感知精炼和重定向,以自动化这一传统上不可扩展的视频到动作问题,无需人工循环标注。其模块化设计使引擎能够持续受益于任何集成组件的进步。通过EgoInfinity,野外人类操作视频被提升为与智能体无关的度量4D手物表示,包括手部轨迹、6自由度物体姿态和接触相关状态。EgoInfinity不是简单连接独立组件,而是结合跨模块度量校准与交互感知精炼,以提高物理可靠性,减少纯视觉重建中常见的漂移和接触不一致。我们进一步提出一种新颖的运动重定向器,将恢复的3D手部运动编译为适用于不同机器人形态的可执行关节轨迹,从而实现从任意视角和镜头尺寸(例如,人体仅部分可见)下任意机器人的视频到动作重定向。我们在感知保真度、运动学可行性、接触一致性、跨形态泛化以及真实机器人技能获取(例如,抓取、切割、擦拭和倒水)方面验证了EgoInfinity,展示了从互联网视频到可执行机器人行为的可扩展桥梁,用于开放世界机器人学习。

英文摘要

Internet videos constitute the largest reservoir of embodied human manipulation knowledge, yet converting arbitrary RGB footage into actionable robot training data remains a major bottleneck. Existing lab- or factory-collected datasets are narrow in scale and diversity, limiting open-world robot learning. Instead of proposing a static dataset, we introduce EgoInfinity, a universal 4D hand-object interaction data engine that enables web-scale data generation for robot retargeting and learning. EgoInfinity is a modular engine integrating perception, segmentation, reconstruction, interaction-aware refinement, and retargeting to automate this traditionally unscalable video-to-action problem without human-in-the-loop annotation. Its modular design lets the engine continuously benefit from advances in any incorporated component. With EgoInfinity, in-the-wild human manipulation videos are lifted into agent-agnostic, metric 4D hand-object representations, including hand trajectories, 6-DoF object poses, and contact-relevant states. Rather than naively connecting standalone components, EgoInfinity combines cross-module metric calibration with interaction-aware refinement to improve physical reliability, reducing drift and contact inconsistencies common in pure visual reconstruction. We further propose a novel motion retargeter that compiles the recovered 3D hand motions into executable joint trajectories for diverse robot morphologies, enabling video-to-action retargeting on any robot from arbitrary viewpoints and shot sizes (e.g., the human body is only partially visible). We validate EgoInfinity across perception fidelity, kinematic feasibility, contact consistency, cross-embodiment generalization, and real-robot skill acquisition (e.g., grasping, cutting, wiping, and pouring), demonstrating a scalable bridge from internet videos to executable robot behavior for open-world robot learning.

2606.17384 2026-06-17 cs.CV 新提交

Improving and Evaluating Hand-Object Interaction Detection

改进和评估手-物体交互检测

Ahmad Darkhalil, Dima Damen, David Fouhey

发表机构 * School of Computer Science, University of Bristol, Bristol, UK(布里斯托大学计算机科学学院) Computer Science and Electrical and Computer Engineering, New York University, NY, US(纽约大学计算机科学与电气与计算机工程系)

AI总结 提出HOI-DETR框架,将手-物体和物体-物体交互引入Co-DETR架构,在四个数据集上显著提升检测性能,mAP提升超过20个百分点。

Comments Project page: https://ahmaddarkhalil.github.io/HOI-DETR/

详情
AI中文摘要

理解手及其直接或通过工具交互的物体,是从动作感知到3D重建和机器人等任务的关键步骤。本文为手-物体交互(HOI)理解文献做出了多项贡献:(1)HOI-DETR,一种新框架,将手-物体和物体-物体交互引入Co-DETR架构,产生最先进的方法;(2)一个包含4个不同数据集的综合HOI评估套件,包括源自HD-EPIC数据集的视频基准和改善Hands23基准的新标注;(3)一个训练好的检查点,显著改进了Hands23、HOIST、FineBio和HD-EPIC上的最先进水平,包括在Hands23和FineBio上mAP提升超过20个百分点。我们的消融实验证实了每个模型组件的贡献。

英文摘要

Understanding hands and the objects they interact with, both directly and through tools, is a key step for tasks ranging from action perception to 3D reconstruction and robotics. Our paper provides several contributions to the Hand-Object Interaction (HOI) understanding literature: (1) HOI-DETR, a new framework that introduces hand-object and object-object interactions to the Co-DETR architecture to produce a state-of-the-art method; (2) a comprehensive HOI evaluation suite of 4 diverse datasets, including a video benchmark derived from the HD-EPIC dataset and fresh annotations that improve the Hands23 benchmark and (3) a trained checkpoint that significantly improves the state of the art across Hands23, HOIST, FineBio, and HD-EPIC, including mAP gains of over 20 percentage points on Hands23 and FineBio. Our ablations confirm the contributions of each model component.

2606.17379 2026-06-17 cs.CV cs.AI eess.IV 新提交

MeiBRD: Meta-Learning Intraoperative Biomechanical Residual Deformation

MeiBRD:元学习术中生物力学残余变形

Casey Meisenzahl, Jon Heiselman, Michael Holtz, Yubo Ye, Michael Miga, Linwei Wang

发表机构 * Rochester Institute of Technology(罗切斯特理工学院) Vanderbilt University(范德堡大学)

AI总结 提出混合配准框架,利用稀疏术中对应点自适应生物力学先验,通过图神经扩散函数学习残余变形,结合元学习从术中样本中快速适应,在肝脏体模上优于现有方法。

详情
AI中文摘要

由于软组织大幅变形且术中测量稀疏,精确的术中肝脏配准具有挑战性。生物力学模型通过先验知识正则化这一不适定问题,但由于简化假设而表现出持续的预测偏差,而数据驱动学习方法在数据效率、泛化能力和物理合理性方面存在困难。我们提出一个混合配准框架,利用稀疏术中对应点自适应生物力学先验。我们不是学习完整的变形场,而是学习一个校正线性生物力学预测的残余变形函数,该函数建模为图神经扩散函数,在3D肝脏网格上具有几何感知注意力。为了实现稀疏观测的长距离信息传递,我们从一个新颖的角度将稀疏术中测量视为\textit{上下文}样本,其中残余变形函数的输入-输出对完全观测,将问题转化为从术中上下文样本中学习该残余函数,使用前馈元学习器。在可变形肝脏体模数据集上的实验表明,与刚性、生物力学和数据驱动基线相比,配准精度和泛化能力得到提升,特别是在分布外几何和变形情况下。

英文摘要

Accurate intraoperative liver registration is challenging due to substantial soft-tissue deformation yet sparse intraoperative measurements. Biomechanical models regularize this ill-posedness with prior knowledge but exhibit persistent prediction bias due to simplifying assumptions, while data-driven learning solutions struggle with data efficiency, generalization, and physical plausibility. We propose a hybrid registration framework that adapts a biomechanical prior using sparse intraoperative correspondences. Rather than learning a full deformation field, we learn a residual deformation function that corrects linear biomechanical predictions, modeled as a graph neural diffusion function with geometry-aware attention over the 3D liver mesh. To enable long-range information transfer of sparse observations, we take a novel perspective of sparse intraoperative measurements as \textit{context} samples where input-output pairs of the residual deformation function are fully observed, casting the problem into learning-to-learn this residual function from intraoperative context samples with feedforward meta-learners. Experiments on a deformable liver phantom dataset demonstrate improved registration accuracy and generalization compared to rigid, biomechanical, and data-driven baselines, particularly for out-of-distribution geometries and deformations.

2606.17377 2026-06-17 cs.LG cs.SY eess.SY 新提交

Performance-Driven Environment Abstraction with Multi-Timescale Learning

性能驱动的多时间尺度学习环境抽象

Yue Guan, Dipankar Maity, Panagiotis Tsiotras

发表机构 * Georgia Institute of Technology(佐治亚理工学院) University of North Carolina at Charlotte(北卡罗来纳大学夏洛特分校)

AI总结 针对大规模马尔可夫决策过程,提出一种性能驱动的环境抽象方法,通过多时间尺度强化学习联合优化策略和树结构抽象,平衡性能与复杂度,实现状态压缩和样本效率提升。

详情
AI中文摘要

我们研究大规模马尔可夫决策过程中用于决策的性能驱动环境抽象。我们寻求直接优化决策质量的抽象,而非保留几何或拓扑结构。我们将抽象建模为通过聚合状态空间并在每个聚合状态内强制执行共享动作分布而获得的受控近似。对于固定划分,我们建立了一个性能保证,将值函数近似误差与动作共享引入的损失分离。在此分析指导下,我们开发了一个多时间尺度强化学习框架,联合调整策略和树结构环境抽象。所得算法基于Q值差异细化和粗化状态空间区域,平衡性能与抽象大小和复杂度。实验结果表明,与演员-评论家基线相比,该方法实现了显著的状态压缩、改进的样本效率和更快的重新规划。

英文摘要

We study performance-driven environment abstraction for decision-making in large Markov decision processes. Rather than preserving geometric or topological structure, we seek abstractions that directly optimize decision quality. We model abstraction as a controlled approximation obtained by aggregating the state space and enforcing a shared action distribution within each aggregated state. For a fixed partition, we establish a performance guarantee that separates value-function approximation error from the loss introduced by action sharing. Guided by this analysis, we develop a multi-timescale reinforcement learning framework that jointly adapts the policy and a tree-structured environment abstraction. The resulting algorithm refines and coarsens regions of the state space based on Q-value discrepancies, balancing performance against abstraction size and complexity. Empirical results demonstrate substantial state compression, improved sample efficiency, and faster replanning compared to actor-critic baselines.

2606.17376 2026-06-17 cs.RO cs.CV 新提交

Contactless Respiratory Monitoring on Heterogeneous Mobile Robots: A Multimodal Edge-Computing Framework

异构移动机器人上的非接触式呼吸监测:一种多模态边缘计算框架

Milind Rampure, Shadman Sakib, Haley Patel, Zahid Hasan, Nirmalya Roy

发表机构 * University of Maryland, Baltimore County(马里兰大学巴尔的摩县分校)

AI总结 提出一种适用于异构移动机器人的多模态非接触式呼吸率监测框架,通过自适应传感器选择、关键点引导的ROI提取和信号质量过滤,在多种平台和光照条件下实现鲁棒监测,无需平台特定调参。

Comments 8 pages, 6 figures. To appear in Proceedings of the 8th International Workshop on IoT Applications and Industry 5.0 (IoTI5 2026), co-located with IEEE DCOSS-IoT 2026, Reykjavik, Iceland, June 2026

详情
AI中文摘要

呼吸率监测是紧急响应、灾难恢复和传染病场景中远程分诊和受害者评估的关键组成部分,在这些场景中,最小化物理接触可以降低救援人员风险并提高操作安全性。然而,由于光照变化、姿势变化、平台异构性以及危险环境中可穿戴传感器的不实用性,非接触式呼吸率监测的现场部署仍然具有挑战性。在本文中,我们提出了一种适用于具有机载边缘计算的异构移动机器人的模态自适应非接触式呼吸率监测框架。所提出的系统结合了跨RGB、热成像、近红外和低光相机的亮度自适应传感器选择、用于姿势鲁棒监测的关键点引导胸部ROI提取,以及基于信号质量指数的滤波机制以实现可靠的呼吸估计。我们在三个机器人平台上实现并评估了该框架,涵盖四足和轮式运动以及多种边缘计算架构。在不同光照条件、受试者姿势和机器人到受试者距离下进行的实验表明,该框架无需针对每个平台进行算法重新调整即可跨平台泛化,同时揭示了模态特定的操作边界。RGB提供最广的覆盖范围,可达8米;近红外在6米内有效;热成像仅在短距离内可靠;低光传感支持在完全黑暗环境中监测,距离可达8米。总体而言,结果证明了在移动机器人上进行多模态非接触式呼吸率监测的可行性,并支持其作为危险搜救场景中自主分诊和受害者评估的基础。

英文摘要

Respiratory-rate (RR) monitoring is a critical component of remote triage and victim assessment in emergency response, disaster recovery, and infectious-disease scenarios, where minimizing physical contact can reduce responder risk and improve operational safety. However, field deployment of contactless RR monitoring remains challenging due to variable illumination, posture changes, platform heterogeneity, and the impracticality of wearable sensors in hazardous environments. In this paper, we present a modality-adaptive contactless RR monitoring framework for heterogeneous mobile robots with onboard edge computing. The proposed system combines brightness-adaptive sensor selection across RGB, thermal, near-infrared (NIR), and low-light cameras, keypoint-guided chest ROI extraction for posture-robust monitoring, and a signal-quality-index (SQI)-based filtering mechanism for reliable respiratory estimation. We implement and evaluate the framework on three robotic platforms spanning quadruped and wheeled locomotion and multiple edge-computing architectures. Experiments conducted across diverse lighting conditions, subject poses, and robot-to-subject distances demonstrate that the framework generalizes across platforms without per-platform algorithmic retuning, while revealing modality-specific operational boundaries. RGB provides the broadest coverage up to 8m, NIR remains effective up to 6m, thermal is reliable only at short range, and low-light sensing supports monitoring in complete darkness up to 8m. Overall, the results demonstrate the feasibility of multimodal contactless RR monitoring on mobile robots and support its use as a foundation for autonomous triage and victim assessment in hazardous search-and-rescue settings.

2606.17372 2026-06-17 cs.CL cs.AI 新提交

Implicit vs. Explicit Prompting Strategies for LVLMs in Referential Communication

LVLMs在指称通信中的隐式与显式提示策略

Peter Zeng, Amie J. Paige, Weiling Li, Susan E. Brennan, Owen Rambow, Cameron R. Jones

发表机构 * Stony Brook University(石溪大学)

AI总结 本研究通过控制任务差异,比较显式与隐式提示对LVLM生成高效指称表达的影响,发现显式提示下模型能协调高效表达,而隐式提示则失败,揭示了人机通信的关键差异。

详情
AI中文摘要

两项近期研究(Jones等人,2026;Zeng等人,2026)关于LVLM能否协调高效指称表达得出了明显矛盾的结论。我们在控制研究间任务差异的同时,直接比较了它们的提示风格。我们复现了当显式提示时模型可以协调高效指称表达的发现,表明其他任务差异并非导致结果分歧的原因。然而,我们也发现相同的模型无法从更隐式的提示中推断出通信效率的需求,凸显了人类与AI系统通信方式的关键差异。

英文摘要

Two recent studies (Jones et al. (2026); Zeng et al. (2026)) reach apparently contradictory conclusions about whether LVLMs can coordinate on efficient referring expressions. We control for task differences between the studies while directly comparing their prompting styles. We replicate the finding that models can coordinate efficient referring expressions when explicitly prompted to do so, suggesting that other task differences are not responsible for divergent results. However, we also find that the same models fail to infer the need for communicative efficiency from a more implicit prompt, highlighting critical differences between how humans and AI systems communicate.

2606.17368 2026-06-17 cs.AI cs.NI 新提交

Distributed General-Purpose Agent Networks: Architecture, Key Mechanisms, and Prototypes

分布式通用智能体网络:架构、关键机制与原型

Shengli Zhang, Deen Ma, Zibin Lin, Taotao Wang

发表机构 * College of Electronics and Information Engineering, Shenzhen University(深圳大学电子与信息工程学院)

AI总结 提出分布式通用智能体网络架构,通过协议适配层连接上层任务语义与底层网络操作,解决语义公告传播、可信身份与多主题声誉、语义梯度机制设计三大核心问题,实现开放可信的智能体协作。

详情
AI中文摘要

大型语言模型加速了从被动对话助手到自主智能体的转变,这些智能体能够理解目标、规划行动、调用工具并执行多步骤任务。然而,单个智能体的能力仍受限于其本地数据、工具权限、运行时环境和治理边界。本文研究分布式通用智能体网络:开放的端到端网络,其中部署在个人设备、边缘节点或自主计算环境中的异构智能体可以相互发现、建立信任、协商合作规则并执行开放式任务。我们认为,这种网络不能通过简单地将现有的端到端覆盖网络与传统多智能体系统相结合来获得。与传统P2P网络不同,智能体网络必须传播关于意图、能力、状态和合作约束的语义声明。因此,我们提出了一种以协议适配层为中心的分层架构,该层连接上层任务语义与底层网络操作。基于该架构,本文识别出三个核心机制问题:用于协作者发现的语义公告传播、用于合作治理的可验证身份与多主题声誉、以及用于开放任务执行的语义梯度机制设计。针对每个问题,我们提出了一条技术路线,包括带顺序日志的无体八卦协议、基于BAID的身份绑定与MG-EigenTrust声誉、以及由语义归因反馈驱动的Stackelberg式机制生成循环。我们还报告了BAID式分层验证的原型开销结果以及跨主题伪装-合谋攻击下MG-EigenTrust的机制级模拟。所得框架为开放、可信和可扩展的智能体协作提供了系统级基础。

英文摘要

Large language models have accelerated the transition from passive conversational assistants to autonomous agents that can understand goals, plan actions, invoke tools, and execute multi-step tasks. Yet the capability of a single agent remains constrained by its local data, tool permissions, runtime environment, and governance boundary. This paper studies distributed general-purpose agent networks: open peer-to-peer networks in which heterogeneous agents deployed on personal devices, edge nodes, or autonomous computing environments can discover one another, establish trust, negotiate cooperation rules, and execute open-ended tasks. We argue that such networks cannot be obtained by simply combining existing peer-to-peer overlays with conventional multi-agent systems. Unlike traditional P2P networks, agent networks must propagate semantic declarations about intentions, capabilities, states, and cooperation constraints. We therefore propose a layered architecture centered on a protocol adaptation layer that connects upper-level task semantics with lower-level network operations. Based on this architecture, the paper identifies three core mechanism problems: semantic announcement propagation for collaborator discovery, verifiable identity and multi-topic reputation for cooperation governance, and semantic-gradient mechanism design for open task execution. For each problem, we present a technical route, including bodyless gossip with sequential logs, BAID-based identity binding with MG-EigenTrust reputation, and a Stackelberg-style mechanism-generation loop driven by semantic attribution feedback. We further report prototype overhead results for BAID-style tiered verification and mechanism-level simulations of MG-EigenTrust under cross-topic disguise-collusion attacks. The resulting framework provides a system-level foundation for open, trustworthy, and scalable agent collaboration.

2606.17362 2026-06-17 cs.CV cs.AI cs.LG cs.RO 新提交

DriveJudge: Rethinking Autonomous Driving Evaluation with Vision-Language Models

DriveJudge: 用视觉-语言模型重新思考自动驾驶评估

Xinglong Sun, Kevin Xie, Jenny Schmalfuss, Despoina Paschalidou, Xiuming Zhang, Sanja Fidler, Kashyap Chitta, Jose M. Alvarez

发表机构 * NVIDIA(英伟达)

AI总结 提出DriveJudge,结合规则评估与VLM推理,通过选择性调用物理规则函数实现可解释且上下文感知的驾驶评估,在驾驶质量分类和轨迹偏好选择任务上超越现有方法。

Comments Under Review

详情
AI中文摘要

自动驾驶已转向端到端策略学习,其中可靠、可解释的策略评估是一个基本挑战,因为驾驶质量高度依赖于上下文。常用的基于规则的驾驶指标(如EPDMS)可解释但缺乏上下文感知,而近期基于VLM的评估虽具有上下文感知能力,但受限于模糊的VLM输出和较弱的物理基础。为了以既可解释又上下文感知的方式评估驾驶,我们引入了DriveJudge。DriveJudge是一个驾驶评估代理,它将规则基础评估与视觉-语言模型(VLM)推理相结合,并在解释环境上下文后有选择地调用基于物理的确定性规则函数。为了训练和评估DriveJudge,我们整理了一个包含33,577个具有挑战性的驾驶样本的大规模数据集,并附有人类标注,指示给定场景中的驾驶行为是否合理。利用该数据集,我们解决了驾驶指标评估中未被充分探索的问题,并引入了两个与人类对齐的基准任务:驾驶质量分类和轨迹偏好选择。DriveJudge在驾驶质量分类上比EPDMS高出21.23 AUC,在轨迹偏好选择上比近期基于VLM的DriveCritic高出6.5%,为可解释且精确的驾驶评估设立了新标准。

英文摘要

Autonomous driving has shifted towards end-to-end policy learning, where reliable, interpretable policy evaluation is a fundamental challenge as driving quality is highly context-dependent. Commonly used rule-based driving metrics like EPDMS are interpretable but lack context-awareness, while recent VLMbased evaluations are context-aware but limited by ambiguous VLM outputs and weak physical grounding. To evaluate driving in a manner that is both interpretable and context-aware, we introduce DriveJudge. DriveJudge is a driving evaluation agent that combines rule-grounded evaluation with Vision-Language Model (VLM) reasoning and selectively invokes physically-grounded deterministic rule functions after interpreting the environmental context. To train and evaluate DriveJudge, we curate a large-scale dataset of 33,577 challenging driving samples with human annotations on whether the driving behavior is reasonable in the given scenario. With this dataset, we address the underexplored problem of driving metric evaluation, and introduce two human-aligned benchmark tasks: Driving Quality Classification and Trajectory Preference Selection. DriveJudge outperforms EPDMS for driving quality classification by 21.23 AUC, and the recent VLM-based DriveCritic for trajectory preference selection by 6.5%, setting a new standard for interpretable and precise driving evaluation.

2606.17355 2026-06-17 cs.CV 新提交

Complex Layout Classification in the Wild: A Low-Resource Approach with Layout-Preserving Augmentations

野外复杂版面分类:一种低资源方法及版面保持增强

Sharva Gogawale, Iddo Hakim, Gal Grudka, Mohammad Suliman, Omer Ventura, Daria Vasyutinsky-Shapira, Berat Kurar-Barakat, Nachum Dershowitz

发表机构 * School of Computer Science and AI, Tel Aviv University(特拉维夫大学计算机科学与人工智能学院)

AI总结 针对低资源复杂版面分类问题,提出基于CNN的分类器,采用窄各向异性高斯掩码和反射诱导标签变换等版面保持增强方法,在标注稀缺下显著提升分类性能。

详情
AI中文摘要

许多数字化语料库面临低资源问题,因为标注可能稀缺、页面扫描噪声大且分辨率低,或者版面结构复杂,对自动转录质量产生负面影响。低资源语言的鲁棒分类模型开发受到缺乏大规模标注数据和页面版面频繁语义复杂性的制约。为此,我们整理了一个复杂版面数据集,根据分隔区域手动分为八种版面类型。为克服数据稀缺,我们提出了一种基于CNN的分类器的新型训练策略,采用强领域感知增强来改善泛化。我们利用窄各向异性高斯掩码抑制偶然文本细节,同时保留基本分隔,迫使模型学习全局几何排列。此外,我们实施反射诱导标签变换以丰富训练分布,同时保持不对称类别间的标签一致性。结果表明,版面特定增强可以在严重标注稀缺下显著改善页面级版面分类。

英文摘要

Many digitized corpora suffer from low resources because annotations may be scarce, page scans are noisy and of poor resolution, or layouts are structurally complex in ways that negatively affect the quality of automatic transcription. Developing robust classification models for low-resource languages is inhibited by the lack of large-scale annotated data and by the frequent semantic complexity of page layouts. To this end, we have curated a complex-layout dataset, manually classified into eight distinct layout types based on their separator regions. To overcome data scarcity, we propose a novel training strategy in the form of a CNN-based classifier that employs strong, domain-aware augmentations to improve generalization. We utilize narrow anisotropic Gaussian masking to suppress incidental textual details while preserving essential separations, compelling the model to learn global geometric arrangements. Additionally, we implement reflection-induced label transformations to enrich the training distribution while maintaining label consistency across asymmetric categories. The results demonstrate that layout-specific augmentations can substantially improve page-level layout classification under severe annotation scarcity.