arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2237
专题追踪 全部专题
2606.17403 2026-06-17 cs.CV cs.AI 新提交

Bridging Spatial And Frequency Views For Disaster Assessment: Benefits And Limitations

桥接空间与频率视角进行灾害评估:优势与局限

Shikha V. Chandel, Yadav Raj Ghimire, Timothy Agboada, Leila Hashemi-Beni

发表机构 * College of Science and Technology(科学与技术学院) Computational Data Science and Engineering(计算数据科学与工程)

AI总结 本研究对比了空间域、频率域及双域深度学习方法在建筑损伤分类中的表现,发现双域模型优于单域模型,但所有模型对轻微损伤检测仍存在困难。

Comments Copyright 2026 IEEE. Published in the 2026 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2026)

详情
AI中文摘要

从卫星图像快速评估建筑损伤对于有效的灾害响应和恢复至关重要。虽然大多数深度学习方法依赖于空间域特征,但频率域表示可以捕捉互补的结构线索,如碎片模式和坍塌引起的纹理。本研究使用来自xView2(xBD)数据集灾后图像,对空间域、频率域和双域深度学习方法进行了受控比较,用于多类建筑损伤分类。为确保公平,所有模型均基于EfficientNet-B0骨干网络,并在相同设置下训练,仅输入表示和融合策略不同。使用准确率、宏F1分数、每类指标和混淆矩阵评估性能。结果表明,双域模型比单域方法提供了可衡量的改进。双空间配置实现了最高的测试准确率(0.4688)和最低的损失,而仅空间模型获得了最佳的宏F1分数(0.4254),表明类别性能更平衡。相比之下,仅频率模型表现最差并出现过拟合,表明泛化能力有限。尽管有这些改进,所有模型仍难以检测细微损伤级别,特别是Minor类别,这是由于类别不平衡和细粒度视觉模糊性。虽然双域方法改进了严重损伤的检测,但挑战依然存在。这些发现突出了混合表示的优势和局限,并推动了未来在数据平衡、高级融合和正则化方面的工作。

英文摘要

Rapid assessment of building damage from satellite imagery is essential for effective disaster response and recovery. While most deep learning methods rely on spatial-domain features, frequency-domain representations can capture complementary structural cues such as debris patterns and collapse-induced textures. This study presents a controlled comparison of spatial-domain, frequency-domain, and dual-domain deep learning approaches for multi-class building damage classification using post-disaster imagery from the xView2 (xBD) dataset. To ensure fairness, all models are built on an EfficientNet-B0 backbone and trained under identical settings, differing only in their input representations and fusion strategies. Performance is evaluated using accuracy, macro F1-score, per-class metrics, and confusion matrices. Results show that dual-domain models provide measurable improvements over single-domain approaches. The dual spatial configuration achieves the highest test accuracy (0.4688) and lowest loss, while the spatial-only model attains the best macro F1-score (0.4254), indicating more balanced class performance. In contrast, frequency-only models perform worst and exhibit overfitting, suggesting limited generalization. Despite these gains, all models struggle to detect subtle damage levels, particularly the Minor class, due to class imbalance and fine-grained visual ambiguity. While dual-domain approaches improve detection of severe damage, challenges remain. These findings highlight the benefits and limitations of hybrid representations and motivate future work on data balancing, advanced fusion, and regularization.

2606.17399 2026-06-17 cs.LG cs.AI 新提交

The Discrete-Log Clock: How a Transformer Learns Modular Multiplication

离散对数时钟:Transformer如何学习模乘法

Huu Danh Nguyen

发表机构 * Stanford University(斯坦福大学)

AI总结 通过乘法特征变换分析,发现Transformer在模乘法任务中学习到稀疏的傅里叶谱,其嵌入和MLP神经元主要编码少数乘法频率,表明模型实现了离散对数空间中的加法运算,即“离散对数时钟”算法。

Comments 5 pages, 5 figures. Accepted to the Mechanistic Interpretability Workshop at ICML 2026

详情
AI中文摘要

当小型Transformer在模乘法任务中实现“grok”时,先前研究报告学习到的嵌入具有“密集”的傅里叶谱,需要所有频率。这与模加法形成对比,后者只需一组稀疏的关键频率。我们证明这种密度是错误基下分析的伪像。乘法的自然傅里叶变换不是标准加法DFT,而是乘法特征变换,它将乘法群$(\mathbb{Z}/p\mathbb{Z})^*$上的函数分解为其不可约表示。将此变换应用于在$a \cdot b \bmod 113$上训练的grokked Transformer,我们发现嵌入谱变得高度稀疏(基尼系数0.58 vs 加法基下的0.07),仅4个关键频率携带显著能量。此外,96.9%的MLP神经元被干净地调谐到单个乘法频率,并且神经元激活热图在按离散对数重排序后显示出二维周期结构。这些结果表明Transformer将乘法简化为离散对数空间中的加法,实现了类似于Nanda等人针对加法的Clock算法的“离散对数时钟”算法。该方法具有普适性:将分析基与任务的代数结构匹配,可以在标准工具视为噪声的地方揭示可解释结构。

英文摘要

When small transformers grok modular multiplication, prior work reports that the learned embedding has a "dense" Fourier spectrum requiring all frequencies. This contrasts with modular addition, where only a sparse set of key frequencies suffices. We show this density is an artifact of analyzing in the wrong basis. The natural Fourier transform for multiplication is not the standard additive DFT but the multiplicative character transform, which decomposes functions on the multiplicative group $(\mathbb{Z}/p\mathbb{Z})^*$ into its irreducible representations. Applying this transform to a grokked transformer trained on $a \cdot b \bmod 113$, we find the embedding spectrum becomes highly sparse (Gini coefficient 0.58 vs. 0.07 in the additive basis) with only 4 key frequencies carrying significant energy. Furthermore, 96.9% of MLP neurons are cleanly tuned to a single multiplicative frequency, and neuron activation heatmaps reveal 2D-periodic structure when reordered by the discrete logarithm. These results demonstrate the transformer reduces multiplication to addition in discrete-log space, implementing a "Discrete-Log Clock" algorithm analogous to Nanda et al.'s Clock algorithm for addition. The methodology generalizes: matching the analysis basis to the algebraic structure of the task reveals interpretable structure where standard tools see noise.

2606.17394 2026-06-17 cs.RO cs.LG 新提交

Damage Adaptation in Seconds for Architected Materials

结构材料的秒级损伤自适应

James Avtges, Jake Ketchum, Helena Young, Taekyoung Kim, Ryan Truby, Todd Murphey

发表机构 * Northwestern University(西北大学)

AI总结 提出LEAP方法,利用潜在损伤表示和集成学习,在软驱动系统中实现一分钟内对灾难性损伤的自适应,无需仿真。

Comments Proceedings of Robotics: Science and Systems

详情
AI中文摘要

对损伤的自适应和原位物理修复对于长期机器人自主性至关重要,但在狭义定义和良好预期的范围之外具有挑战性。在这项工作中,我们在软驱动系统中在一分钟内本体感知地适应灾难性损伤。结构材料非常适合自适应:执行器故障是逐渐发生而非急性,并且损伤可以在低维、离散坐标空间中描述。令人惊讶的是,潜在损伤表示加上简单而稳健的集成方法足以实时适应未见过的损伤。此外,我们确定了指数样本复杂度降低为线性样本复杂度的条件,用于结构材料的学习表示,这是相对于刚性组件或连续软机构的明显优势。我们通过基于手性剪切拉胀(HSA)执行器的6自由度软手腕的追踪任务,演示了我们的自适应本体感知方法LEAP。我们的算法能够适应切割、烧伤和执行器修复,实现了无仿真的实时自适应,这对于在实验室外实现软机器人的承诺至关重要。视频和更多信息请访问此https URL。

英文摘要

Adaptation to damages and in-situ physical repairs is essential for long-term robot autonomy, yet challenging outside of narrowly defined and well-anticipated bounds. In this work we proprioceptively adapt to catastrophic damage in soft-actuated systems in under one minute. Architected materials are well equipped for adaptation: actuator failure occurs gradually rather than acutely, and damage can be described in a low-dimensional, discrete coordinate space. Surprisingly, latent damage representations plus a simple yet robust ensemble method is sufficient for adapting to unseen damage in real-time. Moreover, we identify conditions under which exponential sample complexity collapses to linear sample complexity for learned representations of architected materials, a concrete advantage over rigid components or continuum soft mechanisms. We demonstrate LEAP, our method for adaptive proprioception, via a tracing task for a 6DoF soft wrist based on Handed Shearing Auxetic (HSA) actuators. Our algorithm is able to adapt to cuts, burns, and actuator repairs, enabling simulation-free real-time adaptation that is critical for realizing the promise of soft robots outside the lab. Videos and more information are available at https://murpheylab.github.io/leap.

2606.17391 2026-06-17 cs.CL cs.AI cs.LG 新提交

NarrativeWorldBench: A Frontier-Saturated Benchmark and a Latent World Model for Long-Horizon Co-Creative Audio Drama

NarrativeWorldBench:面向长程共创音频剧的前沿饱和基准与潜在世界模型

Logan Mann, Abdur Rahman, Mohammad Saifullah, Taaha Kazi, Vasu Sharma

发表机构 * University of California, Santa Barbara(加州大学圣塔芭芭拉分校) Pocket FM

AI总结 提出NarrativeWorldBench基准,在九种叙事结构指标上评估21个模型,并引入N-VSSM变分状态空间模型,通过Mamba-2骨干和事件条件后验在200集以上维持结构化潜在状态,在长弧一致性和可控性上超越Claude Opus 4.5。

Comments 10 pages. Accepted to the ICML 2026 Workshops on High-dimensional Learning Dynamics (HiLD) and Culture x AI

详情
AI中文摘要

长篇连载音频剧,其剧情弧线跨越200至800集,是一种重要的创意媒介,也是前沿大语言模型(LLM)表现不佳的场景。我们在一组统一的叙事结构指标上,对21个模型进行了基准测试,涵盖经典、微调、开放前沿、封闭前沿和推理层级。所有封闭前沿系统在情节节拍F1上饱和于[0.78, 0.81]区间,并在视界h=200时下降约-0.20 F1。我们引入了NarrativeWorldBench,一个开放基准,包含九种叙事结构指标,在h∈{10, 20, 50, 100, 200}的视界上评估,并在四种印度语言(印地语、泰米尔语、泰卢固语、马拉地语)上进行跨语言评估。我们提出了N-VSSM,一种叙事变分状态空间模型,通过Mamba-2骨干网络和事件条件后验以及8B解码器,在超过200集的时间内维持一个结构化的256维潜在世界状态。N-VSSM在所有视界上保持情节节拍F1≥0.84,计算量仅为封闭前沿区间的1/4。学习到的文化迁移函数将跨语言忠实度提高了+0.20至+0.23 Likert分。在一项受试者内作家研究(n=12位专业作者,240次试验)中,N-VSSM在长弧一致性上以71%的偏好率优于Claude Opus 4.5,在可控性上评分高出+1.3 Likert分。

英文摘要

Long-form serialized audio drama, with arcs that run for 200 to 800 episodes, is a major creative medium and a setting where frontier large language models (LLMs) fail. We benchmark 21 models, spanning classical, fine-tuned, open-frontier, closed-frontier, and reasoning tiers, on a uniform set of structural narrative metrics. All closed-frontier systems saturate at a plot-beat F1 in the band [0.78, 0.81] and collapse by about -0.20 F1 at horizon h=200. We introduce NarrativeWorldBench, an open benchmark of nine narrative-structure metrics evaluated across horizons h in {10, 20, 50, 100, 200}, with cross-lingual evaluation across four Indic languages (Hindi, Tamil, Telugu, Marathi). We introduce N-VSSM, a Narrative Variational State-Space Model that maintains a structured 256-dimensional latent world state over more than 200 episodes via a Mamba-2 backbone with an event-conditioned posterior and an 8B decoder. N-VSSM holds plot-beat F1 >= 0.84 across all horizons at 4x lower compute than the closed-frontier band. A learned Cultural Transfer Function lifts cross-language fidelity by +0.20 to +0.23 Likert points. In a within-subjects writer study (n = 12 professional authors, 240 trials), N-VSSM is preferred over Claude Opus 4.5 on long-arc consistency 71% of the time and rated +1.3 Likert points higher on controllability.

2606.17389 2026-06-17 cs.CV cs.AI cs.CL cs.LG 新提交

Visuals Lie, Consistency Speaks: Disentangling Spatial Attention from Reliability in Vision-Language Models

视觉会撒谎,一致性说话:在视觉-语言模型中解耦空间注意力与可靠性

Logan Mann, Yi Xia, Ajit Saravanan, Ishan Dave, Saadullah Ismail, Shikhar Shiromani, Emily Huang, Ruizhe Li, Kevin Zhu

发表机构 * University of California, Santa Barbara(加州大学圣塔芭芭拉分校) Algoverse AI Research(Algoverse AI研究) University of California, Berkeley(加州大学伯克利分校)

AI总结 本文提出VLM可靠性探针(VRP),通过结构注意力指标和生成动态分析,发现空间注意力与准确性几乎无关(R≈0.001),而自一致性是可靠性的主要预测因子(R=0.429),揭示了视觉特征与最终生成之间的符号脱离现象。

Comments 16 pages. Accepted to the ICLR 2026 Workshop on Multimodal Intelligence. Code: https://github.com/itsloganmann/VLM-Reliability-Probe

详情
AI中文摘要

多模态基础模型越来越多地被用作推理代理,因此可靠性(即知道模型何时可能产生幻觉)变得至关重要。一种常见的直觉,我们称之为注意力-置信度假设,认为可靠性源于“结构性”视觉感知:对相关区域的紧密注意力应表明答案可信,而分散的注意力则表示困惑。我们通过VLM可靠性探针(VRP)挑战这一观点,这是一项对当代视觉-语言模型(VLM)中可靠性信号进行的系统性跨家族研究。我们引入了结构注意力指标——簇计数(C_k)和空间熵(H_s)——来量化视觉编码器的注视点,并追踪其跨层的演化(ΔH_s)。这揭示了一种“符号脱离”:模型通常“早期锁定”视觉特征,但随后注意力扩散,切断了早期感知与最终生成的联系。与接地假设相反,我们发现“簇失效”:空间注意力与准确性几乎零相关(R≈0.001)。相反,可靠性是生成动态和内部状态分布的现象。自一致性,即采样推理路径之间的一致率,是真实性的主要预测因子(R=0.429)。扩展因果干预揭示了尖锐的架构差异:LLaVA将其预测锁定在脆弱的后期瓶颈中,而PaliGemma和Qwen2-VL全局分布可靠性,即使其最具预测性的层被破坏约50%或更多,仍保持韧性。对于当前的VLM,可靠性信号与视觉接地图脱离,最好通过生成时动态和隐藏状态探针来推断。

英文摘要

Multimodal Foundation Models are increasingly used as reasoning agents, making reliability, knowing when a model may hallucinate, critical. A common intuition, which we call the Attention-Confidence Assumption, holds that reliability follows from "structural" visual perception: tight attention on relevant regions should signal a trustworthy answer, while scattered attention signals confusion. We challenge this through the VLM Reliability Probe (VRP), a systematic cross-family study of reliability signals in contemporary Vision-Language Models (VLMs). We introduce structural-attention metrics, cluster counts (C_k) and spatial entropy (H_s), to quantify the visual encoder's gaze, and track its evolution (Delta H_s) across layers. This reveals a "Symbolic Detachment": models often "Early Lock" visual features only to diffuse attention later, severing early perception from final generation. Contrary to the grounding hypothesis, we find a "Cluster Failure": spatial attention has near-zero correlation (R approx 0.001) with accuracy. Instead, reliability is a phenomenon of generation dynamics and internal-state distributions. Self-Consistency, the agreement rate across sampled reasoning paths, is the dominant predictor of truth (R = 0.429). Scaling causal interventions exposes a sharp architectural divergence: LLaVA locks its prediction in a fragile late-stage bottleneck, whereas PaliGemma and Qwen2-VL distribute reliability globally, staying resilient even when ~50% or more of their most predictive layer is destroyed. For current VLMs, reliability signals are detached from visual grounding maps and are best inferred from generation-time dynamics and hidden-state probes.

2606.17388 2026-06-17 cs.RO cs.CG cs.SY eess.SY 新提交

Agent Utilities over Generalized Voronoi Regions and their Gradients

广义Voronoi区域上的智能体效用及其梯度

Andre N. Costa, Petter Ögren, Carlos H. C. Ribeiro

发表机构 * Royal Institute of Technology (KTH)(皇家理工学院(KTH)) Aeronautics Institute of Technology(航空技术研究所)

AI总结 本文通过引入成本诱导Voronoi区域,将智能体效用定义为效用密度在该区域上的积分,并利用雷诺输运定理推导效用梯度,在足球双队示例中验证了方法,计算时间比有限差分法减少约一个数量级。

Comments Under review at IEEE Control Systems Letters (L-CSS)

详情
AI中文摘要

在本文中,我们推广了Voronoi区域的概念,将智能体效用定义为相应Voronoi区域上效用密度的积分,推导了效用的梯度,并在足球双队示例中说明了该方法。Voronoi区域的推广形式为所谓的成本诱导Voronoi(CIV)区域,其中智能体状态空间可能与划分的空间不同。这类区域的一个例子是当成本由LQR控制问题的最优解给出时。此时,智能体状态包括位置和速度,而划分的空间仅包括位置。智能体效用通过将某个效用密度在智能体的CIV区域上积分来定义。该效用密度可能是某个有益事件(例如在足球中接球)的概率密度。那么效用就是接球的总体概率,梯度表示提高该概率的方法。我们展示了如何使用流体力学中的雷诺输运定理计算该效用梯度,并且该方法在达到类似精度的同时,计算时间比基准有限差分近似减少约一个数量级。

英文摘要

In this paper, we generalize the concept of Voronoi regions, define agent utility as the integral of a utility density over the corresponding Voronoi region, derive gradients of the utility, and illustrate the approach in a two-team example from soccer. The generalization of Voronoi regions is in the form of so-called Cost-Induced Voronoi (CIV) regions, where the agent state space may differ from the space being partitioned. One example of such regions is when the cost is given by the optimal solution of an LQR control problem. Then the agent states include position as well as velocity, while the partitioned space only includes positions. The agent utility is defined by integrating some utility density over the CIV region of the agent. This utility density might be the probability density of some beneficial event, such as receiving a pass in soccer. The utility is then the overall probability of receiving a pass and the gradient represents a way to improve that probability. We show how this utility gradient can be computed using the Reynolds Transport Theorem from fluid mechanics, and that this approach achieves similar accuracy while reducing computation time by about an order of magnitude compared to a baseline finite-difference approximation.

2606.17386 2026-06-17 cs.CV cs.AI cs.RO 新提交

TerraTransfer: Learning End-to-End Driving Policies Without Expert Demonstrations

TerraTransfer: 无需专家示范的端到端驾驶策略学习

Zikang Xiong, Weixin Li, Zhouchonghao Wu, Akshay Rangesh, Saarth Bonde, Grantland Hall, Chen Tang, Yihan Hu, Wei Zhan

发表机构 * Applied Intuition UCLA(加州大学洛杉矶分校) UC Berkeley(加州大学伯克利分校)

AI总结 提出一种无需专家示范的端到端驾驶方法,通过向量化模拟器中的自博弈预训练策略,再与预训练视觉骨干对齐,降低了数据成本并达到或超越现有方法。

详情
AI中文摘要

端到端自动驾驶在基准测试和实际部署中取得了最先进的性能。然而,其标准训练流程在所有阶段都成本高昂:收集和标注数百万驾驶帧代价昂贵,而在图像上进行闭环强化学习受限于每步的光真实感渲染和大视觉骨干的前向传播成本。在向量化模拟器中进行自博弈改变了经济性:每秒数百万次 rollout 步骤,状态分布自然包含碰撞、近碰撞和恢复等驾驶日志中不包含的情况。我们的方法通过解耦学习驾驶和学习视觉来利用这种不对称性。我们通过自博弈预训练单个策略,然后通过动作 KL 散度和批量关系低秩结构损失将其潜在空间与预训练视觉骨干对齐。动作目标来自自博弈策略,因此对齐从未对记录的轨迹进行监督:只需要一个(图像、场景状态)帧的配对数据集,无需模仿预训练所依赖的精心策划的专家示范。在光真实感 3D 高斯泼溅闭环场景中,得到的端到端策略匹配或超越了先前的端到端方法。

英文摘要

End-to-end autonomous driving has achieved state-of-the-art performance on benchmarks and real-world deployments. Its standard training recipe, however, is expensive across all stages: collecting and labeling millions of driving frames is costly, and closed-loop RL on images is bottlenecked by the per-step cost of photorealistic rendering plus a forward pass through a large vision backbone. Self-play in vectorized simulators changes the economics: millions of rollout steps per second, and a state distribution naturally rich in collisions, near-misses, and recoveries that no driving log contains. Our approach exploits this asymmetry by decoupling learning to drive from learning to see. We pretrain a single policy by self-play, then align its latent space with a pretrained vision backbone, through the action KL divergence and a batch-relational low-rank structural loss. The action target comes from the self-play policy, so alignment never supervises against a logged trajectory: a paired dataset of (image, scene-state) frames suffices, with no need for the curated expert demonstrations that imitation pretraining is built on. On photorealistic 3D Gaussian splatting closed-loop scenarios, the resulting end-to-end policy matches or exceeds prior end-to-end methods.

2606.17385 2026-06-17 cs.RO 新提交

EgoInfinity: A Web-Scale 4D Hand-Object Interaction Data Engine for Any-View Robot Retargeting and Video-to-Action Robot Learning

EgoInfinity: 一个面向任意视角机器人重定向与视频到动作机器人学习的网络规模4D手物交互数据引擎

Gaotian Wang, Kejia Ren, Andrew Morgan, Yiting Chen, Howard H. Qian, Podshara Chanrungmaneekul, Kaiyu Hang

发表机构 * Rice University(莱斯大学) Robotics and AI Institute(机器人与人工智能研究所)

AI总结 提出EgoInfinity引擎,从互联网视频自动生成4D手物交互数据,实现跨机器人形态的动作重定向与技能学习,无需人工标注。

Comments 24 pages. Project page: https://huggingface.co/spaces/Rice-RobotPI-Lab/EgoInfinity

详情
AI中文摘要

互联网视频构成了具身人类操作知识的最大储备,然而将任意RGB视频转化为可操作的机器人训练数据仍然是一个主要瓶颈。现有的实验室或工厂收集的数据集在规模和多样性上有限,限制了开放世界机器人学习。我们不提出静态数据集,而是引入EgoInfinity,一个通用的4D手物交互数据引擎,能够为机器人重定向和学习生成网络规模的数据。EgoInfinity是一个模块化引擎,集成了感知、分割、重建、交互感知精炼和重定向,以自动化这一传统上不可扩展的视频到动作问题,无需人工循环标注。其模块化设计使引擎能够持续受益于任何集成组件的进步。通过EgoInfinity,野外人类操作视频被提升为与智能体无关的度量4D手物表示,包括手部轨迹、6自由度物体姿态和接触相关状态。EgoInfinity不是简单连接独立组件,而是结合跨模块度量校准与交互感知精炼,以提高物理可靠性,减少纯视觉重建中常见的漂移和接触不一致。我们进一步提出一种新颖的运动重定向器,将恢复的3D手部运动编译为适用于不同机器人形态的可执行关节轨迹,从而实现从任意视角和镜头尺寸(例如,人体仅部分可见)下任意机器人的视频到动作重定向。我们在感知保真度、运动学可行性、接触一致性、跨形态泛化以及真实机器人技能获取(例如,抓取、切割、擦拭和倒水)方面验证了EgoInfinity,展示了从互联网视频到可执行机器人行为的可扩展桥梁,用于开放世界机器人学习。

英文摘要

Internet videos constitute the largest reservoir of embodied human manipulation knowledge, yet converting arbitrary RGB footage into actionable robot training data remains a major bottleneck. Existing lab- or factory-collected datasets are narrow in scale and diversity, limiting open-world robot learning. Instead of proposing a static dataset, we introduce EgoInfinity, a universal 4D hand-object interaction data engine that enables web-scale data generation for robot retargeting and learning. EgoInfinity is a modular engine integrating perception, segmentation, reconstruction, interaction-aware refinement, and retargeting to automate this traditionally unscalable video-to-action problem without human-in-the-loop annotation. Its modular design lets the engine continuously benefit from advances in any incorporated component. With EgoInfinity, in-the-wild human manipulation videos are lifted into agent-agnostic, metric 4D hand-object representations, including hand trajectories, 6-DoF object poses, and contact-relevant states. Rather than naively connecting standalone components, EgoInfinity combines cross-module metric calibration with interaction-aware refinement to improve physical reliability, reducing drift and contact inconsistencies common in pure visual reconstruction. We further propose a novel motion retargeter that compiles the recovered 3D hand motions into executable joint trajectories for diverse robot morphologies, enabling video-to-action retargeting on any robot from arbitrary viewpoints and shot sizes (e.g., the human body is only partially visible). We validate EgoInfinity across perception fidelity, kinematic feasibility, contact consistency, cross-embodiment generalization, and real-robot skill acquisition (e.g., grasping, cutting, wiping, and pouring), demonstrating a scalable bridge from internet videos to executable robot behavior for open-world robot learning.

2606.17384 2026-06-17 cs.CV 新提交

Improving and Evaluating Hand-Object Interaction Detection

改进和评估手-物体交互检测

Ahmad Darkhalil, Dima Damen, David Fouhey

发表机构 * School of Computer Science, University of Bristol, Bristol, UK(布里斯托大学计算机科学学院) Computer Science and Electrical and Computer Engineering, New York University, NY, US(纽约大学计算机科学与电气与计算机工程系)

AI总结 提出HOI-DETR框架,将手-物体和物体-物体交互引入Co-DETR架构,在四个数据集上显著提升检测性能,mAP提升超过20个百分点。

Comments Project page: https://ahmaddarkhalil.github.io/HOI-DETR/

详情
AI中文摘要

理解手及其直接或通过工具交互的物体,是从动作感知到3D重建和机器人等任务的关键步骤。本文为手-物体交互(HOI)理解文献做出了多项贡献:(1)HOI-DETR,一种新框架,将手-物体和物体-物体交互引入Co-DETR架构,产生最先进的方法;(2)一个包含4个不同数据集的综合HOI评估套件,包括源自HD-EPIC数据集的视频基准和改善Hands23基准的新标注;(3)一个训练好的检查点,显著改进了Hands23、HOIST、FineBio和HD-EPIC上的最先进水平,包括在Hands23和FineBio上mAP提升超过20个百分点。我们的消融实验证实了每个模型组件的贡献。

英文摘要

Understanding hands and the objects they interact with, both directly and through tools, is a key step for tasks ranging from action perception to 3D reconstruction and robotics. Our paper provides several contributions to the Hand-Object Interaction (HOI) understanding literature: (1) HOI-DETR, a new framework that introduces hand-object and object-object interactions to the Co-DETR architecture to produce a state-of-the-art method; (2) a comprehensive HOI evaluation suite of 4 diverse datasets, including a video benchmark derived from the HD-EPIC dataset and fresh annotations that improve the Hands23 benchmark and (3) a trained checkpoint that significantly improves the state of the art across Hands23, HOIST, FineBio, and HD-EPIC, including mAP gains of over 20 percentage points on Hands23 and FineBio. Our ablations confirm the contributions of each model component.

2606.17379 2026-06-17 cs.CV cs.AI eess.IV 新提交

MeiBRD: Meta-Learning Intraoperative Biomechanical Residual Deformation

MeiBRD:元学习术中生物力学残余变形

Casey Meisenzahl, Jon Heiselman, Michael Holtz, Yubo Ye, Michael Miga, Linwei Wang

发表机构 * Rochester Institute of Technology(罗切斯特理工学院) Vanderbilt University(范德堡大学)

AI总结 提出混合配准框架,利用稀疏术中对应点自适应生物力学先验,通过图神经扩散函数学习残余变形,结合元学习从术中样本中快速适应,在肝脏体模上优于现有方法。

详情
AI中文摘要

由于软组织大幅变形且术中测量稀疏,精确的术中肝脏配准具有挑战性。生物力学模型通过先验知识正则化这一不适定问题,但由于简化假设而表现出持续的预测偏差,而数据驱动学习方法在数据效率、泛化能力和物理合理性方面存在困难。我们提出一个混合配准框架,利用稀疏术中对应点自适应生物力学先验。我们不是学习完整的变形场,而是学习一个校正线性生物力学预测的残余变形函数,该函数建模为图神经扩散函数,在3D肝脏网格上具有几何感知注意力。为了实现稀疏观测的长距离信息传递,我们从一个新颖的角度将稀疏术中测量视为\textit{上下文}样本,其中残余变形函数的输入-输出对完全观测,将问题转化为从术中上下文样本中学习该残余函数,使用前馈元学习器。在可变形肝脏体模数据集上的实验表明,与刚性、生物力学和数据驱动基线相比,配准精度和泛化能力得到提升,特别是在分布外几何和变形情况下。

英文摘要

Accurate intraoperative liver registration is challenging due to substantial soft-tissue deformation yet sparse intraoperative measurements. Biomechanical models regularize this ill-posedness with prior knowledge but exhibit persistent prediction bias due to simplifying assumptions, while data-driven learning solutions struggle with data efficiency, generalization, and physical plausibility. We propose a hybrid registration framework that adapts a biomechanical prior using sparse intraoperative correspondences. Rather than learning a full deformation field, we learn a residual deformation function that corrects linear biomechanical predictions, modeled as a graph neural diffusion function with geometry-aware attention over the 3D liver mesh. To enable long-range information transfer of sparse observations, we take a novel perspective of sparse intraoperative measurements as \textit{context} samples where input-output pairs of the residual deformation function are fully observed, casting the problem into learning-to-learn this residual function from intraoperative context samples with feedforward meta-learners. Experiments on a deformable liver phantom dataset demonstrate improved registration accuracy and generalization compared to rigid, biomechanical, and data-driven baselines, particularly for out-of-distribution geometries and deformations.

2606.17377 2026-06-17 cs.LG cs.SY eess.SY 新提交

Performance-Driven Environment Abstraction with Multi-Timescale Learning

性能驱动的多时间尺度学习环境抽象

Yue Guan, Dipankar Maity, Panagiotis Tsiotras

发表机构 * Georgia Institute of Technology(佐治亚理工学院) University of North Carolina at Charlotte(北卡罗来纳大学夏洛特分校)

AI总结 针对大规模马尔可夫决策过程,提出一种性能驱动的环境抽象方法,通过多时间尺度强化学习联合优化策略和树结构抽象,平衡性能与复杂度,实现状态压缩和样本效率提升。

详情
AI中文摘要

我们研究大规模马尔可夫决策过程中用于决策的性能驱动环境抽象。我们寻求直接优化决策质量的抽象,而非保留几何或拓扑结构。我们将抽象建模为通过聚合状态空间并在每个聚合状态内强制执行共享动作分布而获得的受控近似。对于固定划分,我们建立了一个性能保证,将值函数近似误差与动作共享引入的损失分离。在此分析指导下,我们开发了一个多时间尺度强化学习框架,联合调整策略和树结构环境抽象。所得算法基于Q值差异细化和粗化状态空间区域,平衡性能与抽象大小和复杂度。实验结果表明,与演员-评论家基线相比,该方法实现了显著的状态压缩、改进的样本效率和更快的重新规划。

英文摘要

We study performance-driven environment abstraction for decision-making in large Markov decision processes. Rather than preserving geometric or topological structure, we seek abstractions that directly optimize decision quality. We model abstraction as a controlled approximation obtained by aggregating the state space and enforcing a shared action distribution within each aggregated state. For a fixed partition, we establish a performance guarantee that separates value-function approximation error from the loss introduced by action sharing. Guided by this analysis, we develop a multi-timescale reinforcement learning framework that jointly adapts the policy and a tree-structured environment abstraction. The resulting algorithm refines and coarsens regions of the state space based on Q-value discrepancies, balancing performance against abstraction size and complexity. Empirical results demonstrate substantial state compression, improved sample efficiency, and faster replanning compared to actor-critic baselines.

2606.17376 2026-06-17 cs.RO cs.CV 新提交

Contactless Respiratory Monitoring on Heterogeneous Mobile Robots: A Multimodal Edge-Computing Framework

异构移动机器人上的非接触式呼吸监测:一种多模态边缘计算框架

Milind Rampure, Shadman Sakib, Haley Patel, Zahid Hasan, Nirmalya Roy

发表机构 * University of Maryland, Baltimore County(马里兰大学巴尔的摩县分校)

AI总结 提出一种适用于异构移动机器人的多模态非接触式呼吸率监测框架,通过自适应传感器选择、关键点引导的ROI提取和信号质量过滤,在多种平台和光照条件下实现鲁棒监测,无需平台特定调参。

Comments 8 pages, 6 figures. To appear in Proceedings of the 8th International Workshop on IoT Applications and Industry 5.0 (IoTI5 2026), co-located with IEEE DCOSS-IoT 2026, Reykjavik, Iceland, June 2026

详情
AI中文摘要

呼吸率监测是紧急响应、灾难恢复和传染病场景中远程分诊和受害者评估的关键组成部分,在这些场景中,最小化物理接触可以降低救援人员风险并提高操作安全性。然而,由于光照变化、姿势变化、平台异构性以及危险环境中可穿戴传感器的不实用性,非接触式呼吸率监测的现场部署仍然具有挑战性。在本文中,我们提出了一种适用于具有机载边缘计算的异构移动机器人的模态自适应非接触式呼吸率监测框架。所提出的系统结合了跨RGB、热成像、近红外和低光相机的亮度自适应传感器选择、用于姿势鲁棒监测的关键点引导胸部ROI提取,以及基于信号质量指数的滤波机制以实现可靠的呼吸估计。我们在三个机器人平台上实现并评估了该框架,涵盖四足和轮式运动以及多种边缘计算架构。在不同光照条件、受试者姿势和机器人到受试者距离下进行的实验表明,该框架无需针对每个平台进行算法重新调整即可跨平台泛化,同时揭示了模态特定的操作边界。RGB提供最广的覆盖范围,可达8米;近红外在6米内有效;热成像仅在短距离内可靠;低光传感支持在完全黑暗环境中监测,距离可达8米。总体而言,结果证明了在移动机器人上进行多模态非接触式呼吸率监测的可行性,并支持其作为危险搜救场景中自主分诊和受害者评估的基础。

英文摘要

Respiratory-rate (RR) monitoring is a critical component of remote triage and victim assessment in emergency response, disaster recovery, and infectious-disease scenarios, where minimizing physical contact can reduce responder risk and improve operational safety. However, field deployment of contactless RR monitoring remains challenging due to variable illumination, posture changes, platform heterogeneity, and the impracticality of wearable sensors in hazardous environments. In this paper, we present a modality-adaptive contactless RR monitoring framework for heterogeneous mobile robots with onboard edge computing. The proposed system combines brightness-adaptive sensor selection across RGB, thermal, near-infrared (NIR), and low-light cameras, keypoint-guided chest ROI extraction for posture-robust monitoring, and a signal-quality-index (SQI)-based filtering mechanism for reliable respiratory estimation. We implement and evaluate the framework on three robotic platforms spanning quadruped and wheeled locomotion and multiple edge-computing architectures. Experiments conducted across diverse lighting conditions, subject poses, and robot-to-subject distances demonstrate that the framework generalizes across platforms without per-platform algorithmic retuning, while revealing modality-specific operational boundaries. RGB provides the broadest coverage up to 8m, NIR remains effective up to 6m, thermal is reliable only at short range, and low-light sensing supports monitoring in complete darkness up to 8m. Overall, the results demonstrate the feasibility of multimodal contactless RR monitoring on mobile robots and support its use as a foundation for autonomous triage and victim assessment in hazardous search-and-rescue settings.

2606.17368 2026-06-17 cs.AI cs.NI 新提交

Distributed General-Purpose Agent Networks: Architecture, Key Mechanisms, and Prototypes

分布式通用智能体网络:架构、关键机制与原型

Shengli Zhang, Deen Ma, Zibin Lin, Taotao Wang

发表机构 * College of Electronics and Information Engineering, Shenzhen University(深圳大学电子与信息工程学院)

AI总结 提出分布式通用智能体网络架构,通过协议适配层连接上层任务语义与底层网络操作,解决语义公告传播、可信身份与多主题声誉、语义梯度机制设计三大核心问题,实现开放可信的智能体协作。

详情
AI中文摘要

大型语言模型加速了从被动对话助手到自主智能体的转变,这些智能体能够理解目标、规划行动、调用工具并执行多步骤任务。然而,单个智能体的能力仍受限于其本地数据、工具权限、运行时环境和治理边界。本文研究分布式通用智能体网络:开放的端到端网络,其中部署在个人设备、边缘节点或自主计算环境中的异构智能体可以相互发现、建立信任、协商合作规则并执行开放式任务。我们认为,这种网络不能通过简单地将现有的端到端覆盖网络与传统多智能体系统相结合来获得。与传统P2P网络不同,智能体网络必须传播关于意图、能力、状态和合作约束的语义声明。因此,我们提出了一种以协议适配层为中心的分层架构,该层连接上层任务语义与底层网络操作。基于该架构,本文识别出三个核心机制问题:用于协作者发现的语义公告传播、用于合作治理的可验证身份与多主题声誉、以及用于开放任务执行的语义梯度机制设计。针对每个问题,我们提出了一条技术路线,包括带顺序日志的无体八卦协议、基于BAID的身份绑定与MG-EigenTrust声誉、以及由语义归因反馈驱动的Stackelberg式机制生成循环。我们还报告了BAID式分层验证的原型开销结果以及跨主题伪装-合谋攻击下MG-EigenTrust的机制级模拟。所得框架为开放、可信和可扩展的智能体协作提供了系统级基础。

英文摘要

Large language models have accelerated the transition from passive conversational assistants to autonomous agents that can understand goals, plan actions, invoke tools, and execute multi-step tasks. Yet the capability of a single agent remains constrained by its local data, tool permissions, runtime environment, and governance boundary. This paper studies distributed general-purpose agent networks: open peer-to-peer networks in which heterogeneous agents deployed on personal devices, edge nodes, or autonomous computing environments can discover one another, establish trust, negotiate cooperation rules, and execute open-ended tasks. We argue that such networks cannot be obtained by simply combining existing peer-to-peer overlays with conventional multi-agent systems. Unlike traditional P2P networks, agent networks must propagate semantic declarations about intentions, capabilities, states, and cooperation constraints. We therefore propose a layered architecture centered on a protocol adaptation layer that connects upper-level task semantics with lower-level network operations. Based on this architecture, the paper identifies three core mechanism problems: semantic announcement propagation for collaborator discovery, verifiable identity and multi-topic reputation for cooperation governance, and semantic-gradient mechanism design for open task execution. For each problem, we present a technical route, including bodyless gossip with sequential logs, BAID-based identity binding with MG-EigenTrust reputation, and a Stackelberg-style mechanism-generation loop driven by semantic attribution feedback. We further report prototype overhead results for BAID-style tiered verification and mechanism-level simulations of MG-EigenTrust under cross-topic disguise-collusion attacks. The resulting framework provides a system-level foundation for open, trustworthy, and scalable agent collaboration.

2606.17362 2026-06-17 cs.CV cs.AI cs.LG cs.RO 新提交

DriveJudge: Rethinking Autonomous Driving Evaluation with Vision-Language Models

DriveJudge: 用视觉-语言模型重新思考自动驾驶评估

Xinglong Sun, Kevin Xie, Jenny Schmalfuss, Despoina Paschalidou, Xiuming Zhang, Sanja Fidler, Kashyap Chitta, Jose M. Alvarez

发表机构 * NVIDIA(英伟达)

AI总结 提出DriveJudge,结合规则评估与VLM推理,通过选择性调用物理规则函数实现可解释且上下文感知的驾驶评估,在驾驶质量分类和轨迹偏好选择任务上超越现有方法。

Comments Under Review

详情
AI中文摘要

自动驾驶已转向端到端策略学习,其中可靠、可解释的策略评估是一个基本挑战,因为驾驶质量高度依赖于上下文。常用的基于规则的驾驶指标(如EPDMS)可解释但缺乏上下文感知,而近期基于VLM的评估虽具有上下文感知能力,但受限于模糊的VLM输出和较弱的物理基础。为了以既可解释又上下文感知的方式评估驾驶,我们引入了DriveJudge。DriveJudge是一个驾驶评估代理,它将规则基础评估与视觉-语言模型(VLM)推理相结合,并在解释环境上下文后有选择地调用基于物理的确定性规则函数。为了训练和评估DriveJudge,我们整理了一个包含33,577个具有挑战性的驾驶样本的大规模数据集,并附有人类标注,指示给定场景中的驾驶行为是否合理。利用该数据集,我们解决了驾驶指标评估中未被充分探索的问题,并引入了两个与人类对齐的基准任务:驾驶质量分类和轨迹偏好选择。DriveJudge在驾驶质量分类上比EPDMS高出21.23 AUC,在轨迹偏好选择上比近期基于VLM的DriveCritic高出6.5%,为可解释且精确的驾驶评估设立了新标准。

英文摘要

Autonomous driving has shifted towards end-to-end policy learning, where reliable, interpretable policy evaluation is a fundamental challenge as driving quality is highly context-dependent. Commonly used rule-based driving metrics like EPDMS are interpretable but lack context-awareness, while recent VLMbased evaluations are context-aware but limited by ambiguous VLM outputs and weak physical grounding. To evaluate driving in a manner that is both interpretable and context-aware, we introduce DriveJudge. DriveJudge is a driving evaluation agent that combines rule-grounded evaluation with Vision-Language Model (VLM) reasoning and selectively invokes physically-grounded deterministic rule functions after interpreting the environmental context. To train and evaluate DriveJudge, we curate a large-scale dataset of 33,577 challenging driving samples with human annotations on whether the driving behavior is reasonable in the given scenario. With this dataset, we address the underexplored problem of driving metric evaluation, and introduce two human-aligned benchmark tasks: Driving Quality Classification and Trajectory Preference Selection. DriveJudge outperforms EPDMS for driving quality classification by 21.23 AUC, and the recent VLM-based DriveCritic for trajectory preference selection by 6.5%, setting a new standard for interpretable and precise driving evaluation.

2606.17355 2026-06-17 cs.CV 新提交

Complex Layout Classification in the Wild: A Low-Resource Approach with Layout-Preserving Augmentations

野外复杂版面分类:一种低资源方法及版面保持增强

Sharva Gogawale, Iddo Hakim, Gal Grudka, Mohammad Suliman, Omer Ventura, Daria Vasyutinsky-Shapira, Berat Kurar-Barakat, Nachum Dershowitz

发表机构 * School of Computer Science and AI, Tel Aviv University(特拉维夫大学计算机科学与人工智能学院)

AI总结 针对低资源复杂版面分类问题,提出基于CNN的分类器,采用窄各向异性高斯掩码和反射诱导标签变换等版面保持增强方法,在标注稀缺下显著提升分类性能。

详情
AI中文摘要

许多数字化语料库面临低资源问题,因为标注可能稀缺、页面扫描噪声大且分辨率低,或者版面结构复杂,对自动转录质量产生负面影响。低资源语言的鲁棒分类模型开发受到缺乏大规模标注数据和页面版面频繁语义复杂性的制约。为此,我们整理了一个复杂版面数据集,根据分隔区域手动分为八种版面类型。为克服数据稀缺,我们提出了一种基于CNN的分类器的新型训练策略,采用强领域感知增强来改善泛化。我们利用窄各向异性高斯掩码抑制偶然文本细节,同时保留基本分隔,迫使模型学习全局几何排列。此外,我们实施反射诱导标签变换以丰富训练分布,同时保持不对称类别间的标签一致性。结果表明,版面特定增强可以在严重标注稀缺下显著改善页面级版面分类。

英文摘要

Many digitized corpora suffer from low resources because annotations may be scarce, page scans are noisy and of poor resolution, or layouts are structurally complex in ways that negatively affect the quality of automatic transcription. Developing robust classification models for low-resource languages is inhibited by the lack of large-scale annotated data and by the frequent semantic complexity of page layouts. To this end, we have curated a complex-layout dataset, manually classified into eight distinct layout types based on their separator regions. To overcome data scarcity, we propose a novel training strategy in the form of a CNN-based classifier that employs strong, domain-aware augmentations to improve generalization. We utilize narrow anisotropic Gaussian masking to suppress incidental textual details while preserving essential separations, compelling the model to learn global geometric arrangements. Additionally, we implement reflection-induced label transformations to enrich the training distribution while maintaining label consistency across asymmetric categories. The results demonstrate that layout-specific augmentations can substantially improve page-level layout classification under severe annotation scarcity.

2606.17354 2026-06-17 cs.CL cs.AI 新提交

Translating the Untranslatable: An Operationalizable Ontology for Untranslatability

翻译不可译:一个可操作化的不可译性本体论

Jacob Bremerman, Brihi Joshi, Hirona Arai, Xiang Ren, Jonathan May

发表机构 * University of Southern California Information Sciences Institute(南加州大学信息科学研究所)

AI总结 提出一个结构化的不可译性本体论和补偿策略分类法,构建多语言数据集,通过人类偏好研究发现注释补偿策略最受青睐,为策略感知机器翻译奠定基础。

详情
AI中文摘要

不可译性,即意义无法在语言间直接保留的情况,在语言学中已有深入研究,但在自然语言处理中尚未充分探索。随着机器翻译系统在标准基准测试上的改进,其局限性越来越集中在这些情况下,即翻译无法简化为一一对应。我们引入了一个结构化的不可译性本体论以及补偿策略的分类法,这些策略是在这些不可译情况下传达意义的具体技术。我们将该框架操作化为一个多语言数据集,包含不可译句子及其基于策略的翻译,从而能够对翻译行为进行受控分析。初步的人类偏好研究表明,翻译质量取决于所使用的策略,并且对包含解释性上下文(称为注释补偿策略)的输出存在一致的偏好。我们的框架和数据集为研究和建模策略感知的机器翻译提供了基础。

英文摘要

Untranslatability, cases where meaning cannot be directly preserved across languages, is well-studied in linguistics but underexplored in NLP. As machine translation (MT) systems improve on standard benchmarks, their limitations increasingly concentrate in such cases, where translation cannot be reduced to one-to-one equivalence. We introduce a structured ontology of untranslatability along with a taxonomy of compensation strategies, which are specific techniques to convey meaning under these untranslatable circumstances. We operationalize this framework into a multilingual dataset of untranslatable sentences paired with strategy-based translations, enabling controlled analysis of translation behavior. Initial human preference studies suggest that translation quality depends on the strategy used, with consistent preferences for outputs that include explanatory context, known as the Annotation compensation strategy. Our framework and dataset provide a foundation for studying and modeling strategy-informed machine translation.

2606.17352 2026-06-17 cs.LG cs.CV 新提交

MM++: Unsupervised Scale-Invariant Multilayer OOD Detection via Top-K Gated Feature Fusion

MM++: 无监督尺度不变多层OOD检测通过Top-K门控特征融合

Rahim Hossain, Md Tawheedul Islam Bhuian, Md Farhan Shadiq, Kyoung-Don Kang

发表机构 * School of Computing, State University of New York at Binghamton(纽约州立大学宾汉姆顿分校计算机学院)

AI总结 提出MM++框架,通过熵密度下降识别判别性中间层,结合Ledoit-Wolf正则化协方差矩阵实现无监督、后处理、尺度不变的多层OOD检测,在近/远OOD场景中表现鲁棒。

详情
AI中文摘要

我们提出了MM++(多层马氏距离++),一个完全无监督、严格事后处理且尺度不变的分布外(OOD)检测框架。为了解决尺度不变性与层次表达性之间的权衡,MM++构建了一个原则性的联合特征空间。它首先通过测量熵密度下降来识别判别性中间层,这些下降标志着尖锐语义压缩的边界。通过将这些选定层与终端表示融合,该框架捕获潜在的跨层相关性,同时减轻早期层噪声。关键地,一个Ledoit-Wolf正则化的绑定协方差矩阵稳定了这个统一空间,使得距离估计可靠。无需辅助OOD数据、分类器微调或架构修改,MM++在近和远OOD检测的不同架构上均提供了鲁棒性能。

英文摘要

We introduce MM++ (Multilayer Mahalanobis++), a fully unsupervised, strictly post-hoc, and scale-invariant framework for Out-of-Distribution (OOD) detection. To address the trade-off between scale invariance and hierarchical expressivity, MM++ constructs a principled joint feature space. It first identifies discriminative intermediate layers by measuring entropy density drops, which mark the boundaries of sharp semantic compression. By fusing these selected layers with the terminal representation, the framework captures latent cross-layer correlations while mitigating early-layer noise. Crucially, a Ledoit-Wolf regularized tied covariance matrix stabilizes this unified space, enabling reliable distance estimation. Requiring no auxiliary OOD data, classifier fine-tuning, or architectural modifications, MM++ delivers robust performance across distinct architectures for both near- and far-OOD detection.

2606.17350 2026-06-17 cs.CL cs.AI 新提交

Do Large Language Models Always Tell The Same Stories?

大型语言模型总是讲述相同的故事吗?

Thennal DK, Hans Ole Hatzel

发表机构 * University of Hamburg(汉堡大学)

AI总结 通过对比框架和人类故事数据集,研究10种LLM生成故事的叙事相似性,发现LLM故事比人类故事更相似,前沿模型趋向于“平均”通用叙事,且常见缓解策略无效。

详情
AI中文摘要

大型语言模型(LLMs)的最新进展使得生成高质量散文成为可能,但这些模型是否能够生成多样化的输出仍然存在争议。在这项工作中,我们通过叙事相似性框架研究了LLM生成故事的多样性。使用对比框架和来自r/WritingPrompts的人类编写故事和提示数据集,我们收集了10个代表性LLM的叙事相似性判断,同时利用人类评估和三种不同的自动注释方法。我们的发现揭示了一个一致的趋势:LLM生成的叙事彼此之间始终比人类编写的故事更相似。我们证明,特别是前沿模型收敛于一种“平均”通用叙事,这种叙事近似于个体人类故事,但缺乏人类作者的整体多样性。最后,我们表明常见的缓解策略,包括负提示和温度缩放,未能有效解决这种同质性。

英文摘要

Recent advances in large language models (LLMs) have enabled the generation of high-quality prose, yet the question of whether these models are capable of generating diverse outputs remains contested. In this work, we investigate the diversity of LLM-generated stories through the framework of narrative similarity. Using a contrastive framework and a dataset of human-written stories and prompts from r/WritingPrompts, we collect narrative similarity judgments across 10 representative LLMs, utilizing both human evaluations and three different automatic annotation methods. Our findings reveal a consistent trend: LLM-generated narratives are consistently more similar to each other than human-written stories are. We demonstrate that frontier models in particular converge on a ``mean'' generic narrative that approximates individual human stories but lacks the collective diversity of human authors. Finally, we show that common mitigation strategies, including negative prompting and temperature scaling, fail to meaningfully address this homogeneity.

2606.17345 2026-06-17 cs.LG cs.AI 新提交

Counterfactual Optimization of Baseball Pitch Sequences and Estimation of Its Impact on Season-Level Statistics

棒球投球序列的反事实优化及其对赛季级统计指标影响的估计

Ryota Takamido, Hiroki Nakamoto

发表机构 * Sports Innovation Organization, National Institute of Fitness and Sports in Kanoya(体育创新组织,国立健身与体育研究所)

AI总结 利用Transformer模型和反事实分析,优化MLB投球序列中的最终投球和设置投球,发现可显著提升赛季级表现(如K/9提高1.0以上),并提供了速度带有效位置等实用见解。

详情
AI中文摘要

尽管投球序列是棒球分析的核心话题,但以往研究主要关注单次打席中最终投球的优化,对前期设置投球的作用及其对长期赛季级表现的影响研究不足。为解决这些问题,本研究利用MLB Statcast数据进行了反事实分析。训练了一个基于Transformer的机器学习模型,用于预测目标投球是否会导致击球结果或挥空。然后,通过将最终投球或前期设置投球替换为替代的投球类型和位置,同时保持周围背景信息不变,生成了反事实投球序列。最优反事实选择定义为那些最小化预测击球概率的选择,并使用将模型输出与赛季统计指标关联的回归模型估计其对投手赛季统计指标的预期影响。结果表明,最终投球和设置投球的优化都可能显著影响赛季级表现,包括K/9提高超过1.0。分析还提供了若干实用见解,包括特定速度带的有效位置、投球指令的重要性以及通过中速投球扩展投球选择范围。这些发现定量支持了投球序列在棒球中的战略重要性。

英文摘要

Although pitch sequencing is a central topic in baseball analytics, previous studies have primarily focused on optimizing the final pitch within a single plate appearance, leaving the role of preceding setup pitches and their impact on long-term season-level performance insufficiently examined. To address these issues, this study conducted counterfactual analyses using MLB Statcast data. A Transformer-based machine-learning model was trained to predict whether a target pitch would result in an in-play outcome or swing-out. Counterfactual pitch sequences were then generated by replacing either the final pitch or the preceding setup pitch with alternative pitch types and locations while keeping the surrounding contextual information fixed. Optimal counterfactual selections were defined as those that minimized the predicted in-play probability, and their expected effects on pitchers' seasonal statistics were estimated using regression models linking model outputs to season statistics. The results suggest that the optimization of both final and setup pitches may substantially influence season-level performance, including improvements of more than 1.0 in K/9. The analyses also provided several practical insights, including velocity-band-specific effective locations, the importance of pitch commands, and the expansion of pitch-selection options through middle-velocity pitches. These findings quantitatively support the strategic importance of pitch sequencing in baseball.

2606.17343 2026-06-17 cs.CV stat.AP 新提交

Bayesian Magnetic Resonance Joint Image Reconstruction and Uncertainty Quantification using Sparsity Prior Models and Markov Chain Monte Carlo Sampling

贝叶斯磁共振联合图像重建与不确定性量化:基于稀疏先验模型和马尔可夫链蒙特卡洛采样

Ahmed Karam Eldaly, Matteo Figini, Daniel C. Alexander

发表机构 * Department of Computer Science, University of Exeter(埃克塞特大学计算机科学系) UCL Hawkes Institute, Department of Computer Science, University College London(伦敦大学学院计算机科学系霍克斯研究所)

AI总结 提出一种基于压缩感知磁共振图像重建的不确定性量化框架,采用贝叶斯线性逆问题建模,利用稀疏先验(总变分或小波变换)和分裂增广吉布斯采样器进行MCMC采样,在单线圈和多线圈数据集上验证了优于优化方法和深度学习方法的图像重建与不确定性量化性能。

详情
AI中文摘要

我们提出了一种新的框架,用于使用压缩感知磁共振图像重建进行不确定性量化。该问题在贝叶斯框架内被表述为线性逆问题,并为未知模型参数分配先验分布。具体而言,待重建的图像在给定基下被假设为稀疏的。我们开发了一个适用于任何基的通用框架,并作为示例,测试了图像在(1)空间梯度(使用总变分先验模型)和(2)小波变换中的稀疏性。然后,采用基于分裂增广吉布斯采样的马尔可夫链蒙特卡洛(MCMC)方法从未知参数的后验分布中采样。使用近端MCMC方法有效采样不可微的条件分布。所提出的算法在单线圈和多线圈数据集上使用各种k空间子采样模式和比率进行了验证。结果表明,与对应的基于优化的方法相比,每种提出的方法在图像重建方面具有优越性能。此外,与现有的基于深度学习的方法相比,我们的框架有效地量化了不确定性,显示估计的不确定性图与使用真实值和重建图像计算的误差图之间存在显著相关性。

英文摘要

We propose a novel framework for uncertainty quantification using compressed sensing magnetic resonance image reconstruction. The problem is formulated within a Bayesian framework as a linear inverse problem, with prior distributions assigned to the unknown model parameters. Specifically, the image to be reconstructed is assumed to be sparse in a given basis. We develop a general framework applicable to any basis and as examples, we test the sparsity of the image in its (1) spatial gradients using a total variation prior model, and in its (2) wavelet transform. A Markov chain Monte Carlo (MCMC) method, based on a split-and-augmented Gibbs sampler, is then employed to sample from the posterior distribution of the unknown parameters. The non-differentiable conditional distributions are efficiently sampled using a proximal MCMC method. The proposed algorithms are validated on both single-coil and multi-coil datasets using various k-space sub-sampling patterns and ratios. The results demonstrate the superior performance of each proposed approach in reconstructing images compared to its counterpart optimisation-based method. Moreover, our framework effectively quantifies uncertainty, showing a notable correlation between estimated uncertainty maps and error maps computed using ground truth and reconstructed images, compared with existing deep learning-based methods.

2606.17342 2026-06-17 cs.CV 新提交

Learning a Maximum Entropy Model for Visual Textures using Diffusion

使用扩散学习视觉纹理的最大熵模型

Xinyuan Zhao, Eero P. Simoncelli

发表机构 * New York University(纽约大学) Flatiron Institute(熨斗研究所)

AI总结 提出首个基于扩散模型无监督学习最大熵模型统计量的纹理建模方法,仅用512个统计量即可生成质量优于或媲美当前最优模型(约177k统计量)的纹理图像,并实现平滑插值。

详情
AI中文摘要

视觉纹理——包含重复元素的空间均匀图像区域(例如草地、树皮)——在视觉场景中普遍存在,并为识别和分析材料及物体提供重要线索。许多现有纹理模型从单张纹理图像中提取关键统计量,然后通过匹配这些统计量生成视觉上相似的高质量样本。然而,它们的统计量要么是手工设计的,要么基于为其他目的(如物体识别)预训练的网络。在这里,我们开发了第一个用于无监督学习一组统计量的原理性方法,这些统计量用于约束最大熵概率模型。我们利用为生成扩散模型开发的方法来推导训练和采样程序,并将这些与通过匹配统计量进行采样的传统方法进行比较。尽管我们训练的模型很紧凑(512个统计量),但它生成的纹理图像质量与当前最先进的模型(约177k统计量)相当或更好。通过合成对一个模型不可区分但对另一个模型差异最大的图像,对两个模型进行更直接的比较,揭示了它们的相对优势和劣势。最后,我们表明,与以前的统计纹理模型不同,在我们的模型表示空间中的直线轨迹生成均匀的纹理样本,这些样本在两个端点的特征之间平滑插值。

英文摘要

Visual textures -- spatially homogeneous image regions containing repeated elements (e.g. a field of grass, the bark of a tree) -- are ubiquitous in visual scenes and provide important cues for recognizing and analyzing materials and objects. A number of existing texture models extract essential statistics from a single texture image, and can then generate high-quality samples that are visually similar to the original by matching these statistics. However, their statistics are either hand-designed or based on a network pretrained for another purpose (e.g., object recognition). Here, we develop the first principled method for unsupervised learning of a set of statistics that are used to constrain a maximum entropy probability model. We leverage methods developed for generative diffusion models to derive training and sampling procedures, and compare these to the traditional method of sampling via matching the statistics. Despite the compactness of our trained model (512 statistics), it generates texture images whose quality is as good as or better than the current state-of-the-art model (~177k statistics). A more direct comparison of the two models, obtained by synthesizing images that are indistinguishable for one model but maximally different for the other, reveals their relative strengths and weaknesses. Finally, we show that unlike previous statistical texture models, a straight trajectory in the representation space of our model generates homogeneous texture samples that interpolate smoothly between the features of the two end points.

2606.17340 2026-06-17 cs.CV cs.AI 新提交

Geometry-Consistent Endoscopic Representations for Image-Guided Navigation via Structured Foundation Model Adaptation

几何一致的内窥镜表示用于图像引导导航:基于结构化基础模型适配

Hongchao Shu, Roger D. Soberanis-Mukul, Hao Ding, Morgan Ringel, Mali Shen, Saif Iftekar Sayed, Hedyeh Rafii-Tari, Mathias Unberath

发表机构 * Department of Computer Science, Johns Hopkins University(约翰霍普金斯大学计算机科学系) Semaphor Surgical Johnson & Johnson MedTech(强生医疗科技)

AI总结 提出统一框架,结合合成数据管道与层级感知几何语义适配,学习几何一致且领域鲁棒的图像表示,提升单目内窥镜中的位姿估计与深度预测性能。

详情
AI中文摘要

由于深度线索有限、组织纹理弱、非刚性变形以及跨域外观变化大,单目内窥镜中基于视觉的精确导航十分困难,这些问题使得位姿估计、深度预测和图像-解剖对齐复杂化。尽管最近的视觉基础模型显示出潜力,但它们学到的表示往往几何一致性不足,阻碍了稳定的特征对应,限制了其在后续导航任务中的可靠性。我们提出了一个统一框架,用于学习单目内窥镜中几何一致且领域鲁棒的图像表示。该框架结合了提供精确几何监督的合成数据管道与层级感知几何语义适配,后者是标准LoRA的结构化替代方案,在Transformer层级间选择性插入低秩适配器,并配合逐层训练目标,以鼓励中间特征的几何对应和深层特征的语义一致性。在公开和专有数据集上的实验表明,几何和语义表示质量得到提升,从而在包括位姿估计和单目深度估计在内的下游导航任务上取得更好性能。学到的表示在临床支气管镜中显示出良好的合成到真实迁移能力,并为在有限监督下适配鼻窦镜和结肠镜提供了有用的初始化。该框架还显示出随模型大小和训练数据的良好扩展性。这些结果支持层级感知、几何引导的适配作为内窥镜表示学习的实用方法。

英文摘要

Accurate vision-based navigation in monocular endoscopy is difficult due to limited depth cues, weak tissue texture, non-rigid deformation, and substantial appearance variation across domains, all of which complicate pose estimation, depth prediction, and image-to-anatomy alignment. Although recent vision foundation models have shown promise, their learned representations often remain insufficiently geometry-consistent, hindering stable feature correspondence and limiting their reliability for downstream navigation tasks. We propose a unified framework for learning geometry-consistent and domain-robust image representations for monocular endoscopy. The framework combines a synthetic data pipeline that provides accurate geometric supervision with Hierarchy-Aware Geometry-Semantic Adaptation, a structured alternative to standard LoRA that inserts low-rank adapters selectively across the transformer hierarchy and couples them with layer-wise training objectives to encourage geometric correspondence in intermediate features and semantic consistency in deeper features. Experiments on public and proprietary datasets show improved geometric and semantic representation quality, leading to better performance on downstream navigation tasks including pose estimation and monocular depth estimation. The learned representations show favorable synthetic-to-real transfer on clinical bronchoscopy and provide a useful initialization for adaptation to sinus endoscopy and colonoscopy under limited supervision. The framework also shows favorable scaling with model size and training data. These results support hierarchy-aware, geometry-guided adaptation as a practical approach for endoscopic representation learning.

2606.17339 2026-06-17 cs.AI cs.CL cs.SD 新提交

SpeechDx: A Multi-Task Benchmark for Clinical Speech AI

SpeechDx: 面向临床语音AI的多任务基准

Sejal Bhalla, Larry Kieu, Aina Merchant, Eyal de Lara, Alex Mariakakis

发表机构 * University of Toronto(多伦多大学)

AI总结 提出SpeechDx基准,涵盖12个数据集和27个任务,通过语音产生阶段(概念化、公式化、发音)组织任务,评估12种音频编码器,发现大规模语音模型表现最佳,但尚无表示能可靠泛化。

详情
AI中文摘要

语音通过同时涉及神经、运动、呼吸和发声系统,为健康提供了一个独特的窗口。当前的临床语音AI方法主要通过孤立的特定疾病研究取得进展,导致结果难以比较,泛化能力难以评估。我们引入了SpeechDx,这是一个大规模的临床语音AI基准,涵盖12个数据集和27个任务,涉及多种健康状况。为了能够基于共享的临床机制进行评估,SpeechDx根据任务所破坏的语音产生阶段(概念化、公式化和发音)来组织任务。该基准通过包含有限标注数据的任务以及跨多个数据集评估同一健康状况来测试泛化能力,从而区分有临床意义的模式与数据集伪影。我们系统评估了12个最先进的音频编码器在所有任务以及零样本跨条件迁移下的表现。结果表明,大规模语音模型代表了最强的整体基线,领域特定模型仅在紧密匹配的任务上提升性能,而当前没有任何表示能在临床语音领域可靠泛化。SpeechDx建立了一个共享评估框架,用于追踪通用临床语音表示的进展。

英文摘要

Speech offers a uniquely informative window into health by simultaneously engaging neurological, motor, respiratory, and vocal systems. Current clinical speech AI methods have largely progressed through isolated condition-specific studies, making results difficult to compare and generalization difficult to assess. We introduce SpeechDx, a large-scale benchmark for clinical speech AI spanning 12 datasets and 27 tasks across diverse health conditions. To enable evaluation across shared clinical mechanisms, SpeechDx structures tasks by the stage of speech production they disrupt: conceptualization, formulation, and articulation. The benchmark tests generalization by including tasks with limited labeled data and evaluating the same health condition across multiple datasets, distinguishing clinically meaningful patterns from dataset artefacts. We systematically evaluate 12 state-of-the-art audio encoders across all tasks and under zero-shot cross-condition transfer. Results show that large-scale speech models represent the strongest overall baselines, domain-specific models improve performance only on closely matched tasks, and no current representation generalizes reliably across the clinical speech landscape. SpeechDx establishes a shared evaluation framework for tracking progress toward general-purpose clinical speech representations

2606.17334 2026-06-17 cs.CV 新提交

FATE: Pillar Encoding and Frequency-Aware Training for Event-Based Object Detection

FATE: 基于柱状编码和频率感知训练的事件目标检测

Md Tawheedul Islam Bhuian, Kyoung-Don Kang

发表机构 * School of Computing, State University of New York at Binghamton(纽约州立大学宾汉姆顿分校计算学院)

AI总结 提出FATE框架,通过柱状编码保留事件流时间结构,并利用频率感知训练生成密集伪标签,实现高达200Hz的高时间分辨率目标检测,性能优于现有方法。

详情
AI中文摘要

事件相机是生物启发式传感器,异步捕获对数强度变化,在高速和高动态范围场景中具有固有优势。然而,事件流的稀疏和异步特性对现代深度学习架构构成了根本性挑战。为了与标准模型兼容,大多数现有方法将累积窗口划分为固定的时间子区间。虽然这种方法对空间处理有效,但这种内部离散化丢弃了细粒度的时间结构,并将推理限制在训练监督所施加的低时间频率下。为了解决这一限制,我们提出了FATE,一个基于新型柱状编码(PE)的统一框架。在目标频率决定的离散宏观累积窗口上操作时,PE避免了内部时间子区间划分。它将事件组织成空间柱,并通过投影到连续时间正交多项式基上来近似其窗口内演化。这种公式产生了一个L2最优表示,在密集伪图像中保留了丰富的时间动态,减轻了稀疏事件条件下的信息损失。为了充分利用这种表示,我们引入了频率感知训练(FAT),一种软均值教师课程,生成时间密集的伪标签,有效弥合了低频监督和高频推理之间的不匹配。大量实验表明,FATE能够跨架构范式泛化,并持续优于强基线。它能够在高达200Hz的高时间分辨率下实现鲁棒的目标检测,同时参数数量和推理延迟的开销最小。

英文摘要

Event cameras are bio-inspired sensors that asynchronously capture logarithmic intensity changes, offering inherent advantages in high-speed and high-dynamic-range scenarios. However, the sparse and asynchronous nature of event streams poses a fundamental challenge for modern deep learning architectures. To enable compatibility with standard models, most existing approaches partition the accumulation window into fixed temporal sub-bins. While effective for spatial processing, this internal discretization discards fine-grained temporal structure and constrains inference to the low temporal frequencies imposed by training supervision. To address this limitation, we propose FATE, a unified framework built upon a novel Pillar Encoding (PE). While operating over discrete macro-accumulation windows dictated by the target frequency, PE avoids internal temporal sub-binning. It organizes events into spatial pillars and approximates their intra-window evolution via projection onto a continuous-time orthogonal polynomial basis. This formulation yields an L2-optimal representation that retains rich temporal dynamics in a dense pseudo-image, mitigating information loss under sparse event conditions. To fully leverage this representation, we introduce Frequency-Aware Training (FAT), a soft mean-teacher curriculum that generates temporally dense pseudo-labels, effectively bridging the mismatch between low-frequency supervision and high-frequency inference. Extensive experiments demonstrate that FATE generalizes across architectural paradigms and consistently outperforms strong baselines. It enables robust object detection at high temporal resolutions up to 200 Hz, while incurring minimal overhead in parameter count and inference latency

2606.17331 2026-06-17 cs.LG 新提交

Decision-Driven Geosteering Under Uncertainty: A Unified Framework for Sequential Decision Optimization

不确定性下的决策驱动地质导向:序列决策优化的统一框架

Hibat Errahmen Djecta, Sergey Alyaev, Kristian Fossum, Reidar B. Bratvold, Ressi Bonti Muhammad, Apoorv Srivastava

发表机构 * NORCE Research Centre(NORCE研究机构) University of Stavanger(斯塔夫anger大学) Stanford University(斯坦福大学)

AI总结 提出一个将粒子滤波与强化学习结合的地质导向框架,通过显式建模地质不确定性并评估三种决策策略,实现稳定且高效的井轨迹实时优化。

详情
AI中文摘要

地质导向需要在未知地质构造中导航井轨迹,同时根据钻井过程中获取的间接测量值顺序更新决策。本文提出一个不确定性感知的地质导向框架,该框架将用于概率性地下解释的粒子滤波与用于序列决策的基于价值的强化学习紧密结合。钻头前方的地质不确定性通过粒子滤波显式表示,从而实现基于信念的控制而非确定性轨迹校正。该框架将粒子滤波信念更新与信念感知决策策略耦合,并评估在相同不确定性表示下运行的三种决策选项:一种可解释的近似动态规划方案、一种深度Q学习基线,以及一种采用目标Q网络方案训练以保持稳定性的双深度强化学习架构,该架构使用对偶(价值/优势)分解进行Q值参数化。除了最终的放置性能外,我们还使用衡量随时间变化的转向平滑度的稳定性指标评估策略行为,从而提供关于决策策略如何随不确定性演变而响应的额外操作洞察。该框架集成了一个API,用于在工业地质导向模拟器中在真实测量噪声和钻井约束下进行验证。通过在所有方法中使用相同的地质实现、操作限制和奖励定义,实验提供了对替代决策策略在整个钻井过程中行为的受控和高保真评估,而不仅仅是根据最终井轨迹评估性能。

英文摘要

Geosteering requires navigating a well trajectory through an unknown geological configuration, while sequentially updating decisions based on indirect measurements acquired during drilling. This work presents an uncertainty-aware geosteering framework that tightly integrates particle filtering for probabilistic subsurface interpretation with value-based reinforcement learning for sequential decision-making. Geological uncertainty ahead of the drill bit is represented explicitly through a particle filter (PF), enabling belief-informed control rather than deterministic trajectory correction. The framework couples PF belief updates with belief-informed decision policies and evaluates three decision-making options that operate under identical uncertainty representations: an interpretable Approximate Dynamic Programming (ADP) scheme, a Deep Q-learning baseline, and a Dual Deep Reinforcement Learning (Dual DRL) architecture trained with a target Q-network scheme for stability, using a dueling (value/advantage) decomposition for Q-value parameterization. Beyond final placement performance, we assess policy behavior using stability-oriented metrics that quantify steering smoothness over time, providing additional operational insight into how decision policies respond as uncertainty evolves. The framework is integrated with an API for validation within an industrial geosteering simulator under realistic measurement noise and drilling constraints. Using identical geological realizations, operational limits, and reward definitions across methods, the experiments provide a controlled and high-fidelity evaluation of how alternative decision policies behave throughout the drilling process, rather than evaluating performance solely from the final well trajectory.

2606.17328 2026-06-17 cs.AI 新提交

MemTrace: Probing What Final Accuracy Misses in Long-Term Memory

MemTrace: 探知长期记忆中最终准确率所遗漏的信息

Xianxuan Long, Zhikai Chen, Shenglai Zeng, Shouren Wang, Kai Guo, Jiliang Tang

发表机构 * Michigan State University(密歇根州立大学) Case Western Reserve University(凯斯西储大学)

AI总结 提出MemTrace基准,以知识点为单位,沿记忆年龄、问题类型和证据条件三个维度评估LLM代理的长期记忆,发现证据使用是主要瓶颈。

详情
AI中文摘要

LLM代理越来越多地在会话之间维护用户事实的长期记忆。然而,这种记忆通常通过聚合问题行或情节的准确率来评估。由于这种方法独立评分问题行,即使多个问题探查同一事实,也无法显示该事实在条件变化时的行为。我们引入MemTrace,一个以知识点为测量单位的基准:知识点是关于用户的单个类型化事实,而非单个问题。MemTrace沿三个受控维度探查每个事实:记忆年龄,由事实出现在历史中的会话次数定义;问题类型,涵盖当前状态、先前状态和变化轨迹;以及证据条件,涵盖存在、缺失和被错误前提反驳的设置。评估跨四个范式的13种记忆系统配置,我们发现相似的汇总准确率隐藏了不同的失败:恢复事实的当前和先前状态并不意味着跟踪其变化,安全弃权并不意味着纠正错误前提。主要瓶颈是证据使用,而非检索:当系统失败时,证据可检索的次数比缺失的次数多10倍。这些结果表明,改进长期记忆需要更好地使用可获取的证据,而不仅仅是增加存储或检索。

英文摘要

LLM agents increasingly maintain long-term memory of user facts across sessions. Yet such memory is usually evaluated by aggregating accuracy over question rows or episodes. Because this approach scores question rows independently, even when several questions probe the same fact, it cannot show how that fact behaves as conditions change. We introduce MemTrace, a benchmark whose unit of measurement is the knowledge point: a single typed fact about the user, rather than an individual question. MemTrace probes each fact along three controlled dimensions: memory age, defined by how many sessions ago the fact appeared in the history; question type, covering current state, earlier state, and trajectory of change; and evidence condition, covering present, missing, and contradicted-by-false-premise settings. Evaluating 13 memory-system configurations across four paradigms, we find that similar pooled accuracy hides different failures: recovering a fact's current and earlier states does not imply tracking how it changed, and safe abstention does not imply correcting a false premise. The dominant bottleneck is evidence use, not retrieval: when systems fail, the evidence was retrievable 10 times more often than it was missing. These results suggest that improving long-term memory requires better use of reachable evidence, not simply more storage or retrieval.

2606.17321 2026-06-17 cs.LG cs.CV 新提交

ProCUA-SFT Technical Report

ProCUA-SFT 技术报告

Jaehun Jung, Ximing Lu, Brandon Cui, Muhammad Khalifa, Shaokun Zhang, Hao Zhang, Jin Xu, Amala Sanjay Deshmukh, Karan Sapra, Andrew Tao, Yejin Choi, Jan Kautz, Mingjie Liu, Yi Dong

发表机构 * NVIDIA(英伟达) University of Washington(华盛顿大学) Allen Institute for AI(艾伦人工智能研究所)

AI总结 提出 ProCUA-SFT 数据集,通过自动化管道从 2484 个应用组合的合成轨迹中蒸馏出 310 万步级 SFT 样本,微调 UI-TARS 7B 在 OSWorld 上达到 45.0% 的成功率,比基线提升 18.7 个百分点。

Comments 15 pages, 5 figures

详情
AI中文摘要

训练计算机使用智能体(CUA)——通过截图和键盘/鼠标操作与图形桌面交互的模型——需要在全桌面环境中收集的大规模、多样化的轨迹数据。最大的公共资源 AgentNet(22.5K 条人类轨迹)在用于监督微调(SFT)时会导致负迁移:在 AgentNet 上继续训练 UI-TARS 7B 导致 OSWorld 成功率从 26.3% 下降到 8-10%。我们提出了 ProCUA-SFT,一个包含 310 万步级 SFT 样本的数据集,这些样本从 2484 个应用组合中的 93K 条合成轨迹中蒸馏得到。该数据集由一个全自动管道生成,该管道(i)在带有真实世界内容的实况桌面上合成有基础的任务——912 个来自 SpreadsheetBench 的电子表格、约 10K 个来自 Zenodo10K 的宽松许可演示文稿以及多应用 OSWorld 配置——以及(ii)在展开前通过二元前置条件检查验证每个任务的可行性。单个 VLM(Kimi-K2.5)作为目标生成器、前置条件判断器和轨迹执行器,消除了规划器-执行器的能力差距。每条轨迹被扩展为步前缀样本,精确复现推理时看到的上下文布局。在 ProCUA-SFT 上微调 UI-TARS 7B 一个 epoch 后,在 OSWorld 上达到 45.0%——比基础模型提升 18.7 个百分点,比 AgentNet 训练的模型高出 35% 以上。ProCUA 的一个子集被纳入 Nemotron 3 Nano Omni 模型的训练数据中,为其计算机使用能力做出了贡献。

英文摘要

Training computer-use agents (CUAs) -- models that interact with graphical desktops through screenshots and keyboard/mouse actions -- requires large-scale, diverse trajectory data collected in full desktop environments. The largest public resource, AgentNet (22.5K human trajectories), leads to negative transfer when used for supervised fine-tuning (SFT): continuing training UI-TARS 7B on AgentNet causes OSWorld success rate to fall from 26.3% to 8-10%. We present ProCUA-SFT, a dataset of 3.1M step-level SFT samples distilled from 93K synthetic trajectories across 2,484 application combinations. The dataset is produced by a fully automated pipeline that (i) synthesizes grounded tasks on live desktops seeded with real-world content -- 912 spreadsheets from SpreadsheetBench, approximately 10K permissively-licensed presentations from Zenodo10K, and multi-application OSWorld configs -- and (ii) verifies each task's feasibility through binary precondition checking before rollout. A single VLM (Kimi-K2.5) serves as goal generator, precondition judge, and trajectory executor, eliminating planner-actor capability gaps. Each trajectory is expanded into step-prefix samples that exactly reproduce the context layout seen at inference time. Fine-tuning UI-TARS 7B on ProCUA-SFT for one epoch yields 45.0% on OSWorld -- an 18.7 percentage-point improvement over the base model and over 35% above AgentNet-trained counterparts. A subset of ProCUA was incorporated into the training data for the Nemotron 3 Nano Omni model, contributing to its computer-use capabilities.

2606.17312 2026-06-17 cs.AI 新提交

Quantifying Consistency in LLM Logical Reasoning via Structural Uncertainty

通过结构不确定性量化LLM逻辑推理中的一致性

Baishali Chaudhury, Mengdie Flora Wang, Hyunji Hayley Park, Rahul Ghosh, Sungmin Hong, Jae Oh Woo

发表机构 * AWS Generative AI Innovation Center(AWS生成式AI创新中心)

AI总结 提出结构不确定性框架,通过自偏好排序的稳定性评估LLM推理一致性,在逻辑和数学任务中与答案分散度互补,提升不可靠实例识别。

Comments Published at ICLR 2026 Workshop on Logical Reasoning of Large Language Models. Accepted as best paper

详情
AI中文摘要

大型语言模型可以通过不稳定、矛盾或难以一致排序的推理路径得出相同答案——这种失败模式在多步演绎推理中尤为普遍。现有方法主要通过输出分散度(衡量采样答案的差异)来评估可靠性,但这丢弃了一个互补信号:模型是否能一致地对竞争性推理候选进行排序。我们提出结构不确定性,一个从自偏好诱导的推理解决方案排序稳定性导出的、具有一致性意识的框架。给定一个查询,我们生成多个候选解决方案,并让模型对其自身输出进行成对偏好判断。我们通过Bradley-Terry模型与PageRank将自偏好聚合成排序分布,并将信号分解为两个基于熵的分量:跨试验排序不稳定性和试验内候选歧义性。在五个LLM和八个基准上,结构信号提供了与答案分散度互补的信息:在逻辑和数学推理任务中,组合提高了不可靠实例的识别,而在事实检索中,结构信号坍缩为均匀分布,诊断出一个推理层面一致性评估无信息性的状态边界。两个分量与准确性的关系不同:试验内歧义性与正确性正相关——与多个合理解决方案路径保持竞争的情况一致——而跨试验不稳定性与正确性负相关,表明推理不可靠。结构不确定性最好不被理解为通用置信度估计器,而是作为逻辑推理一致性的状态敏感评估器。

英文摘要

Large language models can arrive at the same answer through reasoning paths that are unstable, contradictory, or difficult to rank consistently -- a failure mode especially prevalent in multi-step deductive reasoning. Existing methods assess reliability primarily through output dispersion -- measuring how much sampled answers differ -- but this discards a complementary signal: whether the model can consistently rank competing reasoning candidates. We propose structural uncertainty, a consistency-aware framework derived from the stability of self-preference-induced rankings over sampled reasoning solutions. Given a query, we generate multiple candidate solutions and ask the model to judge pairwise preferences among its own outputs. We aggregate self-preferences into ranking distributions via Bradley-Terry modeling with PageRank, and decompose the signal into two entropy-based components: across-trial ranking instability and within-trial candidate ambiguity. Across five LLMs and eight benchmarks, structural signals provide information complementary to answer dispersion: on logical and mathematical reasoning tasks, the combination improves identification of unreliable instances, while on factual retrieval the structural signal collapses toward uniformity, diagnosing a regime boundary where reasoning-level consistency evaluation is uninformative. The two components relate differently to accuracy: within-trial ambiguity correlates positively with correctness -- consistent with settings where multiple plausible solution paths remain competitive -- while across-trial instability correlates negatively, signaling unreliable reasoning. Structural uncertainty is best understood not as a universal confidence estimator, but as a regime-sensitive evaluator of logical reasoning consistency.

2606.17310 2026-06-17 cs.CV 新提交

SierpinskiCam: Camera-Controlled Video Retaking with Sierpinski Triangle Pattern Cues

SierpinskiCam: 基于谢尔宾斯基三角形图案线索的相机控制视频重拍

Suttisak Wizadwongsa, Hyelin Nam, Supasorn Suwajanakorn, Jeong Joon Park

发表机构 * University of Michigan, Ann Arbor(密歇根大学安娜堡分校) VISTEC, Thailand(泰国威斯泰克科学技术研究院)

AI总结 提出SierpinskiCam方法,通过谢尔宾斯基圆顶纹理线索增强几何引导,并引入参考视频条件机制,解决单目视频重拍中相机大角度偏离时的稀疏区域问题,提升相机可控性、几何一致性和视频质量。

Comments 20 pages, 13 figures

详情
AI中文摘要

从单个单目视频沿用户定义的相机轨迹生成场景的新颖渲染,称为视频重拍,是内容创作和视觉效果中一个引人注目但困难的问题。现有的几何引导方法从源视频重建4D表示,并沿目标轨迹渲染以条件视频扩散模型。然而,当目标相机偏离源轨迹时,这种引导会退化,导致新暴露区域稀疏或完全缺失。我们提出SierpinskiCam,通过使用包含丰富可跟踪特征的谢尔宾斯基圆顶纹理线索来增强基于几何的引导,从而解决了这一限制,即使在大的视角变化下也能保持跟踪。我们进一步引入了一种参考视频条件机制,将源视频令牌附加到目标令牌序列,并使用负RoPE索引分离两个流,从而无需架构修改或逐视频适应即可实现外观基础。大量实验表明,SierpinskiCam在多样且具有挑战性的重拍场景中,在相机可控性、几何一致性和视频质量方面取得了显著提升。项目页面:此https URL。

英文摘要

Generating novel renderings of a scene along user-defined camera trajectories from a single monocular video, dubbed video retaking, is a compelling but difficult problem in content creation and visual effects. Existing geometry-guided approaches reconstruct a 4D representation from the source video and render it along the target trajectory to condition video diffusion models. However, this guidance degrades as the target camera departs from the source trajectory, leaving newly revealed regions sparse or entirely missing. We propose SierpinskiCam, which addresses this limitation by augmenting geometry-based guidance with Sierpinski dome texture cues that contains rich trackable features even under large viewpoint changes. We further introduce a reference video conditioning mechanism that appends source-video tokens to the target-token sequence and separates the two streams with negative RoPE indices, enabling appearance grounding without architectural modification or per-video adaptation. Extensive experiments show that SierpinskiCam achieves significant gains in camera controllability, geometric consistency, and video quality across diverse and challenging retaking scenarios. Project page: https://hyelinnam.github.io/SierpinskiCam/.

2606.17309 2026-06-17 cs.RO 新提交

Abstention-Aware Personalized Object Rearrangement via Uncertainty-Guided LLM Assistance

基于不确定性引导的LLM辅助的弃权感知个性化物体重排

Sam Collin, Ali Ayub

发表机构 * Concordia University(康考迪亚大学)

AI总结 提出APOLLO框架,结合轻量级个性化嵌入模型与选择性大语言模型辅助,通过不确定性估计在模糊决策时调用LLM,实现高效、隐私保护的弃权感知物体重排。

Comments Accepted at the 2026 IEEE 35th International Conference on Robot and Human Interactive Communication (RO-MAN 2026)

详情
AI中文摘要

家庭环境中的机器人辅助不仅需要预测物体应放置的位置,还需要推理何时不应放置物体。现有的个性化物体重排方法主要假设观测清晰且完全可操作,限制了其在现实、杂乱且部分错误环境中的适用性。本文提出APOLLO,一个用于弃权感知个性化物体重排的混合框架,结合了轻量级个性化嵌入模型(PEM)与选择性大语言模型(LLM)辅助。PEM针对每个用户-环境对使用少量演示进行训练,完全在CPU上运行,并产生不确定性估计,用于仅对模糊决策选择性调用基于LLM的推理,平衡效率、隐私和推理能力。为了在现有基准之外评估该公式,我们引入了APOR,一个合成的、由LLM生成的数据集,捕捉房间级、多家具环境、多样化的组织配置文件、明确的弃权行为和嘈杂的部分场景上下文。在PARSEC和APOR上的大量实验初步表明,APOLLO在受控基准设置中优于先前基于LLM的基线,同时大幅减少LLM的使用。代码可在该网址获取。

英文摘要

Robotic assistance in household environments requires not only predicting where objects should be placed, but also reasoning about when objects should not be placed at all. Existing approaches to personalized object rearrangement primarily focus on placement decisions under the assumption of clean observations and complete actionability, limiting their applicability in realistic, cluttered, and partially erroneous settings. In this paper, we introduce APOLLO, a hybrid framework for abstention-aware personalized object rearrangement that combines a lightweight, personalized embedding model (PEM) with selective large language model (LLM) assistance. PEM is trained for each user-environment pair using a small number of demonstrations, operates entirely on CPU, and produces uncertainty estimates, which are used to selectively invoke LLM-based reasoning only for ambiguous decisions, balancing efficiency, privacy, and reasoning capability. To evaluate this formulation beyond existing benchmarks, we introduce APOR, a synthetic, LLM-generated dataset that captures room-level, multi-furniture environments, diverse organizational profiles, explicit abstention behavior, and noisy partial scene context. Extensive experiments on both PARSEC and APOR provide initial evidence that APOLLO improves over prior LLM-based baselines in controlled benchmark settings while substantially reducing LLM usage. Code is available at https://github.com/PaInt-Lab/APOLLO.