arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.20079 2026-05-20 cs.CV cs.AI cs.LG eess.IV

基于机器学习和区域生长的X射线心血管造影血管分割

E O Rodrigues, L O Rodrigues, J J Lima, D Casanova, F Favarim, E R Dosciatti, V Pegorini, L S N Oliveira, F F C Morais

AI总结本文提出了一种基于像素分类的X射线血管分割方法，利用纹理特征和区域生长技术，通过随机森林分类器实现高精度血管识别，达到95.48%的准确率。

2605.20072 2026-05-20 cs.AI cs.RO

Probing Embodied LLMs: When Higher Observation Fidelity Hurts Problem Solving

探查具身大语言模型：当更高的观察保真度损害问题解决

Oussama Zenkri, Oliver Brock

AI总结本文研究了具身大语言模型在不同观察信息下的行为，发现高保真度观察反而降低了问题解决能力，核心方法是通过实验改变可用信息并测量行为变化，主要贡献是揭示了感知误差与推理失败的交互影响。

Comments Submitted to From Animals to Animats: The 18th International Conference on the Simulation of Adaptive Behavior (SAB)

详情

AI中文摘要

大型语言模型日益被提出作为机器人系统的认知组件，但其不透明的决策过程使得在闭环具身任务中的成功或失败难以解释。遵循经验AI方法，我们通过改变代理可用的信息并测量行为变化来研究具身LLM代理的行为。使用Lockbox，一个具有隐藏依赖关系的顺序机械谜题，在物理机器人设置中评估LLM在RGB、RGB-D和地面真实符号观察下的表现，并通过受控模拟来探测由此产生行为。反直觉的是，代理在原始RGB输入下表现最佳，而在完美地面真实观察下表现最差。在模拟中，我们通过随机翻转感知的动作结果来探测这一效应，发现适度的噪声提高了性能，峰值出现在40%的翻转概率下，相比无噪声基线，成功率提高了2.85倍。进一步分析将这一收益归因于重复动作循环的减少。这些发现表明，仅凭成功率来评估LLM是不够的，因为测量性能可能反映了感知误差与推理失败之间的相互作用，而非稳健的问题解决。

英文摘要

Large Language Models are increasingly proposed as cognitive components for robotic systems, yet their opaque decision processes make it difficult to explain success or failure in closed-loop embodied tasks. Following an empirical AI methodology, we study embodied LLM agents behaviorally by varying the information available to the agent and measuring the resulting changes in behavior. Using the Lockbox, a sequential mechanical puzzle with hidden interdependencies, we evaluate LLMs across RGB, RGB-D, and ground-truth symbolic observations in a physical robotic setup and use controlled simulation to probe the resulting behavior. Counterintuitively, agents perform best under raw RGB input and worst under perfect ground-truth observations. In simulation, we probe this effect by randomly flipping perceived action outcomes and find that moderate noise improves performance, peaking at a 40% flip probability with a 2.85-fold success rate increase over the noise-free baseline. Further analysis links this gain to a reduction in repetitive action loops. These findings suggest that success rates alone are insufficient for evaluating LLMs, as measured performance may reflect the interaction between perceptual errors and reasoning failures rather than robust problem solving.

URL PDF HTML ☆

赞 0 踩 0

2605.20066 2026-05-20 cs.CL

Text-to-SPARQL Generation with Reinforcement Learning: A GRPO-based Approach on DBLP

基于强化学习的文本到SPARQL生成：在DBLP上的GRPO方法

Jann Pfeifer, Debayan Banerjee, Ricardo Usbeck

AI总结本文研究了在学术领域中，基于强化学习的零样本文本到SPARQL生成方法，通过GRPO算法在DBLP-QuAD上训练小型指令微调语言模型，并与监督学习的DoRA微调基线进行比较。

Comments Accepted by NeSy 2026

详情

AI中文摘要

知识图谱问答旨在将自然语言问题转换为可执行的知识图谱查询，但现有方法往往依赖于大型模型或全监督形式的黄金查询注释。本研究探讨了基于结果奖励的强化学习是否能训练一个小型指令微调语言模型，在学术领域进行零样本文本到SPARQL生成。Group-Relative Policy Optimization (GRPO)被应用于DBLP-QuAD上的Qwen3-1.7B模型，使用结合自然语言问题和实体及关系的符号提示。训练依赖于执行反馈、结构约束和答案级奖励，并额外引入基于黄金查询的塑造。所得模型在答案级准确性、执行准确性、类别得分和泛化到预留模板方面与未修改的零样本基线和监督DoRA微调基线进行比较。GRPO在零样本基线上显著提升，并表现出有竞争力的泛化能力，而监督DoRA微调在相同模型规模上实现了更高的整体准确性。消融分析表明，基于执行的奖励贡献了大部分收益，而额外的塑造带来了有限的额外收益，表明当没有黄金查询用于token级监督时，基于结果的强化学习是一种可行的训练策略。

英文摘要

Knowledge graph question answering seeks to translate natural language questions into executable queries over knowledge graphs, but existing approaches often rely on large models or full supervision in the form of gold query annotations. This study examines whether reinforcement learning with outcome-based rewards can train a small instruction-tuned language model to perform zero-shot Text-to-SPARQL generation in the scholarly domain. Group-Relative Policy Optimization (GRPO) is applied to the Qwen3-1.7B model on DBLP-QuAD, using prompts that combine natural language questions with symbolic hints about entities and relations. Training relies on execution feedback, structural constraints, and answer-level rewards, with an additional variant that incorporates gold-query-based shaping. The resulting models are compared to the unmodified zero-shot baseline and to a supervised DoRA-finetuned baseline across answer-level accuracy, execution accuracy, category-wise scores, and generalization to held-out templates. GRPO substantially improves over the zero-shot baseline and exhibits competitive generalization, while supervised DoRA finetuning achieves higher overall accuracy on the same model scale. Ablation analyses indicate that execution-based rewards account for most gains, with additional shaping yielding limited additional benefit, suggesting that outcome-based reinforcement learning is a viable training strategy when gold queries are unavailable for token-level supervision.

URL PDF HTML ☆

赞 0 踩 0

2605.20064 2026-05-20 cs.CV

Cardiac fat segmentation using computed tomography and an image-to-image conditional generative adversarial neural network

利用计算断层扫描和图像到图像的条件生成对抗神经网络进行心脏脂肪分割

Guilherme Santos da Silva, Dalcimar Casanova, Jefferson Tales Oliva, Erick Oliveira Rodrigues

AI总结本研究提出了一种基于深度学习的新方法，利用pix2pix网络对心脏脂肪进行自动分割和量化，实现了高精度的epicardial和mediastinal脂肪分割，并在准确率和运行时间上优于现有方法。

详情

DOI: 10.1016/j.medengphy.2024.104104
Journal ref: Medical Engineering & Physics 2024

AI中文摘要

近年来，研究强调了人类心脏周围脂肪组织增加与心瓣膜纤维颤动和冠心病等心血管疾病之间存在联系。然而，由于对医疗专业人员来说手动分割这些脂肪沉积物工作量大且成本高，这种分割并未在临床实践中广泛应用。因此，对更精确和高效定量分析的需求推动了新型计算方法的出现。本研究提出了一种新的深度学习方法，能够自主分割和量化两种不同类型的心脏脂肪沉积物。所提出的方法利用了pix2pix网络，这是一种主要设计用于图像到图像翻译任务的生成对抗网络。通过应用此网络架构，我们旨在研究其在解决心脏脂肪分割特定挑战方面的有效性，尽管该网络并非最初为该目的设计。本研究中感兴趣的两种脂肪沉积物称为心外膜脂肪和心包脂肪，它们被心包空间分开。实验结果表明，epicardial脂肪分割的平均准确率为99.08%和f1分数98.73，mediastinal脂肪分割的准确率为97.90%和f1分数98.40。这些发现代表了所提出方法的高精度和重叠一致性。与现有研究相比，我们的方法在f1分数和运行时间上表现更优，使图像能够在实时情况下进行分割。

英文摘要

In recent years, research has highlighted the association between increased adipose tissue surrounding the human heart and elevated susceptibility to cardiovascular diseases such as atrial fibrillation and coronary heart disease. However, the manual segmentation of these fat deposits has not been widely implemented in clinical practice due to the substantial workload it entails for medical professionals and the associated costs. Consequently, the demand for more precise and time-efficient quantitative analysis has driven the emergence of novel computational methods for fat segmentation. This study presents a novel deep learning-based methodology that offers autonomous segmentation and quantification of two distinct types of cardiac fat deposits. The proposed approach leverages the pix2pix network, a generative conditional adversarial network primarily designed for image-to-image translation tasks. By applying this network architecture, we aim to investigate its efficacy in tackling the specific challenge of cardiac fat segmentation, despite not being originally tailored for this purpose. The two types of fat deposits of interest in this study are referred to as epicardial and mediastinal fats, which are spatially separated by the pericardium. The experimental results demonstrated an average accuracy of 99.08% and f1-score 98.73 for the segmentation of the epicardial fat and 97.90% of accuracy and f1-score of 98.40 for the mediastinal fat. These findings represent the high precision and overlap agreement achieved by the proposed methodology. In comparison to existing studies, our approach exhibited superior performance in terms of f1-score and run time, enabling the images to be segmented in real time.

URL PDF HTML ☆

赞 0 踩 0

2605.20061 2026-05-20 cs.CL

Rewarding Beliefs, Not Actions: Consistency-Guided Credit Assignment for Long-Horizon Agents

奖励信念，而非行动：一致性引导的长期智能体信用分配

Wenjie Tang, Minne Li, Sijie Huang, Liquan Xiao, Yuan Zhou

AI总结本文提出ReBel算法，通过建模结构化信念状态来指导策略学习，解决长期任务中由于部分可观测性导致的信用分配问题，实验表明其在ALFWorld和WebShop等基准测试中提升了任务成功率并提高了样本效率。

Comments 10 pages, 4 figures, 3 tables, plus appendix

详情

AI中文摘要

可验证奖励的强化学习（RLVR）是一种有前景的范式，用于提高大语言模型（LLM）智能体在长期交互任务中的表现。然而，在部分可观测环境中，不完整的观察导致智能体信念随时间漂移，而延迟奖励会模糊中间决策的因果影响，加剧时间信用分配的挑战。为此，我们提出ReBel（奖励信念），一种过程级强化学习算法，通过显式建模结构化信念状态来总结交互历史并指导后续策略学习。ReBel引入信念一致性监督，将预测信念与观察反馈之间的差异转换为密集的自监督信号，无需外部步骤注释或验证者。它还采用信念感知分组，比较相似信念状态下的轨迹，产生更稳健且方差更低的优势估计。我们在具有挑战性的长期基准测试上评估了ReBel，包括ALFWorld和WebShop。ReBel在episode级基线GRPO上将任务成功率提高高达20.4个百分点，并将样本效率提高2.1倍。这些结果表明，信念感知的自监督是一种在部分可观测性下可靠长期决策的有前景方向。代码可在：https://github.com/Fateyetian/Rebel.git获取。

英文摘要

Reinforcement learning from verifiable rewards (RLVR) is a promising paradigm for improving large language model (LLM) agents on long-horizon interactive tasks. However, in partially observable environments, incomplete observations cause agent beliefs to drift over time, while delayed rewards obscure the causal impact of intermediate decisions, exacerbating temporal credit assignment challenges. To address this, we propose ReBel (Reward Belief), a process-level reinforcement learning algorithm that explicitly models structured belief states to summarize interaction history and guide subsequent policy learning. ReBel introduces belief-consistency supervision, converting discrepancies between predicted beliefs and observed feedback into dense self-supervised signals without requiring external step-wise annotations or verifiers. It also employs belief-aware grouping to compare trajectories under similar belief states, yielding more robust and lower-variance advantage estimates. We evaluate ReBel on challenging long-horizon benchmarks, including ALFWorld and WebShop. ReBel improves task success by up to $20.4$ percentage points over the episode-level baseline GRPO and increases sample efficiency by $2.1\times$. These results suggest that belief-aware self-supervision is a promising direction for reliable long-horizon decision-making under partial observability. Code is available at: https://github.com/Fateyetian/Rebel.git.

URL PDF HTML ☆

赞 0 踩 0

2605.20050 2026-05-20 cs.CL

Language Mutations Sustain the Persistences of Conspiracy Theories on Social Media

语言变异维持社交媒体上阴谋论的持续性

Calvin Yixiang Cheng, Dorian Quelle, Scott A. Hale

AI总结本研究探讨了语言变异如何影响社交媒体上阴谋论的持续传播，通过分析X平台三年的阴谋相关帖子数据，发现语义变异更大的阴谋论具有更长的生命周期，且心理语言学属性的变异与延长生命周期有关。

详情

AI中文摘要

本研究探讨了语言变异如何影响社交媒体上阴谋论的持续传播。通过分析X平台三年的阴谋相关帖子数据，结合计算语言学分析和生存建模，我们发现语义变异更大的阴谋论具有更长的生命周期。心理语言学属性的变异，包括代词、社会参照词、认知过程术语、风险和健康相关的词汇，与延长生命周期有关。演员、行动和目标（AAT）类别的变异也与更长的生命周期有关。定性分析识别出两种主要的变异模式：简化和同化，分别在语言和AAT结构层面。总体而言，这些结果加深了我们对语言变异如何促进在线阴谋论持续性的理解，并为长期内容管理策略提供了新的视角。我们主张内容管理应考虑阴谋论声明的可变性，并专注于核心声明以应对其潜在变化。

英文摘要

This study investigates how language mutations affect the persistent diffusion of conspiracy theories on social media. Drawing on a three-year dataset of conspiracy-related posts from X, and applying computational linguistic analysis alongside survival modelling, we find that conspiracy claims with greater semantic mutations have substantially longer lifespans. Mutations in psycholinguistic properties, including pronouns, social reference words, cognitive process terms, risk- and health- related vocabularies, are associated with extended lifespans. Mutations in actor, action and target (AAT) categories are associated with longer lifespans as well. Qualitative analysis identifies two predominant mutation patterns: simplification and assimilation, at both linguistic and AAT structural levels. Taken together, the results advance our understanding of how language mutations contribute to conspiracy persistence online and shed lights on longitudinal content moderation strategies. We argue that content moderation should consider the mutability of conspiracy claims and focus on the core claims that can address their potential variations.

URL PDF HTML ☆

赞 0 踩 0

2605.20044 2026-05-20 cs.CV

OP2GS: Object-Aware 3D Gaussian Splatting with Dual-Opacity Primitives

OP2GS: 带双不透明度的物体感知3D高斯散射

Guiyu Liu, Niklas Vaara, Janne Mustaniemi, Juho Kannala, Janne Heikkilä

AI总结 OP2GS通过引入双不透明度机制，为每个原始体素添加显式实例身份和专用实例不透明度σ*，以解决3D高斯散射在物体层面身份缺失的问题，从而提升开放词汇场景理解的性能。

Comments Under review

详情

AI中文摘要

3D高斯散射（3DGS）提供了一种显式且高效的场景表示，但其原始体素缺乏固有的物体层面身份，阻碍了下游任务如开放词汇场景理解。现有方法通常通过将高维特征嵌入提炼为高斯或通过启发式细化将2D掩码标签提升为3D来解决这一问题。然而，基于特征的方法会带来沉重的存储和解码开销，而基于提升的方法则容易受到标签污染：用于外观重建的高斯体往往在2D到3D投影时会获得错误的物体标签。我们提出了OP2GS，一种带物体感知的高斯表示，通过为每个原始体素添加显式实例身份和专用实例不透明度σ*用于物体掩码渲染。原始不透明度σ仍负责视觉重建，而σ*则模型该高斯是否应贡献于特定的物体掩码。这种双不透明度公式将视觉存在与实例占用解耦：错误标记的高斯体仍可用于图像渲染，但在物体掩码分支中会变得透明。为了学习这种表示，我们引入了随机物体损失，通过3DGS标准的透射率基可见性优化1D实例占用场。然后通过多视角聚合将语义描述符附加在物体层面，消除了每个高斯体的特征存储需求。与基于特征训练的方法相比，OP2GS在开放词汇性能方面具有竞争力，同时显著减少了计算开销。与无训练管道相比，它利用物理一致的占用学习来解决可见性歧义。

英文摘要

3D Gaussian Splatting (3DGS) provides an explicit and efficient scene representation, but its primitives lack inherent object-level identity, hindering downstream tasks such as open-vocabulary scene understanding. Existing methods typically address this by either distilling high-dimensional feature embeddings into Gaussians or by lifting 2D mask labels into 3D via heuristic refinement. However, feature-based approaches incur heavy storage and decoding overhead, while lifting-based pipelines remain vulnerable to label contamination: Gaussians necessary for appearance reconstruction often receive incorrect object labels during 2D-to-3D projection. We propose OP2GS, an object-aware Gaussian representation that augments each primitive with an explicit instance identity and a dedicated instance opacity $σ^{*}$ for object-mask rendering. The original opacity $σ$ remains responsible for visual reconstruction, while $σ^{*}$ models whether a Gaussian should contribute to a particular object mask. This dual-opacity formulation decouples visual existence from instance occupancy: mislabeled Gaussians can remain available for image rendering while becoming transparent in the object-mask branch. To learn this representation, we introduce a random object loss that optimizes the 1D instance occupancy field using the standard transmittance-based visibility of 3DGS. Semantic descriptors are then attached at the object level through multi-view aggregation, eliminating per-Gaussian feature storage. Compared with feature-training approaches, OP2GS achieves competitive open-vocabulary performance while significantly reducing computational overhead. Compared with training-free pipelines, it leverages physically consistent occupancy learning to resolve visibility ambiguities.

URL PDF HTML ☆

赞 0 踩 0

2605.20040 2026-05-20 cs.LG

Active Context Selection Improves Simple Regret in Contextual Bandits

主动上下文选择提升上下文老虎机中的简单遗憾

Mohammad Shahverdikondori, Jalal Etesami, Negar Kiyavash

AI总结本文研究了具有有限上下文空间的上下文多臂老虎机问题，通过主动选择上下文样本来优化简单遗憾，提出了一种在已知和未知上下文分布时均能有效提升性能的算法。

详情

AI中文摘要

我们研究了具有有限上下文空间（即亚群体）的上下文多臂老虎机问题，其中学习者为每个上下文推荐最佳动作，并通过上下文加权简单遗憾进行评估。我们的保证是在奖励分布的最坏情况下，同时保持对上下文分布向量p的实例依赖性。类似于实验设计问题，其中感兴趣的总体是固定的但可选的亚群体可以被控制，我们允许学习者主动选择从何处采样上下文。对于已知的p，我们刻画了紧致的遗憾率：被动采样（上下文随机揭示）的遗憾为顺序√(n/T ||p||_{1/2})，而主动采样（分配q_j ∝ p_j^{2/3}）则达到紧致的速率√(n/T) ||p||_{2/3}。所获得的改进可以达到Θ(k^{1/4})，其中k是上下文的数量。我们进一步将分析扩展到预算化的主动采样，刻画相应的紧致速率，并确定何时有限的主动预算足以恢复完全主动的速率。当p未知时，我们提出探索-探索-然后-提交（EETC）算法，该算法在大时间范围内能够匹配已知p的主动速率，仅相差常数因子。在合成和现实数据上的实验支持了我们的理论发现。

英文摘要

We study the contextual multi-armed bandit problem with a finite context space (a.k.a. subpopulations), where the learner recommends a best action for each context and is evaluated by context-weighted simple regret. Our guarantees are worst-case over the reward distributions, while remaining instance-dependent with respect to the context distribution vector $p$. Akin to experimental design problems where the population of interest is fixed but the sampled subpopulation can be controlled, we allow the learner to actively choose which context to sample from. For a known $p$, we characterize tight regret rates: passive sampling where contexts are randomly revealed achieves regret of order $\sqrt{n/T \, \lVert p \rVert_{1/2}}$, whereas active sampling with allocation $q_j \propto p_j^{2/3}$ achieves the tight rate $\sqrt{n/T} \, \lVert p \rVert_{2/3}$. The resulting improvement can be as large as $Θ(k^{1/4})$, where $k$ is the number of contexts. We further extend the analysis to budgeted active sampling, characterize the corresponding tight rate, and identify when a limited active budget suffices to recover the fully active rate. When $p$ is unknown, we propose the Explore-Explore-Then-Commit (EETC) algorithm, which optimally balances estimating the context distribution and the time to switch to active allocation, such that for large horizons, it matches the known-$p$ active rate up to constants. Experiments on synthetic and real-world data support our theoretical findings.

URL PDF HTML ☆

赞 0 踩 0

2605.20037 2026-05-20 cs.LG cs.AI

无需训练的贝叶斯过滤与生成模拟器

Thomas Savary, François Rozet, Gilles Louppe

AI总结本文提出一种无需额外训练的最优粒子滤波变种，利用基于扩散的动力学模拟器，解决了高维环境下粒子滤波的可扩展性问题，通过非线性混沌系统实验验证了其有效性。

Comments Accepted as a spotlight paper at the International Conference on Machine Learning 2026

2605.20022 2026-05-20 cs.CL

CogOmniControl: 通过创意意图认知实现推理驱动的可控视频生成

Hongji Yang, Songlian Li, Yucheng Zhou, Xiaotong Zhao, Alan Zhao, Chengzhong Xu, Jianbing Shen

AI总结本文提出CogOmniControl框架，通过将可控视频生成分解为创意意图认知和生成两个阶段，利用专门训练的CogVLM生成更专业清晰的输出，并通过强化学习对齐不同条件的控制，最终在两个基准测试中超越现有开源模型。

详情

AI中文摘要

最近的扩散模型在视频生成中实现了强大的照片真实性和流畅性，但在抽象、稀疏或复杂条件下表现脆弱，导致在专业生产流程如分镜头草图和泥塑渲染条件中性能不佳。现有视频生成模型要么通过适配器注入条件，要么将通用视觉-语言模型（VLM）嵌入扩散骨干中，导致能力缺口，无法生成符合用户创意意图的视频。我们提出了CogOmniControl，一个推理驱动的框架，将可控视频生成分解为创意意图认知和生成。具体而言，我们训练了一个专门的CogVLM，使用真实的动画制作数据。与通用VLM相比，它生成更专业和清晰的输出，能够从稀疏和抽象的条件下准确认知用户的创意意图，并将这些提示转换为密集的推理输出。此外，CogOmniDiT通过上下文生成统一各种条件的控制，并通过强化学习对齐CogVLM的推理输出。此外，利用CogVLM在引导视频生成中的强大能力，我们释放了其在规划特定评估者和启用生成视频的最佳N选择中的潜力。这种整合将整个框架转变为闭环的

英文摘要

Recent diffusion models achieve strong photorealism and fluency in video generation, yet remain fragile under abstract, sparse or complex conditions, leading to poor performance in professional production workflows such as storyboard sketches and clay render conditions. Existing video generation models, either inject conditions through adapters or couple a generic vision-language model (VLM) within a diffusion backbone, leaving a capability gap and failing to produce the videos that align with the user's creative intent. We present CogOmniControl, a reasoning-driven framework that factorizes controllable video generation into creative intent cognition and generation. Specifically, we train a specialized CogVLM using authentic anime production data. Compared to generic VLMs, it generates more professional and clear outputs, accurately cognizing user creative intent from sparse and abstract conditions and tuning these cues into dense reasoning output. Besides, CogOmniDiT unifies the controls from various conditions through in-context generation and is aligned to the CogVLM reasoning outputs via reinforcement learning. Furthermore, leveraging CogVLM's robust capability in guiding video generation, we release its potential in planning specific evaluators and enable a Best-of-N selection for the generated videos. This integration transforms the entire framework into a closed-loop "harness-like" architecture. We further introduce CogReasonBench and CogControlBench, built from professional workflows data that carry genuine creative intent rather than simulated ones. Experiments on two benchmarks show that CogOmniControl surpassed the existing open-source models. The project website: https://um-lab.github.io/CogOmniControl/

URL PDF HTML ☆

赞 0 踩 0

2605.19990 2026-05-20 cs.RO cs.CV cs.LG

Minimalist Visual Inertial Odometry

极简视觉惯性里程计

Francesco Pasti, Jeremy Klotz, Nicola Bellotto, Shree K. Nayar

AI总结本文提出了一种极简的平面里程计方法，通过四个视觉测量和一个IMU实现差分驱动机器人的鲁棒运动估计，展示了极简传感在高效准确平面里程计中的应用。

Comments This work has been submitted to the IEEE for possible publication

详情

AI中文摘要

视觉-惯性里程计（VIO）对于移动机器人导航至关重要，但使用高像素相机需要大量资源。本文提出了一种极简方法用于平面里程计，证明仅四个视觉测量和一个IMU即可为差分驱动机器人提供可靠的运动估计。我们的关键见解是四个向下 facing 的光电二极管通过光学Gabor掩码感知世界，产生编码速度的信号。基于此，我们利用物理基础模拟器联合优化掩码参数和时间卷积网络（TCN）。所得到的模型仅通过光电二极管产生的四个测量值解码速度。将这些估计与IMU提供的角速度结合，可以得到连续的平面轨迹。我们通过将原型传感器安装在差分驱动机器人上验证了我们的方法。在多样化的室内和室外地形上，我们的系统能够紧密跟踪参考真实地面，无需任何现实中的微调。我们的工作表明，极简传感能够实现高效且准确的平面里程计。

英文摘要

Visual-Inertial Odometry(VIO), which is critical to mobile robot navigation, uses cameras with a large number of pixels. Capturing and processing camera images requires significant resources. This work presents a minimalist approach to planar odometry, demonstrating that just four visual measurements and an IMU can provide robust motion estimation for differential-drive robots. Our key insight is that four downward-facing photodiodes that sense the world through optical Gabor masks produce signals that encode speed. Based on this, we jointly optimize the mask parameters alongside a Temporal Convolutional Network (TCN) using a physically-grounded simulator. The resulting model decodes speed from just the four measurements produced by the photodiodes. Pairing these estimates with the angular speed from an IMU yields a continuous planar trajectory. We validate our approach with a prototype sensor mounted on a differential drive robot. Across diverse indoor and outdoor terrains, our system closely tracks the reference ground truth without any real-world fine-tuning. Our work shows that minimalist sensing enables efficient and accurate planar odometry.

URL PDF HTML ☆

赞 0 踩 0

2605.19986 2026-05-20 cs.RO cs.CV cs.LG

Beyond Binary Success: A Diagnostic Meta-Evaluation Framework for Fine-Grained Manipulation

超越二元成功：一种用于细粒度操控的诊断元评估框架

He-Yang Xu, Pengyuan Zhang, Zongyuan Ge, Xiaoshuai Hao, Serge Belongie, Xin Geng, Yuxin Peng, Xiu-Shen Wei

AI总结本文提出MetaFine框架，通过分解理解、感知和受控行为三个维度，诊断细粒度操控中的能力瓶颈，并通过因果干预识别视觉编码器在保持局部空间结构方面的关键限制，从而提升操控精度。

Comments Project page: https://metafine.github.io/

详情

RECIPE: 通过指令视频中的 grounding 实现过程规划

Luigi Seminara, Antonino Furnari, Lorenzo Torresani

AI总结该研究提出RECIPE方法，通过利用指令视频中的grounding信息来改进过程规划任务，通过利用预计算的文本嵌入实现大规模视频数据的验证，从而提升规划的准确性和鲁棒性。

详情

AI中文摘要

视觉规划要求模型在给定部分视频上下文和目标的情况下，生成剩余步骤的自然语言描述。该任务的进展受到标注的限制：干净的标记数据集较小，领域狭窄，每个示例只编码一个执行轨迹，尽管许多有效的顺序存在。大规模的指令视频语料库提供了数量级更多的过程内容，但通过使用伪标签进行监督微调会传播分割和对齐错误，并且只能生成单轨迹。我们识别出一个关键的不对称性：从噪声视频中提取干净的步骤标签是困难的，但验证生成的步骤序列是否在ASR转录中时间上接地是便宜的，并且可以通过预计算的文本嵌入扩展到数百万个视频。我们利用这种不对称性，在RECIPE中将grounding质量作为GRPO的奖励，将噪声语料库转化为验证者而不是标签来源。该框架可以统一应用于两种规划器输入配置（Socratic，使用冻结的VLM提取文本历史，以及Video，直接消耗视频令牌）以及标注和弱监督的模式。我们在7个过程基准上进行评估，使用基于参考的LLM-as-judge协议对计划进行评分，跨6个过程标准。RECIPE-RL在所有规模（0.5B、3B、7B）和每个基准上都优于基础检查点，领域内宏准确率提升7到8分，在零样本情况下最高提升16分。它在标注和伪标签计划上均优于监督微调（后者会降低基础模型性能），并在没有人工标注的情况下保持稳健。作为先前提案-评估-搜索规划器的提案阶段使用时，在视觉规划辅助任务中在每个时间范围内均优于最强的零样本基线，在COIN任务中保持了SFT所崩溃的生成多样性。

英文摘要

Visual planning asks a model to generate the remaining steps of a procedure in natural language given a partial video context and a goal. Progress on this task is bottlenecked by annotation: clean labeled datasets are small, domain-narrow, and encode a single execution trajectory per example, even though many valid orderings exist. Large-scale instructional video corpora offer orders of magnitude more procedural content, but supervised fine-tuning on pseudo-labels from their noisy ASR narrations propagates segmentation and alignment errors and stays single-trajectory. We identify a key asymmetry: extracting clean step labels from noisy video is hard, but verifying whether a generated step sequence is temporally grounded in ASR transcripts is cheap and scales to millions of videos via precomputed text embeddings. We exploit this asymmetry in RECIPE, which uses grounding quality as a reward for GRPO, turning the noisy corpus into a verifier rather than a label source. The framework applies uniformly to two planner input configurations (Socratic, with a textual history extracted by a frozen VLM, and Video, consuming video tokens directly) and to annotated and weakly supervised regimes. We evaluate on 7 procedural benchmarks using a reference-based LLM-as-judge protocol scoring plans across 6 procedural criteria. RECIPE-RL improves over the base checkpoint at all scales (0.5B, 3B, 7B) and every benchmark, with macro-accuracy gains of +7 to +8 points in-domain and up to +16 points zero-shot. It outperforms supervised fine-tuning on both annotated and pseudo-labeled plans (the latter degrades the base) and remains robust without human annotations. Used as the proposal stage of a prior propose-assess-search planner, it improves over the strongest zero-shot baseline at every horizon on Visual Planning for Assistance, and on COIN it preserves the generation diversity that SFT collapses.

URL PDF HTML ☆

赞 0 踩 0

2605.19975 2026-05-20 cs.LG cs.AI

Learning with Foresight: Enhancing Neural Routing Policy via Multi-Node Lookahead Prediction

具有前瞻性学习：通过多节点前瞻性预测增强神经路由策略

Xia Jiang, Yaoxin Wu, Yew-Soon Ong, Yingqian Zhang

AI总结本研究提出多节点前瞻性预测（MnLP）方法，通过扩展监督学习范式同时预测多个未来节点，提升神经路由策略的长期规划能力，并在不同问题规模和现实基准上改进泛化能力。

Comments Accepted by the 35th International Joint Conference on Artificial Intelligence

详情

AI中文摘要

神经策略因其对人工启发式依赖的减少而在解决车辆路径问题中展现出潜力。然而，当前的训练范式存在根本性局限：它们主要关注下一个节点的预测，导致短视决策，削弱了长期规划能力。为此，我们引入多节点前瞻性预测（MnLP），一种新的训练策略，扩展监督学习范式以同时预测多个未来节点。我们整合了因果性和可丢弃的MnLP模块，这些模块仅在训练期间运行，使模型能够预测多步决策，同时保持推理时的效率。通过将多深度辅助监督融入损失函数，MnLP使神经策略具备长距离上下文理解能力。实验表明，MnLP在现有训练方法上表现更优，提升了神经策略在各种问题规模、分布和现实基准上的泛化能力。此外，MnLP可以无缝集成到不同的神经架构中，而不引入额外的推理开销。

英文摘要

Neural policies have shown promise in solving vehicle routing problems due to their reduced reliance on handcrafted heuristics. However, current training paradigms suffer from a fundamental limitation: they primarily focus on next-node prediction for solution construction, resulting in myopic decision-making that undermines long-horizon planning capacity. To this end, we introduce Multi-node Lookahead Prediction (MnLP), a novel training strategy that extends the supervised learning paradigm to predict multiple future nodes simultaneously. We incorporate causal and discardable MnLP modules that operate exclusively during training, facilitating models to anticipate multi-step decisions while preserving inference-time efficiency. By incorporating multi-depth auxiliary supervision into the loss function, MnLP equips neural policies with the ability of long-range contextual understanding. Experimentally, MnLP outperforms existing training methods, improving the generalization capability of neural policies across various problem sizes, distributions, and real-world benchmarks. Moreover, MnLP can be seamlessly integrated into diverse neural architectures without introducing additional inference overhead.

URL PDF HTML ☆

赞 0 踩 0