arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.17403 2026-06-17 cs.CV cs.AI 新提交

TerraTransfer: 无需专家示范的端到端驾驶策略学习

Zikang Xiong, Weixin Li, Zhouchonghao Wu, Akshay Rangesh, Saarth Bonde, Grantland Hall, Chen Tang, Yihan Hu, Wei Zhan

发表机构 * Applied Intuition ； UCLA（加州大学洛杉矶分校）； UC Berkeley（加州大学伯克利分校）

AI总结提出一种无需专家示范的端到端驾驶方法，通过向量化模拟器中的自博弈预训练策略，再与预训练视觉骨干对齐，降低了数据成本并达到或超越现有方法。

详情

AI中文摘要

端到端自动驾驶在基准测试和实际部署中取得了最先进的性能。然而，其标准训练流程在所有阶段都成本高昂：收集和标注数百万驾驶帧代价昂贵，而在图像上进行闭环强化学习受限于每步的光真实感渲染和大视觉骨干的前向传播成本。在向量化模拟器中进行自博弈改变了经济性：每秒数百万次 rollout 步骤，状态分布自然包含碰撞、近碰撞和恢复等驾驶日志中不包含的情况。我们的方法通过解耦学习驾驶和学习视觉来利用这种不对称性。我们通过自博弈预训练单个策略，然后通过动作 KL 散度和批量关系低秩结构损失将其潜在空间与预训练视觉骨干对齐。动作目标来自自博弈策略，因此对齐从未对记录的轨迹进行监督：只需要一个（图像、场景状态）帧的配对数据集，无需模仿预训练所依赖的精心策划的专家示范。在光真实感 3D 高斯泼溅闭环场景中，得到的端到端策略匹配或超越了先前的端到端方法。

英文摘要

End-to-end autonomous driving has achieved state-of-the-art performance on benchmarks and real-world deployments. Its standard training recipe, however, is expensive across all stages: collecting and labeling millions of driving frames is costly, and closed-loop RL on images is bottlenecked by the per-step cost of photorealistic rendering plus a forward pass through a large vision backbone. Self-play in vectorized simulators changes the economics: millions of rollout steps per second, and a state distribution naturally rich in collisions, near-misses, and recoveries that no driving log contains. Our approach exploits this asymmetry by decoupling learning to drive from learning to see. We pretrain a single policy by self-play, then align its latent space with a pretrained vision backbone, through the action KL divergence and a batch-relational low-rank structural loss. The action target comes from the self-play policy, so alignment never supervises against a logged trajectory: a paired dataset of (image, scene-state) frames suffices, with no need for the curated expert demonstrations that imitation pretraining is built on. On photorealistic 3D Gaussian splatting closed-loop scenarios, the resulting end-to-end policy matches or exceeds prior end-to-end methods.

URL PDF HTML ☆

赞 0 踩 0

2606.17385 2026-06-17 cs.RO 新提交

EgoInfinity: A Web-Scale 4D Hand-Object Interaction Data Engine for Any-View Robot Retargeting and Video-to-Action Robot Learning

EgoInfinity: 一个面向任意视角机器人重定向与视频到动作机器人学习的网络规模4D手物交互数据引擎

Gaotian Wang, Kejia Ren, Andrew Morgan, Yiting Chen, Howard H. Qian, Podshara Chanrungmaneekul, Kaiyu Hang

发表机构 * Rice University（莱斯大学）； Robotics and AI Institute（机器人与人工智能研究所）

AI总结提出EgoInfinity引擎，从互联网视频自动生成4D手物交互数据，实现跨机器人形态的动作重定向与技能学习，无需人工标注。

Comments 24 pages. Project page: https://huggingface.co/spaces/Rice-RobotPI-Lab/EgoInfinity

详情

AI中文摘要

互联网视频构成了具身人类操作知识的最大储备，然而将任意RGB视频转化为可操作的机器人训练数据仍然是一个主要瓶颈。现有的实验室或工厂收集的数据集在规模和多样性上有限，限制了开放世界机器人学习。我们不提出静态数据集，而是引入EgoInfinity，一个通用的4D手物交互数据引擎，能够为机器人重定向和学习生成网络规模的数据。EgoInfinity是一个模块化引擎，集成了感知、分割、重建、交互感知精炼和重定向，以自动化这一传统上不可扩展的视频到动作问题，无需人工循环标注。其模块化设计使引擎能够持续受益于任何集成组件的进步。通过EgoInfinity，野外人类操作视频被提升为与智能体无关的度量4D手物表示，包括手部轨迹、6自由度物体姿态和接触相关状态。EgoInfinity不是简单连接独立组件，而是结合跨模块度量校准与交互感知精炼，以提高物理可靠性，减少纯视觉重建中常见的漂移和接触不一致。我们进一步提出一种新颖的运动重定向器，将恢复的3D手部运动编译为适用于不同机器人形态的可执行关节轨迹，从而实现从任意视角和镜头尺寸（例如，人体仅部分可见）下任意机器人的视频到动作重定向。我们在感知保真度、运动学可行性、接触一致性、跨形态泛化以及真实机器人技能获取（例如，抓取、切割、擦拭和倒水）方面验证了EgoInfinity，展示了从互联网视频到可执行机器人行为的可扩展桥梁，用于开放世界机器人学习。

异构移动机器人上的非接触式呼吸监测：一种多模态边缘计算框架

Milind Rampure, Shadman Sakib, Haley Patel, Zahid Hasan, Nirmalya Roy

发表机构 * University of Maryland, Baltimore County（马里兰大学巴尔的摩县分校）

AI总结提出一种适用于异构移动机器人的多模态非接触式呼吸率监测框架，通过自适应传感器选择、关键点引导的ROI提取和信号质量过滤，在多种平台和光照条件下实现鲁棒监测，无需平台特定调参。

Comments 8 pages, 6 figures. To appear in Proceedings of the 8th International Workshop on IoT Applications and Industry 5.0 (IoTI5 2026), co-located with IEEE DCOSS-IoT 2026, Reykjavik, Iceland, June 2026

详情

AI中文摘要

呼吸率监测是紧急响应、灾难恢复和传染病场景中远程分诊和受害者评估的关键组成部分，在这些场景中，最小化物理接触可以降低救援人员风险并提高操作安全性。然而，由于光照变化、姿势变化、平台异构性以及危险环境中可穿戴传感器的不实用性，非接触式呼吸率监测的现场部署仍然具有挑战性。在本文中，我们提出了一种适用于具有机载边缘计算的异构移动机器人的模态自适应非接触式呼吸率监测框架。所提出的系统结合了跨RGB、热成像、近红外和低光相机的亮度自适应传感器选择、用于姿势鲁棒监测的关键点引导胸部ROI提取，以及基于信号质量指数的滤波机制以实现可靠的呼吸估计。我们在三个机器人平台上实现并评估了该框架，涵盖四足和轮式运动以及多种边缘计算架构。在不同光照条件、受试者姿势和机器人到受试者距离下进行的实验表明，该框架无需针对每个平台进行算法重新调整即可跨平台泛化，同时揭示了模态特定的操作边界。RGB提供最广的覆盖范围，可达8米；近红外在6米内有效；热成像仅在短距离内可靠；低光传感支持在完全黑暗环境中监测，距离可达8米。总体而言，结果证明了在移动机器人上进行多模态非接触式呼吸率监测的可行性，并支持其作为危险搜救场景中自主分诊和受害者评估的基础。

英文摘要

Respiratory-rate (RR) monitoring is a critical component of remote triage and victim assessment in emergency response, disaster recovery, and infectious-disease scenarios, where minimizing physical contact can reduce responder risk and improve operational safety. However, field deployment of contactless RR monitoring remains challenging due to variable illumination, posture changes, platform heterogeneity, and the impracticality of wearable sensors in hazardous environments. In this paper, we present a modality-adaptive contactless RR monitoring framework for heterogeneous mobile robots with onboard edge computing. The proposed system combines brightness-adaptive sensor selection across RGB, thermal, near-infrared (NIR), and low-light cameras, keypoint-guided chest ROI extraction for posture-robust monitoring, and a signal-quality-index (SQI)-based filtering mechanism for reliable respiratory estimation. We implement and evaluate the framework on three robotic platforms spanning quadruped and wheeled locomotion and multiple edge-computing architectures. Experiments conducted across diverse lighting conditions, subject poses, and robot-to-subject distances demonstrate that the framework generalizes across platforms without per-platform algorithmic retuning, while revealing modality-specific operational boundaries. RGB provides the broadest coverage up to 8m, NIR remains effective up to 6m, thermal is reliable only at short range, and low-light sensing supports monitoring in complete darkness up to 8m. Overall, the results demonstrate the feasibility of multimodal contactless RR monitoring on mobile robots and support its use as a foundation for autonomous triage and victim assessment in hazardous search-and-rescue settings.

URL PDF HTML ☆

赞 0 踩 0

2606.17368 2026-06-17 cs.AI cs.NI 新提交

Distributed General-Purpose Agent Networks: Architecture, Key Mechanisms, and Prototypes

分布式通用智能体网络：架构、关键机制与原型

Shengli Zhang, Deen Ma, Zibin Lin, Taotao Wang

发表机构 * College of Electronics and Information Engineering, Shenzhen University（深圳大学电子与信息工程学院）

AI总结提出分布式通用智能体网络架构，通过协议适配层连接上层任务语义与底层网络操作，解决语义公告传播、可信身份与多主题声誉、语义梯度机制设计三大核心问题，实现开放可信的智能体协作。

详情

AI中文摘要

大型语言模型加速了从被动对话助手到自主智能体的转变，这些智能体能够理解目标、规划行动、调用工具并执行多步骤任务。然而，单个智能体的能力仍受限于其本地数据、工具权限、运行时环境和治理边界。本文研究分布式通用智能体网络：开放的端到端网络，其中部署在个人设备、边缘节点或自主计算环境中的异构智能体可以相互发现、建立信任、协商合作规则并执行开放式任务。我们认为，这种网络不能通过简单地将现有的端到端覆盖网络与传统多智能体系统相结合来获得。与传统P2P网络不同，智能体网络必须传播关于意图、能力、状态和合作约束的语义声明。因此，我们提出了一种以协议适配层为中心的分层架构，该层连接上层任务语义与底层网络操作。基于该架构，本文识别出三个核心机制问题：用于协作者发现的语义公告传播、用于合作治理的可验证身份与多主题声誉、以及用于开放任务执行的语义梯度机制设计。针对每个问题，我们提出了一条技术路线，包括带顺序日志的无体八卦协议、基于BAID的身份绑定与MG-EigenTrust声誉、以及由语义归因反馈驱动的Stackelberg式机制生成循环。我们还报告了BAID式分层验证的原型开销结果以及跨主题伪装-合谋攻击下MG-EigenTrust的机制级模拟。所得框架为开放、可信和可扩展的智能体协作提供了系统级基础。

英文摘要

Large language models have accelerated the transition from passive conversational assistants to autonomous agents that can understand goals, plan actions, invoke tools, and execute multi-step tasks. Yet the capability of a single agent remains constrained by its local data, tool permissions, runtime environment, and governance boundary. This paper studies distributed general-purpose agent networks: open peer-to-peer networks in which heterogeneous agents deployed on personal devices, edge nodes, or autonomous computing environments can discover one another, establish trust, negotiate cooperation rules, and execute open-ended tasks. We argue that such networks cannot be obtained by simply combining existing peer-to-peer overlays with conventional multi-agent systems. Unlike traditional P2P networks, agent networks must propagate semantic declarations about intentions, capabilities, states, and cooperation constraints. We therefore propose a layered architecture centered on a protocol adaptation layer that connects upper-level task semantics with lower-level network operations. Based on this architecture, the paper identifies three core mechanism problems: semantic announcement propagation for collaborator discovery, verifiable identity and multi-topic reputation for cooperation governance, and semantic-gradient mechanism design for open task execution. For each problem, we present a technical route, including bodyless gossip with sequential logs, BAID-based identity binding with MG-EigenTrust reputation, and a Stackelberg-style mechanism-generation loop driven by semantic attribution feedback. We further report prototype overhead results for BAID-style tiered verification and mechanism-level simulations of MG-EigenTrust under cross-topic disguise-collusion attacks. The resulting framework provides a system-level foundation for open, trustworthy, and scalable agent collaboration.

URL PDF HTML ☆

赞 0 踩 0

2606.17362 2026-06-17 cs.CV cs.AI cs.LG cs.RO 新提交

DriveJudge: Rethinking Autonomous Driving Evaluation with Vision-Language Models

DriveJudge: 用视觉-语言模型重新思考自动驾驶评估

Xinglong Sun, Kevin Xie, Jenny Schmalfuss, Despoina Paschalidou, Xiuming Zhang, Sanja Fidler, Kashyap Chitta, Jose M. Alvarez

发表机构 * NVIDIA（英伟达）

AI总结提出DriveJudge，结合规则评估与VLM推理，通过选择性调用物理规则函数实现可解释且上下文感知的驾驶评估，在驾驶质量分类和轨迹偏好选择任务上超越现有方法。

Comments Under Review

详情

AI中文摘要

自动驾驶已转向端到端策略学习，其中可靠、可解释的策略评估是一个基本挑战，因为驾驶质量高度依赖于上下文。常用的基于规则的驾驶指标（如EPDMS）可解释但缺乏上下文感知，而近期基于VLM的评估虽具有上下文感知能力，但受限于模糊的VLM输出和较弱的物理基础。为了以既可解释又上下文感知的方式评估驾驶，我们引入了DriveJudge。DriveJudge是一个驾驶评估代理，它将规则基础评估与视觉-语言模型（VLM）推理相结合，并在解释环境上下文后有选择地调用基于物理的确定性规则函数。为了训练和评估DriveJudge，我们整理了一个包含33,577个具有挑战性的驾驶样本的大规模数据集，并附有人类标注，指示给定场景中的驾驶行为是否合理。利用该数据集，我们解决了驾驶指标评估中未被充分探索的问题，并引入了两个与人类对齐的基准任务：驾驶质量分类和轨迹偏好选择。DriveJudge在驾驶质量分类上比EPDMS高出21.23 AUC，在轨迹偏好选择上比近期基于VLM的DriveCritic高出6.5%，为可解释且精确的驾驶评估设立了新标准。

英文摘要

Autonomous driving has shifted towards end-to-end policy learning, where reliable, interpretable policy evaluation is a fundamental challenge as driving quality is highly context-dependent. Commonly used rule-based driving metrics like EPDMS are interpretable but lack context-awareness, while recent VLMbased evaluations are context-aware but limited by ambiguous VLM outputs and weak physical grounding. To evaluate driving in a manner that is both interpretable and context-aware, we introduce DriveJudge. DriveJudge is a driving evaluation agent that combines rule-grounded evaluation with Vision-Language Model (VLM) reasoning and selectively invokes physically-grounded deterministic rule functions after interpreting the environmental context. To train and evaluate DriveJudge, we curate a large-scale dataset of 33,577 challenging driving samples with human annotations on whether the driving behavior is reasonable in the given scenario. With this dataset, we address the underexplored problem of driving metric evaluation, and introduce two human-aligned benchmark tasks: Driving Quality Classification and Trajectory Preference Selection. DriveJudge outperforms EPDMS for driving quality classification by 21.23 AUC, and the recent VLM-based DriveCritic for trajectory preference selection by 6.5%, setting a new standard for interpretable and precise driving evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.17355 2026-06-17 cs.CV 新提交

Complex Layout Classification in the Wild: A Low-Resource Approach with Layout-Preserving Augmentations

野外复杂版面分类：一种低资源方法及版面保持增强

Sharva Gogawale, Iddo Hakim, Gal Grudka, Mohammad Suliman, Omer Ventura, Daria Vasyutinsky-Shapira, Berat Kurar-Barakat, Nachum Dershowitz

发表机构 * School of Computer Science and AI, Tel Aviv University（特拉维夫大学计算机科学与人工智能学院）

AI总结针对低资源复杂版面分类问题，提出基于CNN的分类器，采用窄各向异性高斯掩码和反射诱导标签变换等版面保持增强方法，在标注稀缺下显著提升分类性能。

详情

AI中文摘要

许多数字化语料库面临低资源问题，因为标注可能稀缺、页面扫描噪声大且分辨率低，或者版面结构复杂，对自动转录质量产生负面影响。低资源语言的鲁棒分类模型开发受到缺乏大规模标注数据和页面版面频繁语义复杂性的制约。为此，我们整理了一个复杂版面数据集，根据分隔区域手动分为八种版面类型。为克服数据稀缺，我们提出了一种基于CNN的分类器的新型训练策略，采用强领域感知增强来改善泛化。我们利用窄各向异性高斯掩码抑制偶然文本细节，同时保留基本分隔，迫使模型学习全局几何排列。此外，我们实施反射诱导标签变换以丰富训练分布，同时保持不对称类别间的标签一致性。结果表明，版面特定增强可以在严重标注稀缺下显著改善页面级版面分类。

英文摘要

Many digitized corpora suffer from low resources because annotations may be scarce, page scans are noisy and of poor resolution, or layouts are structurally complex in ways that negatively affect the quality of automatic transcription. Developing robust classification models for low-resource languages is inhibited by the lack of large-scale annotated data and by the frequent semantic complexity of page layouts. To this end, we have curated a complex-layout dataset, manually classified into eight distinct layout types based on their separator regions. To overcome data scarcity, we propose a novel training strategy in the form of a CNN-based classifier that employs strong, domain-aware augmentations to improve generalization. We utilize narrow anisotropic Gaussian masking to suppress incidental textual details while preserving essential separations, compelling the model to learn global geometric arrangements. Additionally, we implement reflection-induced label transformations to enrich the training distribution while maintaining label consistency across asymmetric categories. The results demonstrate that layout-specific augmentations can substantially improve page-level layout classification under severe annotation scarcity.

URL PDF HTML ☆

赞 0 踩 0

2606.17354 2026-06-17 cs.CL cs.AI 新提交

Translating the Untranslatable: An Operationalizable Ontology for Untranslatability

翻译不可译：一个可操作化的不可译性本体论

Jacob Bremerman, Brihi Joshi, Hirona Arai, Xiang Ren, Jonathan May

发表机构 * University of Southern California Information Sciences Institute（南加州大学信息科学研究所）

AI总结提出一个结构化的不可译性本体论和补偿策略分类法，构建多语言数据集，通过人类偏好研究发现注释补偿策略最受青睐，为策略感知机器翻译奠定基础。

详情

AI中文摘要

ProCUA-SFT 技术报告

Jaehun Jung, Ximing Lu, Brandon Cui, Muhammad Khalifa, Shaokun Zhang, Hao Zhang, Jin Xu, Amala Sanjay Deshmukh, Karan Sapra, Andrew Tao, Yejin Choi, Jan Kautz, Mingjie Liu, Yi Dong

发表机构 * NVIDIA（英伟达）； University of Washington（华盛顿大学）； Allen Institute for AI（艾伦人工智能研究所）

AI总结提出 ProCUA-SFT 数据集，通过自动化管道从 2484 个应用组合的合成轨迹中蒸馏出 310 万步级 SFT 样本，微调 UI-TARS 7B 在 OSWorld 上达到 45.0% 的成功率，比基线提升 18.7 个百分点。

Comments 15 pages, 5 figures

详情

AI中文摘要

训练计算机使用智能体（CUA）——通过截图和键盘/鼠标操作与图形桌面交互的模型——需要在全桌面环境中收集的大规模、多样化的轨迹数据。最大的公共资源 AgentNet（22.5K 条人类轨迹）在用于监督微调（SFT）时会导致负迁移：在 AgentNet 上继续训练 UI-TARS 7B 导致 OSWorld 成功率从 26.3% 下降到 8-10%。我们提出了 ProCUA-SFT，一个包含 310 万步级 SFT 样本的数据集，这些样本从 2484 个应用组合中的 93K 条合成轨迹中蒸馏得到。该数据集由一个全自动管道生成，该管道（i）在带有真实世界内容的实况桌面上合成有基础的任务——912 个来自 SpreadsheetBench 的电子表格、约 10K 个来自 Zenodo10K 的宽松许可演示文稿以及多应用 OSWorld 配置——以及（ii）在展开前通过二元前置条件检查验证每个任务的可行性。单个 VLM（Kimi-K2.5）作为目标生成器、前置条件判断器和轨迹执行器，消除了规划器-执行器的能力差距。每条轨迹被扩展为步前缀样本，精确复现推理时看到的上下文布局。在 ProCUA-SFT 上微调 UI-TARS 7B 一个 epoch 后，在 OSWorld 上达到 45.0%——比基础模型提升 18.7 个百分点，比 AgentNet 训练的模型高出 35% 以上。ProCUA 的一个子集被纳入 Nemotron 3 Nano Omni 模型的训练数据中，为其计算机使用能力做出了贡献。

英文摘要

Training computer-use agents (CUAs) -- models that interact with graphical desktops through screenshots and keyboard/mouse actions -- requires large-scale, diverse trajectory data collected in full desktop environments. The largest public resource, AgentNet (22.5K human trajectories), leads to negative transfer when used for supervised fine-tuning (SFT): continuing training UI-TARS 7B on AgentNet causes OSWorld success rate to fall from 26.3% to 8-10%. We present ProCUA-SFT, a dataset of 3.1M step-level SFT samples distilled from 93K synthetic trajectories across 2,484 application combinations. The dataset is produced by a fully automated pipeline that (i) synthesizes grounded tasks on live desktops seeded with real-world content -- 912 spreadsheets from SpreadsheetBench, approximately 10K permissively-licensed presentations from Zenodo10K, and multi-application OSWorld configs -- and (ii) verifies each task's feasibility through binary precondition checking before rollout. A single VLM (Kimi-K2.5) serves as goal generator, precondition judge, and trajectory executor, eliminating planner-actor capability gaps. Each trajectory is expanded into step-prefix samples that exactly reproduce the context layout seen at inference time. Fine-tuning UI-TARS 7B on ProCUA-SFT for one epoch yields 45.0% on OSWorld -- an 18.7 percentage-point improvement over the base model and over 35% above AgentNet-trained counterparts. A subset of ProCUA was incorporated into the training data for the Nemotron 3 Nano Omni model, contributing to its computer-use capabilities.

URL PDF HTML ☆

赞 0 踩 0

2606.17312 2026-06-17 cs.AI 新提交

Quantifying Consistency in LLM Logical Reasoning via Structural Uncertainty

通过结构不确定性量化LLM逻辑推理中的一致性

Baishali Chaudhury, Mengdie Flora Wang, Hyunji Hayley Park, Rahul Ghosh, Sungmin Hong, Jae Oh Woo

发表机构 * AWS Generative AI Innovation Center（AWS生成式AI创新中心）

AI总结提出结构不确定性框架，通过自偏好排序的稳定性评估LLM推理一致性，在逻辑和数学任务中与答案分散度互补，提升不可靠实例识别。

Comments Published at ICLR 2026 Workshop on Logical Reasoning of Large Language Models. Accepted as best paper

详情

AI中文摘要

大型语言模型可以通过不稳定、矛盾或难以一致排序的推理路径得出相同答案——这种失败模式在多步演绎推理中尤为普遍。现有方法主要通过输出分散度（衡量采样答案的差异）来评估可靠性，但这丢弃了一个互补信号：模型是否能一致地对竞争性推理候选进行排序。我们提出结构不确定性，一个从自偏好诱导的推理解决方案排序稳定性导出的、具有一致性意识的框架。给定一个查询，我们生成多个候选解决方案，并让模型对其自身输出进行成对偏好判断。我们通过Bradley-Terry模型与PageRank将自偏好聚合成排序分布，并将信号分解为两个基于熵的分量：跨试验排序不稳定性和试验内候选歧义性。在五个LLM和八个基准上，结构信号提供了与答案分散度互补的信息：在逻辑和数学推理任务中，组合提高了不可靠实例的识别，而在事实检索中，结构信号坍缩为均匀分布，诊断出一个推理层面一致性评估无信息性的状态边界。两个分量与准确性的关系不同：试验内歧义性与正确性正相关——与多个合理解决方案路径保持竞争的情况一致——而跨试验不稳定性与正确性负相关，表明推理不可靠。结构不确定性最好不被理解为通用置信度估计器，而是作为逻辑推理一致性的状态敏感评估器。

英文摘要

Large language models can arrive at the same answer through reasoning paths that are unstable, contradictory, or difficult to rank consistently -- a failure mode especially prevalent in multi-step deductive reasoning. Existing methods assess reliability primarily through output dispersion -- measuring how much sampled answers differ -- but this discards a complementary signal: whether the model can consistently rank competing reasoning candidates. We propose structural uncertainty, a consistency-aware framework derived from the stability of self-preference-induced rankings over sampled reasoning solutions. Given a query, we generate multiple candidate solutions and ask the model to judge pairwise preferences among its own outputs. We aggregate self-preferences into ranking distributions via Bradley-Terry modeling with PageRank, and decompose the signal into two entropy-based components: across-trial ranking instability and within-trial candidate ambiguity. Across five LLMs and eight benchmarks, structural signals provide information complementary to answer dispersion: on logical and mathematical reasoning tasks, the combination improves identification of unreliable instances, while on factual retrieval the structural signal collapses toward uniformity, diagnosing a regime boundary where reasoning-level consistency evaluation is uninformative. The two components relate differently to accuracy: within-trial ambiguity correlates positively with correctness -- consistent with settings where multiple plausible solution paths remain competitive -- while across-trial instability correlates negatively, signaling unreliable reasoning. Structural uncertainty is best understood not as a universal confidence estimator, but as a regime-sensitive evaluator of logical reasoning consistency.

URL PDF HTML ☆

赞 0 踩 0

2606.17310 2026-06-17 cs.CV 新提交

SierpinskiCam: Camera-Controlled Video Retaking with Sierpinski Triangle Pattern Cues

SierpinskiCam: 基于谢尔宾斯基三角形图案线索的相机控制视频重拍

Suttisak Wizadwongsa, Hyelin Nam, Supasorn Suwajanakorn, Jeong Joon Park

发表机构 * University of Michigan, Ann Arbor（密歇根大学安娜堡分校）； VISTEC, Thailand（泰国威斯泰克科学技术研究院）

AI总结提出SierpinskiCam方法，通过谢尔宾斯基圆顶纹理线索增强几何引导，并引入参考视频条件机制，解决单目视频重拍中相机大角度偏离时的稀疏区域问题，提升相机可控性、几何一致性和视频质量。

Comments 20 pages, 13 figures

详情

AI中文摘要

从单个单目视频沿用户定义的相机轨迹生成场景的新颖渲染，称为视频重拍，是内容创作和视觉效果中一个引人注目但困难的问题。现有的几何引导方法从源视频重建4D表示，并沿目标轨迹渲染以条件视频扩散模型。然而，当目标相机偏离源轨迹时，这种引导会退化，导致新暴露区域稀疏或完全缺失。我们提出SierpinskiCam，通过使用包含丰富可跟踪特征的谢尔宾斯基圆顶纹理线索来增强基于几何的引导，从而解决了这一限制，即使在大的视角变化下也能保持跟踪。我们进一步引入了一种参考视频条件机制，将源视频令牌附加到目标令牌序列，并使用负RoPE索引分离两个流，从而无需架构修改或逐视频适应即可实现外观基础。大量实验表明，SierpinskiCam在多样且具有挑战性的重拍场景中，在相机可控性、几何一致性和视频质量方面取得了显著提升。项目页面：此https URL。

英文摘要

Generating novel renderings of a scene along user-defined camera trajectories from a single monocular video, dubbed video retaking, is a compelling but difficult problem in content creation and visual effects. Existing geometry-guided approaches reconstruct a 4D representation from the source video and render it along the target trajectory to condition video diffusion models. However, this guidance degrades as the target camera departs from the source trajectory, leaving newly revealed regions sparse or entirely missing. We propose SierpinskiCam, which addresses this limitation by augmenting geometry-based guidance with Sierpinski dome texture cues that contains rich trackable features even under large viewpoint changes. We further introduce a reference video conditioning mechanism that appends source-video tokens to the target-token sequence and separates the two streams with negative RoPE indices, enabling appearance grounding without architectural modification or per-video adaptation. Extensive experiments show that SierpinskiCam achieves significant gains in camera controllability, geometric consistency, and video quality across diverse and challenging retaking scenarios. Project page: https://hyelinnam.github.io/SierpinskiCam/.

URL PDF HTML ☆

赞 0 踩 0

2606.17309 2026-06-17 cs.RO 新提交

Abstention-Aware Personalized Object Rearrangement via Uncertainty-Guided LLM Assistance

基于不确定性引导的LLM辅助的弃权感知个性化物体重排

Sam Collin, Ali Ayub

发表机构 * Concordia University（康考迪亚大学）

AI总结提出APOLLO框架，结合轻量级个性化嵌入模型与选择性大语言模型辅助，通过不确定性估计在模糊决策时调用LLM，实现高效、隐私保护的弃权感知物体重排。

Comments Accepted at the 2026 IEEE 35th International Conference on Robot and Human Interactive Communication (RO-MAN 2026)

详情

AI中文摘要

家庭环境中的机器人辅助不仅需要预测物体应放置的位置，还需要推理何时不应放置物体。现有的个性化物体重排方法主要假设观测清晰且完全可操作，限制了其在现实、杂乱且部分错误环境中的适用性。本文提出APOLLO，一个用于弃权感知个性化物体重排的混合框架，结合了轻量级个性化嵌入模型（PEM）与选择性大语言模型（LLM）辅助。PEM针对每个用户-环境对使用少量演示进行训练，完全在CPU上运行，并产生不确定性估计，用于仅对模糊决策选择性调用基于LLM的推理，平衡效率、隐私和推理能力。为了在现有基准之外评估该公式，我们引入了APOR，一个合成的、由LLM生成的数据集，捕捉房间级、多家具环境、多样化的组织配置文件、明确的弃权行为和嘈杂的部分场景上下文。在PARSEC和APOR上的大量实验初步表明，APOLLO在受控基准设置中优于先前基于LLM的基线，同时大幅减少LLM的使用。代码可在该网址获取。

英文摘要

Robotic assistance in household environments requires not only predicting where objects should be placed, but also reasoning about when objects should not be placed at all. Existing approaches to personalized object rearrangement primarily focus on placement decisions under the assumption of clean observations and complete actionability, limiting their applicability in realistic, cluttered, and partially erroneous settings. In this paper, we introduce APOLLO, a hybrid framework for abstention-aware personalized object rearrangement that combines a lightweight, personalized embedding model (PEM) with selective large language model (LLM) assistance. PEM is trained for each user-environment pair using a small number of demonstrations, operates entirely on CPU, and produces uncertainty estimates, which are used to selectively invoke LLM-based reasoning only for ambiguous decisions, balancing efficiency, privacy, and reasoning capability. To evaluate this formulation beyond existing benchmarks, we introduce APOR, a synthetic, LLM-generated dataset that captures room-level, multi-furniture environments, diverse organizational profiles, explicit abstention behavior, and noisy partial scene context. Extensive experiments on both PARSEC and APOR provide initial evidence that APOLLO improves over prior LLM-based baselines in controlled benchmark settings while substantially reducing LLM usage. Code is available at https://github.com/PaInt-Lab/APOLLO.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

Bridging Spatial And Frequency Views For Disaster Assessment: Benefits And Limitations

The Discrete-Log Clock: How a Transformer Learns Modular Multiplication

Damage Adaptation in Seconds for Architected Materials

NarrativeWorldBench: A Frontier-Saturated Benchmark and a Latent World Model for Long-Horizon Co-Creative Audio Drama

Visuals Lie, Consistency Speaks: Disentangling Spatial Attention from Reliability in Vision-Language Models

Agent Utilities over Generalized Voronoi Regions and their Gradients

TerraTransfer: Learning End-to-End Driving Policies Without Expert Demonstrations

EgoInfinity: A Web-Scale 4D Hand-Object Interaction Data Engine for Any-View Robot Retargeting and Video-to-Action Robot Learning

Improving and Evaluating Hand-Object Interaction Detection

MeiBRD: Meta-Learning Intraoperative Biomechanical Residual Deformation

Performance-Driven Environment Abstraction with Multi-Timescale Learning

Contactless Respiratory Monitoring on Heterogeneous Mobile Robots: A Multimodal Edge-Computing Framework

Distributed General-Purpose Agent Networks: Architecture, Key Mechanisms, and Prototypes

DriveJudge: Rethinking Autonomous Driving Evaluation with Vision-Language Models

Complex Layout Classification in the Wild: A Low-Resource Approach with Layout-Preserving Augmentations

Translating the Untranslatable: An Operationalizable Ontology for Untranslatability

MM++: Unsupervised Scale-Invariant Multilayer OOD Detection via Top-K Gated Feature Fusion

Do Large Language Models Always Tell The Same Stories?

Counterfactual Optimization of Baseball Pitch Sequences and Estimation of Its Impact on Season-Level Statistics

Bayesian Magnetic Resonance Joint Image Reconstruction and Uncertainty Quantification using Sparsity Prior Models and Markov Chain Monte Carlo Sampling

Learning a Maximum Entropy Model for Visual Textures using Diffusion

Geometry-Consistent Endoscopic Representations for Image-Guided Navigation via Structured Foundation Model Adaptation

SpeechDx: A Multi-Task Benchmark for Clinical Speech AI

FATE: Pillar Encoding and Frequency-Aware Training for Event-Based Object Detection

Decision-Driven Geosteering Under Uncertainty: A Unified Framework for Sequential Decision Optimization

MemTrace: Probing What Final Accuracy Misses in Long-Term Memory

ProCUA-SFT Technical Report

Quantifying Consistency in LLM Logical Reasoning via Structural Uncertainty

SierpinskiCam: Camera-Controlled Video Retaking with Sierpinski Triangle Pattern Cues

Abstention-Aware Personalized Object Rearrangement via Uncertainty-Guided LLM Assistance