arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.13435 2026-06-12 cs.RO 新提交

GIVE: Grounding Human Gestures in Vision-Language-Action Models

GIVE：在视觉-语言-动作模型中接地人类手势

Pengfei Liu, Gen Li, Junqiao Fan, Boyu Ma, Jindou Jia, Yang Xiao, Jianfei Yang

发表机构 * MARS Lab, Nanyang Technological University（南洋理工大学MARS实验室）

AI总结针对VLA模型忽略手势导致意图理解不准的问题，提出GIVE方法，通过视觉和语义双路径增强手势理解，在真实HRI实验中目标识别准确率提升40%，任务成功率提升80%。

Comments Project page: https://luis-cloud-sg.github.io/GIVE-project/

详情

AI中文摘要

人类交流本质上是多模态的，语言通常伴随着非语言线索（如手势）来传达意图。然而，当前的视觉-语言-动作（VLA）模型将机器人操作视为纯文本驱动的任务，忽视了手势在人机交互（HRI）中的重要作用。当语言指令模糊或不明确时，这往往导致意图接地不准确和操作不可靠。为了解决这一挑战，我们提出了GIVE（通过视觉-语义增强的手势意图），一种有效的方法，在不修改架构的情况下，用人类手势理解增强预训练的VLA模型。具体来说，GIVE通过两条互补的路径融入手势信息：一条视觉路径，将手部骨架和指尖射线叠加到机器人观测上，用于显式对象接地；一条语义路径，生成人类手势和任务指令的高级描述，用于鲁棒的意图接地。通过联合利用视觉和语义指导，GIVE使VLA策略能够更好地将手势与操作行为关联，并适应动态交互意图。在真实世界的HRI实验中，GIVE显著优于基线，目标对象识别准确率提升40%，整体任务成功率提升80%，同时展现出对未见空间布局和不同参与者的强大鲁棒性和泛化能力。

英文摘要

Human communication is inherently multimodal, where language is often accompanied by non-verbal cues such as gestures to convey intentions. However, current Vision-Language-Action (VLA) models treat robotic manipulation as a pure text-driven task, overlooking the important role of gestures in Human-Robot Interaction (HRI). This often leads to inaccurate intent grounding and unreliable manipulation when language instructions are ambiguous or underspecified. To address this challenge, we propose GIVE (Gesture Intent via Visual-Semantic Enhancement), an effective approach that enhances pre-trained VLA models with human gesture understanding without architectural modifications. Specifically, GIVE incorporates gesture information through two complementary pathways: a visual pathway that overlays hand skeletons and fingertip rays onto robot observations for explicit object grounding, and a semantic pathway that generates high-level descriptions of human gestures and task instructions for robust intent grounding. By jointly leveraging visual and semantic guidance, GIVE enables VLA policies to better associate gestures with manipulation behaviors and adapt to dynamic interaction intents. In real-world HRI experiments, GIVE substantially outperforms the baseline, improving target object recognition accuracy by 40% and overall task success rate by 80%, while demonstrating strong robustness and generalization to unseen spatial layouts and diverse participants.

URL PDF HTML ☆

赞 0 踩 0

2606.13432 2026-06-12 cs.CV cs.AI 新提交

OmniDirector: General Multi-Shot Camera Cloning without Cross-Paired Data

OmniDirector: 无需配对数据的通用多镜头相机克隆

Jiwen Liu, Shujuan Li, Zhixue Fang, Xiaohan Li, Yan Zhou, Zijie Meng, Zhimin Zhang, Yawen Luo, Guoxin Zhang, Yu-Shen Liu, Pengfei Wan

发表机构 * Kuaishou Technology（快手科技）； Tsinghua University（清华大学）； University of Science and Technology of China（中国科学技术大学）

AI总结提出OmniDirector框架，通过将相机参数编码为网格运动视频，并利用百万级配对数据训练，实现无需交叉配对数据的多镜头相机运动克隆，具备卓越的控制性能。

Comments 12 pages, 8 figures

详情

AI中文摘要

从参考视频中克隆相机运动是视频生成中的一项重要任务，因为视频提供了直观且精确的控制。现有方法要么直接使用无法处理多镜头生成的参数化表示，要么合成交叉配对数据，但受限于数据稀缺性，导致在复杂相机运动克隆中表现不佳。为解决这些问题，我们引入了一种通用的相机运动表示，将相机编码为网格运动视频。该相机网格以视觉方式表示相机参数，并支持集成多样化的轨迹以进行多镜头视频生成。基于此，我们提出了OmniDirector，一个在百万级相机网格-视频对上训练的统一框架，该框架协调角色、动作和相机，为多模态扩散变换器提供导演级别的控制。此外，我们设计了一种新颖的分层提示扩展代理，通过理解信号关系系统地描述相机运动和视觉内容，从而和谐地整合不同的控制信号。大量实验证明了我们框架的卓越性能和出色的可控性。项目页面：此https URL

英文摘要

Cloning camera motion from reference videos is an important task in video generation, as videos provide intuitive and precise control. Existing methods either directly use parametric representations that fail to handle multi-shot generation or synthesize cross-paired data, which suffer from data scarcity, resulting in poor performance in complicated camera motion cloning. To address these issues, we introduce a general camera motion representation that encodes cameras as grid motion videos. This camera grid represents the camera parameters visually and supports the integration of diverse trajectories for multi-shot video generation. Building upon this, we propose OmniDirector, a unified framework trained on a million-scale camera grid-video pairs that coordinates characters, actions, and cameras to provide director-level control for multimodal diffusion transformers. Furthermore, we design a novel hierarchical prompt expansion agent that harmoniously integrates different control signals by systematically describing camera motion and visual content through understanding signal relationships. Extensive experiments demonstrate the superior performance and outstanding controllability of our framework. Project page: https://ymlinfeng.github.io/OmniDirector.github.io/

URL PDF HTML ☆

赞 0 踩 0

2606.13427 2026-06-12 cs.CV 新提交

VietFashion: Benchmarking Sketch-Text Composed Image Retrieval for Cultural Outfits

VietFashion：面向文化服饰的草图-文本组合图像检索基准

Hoang-Nguyen Cao, Le-Hoang Bui, Dinh-Khoi Vo, Minh-Triet Tran, Trung-Nghia Le

发表机构 * University of Science, Ho Chi Minh City, Vietnam（胡志明市理科大学）； Vietnam National University, Ho Chi Minh City, Vietnam（越南国家大学胡志明市分校）

AI总结提出VietFashion基准，针对越南传统服饰奥黛，结合手绘草图和文本描述进行多目标检索，揭示现有方法在细粒度文化语义和跨模态组合上的不足。

Comments ICMR 2026. Project page: https://hng0303.github.io/VietFashion

详情

DOI: 10.1145/3805622.3810590

AI中文摘要

文化服饰对视觉检索系统提出了独特挑战，因为其身份往往依赖于标准AI模型难以捕捉的微妙结构和符号细节。我们引入VietFashion，一个以越南传统服饰奥黛为中心的草图-文本组合图像检索新基准。VietFashion使设计师和研究人员能够通过手绘草图（传达服装结构）和文本描述（编码文化语义）的组合来检索具有文化意义的服装。数据集初始包含650张草图，并通过生成模型扩展至超过21,000张带有对齐标题的照片级真实图像。文本提示描述了详细的服装属性，这些属性从时尚杂志中提取以确保真实性和多样性。为了更好地反映设计意图固有的模糊性，VietFashion采用多目标检索设置，其中单个查询可能对应多个有效结果。我们建立了标准化的评估协议，并对最先进的组合图像检索方法进行了基准测试。实验结果表明，在建模细粒度文化语义和多模态组合方面存在显著性能差距，使VietFashion成为细粒度时尚检索的一个具有挑战性的基准。数据集公开于：this https URL。

英文摘要

Cultural garments pose a unique challenge for visual retrieval systems, as their identity often depends on subtle structural and symbolic details that are poorly captured by standard AI models. We introduce VietFashion, a new benchmark for sketch-text composed image retrieval centered on the Ao Dai, a traditional Vietnamese garment. VietFashion enables designers and researchers to retrieve culturally meaningful outfits using a combination of hand-drawn sketches, which convey garment structure, and textual descriptions, which encode cultural semantics. The dataset is initialized with 650 sketches and expanded using generative models to produce over 21,000 photorealistic images with aligned captions. Textual prompts that describe detailed outfit attributes, which are extracted from fashion magazines to ensure authenticity and diversity. To better reflect the inherent ambiguity of design intent, VietFashion adopts a multi-target retrieval setting, where a single query may correspond to multiple valid results. We establish standardized evaluation protocols and benchmark state-of-the-art composed image retrieval methods. Experimental results reveal significant performance gaps in modeling fine-grained cultural semantics and multi-modal composition, positioning VietFashion as a challenging benchmark for fine-grained fashion retrieval. The dataset is publicly available at: https://hng0303.github.io/VietFashion.

URL PDF HTML ☆

赞 0 踩 0

2606.13426 2026-06-12 cs.LG stat.ML 新提交

Accelerating Speculative Diffusions via Block Verification

通过块验证加速推测性扩散

Alexander Soen, Hisham Husain, Valentin De Bortoli, Arnaud Doucet

发表机构 * KTH（皇家理工学院）； Google Research（谷歌研究）； Google DeepMind（谷歌深Mind）

AI总结提出一种针对扩散模型的推测性采样方案，通过块验证提高草稿接受率，无需训练的Free Drafter实现高达6.3%的加速。

详情

AI中文摘要

推测性解码通过使用草稿模型生成令牌，并采用接受-拒绝方案确保输出与目标分布匹配，从而加速LLM推理。将其适应于连续扩散是困难的，因为推测性采样需要从残差分布中采样。虽然在离散空间中直接，但在连续空间中高效采样残差并非易事。因此，现有的扩散适应要么使用计算效率低下的采样技术，要么依赖替代方案。在这项工作中，我们引入了一种新颖的方案，高效地实现了扩散模型的原始推测性采样机制。我们的方法相比现有方法具有关键优势：它使我们能够将LLM的块验证适应到扩散——这被证明可以提高草稿的接受率。此外，我们形式化并分析了Free Drafter，一种无需训练的扩散启发式自推测草稿生成器。通过启用块验证，我们的Free Drafter在无需额外训练且开销可忽略的情况下，相比现有推测性方法实现了高达6.3%的加速。

英文摘要

Speculative decoding speeds up LLM inference by using a draft model to generate tokens, with an acceptance-rejection scheme that ensures that the output matches the target distribution. Adapting this to continuous diffusions is difficult because speculative sampling requires drawing from a residual distribution. While straightforward in discrete spaces, efficiently sampling this residual in continuous space is non-trivial. Consequently, existing diffusion adaptations either use computationally inefficient sampling techniques or rely on an alternative scheme. In this work, we introduce a novel scheme that efficiently implements the original speculative sampling mechanism for diffusion models. Our approach offers a critical advantage over current methods: it enables us to adapt block verification from LLMs to diffusions -- which provably improves the acceptance rate of drafts. Furthermore, we formalize and analyze the Free Drafter, a heuristic self-speculative drafter for diffusions that requires no training. By enabling block verification, our Free Drafter yields up to a 6.3% speedup over existing speculative methods with no additional training and negligible overhead beyond the existing parallel verification pass.

URL PDF HTML ☆

赞 0 踩 0

2606.13411 2026-06-12 cs.CL 新提交

An End-to-End Hybrid Framework for Rumour Detection in Low-Resources Algerian Dialect

面向低资源阿尔及利亚方言谣言检测的端到端混合框架

Dihia Lanasri, Fatima Benbarek

发表机构 * ATM Mobilis ； USTHB Algiers（阿尔及尔科技大学）

AI总结针对阿尔及利亚方言谣言检测中资源稀缺、代码切换等问题，提出端到端混合框架，结合Transformer嵌入与经典分类器，F1达0.84，并发现领域预训练比模型规模更重要。

详情

AI中文摘要

社交媒体的快速增长加剧了谣言的传播。在阿尔及利亚语境下，由于方言内容的非正式性和代码切换特性、标注资源的稀缺以及标准阿拉伯语NLP工具在方言文本上的有限有效性，这一问题更具挑战性。本文提出了一种面向阿尔及利亚方言社交媒体内容的端到端谣言检测混合框架。我们通过结合真实社交媒体帖子、合成数据和FASSILA语料库，并基于相似性标注过程进行自动标注，构建了一个领域特定的标注数据集。还引入了一个音译流水线，以生成阿拉伯文字和Arabizi的并行数据集。我们评估了多种方法，包括经典机器学习、深度学习、Transformer和混合模型。实验结果表明，结合Transformer嵌入与经典分类器的混合方法达到了最佳性能，F1分数为0.84。我们还发现，领域特定预训练比模型规模更重要，在社交媒体上训练的模型优于在正式阿拉伯语语料库上训练的更大模型。这些结果证明了在低资源阿尔及利亚方言环境下进行谣言检测的可行性。

英文摘要

The rapid growth of social media has intensified the spread of rumours. This issue is more challenging in the Algerian context due to the informal and code-switched nature of dialectal content, the scarcity of annotated resources, and the limited effectiveness of standard Arabic NLP tools on dialect text. This paper presents an end-to-end rumour detection hybrid framework for Algerian dialect social media content. We build a domain-specific annotated dataset by combining real social media posts, synthetic data, and the FASSILA corpus, with automatic labeling based on a similarity-based annotation process. A transliteration pipeline is also introduced to generate parallel datasets in Arabic script and Arabizi. We evaluate multiple approaches, including classical machine learning, deep learning, transformers, and hybrid models. Experimental results show that a hybrid approach combining transformer embeddings with a classical classifier achieves the best performance, reaching an F1-score of 0.84. We also find that domain-specific pre-training is more important than model size, with social media-trained models outperforming larger models trained on formal Arabic corpora. These results demonstrate the feasibility of rumour detection in low-resource Algerian dialect settings.

URL PDF HTML ☆

赞 0 踩 0

2606.13410 2026-06-12 cs.CV cs.HC 新提交

Person Identification from Contextual Motion

基于情境运动的人物识别

Igor Kviatkovsky, Ehud Rivlin, Ilan Shimshoni

发表机构 * Technion – Israel Institute of Technology（以色列理工学院）； University of Haifa（海法大学）

AI总结提出一种生成模型描述动作实例创建过程，并针对监控和认证应用推导概率身份推断方案；引入交互式人物识别场景，通过序列化消息交换最大化互信息，实现高识别率。

详情

AI中文摘要

我们考虑基于运动风格识别人的问题。我们提出了一个描述动作实例创建过程的生成模型，并针对监控和认证应用所驱动的两种常见人物识别场景推导了概率身份推断方案。我们引入了一种新颖的、交互式的人物运动模式识别场景。为此，我们将识别过程形式化为受试者与系统之间的顺序消息交换会话。受试者的行为使用受人类信息处理（HIP）范式启发的概率生成模型建模。在每个阶段，系统向受试者呈现视觉刺激（线索）并记录其运动响应。线索的选择旨在最大化预期响应与受试者身份的互信息。一旦记录，响应用于更新可能受试者身份的后验概率。一旦达到足够的分类置信水平，该过程终止。据我们所知，这是首次在这种交互式设置中解决人物识别问题。我们在五个公开数据集和我们自己的新数据集（包含22名受试者对15个线索的4,476条记录）上报告了高识别率。

英文摘要

We consider the problem of identifying people based on their motion styles. We present a generative model describing the action instance creation process and derive a probabilistic identity inference scheme for two common person identification scenarios motivated by the surveillance and authentication applications. We introduce a novel, \emph{interactive}, scenario for person identification from motion patterns. To this end, we formalize the identification process in the context of a sequential message exchange session between the subject and the system. The subject's behavior is modeled using a probabilistic generative model inspired by the Human Information Processing (HIP) paradigm. At each stage, the system presents a visual stimulus (a cue) to the subject and records their motion response. The cue is selected so as to maximize the mutual information of the expected response and the subject's identity. Once recorded, the response is used to update the a posteriori probability over possible subjects' identities. The process terminates once a sufficient classification confidence level is reached. To the best of our knowledge, this is the first time person identification is addressed in such interactive setting. We report high recognition rates on five publicly available datasets and our own novel dataset consisting of 4,476 recordings of 22 test subjects responding to 15 cues.

URL PDF HTML ☆

赞 0 踩 0

2606.13407 2026-06-12 cs.AI 新提交

Optimizing Appliance Scheduling for Solar Energy Management Using Metaheuristic Algorithms

使用元启发式算法优化太阳能管理的电器调度

Hiba Ahmed, Alexander E. I. Brownlee, Jason Adair, Simon T. Powers

发表机构 * Computing Science and Mathematics, University of Stirling（斯特灵大学计算科学与数学学院）

AI总结提出基于迭代局部搜索和模拟退火的元启发式方法，优化电器启动时间以最大化太阳能利用，并处理多天任务溢出问题。

Comments 9 pages; full results and methodology for poster paper accepted to GECCO 2026

详情

DOI: 10.1145/3795101.3805310

AI中文摘要

可再生能源对于满足未来能源需求至关重要；然而，仅在白天发生的太阳能发电通常与家庭消费模式不一致。诸如炊具、洗衣机和烘干机等电器通常根据用户偏好的时间表运行，而不是根据太阳能可用性，这形成了一个调度优化问题。目标是确定最佳电器启动时间，以最大化可再生能源利用，同时最小化用户不便并遵守系统约束。本文提出了一种使用迭代局部搜索（ILS）和模拟退火（SA）的元启发式方法，以优化电器启动时间，同时考虑电器运行持续时间、功耗、逆变器限制、电池荷电状态约束和太阳能发电预测。与大多数现有工作不同，调度扩展到单日之外，以容纳前几天的未完成任务（溢出），确保操作连续性并支持跨多天的顺序操作。实验结果表明，顺序多日调度框架在独家太阳能发电下有效管理系统约束，同时确保用户便利。这些发现也为未来关于不同规模设备投资、投资回报和用户满意度之间的多目标权衡研究提供了机会。

英文摘要

Renewable energy is essential for meeting future energy demands; however, solar energy generation, which occurs only during daylight hours often does not align with household consumption patterns. Appliances such as cookers, washing machines, and dryers are typically operated according to user preferred schedules rather than solar energy availability, creating a scheduling optimization problem. The objective is to determine optimal appliance start times to maximize renewable energy utilization while minimizing user inconvenience and adhering to system constraints. This paper presents a metaheuristic approach using Iterated Local Search (ILS) and Simulated Annealing (SA) to optimize appliance start times, while considering appliance operating durations, power consumption, inverter limit, battery state of charge constraints, and solar generation forecasts. Unlike most existing work, the scheduling is extended beyond a single day to accommodate unfinished tasks from previous days (spillover), ensuring operational continuity and enabling sequential operation across multiple days. Experimental results show that the sequential multi-day scheduling framework effectively manages system constraints while ensuring user convenience under exclusive solar generation. These findings also open opportunities for future research on multi-objective trade-offs between investment in equipment of various sizes, return on that investment, and user satisfaction.

URL PDF HTML ☆

赞 0 踩 0

2606.13405 2026-06-12 cs.AI cs.MA 新提交

Neuro-Symbolic Agents for Regulated Process Automation: Challenges and Research Agenda

用于受规管流程自动化的神经符号代理：挑战与研究议程

Alexander Rombach, Chantale Lauer, Nijat Mehdiyev

发表机构 * German Research Center for Artificial Intelligence (DFKI)（德国人工智能研究中心（DFKI））； Saarland University（萨尔大学）

AI总结提出将领域内符号结构（法规、流程模型、合规约束）作为代理核心架构组件，实现合规性内置（compliance-by-construction）以补充护栏监控，并列出神经符号研究挑战。

Comments Accepted as a poster in NILA Workshop @ IJCAI-ECAI 2026

2606.13400 2026-06-12 cs.LG cs.AI cs.RO 新提交

PolyFlow: Safe and Efficient Polytope-Constrained Flow Matching with Constraint Embedding and Projection-free Update

PolyFlow: 安全高效的多面体约束流匹配，具有约束嵌入和无投影更新

Jianming Ma, Qiyue Yang, Yang Zhang, Liyun Yan, Zhanxiang Cao, Yazhou Zhang, Yue Gao

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出PolyFlow，一种将约束直接嵌入模型和流动力学的多面体约束流匹配框架，通过离散时间流公式和无投影架构消除离散化误差并严格满足任意多面体约束，在规划与控制任务中实现零约束违反并降低推理延迟。

Comments 30 pages, 12 figures, Accepted to ICML 2026

详情

AI中文摘要

尽管基于流的生成模型在广泛领域展现了强大的性能，但由于严格的约束要求，在安全关键的物理系统中部署它们仍然具有挑战性。现有方法通常通过事后修正来强制执行安全性，这会产生大量的计算开销，并可能扭曲学习到的分布。我们提出了PolyFlow，一种多面体约束流匹配框架，将约束直接嵌入到模型和流动力学中。PolyFlow引入了离散时间流公式和无投影架构，消除了离散化误差，并保证严格满足任意多面体约束，无需昂贵的迭代求解器。实验结果表明，PolyFlow在规划和控制任务中实现了零约束违反，同时保持了较高的分布保真度。与最先进的约束生成基线相比，PolyFlow显著降低了推理延迟，并在安全性、效率和生成质量之间展示了有利的权衡。代码可在该 https URL 获取。

英文摘要

While flow-based generative models have demonstrated strong performance across a wide range of domains, deploying them in safety-critical physical systems remains challenging due to strict constraint requirements. Existing approaches typically enforce safety through post-hoc corrections, which incur substantial computational overhead and may distort the learned distribution. We propose PolyFlow, a polytope-constrained flow matching framework that embeds constraints directly into the model and flow dynamics. PolyFlow introduces a discrete-time flow formulation and a projection-free architecture, which eliminate the discretization error and guarantee strict satisfaction of arbitrary polyhedral constraints, without the need for expensive iterative solvers. Experimental results show that PolyFlow achieves zero constraint violation while maintaining high distributional fidelity across a range of planning and control tasks. Compared to state-of-the-art constrained generation baselines, PolyFlow significantly reduces inference latency and demonstrates a favorable trade-off between safety, efficiency, and generative quality. Code is available on https://github.com/MJianM/PolyFlow.

URL PDF HTML ☆

赞 0 踩 0

2606.13394 2026-06-12 cs.RO 新提交

GeoHAT: Geometry-Adaptive Hybrid Action Transformer for Mobile Manipulation

GeoHAT: 几何自适应混合动作Transformer用于移动操作

Xiangyu Zhu, Renjun Wu, Luzhou Ge, Jinyan Liu, Xuesong Li

发表机构 * Beijing Institute of Technology（北京理工大学）

AI总结提出GeoHAT框架，通过轻量级傅里叶空间编码器注入几何信息，并采用混合全身动作解码器分解机械臂与基座动作，在ManiSkill-HAB基准上成功率提升23.7%。

详情

AI中文摘要

全身移动操作需要在不断变化的视角下协调移动基座和机械臂，这对几何感知和动作生成提出了挑战。当前的策略要么依赖2D特征，要么依赖缺乏密集空间结构的稀疏3D表示，并且通常将机械臂和基座编码在一个动作向量中，忽略了它们各自不同的控制需求。此外，现有的密集融合策略在噪声深度下可能破坏预训练表示，同时带来沉重的计算开销。我们提出了GeoHAT，一个基于简单原则的端到端扩散框架：几何信息应仅在可靠处注入，且仅在需要处被关注。GeoHAT采用轻量级傅里叶空间编码器，将密集的逐像素3D坐标映射为几何标记，无需额外的3D视觉骨干网络。然后，通过由深度有效性调制的逐标记门控融合，将这些标记选择性地注入视觉基础模型特征中，在保留语义先验的同时丰富空间理解。对于动作生成，混合全身动作解码器将机械臂和基座分解到不同的子空间，并通过稀疏交叉注意力让每个动作模态关注其任务相关的视觉上下文，同时因果时序建模捕获时间步内协调和时间步间依赖。在ManiSkill-HAB仿真基准上的实验表明，GeoHAT实现了79.3%的平均成功率，比最强基线高出23.7%。此外，在多种任务上的真实世界实验也证实了在所有基线上的一致改进。

英文摘要

Whole-body mobile manipulation requires coordinating mobile base and manipulator under shifting viewpoints, posing challenges in geometric perception and action generation. Current policies either rely on 2D features or sparse 3D representations that lack dense spatial structure, and typically encode arm and base within one action vector that ignores their distinct control demands. Moreover, existing dense fusion strategies risk corrupting pretrained representations under noisy depth while incurring heavy computational overhead. We present GeoHAT, an end-to-end diffusion-based framework built on a simple principle: geometry should be injected only where reliable and attended to only where needed. GeoHAT employs a lightweight Fourier spatial encoder that maps dense per-pixel 3D coordinates into geometric tokens without an additional 3D vision backbone. These tokens are then selectively injected into vision foundation model features through per-token gated fusion modulated by depth validity, preserving the semantic prior while enriching spatial understanding. For action generation, a Hybrid Whole-Body Action Decoder decomposes arm and base into distinct subspaces and lets each action modality attend to its task-relevant visual context through sparse cross-attention, while causal temporal modeling captures intra-timestep coordination and inter-timestep dependencies. Experiments on the ManiSkill-HAB simulation benchmark demonstrate that GeoHAT achieves a 79.3% mean success rate, surpassing the strongest baseline by 23.7%. Furthermore, real-world experiments on diverse tasks also confirm consistent improvements over all baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.13382 2026-06-12 cs.CV cs.AI 新提交

SmartFont: Dynamic Condition Allocation for Few-Shot Font Generation

SmartFont: 少样本字体生成的动态条件分配

Zian Yang, Zixin Wang

发表机构 * Fudan University（复旦大学）

AI总结提出SmartFont扩散框架，通过全局内容-风格生成与弱监督局部校正专家结合，并引入去噪状态条件分配模块动态加权全局与局部特征，实现少样本字体生成的全局完整性与局部细节保真度平衡。

详情

AI中文摘要

少样本字体生成同时需要全局结构完整性和细粒度局部风格保真度。现有方法通常要么依赖全局内容-风格建模（鲁棒但解耦不完美），要么强调组件/局部建模（捕捉细节但严重依赖局部先验和参考覆盖）。我们认为关键挑战不仅在于学习更纯净的条件，而在于通过生成过程中的多级分配来组织互补但有偏的全局和局部条件。为此，我们提出SmartFont，一个基于扩散的少样本字体生成框架，结合全局内容-风格生成与弱监督局部校正专家。局部分支通过弱组件监督学习专家级局部概念和语义有意义的空间图，实现无需显式组件条件推理的细粒度校正。在此基础上，去噪状态条件分配模块在时间步和注入块上自适应地加权全局内容、全局风格和局部校正特征。大量实验表明，SmartFont实现了更好的全局-局部平衡，提高了字形质量和局部细节保真度。

英文摘要

Few-shot font generation simultaneously requires global structural completeness and fine-grained local style fidelity. Existing methods usually either rely on global content-style modeling, which is robust but imperfectly disentangled, or emphasize component/local modeling, which captures fine details but relies heavily on local priors and reference coverage. We argue that the key challenge is not merely to learn purer conditions, but to organize complementary yet biased global and local conditions through multi-level allocation during generation. To this end, we propose SmartFont, a diffusion-based few-shot font generation framework that combines global content-style generation with weakly supervised local corrective experts. The local branch performs semantic-spatial allocation by learning expert-wise local concepts and semantically meaningful spatial maps under weak component supervision, enabling fine-grained correction without requiring explicit component-conditioned inference. On top of this, a denoising-state condition allocation module adaptively weights global content, global style, and local corrective feature across timesteps and injection blocks. Extensive experiments show that SmartFont achieves better global-local balance, improves glyph quality and local detail fidelity.

URL PDF HTML ☆

赞 0 踩 0

2606.13381 2026-06-12 cs.LG 新提交

Hölder++: Improving the Quality-Coherence Trade-off in Multimodal VAEs

Hölder++：改进多模态VAE中的质量-一致性权衡

Huyen Vo, María Martínez-García, Isabel Valera

发表机构 * Hölder++: Improving the Quality-Coherence Trade-off in Multimodal VAEs Supplementary Material（Hölder++：多模态VAE中质量与一致性权衡的改进补充材料）

AI总结针对多模态VAE生成质量与语义一致性之间的权衡问题，提出Hölder++，通过精确Hölder池化、扩展架构和层次推理，在提升一致性的同时保持生成质量。

Comments Accepted at ICML 2026. Camera-ready version

详情

AI中文摘要

现有的多模态变分自编码器（VAE）方法面临生成质量与一致性之间的权衡——即它们难以生成既真实多样又在各模态间语义一致的样本。最近的一项工作表明，使用Hölder池化的简单近似作为聚合方法，尽管假设所有模态共享单一表示，但能提高一致性超过SOTA MMVAE+。然而，它略微牺牲了样本多样性。受此启发，我们提出Hölder++，一种新颖的多模态VAE，通过以下方式改进生成质量-一致性权衡：(i) 首次实现无近似的Hölder池化用于多模态VAE；(ii) 扩展架构，建模不同的共享和私有（即模态特定）表示（Hölder+）；(iii) 层次推理，进一步增强共享和私有表示之间的解耦（Hölder++）。我们的实验证实，Hölder++持续改进生成质量-一致性权衡，产生更结构化的潜在空间，并学习对下游任务信息丰富的共享表示。

VideoMDM: 从2D监督走向3D人体运动生成

Amir Mann, Gal Michael Harari, Merav Keidar, Or Litany

发表机构 * Technion（以色列理工学院）； NVIDIA（英伟达）

AI总结提出VideoMDM框架，利用单目视频的2D姿态通过扩散模型学习3D运动先验，使用深度加权的2D重投影损失近似3D监督，在HumanML3D上接近全3D监督性能。

Comments https://videomdm.github.io/

详情

AI中文摘要

我们提出VideoMDM，一个基于扩散的框架，直接从单目视频中提取的精确2D姿态训练3D人体运动先验，无需任何3D真实数据。预训练的2D到3D提升器提供近似的3D姿态序列，作为有噪声的教师：这些序列被扩散，模型在3D空间去噪，并通过重投影预测并与精确关键点比较在2D空间进行监督。我们证明，在温和假设下，深度加权的2D重投影损失在期望上等价于直接3D监督，并将标准3D运动正则化器——速度一致性和过参数化表示对齐——适应到这一2D设置。与仅在推理时将2D提升到3D的方法不同，VideoMDM在训练期间学习一个连贯的3D运动流形。在HumanML3D上，它几乎缩小了与完全3D监督的MDM的差距（FID 0.88 vs 0.54）；在真实视频数据集Fit3D和NBA上，该方法学习生成一致被人类偏好的运动，并取得了强定量结果。

英文摘要

We introduce VideoMDM, a diffusion-based framework that trains 3D human motion priors directly from accurate 2D poses extracted from monocular videos, without any 3D ground truth. A pretrained 2D-to-3D lifter provides approximate 3D pose sequences that serve as a noisy teacher: these are diffused, denoised by the model in 3D, and supervised in 2D by reprojecting the prediction and comparing against accurate keypoints. We show that, under mild assumptions, a depth-weighted 2D reprojection loss is equivalent in expectation to direct 3D supervision, and we adapt standard 3D motion regularizers - velocity consistency and over-parameterized representation alignment - to this 2D setting. Unlike methods that lift 2D to 3D only at inference, VideoMDM learns a coherent 3D motion manifold during training. On HumanML3D it nearly closes the gap to fully 3D-supervised MDM (FID 0.88 vs 0.54); On real video datasets Fit3D and NBA the method learns to generate motions consistently preferred by humans, with strong quantitative results.

URL PDF HTML ☆

赞 0 踩 0

2606.13361 2026-06-12 cs.AI cs.CE cs.MA 新提交

Can I Buy Your KV Cache?

我能买你的KV缓存吗？

Luoyuan Zhang

发表机构 * Harbin Institute of Technology, Shenzhen (HITSZ)（哈尔滨工业大学（深圳））

AI总结针对AI代理重复计算相同文档KV缓存的问题，提出由发布者预计算KV缓存，其他代理付费加载以跳过预填充，实验表明在Qwen3-4B上计算成本降低9-50倍，并设计了代理原生预填充CDN架构。

详情

AI中文摘要

现在，在世界各地，AI代理正在重复同样的荒谬行为：为了读取一份文档，每个代理都从头开始重新计算。每个代理都重新运行预填充——大型模型最计算密集的步骤——在相同的文本上，只是为了重建一个与之前代理刚刚构建的完全相同的键值（KV）缓存。相同的答案，被计算了一百万次。我们提出了一个几乎粗鲁简单的建议：只计算一次。让发布者预计算文档的KV缓存，然后让每个其他代理购买加载该缓存并跳过预填充的权利。这可行，并且是token精确的：加载预计算的KV并继续与从头开始预填充匹配（24/24个贪婪token，并且在logits级别），没有准确度损失。在Qwen3-4B上，重用比预填充计算便宜9-50倍，并且差距随长度增加而扩大（预填充的注意力与L^2成比例），因此一次重用就足以收回成本。然后关键部分：KV存储在哪里。传输它失败了，因为KV几乎不可压缩，因此每次加载的出口成本比它节省的预填充成本还要高。将其托管在提供方侧，正如生产中的提示缓存那样，完全消除了出口成本。奖励的大小由我们测量的计算节省决定：为80M代理提供一份热门的3774-token文档，重新预填充成本约150万美元，而重用计算成本仅约3万美元（减少49.7倍）。API收取的0.1倍缓存读取关税在测量范围内为用户提供了10倍的折扣，因此10倍是下限，而测量的约50倍计算节省超过了它，与物理约50倍的差距是提供方的利润：每份热门文档数百万美元。我们构建了由此产生的代理原生预填充CDN，并将无损KV压缩和跨方支付层作为开放问题。

英文摘要

Right now, across the world, AI agents are repeating the same absurd act: to read one document, they each recompute it from scratch. Every agent re-runs prefill, the most compute-intensive step a large model takes, over identical text, only to rebuild a key-value (KV) cache identical to the one the agent before it just built. The same answer, computed a million times. We make a proposal that is almost offensively simple: compute it once. Let a publisher precompute a document's KV cache, and let every other agent buy the right to load it and skip prefill. It works, and it is token-exact: loading a precomputed KV and continuing matches prefilling from scratch (24/24 greedy tokens, and at the logits level), with no accuracy cost. On Qwen3-4B, reuse is 9-50x cheaper in compute than prefill, and the gap widens with length (prefill's attention scales with L^2), so a single reuse already pays it back. Then the part that matters: where the KV lives. Shipping it fails, because KV is nearly incompressible, so per-load egress costs more than the prefill it saves. Hosting it provider-side, exactly as production prompt-caching works, removes egress entirely. The size of the prize is set by our measured compute saving: serving one hot 3774-token document to 80M agents costs ~$1.5M to re-prefill but only ~$0.03M of reuse compute (49.7x less). The 0.1x cache-read tariff APIs charge passes a 10x discount to users while sitting inside this measured envelope, so the 10x is a floor that the measured ~50x compute saving clears, and the gap to the physical ~50x is provider margin: millions of dollars per popular document. We frame the resulting agent-native prefill CDN and leave lossless KV compression and a cross-party payment layer as the open problems.

URL PDF HTML ☆

赞 0 踩 0

2606.13355 2026-06-12 cs.RO cs.AI 新提交

Real-Time Execution with Autoregressive Policies

基于自回归策略的实时执行

Sangkyu Lee, Seohyeon Park, Tackgeun You, Avi Caciularu, Idan Szpektor, Hwasup Lim, Youngjae Yu

发表机构 * Korea Institute of Science and Technology（韩国科学技术研究院）； Seoul National University（首尔大学）； Google Research（谷歌研究院）

AI总结通过异步推理和约束解码实现自回归策略的实时执行，在保证低延迟的同时提升任务完成速度，实验表明其性能优于流匹配策略。

详情

AI中文摘要

实时执行通过异步推理实现平滑动作轨迹和快速响应，对于大规模视觉-语言-动作模型的实际部署至关重要。然而，近期关于实时执行的工作主要关注扩散策略的变体，尽管自回归策略在同步推理中滚动速度较慢，更需要实时性。相比之下，我们证明自回归策略可以通过调整分词范围和应用约束解码来实现实时执行，从而保证严格的延迟界限，支持多轨迹解码以最大化性能。在模拟和真实环境中，我们发现自回归策略始终优于同等水平的流匹配策略，同时显著提升了同步推理的任务完成速度。结合自回归策略的固有优势（如更快的收敛速度和更好的指令遵循泛化能力），这些结果证实自回归策略仍是一种支持实时执行的竞争性策略类型。

英文摘要

Real-time execution, enabled by asynchronous inference that ensures both smooth action trajectories and fast reactivity, is critical for realistic deployments of large-scale Vision-Language-Action models. However, recent work on real-time execution primarily focuses on variants of diffusion policies, even though it is more critical for autoregressive policies given their slower rollout speed in synchronous inference. In contrast, we demonstrate that autoregressive policies can achieve real-time execution by adjusting the tokenization horizon and applying constrained decoding, thereby guaranteeing strict latency bounds that enable multi-trajectory decoding to maximize performance. Across simulated and real-world environments, we find that the autoregressive policy consistently outperforms its equivalent-level flow-matching policy counterpart while achieving significantly improved task completion speeds from synchronous inference. Coupled with the inherent advantages of autoregressive policies, such as faster convergence and better generalizability in instruction-following, these results confirm that autoregressive policies can remain a competitive policy type supporting real-time execution.

URL PDF HTML ☆

赞 0 踩 0

2606.13352 2026-06-12 cs.RO 新提交

Low cost, easily manufactured, highly flexible strain and touch sensitive fiber for robotics applications

低成本、易制造、高柔性应变与触觉传感纤维用于机器人应用

Christian Diaz Herrera, Srushti Raste, Simin Liu, Miles Modeste, Jiyang, Yin, Katelyn McCall, Yuxing Jared Yao, Roopkamal Chahal, Simon Chidley, Trung Ha, T. David Westmoreland, Sonia Roberts

发表机构 * Wesleyan University（卫斯理大学）

AI总结提出一种仅用廉价商用部件和工具快速制造的导电纤维，兼具电阻应变传感和电容触觉传感功能，实验验证其在机器人抓取、姿态估计和近场跟踪中的应用。

详情

AI中文摘要

现有的机器人拉伸和触觉传感器通常在材料成本、所需制造设备或制造时间方面至少有一项昂贵。我们提出并实验表征了一种导电纤维，仅使用廉价的商用现成部件（导电线程$0.07/英尺，硅胶管$0.94/英尺）和工具（环形针穿线器$2），可快速制造（20厘米长度2分钟）。我们展示了其作为电阻应变传感器的三种应用：触发气动辅助手指的抓取、感知气动机器人带的位置、以及估计柔性固体的姿态。我们还展示了其作为电容传感器的两种应用：首先，作为触觉传感器触发商业机器人手臂移动；其次，作为近场传感器使机器人手臂跟随移动的手。电容传感器通过编织制成，展示了纤维的高柔性。我们讨论了提高制造可扩展性的方法及其成本权衡。最后，我们展示了一种修复切断纤维的方法。

英文摘要

Existing stretch and touch sensors for robots are generally expensive with respect to at least one of material costs, required manufacturing equipment, or manufacturing time. We present and experimentally characterize a conductive fiber made using only inexpensive commercial off-the-shelf parts (conductive thread at $0.07/ft, silicone tubing at $0.94/ft) and tools (loop-style needle threader at $2), which can be manufactured quickly (20 cm length in 2 minutes.) We demonstrate its use as a resistive strain sensor with three applications: Triggering a grasp in a pneumatically actuated assistive finger, sensing the pose of a pneumatically actuated robotic strap, and estimating the pose of a flexible solid. We also demonstrate that it can be used as a capacitive sensor with two applications: First, as a touch sensor which triggers a commercial robot arm to move, and second, as a near-field sensor enabling the robot arm to follow a moving hand. The capacitive sensors are knitted, showcasing the high flexibility of the fiber. We discuss methods for improving manufacturing scalability and their cost trade-offs. Finally, we demonstrate a method for repairing a cut fiber.

URL PDF HTML ☆

赞 0 踩 0

2606.13349 2026-06-12 cs.CL 新提交

From Passive Generation to Investigation: A Proactive Scientific Peer Review Agent

从被动生成到主动调查：一种主动的科学同行评审代理

Haishuo Fang, Yue Feng, Iryna Gurevych

发表机构 * Ubiquitous Knowledge Processing Lab (UKP Lab), Technical University of Darmstadt（达姆施塔特工业大学通用知识处理实验室）； National Research Center for Applied Cybersecurity ATHENE, Germany（德国国家应用网络安全研究中心 ATHENE）； School of Computer Science, University of Birmingham（伯明翰大学计算机科学学院）

AI总结提出ProReviewer，一种基于LLM的主动科学同行评审代理，将评审建模为马尔可夫决策过程，通过结构化评审日志引导主动调查，在五个质量维度上平均得分最高，优于现有方法。

详情

AI中文摘要

大型语言模型（LLM）在自动化科学同行评审方面显示出潜力。然而，现有方法通常难以生成有具体证据支持的深入评审。我们认为，一个关键限制是缺乏根据累积证据主动调查论文可疑部分的灵活性，就像人类评审员所做的那样。在本文中，我们探讨如何使基于LLM的评审代理能够进行这种主动调查。我们发现，这可以自然地表述为马尔可夫决策过程（MDP），并提出了ProReviewer，一种科学同行评审代理，它通过维护的结构化评审日志主动评审论文。结构化评审日志作为代理的工作空间，用于跟踪评审过程中收集的证据和中间发现。实验表明，使用8B骨干网络、通过监督微调训练并通过强化学习优化的ProReviewer，在五个质量维度上取得了最高平均分，相对优于基于提示的方法（使用更大的前沿LLM）高达39%，优于最强的微调基线16%。在人工评估中，它也取得了对基线最高的胜率。

英文摘要

Large language models (LLMs) have shown promise in automating scientific peer review. However, existing approaches often struggle to generate in-depth reviews supported by concrete evidence. We argue that a key limitation is the lack of flexibility to proactively investigate suspicious parts of a paper based on accumulated evidence, as human reviewers do. In this paper, we explore how to enable an LLM-based review agent to perform such proactive investigation. We find that this can be naturally formulated as a Markov Decision Process (MDP), and propose ProReviewer, a scientific peer review agent that proactively reviews a paper guided by a maintained, structured review log. The structured review log serves as a workspace for the agent to track evidence and intermediate findings collected during review. Experiments show that ProReviewer with an 8B backbone, trained by supervised fine-tuning and optimized by reinforcement learning, achieves the highest average score across five quality dimensions, outperforming prompt-based methods with much larger frontier LLMs by up to 39% and the strongest fine-tuned baseline by 16% relatively. It also attains the highest win rates against baselines in human evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.13348 2026-06-12 cs.CL cs.AI 新提交

IVIE: A Neuro-symbolic Approach to Incremental and Validated Generation of Interactive Fiction Worlds

IVIE：一种用于增量且经过验证的交互式小说世界生成的神经符号方法

Micaela Vaucher, Santiago Silveira, Santiago Góngora, Luis Chiruzzo

发表机构 * Instituto de Computación, Facultad de Ingeniería, Universidad de la República（乌拉圭共和国大学工程学院计算机研究所）

AI总结提出IVIE神经符号方法，结合LLM的创造力与符号验证的连贯性，通过四阶段增量生成管道构建可玩的交互式小说世界，人类评估显示其生成沉浸式、主题连贯的世界，平衡了灵活性与叙事一致性。

Comments 10 pages, 3 figures. To appear in the Proceedings of the 16th International Conference on Computational Creativity (ICCC'26), June 2026

详情

AI中文摘要

交互式小说中的计算创造力面临一个基本矛盾：大型语言模型（LLM）可能产生创意叙事，但难以维持世界连贯性，而符号系统确保一致性但缺乏创意灵活性。我们提出IVIE（增量与验证的交互体验），一种从零开始生成完整且可玩的交互式小说世界的神经符号方法。基于PAYADOR的神经符号框架，IVIE实现了一个四阶段增量生成管道，将创意决策——设定与角色创建、谜题设计——委托给LLM，同时通过符号验证将世界状态接地。该系统生成具有相互关联的地点、功能性物品、非玩家角色和连贯谜题的世界，所有这些都围绕一个中心目标导向架构组织。人类评估表明，该方法生成了沉浸式、主题连贯的世界，具有高玩家参与度。结果似乎表明，神经符号方法成功平衡了灵活性与叙事连贯性：符号验证在不消除生成自由的情况下将LLM生成接地。然而，挑战依然存在：LLM的不一致性偶尔会绕过谜题约束，客观验证的空白允许一些结构上不可能的目标。我们为未来的神经符号交互式叙事系统确定了关键设计考虑因素，特别是关于LLM的能力及其局限性。

英文摘要

Computational creativity in Interactive Fiction faces a fundamental tension: Large Language Models (LLM) may produce creative narratives but struggle with world coherence, while symbolic systems ensure consistency but lack creative flexibility. We present IVIE (Incremental & Validated Interactive Experiences), a neuro-symbolic approach to generating complete and playable interactive fiction worlds from scratch. Building upon PAYADOR's neuro-symbolic framework, IVIE implements a four-stage incremental generation pipeline that delegates creative decisions--setting and character creation, puzzle design--to LLMs while grounding the world state through symbolic validation. The system generates worlds with interconnected locations, functional items, non-player characters, and coherent puzzles, all structured around a central goal-oriented architecture. Human evaluation shows the approach generates immersive, thematically coherent worlds with high player engagement. Results seem to indicate that the neuro-symbolic approach successfully balances flexibility with narrative coherence: symbolic validation grounds LLM generation without eliminating generative freedom. However, challenges remain: LLM inconsistencies occasionally bypass puzzle constraints, and objective validation gaps allow some structurally impossible goals. We identify key design considerations for future neurosymbolic interactive storytelling systems, particularly regarding LLM capabilities and their limitations.

URL PDF HTML ☆

赞 0 踩 0

2606.13347 2026-06-12 cs.LG 新提交

Enhanced Low-Density Region Exploration in Classifier-Guided Diffusion Models Through Modified Reverse Diffusion Sampling

改进反向扩散采样在分类器引导扩散模型中的低密度区域探索

Jagriti Singh, Shekhar Verma, Muneendra Ojha

发表机构 * University of Allahabad（阿拔斯大学）

AI总结提出一种无需额外训练的采样时间密度感知方法，通过修改分类器梯度引导轨迹朝向低置信区域并引导采样朝向预测真实图像，以增强扩散模型对低密度区域的探索。

详情

AI中文摘要

扩散模型已成为高保真图像合成的最先进生成模型，特别是在无分类器引导和分类器引导形式中。然而，标准分类器引导将概率质量集中在高密度类均值周围，导致对类条件分布尾部罕见样本的覆盖不足。最近关于基于扩散的尾部采样的工作通过训练一个额外的低密度寻求分类器（使用合成与真实判别器）来缓解这一问题，但代价是额外的网络和训练。与此同时，许多采样器和蒸馏技术加速或改进扩散采样，但并未明确解决长尾覆盖问题。我们提出一种纯采样时间、密度感知的分类器引导条件扩散模型扩展，针对低密度区域且无需任何额外训练。我们像大多数扩散模型一样，对噪声图像应用引导而非预测噪声。从预训练的ImageNet条件扩散模型和分类器开始，我们通过修改分类器梯度将轨迹引导向低置信区域，并在每个时间步引导采样过程朝向预测的真实图像，从而修改引导反向动力学。第一个引导有助于探索低概率样本，第二个引导有助于生成接近真实数据流形的样本。所提出的采样器在64x64分辨率下一致提高了ADM模型的召回率，同时保持可比的FID，并且使用256x256 ADM模型，我们展示了两种引导不同组合的视觉结果。我们还表明，标准ADM分类器引导结合预测真实图像引导，有助于在ImageNet上使用256x256 ADM模型生成高感知质量的样本。

英文摘要

Diffusion models have emerged as state-of-the-art generative models for high-fidelity image synthesis, particularly in their classifier-free guided and classifier-guided forms. However, standard classifier guidance concentrates probability mass around high-density class mean, leading to poor coverage of rare samples in the tails of the class-conditional distributions. Recent work on diffusion-based tail sampling mitigates this by training an additional low-density-seeking classifier with a synthetic-vs-real discriminator, at the cost of additional networks and training. In parallel, a number of samplers and distillation techniques accelerate or refine diffusion sampling, but do not explicitly address long-tail coverage. We propose a purely sampling-time, density-aware extension of classifier-guided conditional diffusion model that targets low-density regions without any additional training. We have applied guidance at noisy images not on predicted noise like most diffusion models. Starting from a pretrained conditional diffusion model and classifier on ImageNet, we modify the guided reverse dynamics by steering trajectories toward low-confidence regions via the modified classifier gradient, and at each time step, we also guide the sampling process toward the predicted real image. 1st guidance helps explore low-probability samples, and 2nd guidance helps to generate samples to be close to the real data manifold. The proposed sampler consistently improves ADM model recall at 64x64 resolution while maintaining a comparable FID, and with a 256x256 ADM model, we showed the results visually with different combinations of both guidance. We also showed that standard ADM classifier guidance, combined with predicted real image guidance, helps generate high perceptual quality samples with a 256x256 ADM model on ImageNet.

URL PDF HTML ☆

赞 0 踩 0

2606.13345 2026-06-12 cs.CV 新提交

JointEdit3D: Feed-Forward 3D Scene Editing in a Unified Latent Space

JointEdit3D：统一潜在空间中的前馈3D场景编辑

Xinnan Zhu, Ruijie Xu, Jiayu Ying, Daoguo Dong, Jiachen Xu, Yuan Xie, Xin Tan

发表机构 * East China Normal University（华东师范大学）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； Fudan University（复旦大学）； Tencent（腾讯）

AI总结提出JointEdit3D，在统一RGB-几何重建生成潜在空间中通过非对称潜在修复实现前馈3D场景编辑，引入SceneAnchor分支和编辑/背景感知损失，并构建SceneEdit3D-15K数据集和SceneEdit3D-Bench基准，显著提升编辑区域质量和3D结构完整性。

Comments Preprint. Project page: https://xinnan-zhu.github.io/JointEdit3D-Page/

详情

AI中文摘要

现有的3D场景编辑方法通常依赖于对显式3D表示进行逐场景优化或级联编辑-重建流水线，导致测试时成本高、3D感知有限以及结构不一致。为了在编辑过程中耦合外观合成和几何预测，我们构建了一个统一的RGB-几何重建生成潜在空间，并将其适应于前馈3D场景编辑。由此产生的框架JointEdit3D通过仅观察单个编辑后的RGB参考潜在变量，并在源场景锚定下生成剩余的RGB视图和编辑后的几何潜在变量，执行非对称潜在修复。JointEdit3D引入了一个专门的SceneAnchor分支来注入源场景结构而不强制直接复制，并采用编辑/背景感知损失来平衡编辑区域的保真度与未编辑内容的保持。为了解决缺乏用于标准化3D场景编辑评估的配对资源的问题，我们引入了SceneEdit3D-15K数据集，该数据集包含15K个配对编辑样本和渲染器提供的3D注释，以及SceneEdit3D-Bench，一个精心挑选的100样本基准。实验表明，JointEdit3D在保持竞争性背景保留的同时，在编辑区域质量和3D结构完整性方面优于先前基线。

英文摘要

Existing 3D scene editing methods typically rely on per-scene optimization over explicit 3D representations or cascaded edit-and-reconstruct pipelines, resulting in high test-time cost, limited 3D awareness, and structural inconsistencies. To couple appearance synthesis and geometry prediction during editing, we build on a unified RGB-geometry reconstruction-generation latent space and adapt it to feed-forward 3D scene editing. The resulting framework, \textbf{JointEdit3D}, performs asymmetric latent inpainting by observing only a single edited RGB reference latent and generating the remaining RGB views and edited geometry latent under source-scene anchoring. JointEdit3D introduces a dedicated SceneAnchor Branch to inject source-scene structure without forcing direct copying, and adopts edit/background-aware losses to balance edited-region fidelity with unedited-content preservation. To address the lack of paired resources for standardized 3D scene editing evaluation, we introduce SceneEdit3D-15K, a dataset with 15K paired editing samples and renderer-provided 3D annotations, together with SceneEdit3D-Bench, a curated 100-sample benchmark. Experiments show that JointEdit3D improves edited-region quality and 3D structural completeness over prior baselines while maintaining competitive background preservation.

URL PDF HTML ☆

赞 0 踩 0

2606.13340 2026-06-12 cs.RO 新提交

EMG-Based Adaptation of Anisotropic Virtual Fixtures for Robot-Assisted Surgical Resection and Dissection

基于EMG的各向异性虚拟夹具自适应方法用于机器人辅助手术切除与解剖

Dario Onfiani, Michael Dyck, Luigi Biagiotti, Julian Klodmann

发表机构 * University of Modena and Reggio Emilia（摩德纳大学）； German Aerospace Center (DLR)（德国航空航天中心）

AI总结提出一种基于EMG信号自适应调节各向异性虚拟夹具的框架，通过实时推断外科医生意图动态调整约束，实验证明能提高手术精度和运动一致性，降低认知负荷。

详情

AI中文摘要

本文针对机器人辅助腹腔镜手术中的精细任务（如切除和解剖），开发了一种自适应辅助系统。尽管虚拟夹具在引导外科医生运动方面具有显著优势，但传统虚拟夹具通常由固定几何形状定义，缺乏适应手术流程或外科医生即时意图的灵活性。为解决这些局限性，我们提出了一种自适应各向异性虚拟夹具的新框架。此外，我们引入了一种直观的控制接口，该接口基于从EMG信号推断的外科医生意图，实时调节夹具的几何形状。该方法允许外科医生通过收缩前臂肌肉动态扩展或解除约束，实现精确引导运动和工具自由重新定位之间的无缝切换。基于标准化手术训练任务的初步用户研究实验结果表明了所提方法的有效性。该系统在任务精度和运动一致性方面表现出显著改善，同时降低了感知认知负荷、努力和挫败感。

英文摘要

In this paper, we address the development of an adaptive assistance system for robot-assisted laparoscopic surgery, specifically for delicate tasks such as Resection and Dissection. Even if Virtual Fixtures offer significant advantages for guiding a surgeon's movements, conventional Virtual Fixtures are often defined by fixed geometries, lacking the flexibility to adapt to the surgical workflow or the surgeon's immediate intent. To address these limitations, we propose a novel framework for an adaptive and anisotropic virtual fixture. In addition, we introduce an intuitive control interface that modulates the fixture's geometry in real-time based on the surgeon's intent, inferred from EMG signals. This approach allows the surgeon to dynamically expand or disengage the constraint by contracting their forearm muscles, enabling seamless transitions between precise guided motion and free repositioning of the tool. Experimental results from a pilot user study, based on a standardized surgical training task, demonstrate the effectiveness of the proposed method. The system showed significant improvements in task accuracy and movement consistency, alongside a reduction in perceived cognitive load, effort, and frustration.

URL PDF HTML ☆

赞 0 踩 0

2606.13338 2026-06-12 cs.LG 新提交

Navigating the Safety-Fidelity Trade-off: Massive-Variate Time Series Forecasting for Power Systems via Probabilistic Scenarios

导航安全-保真度权衡：通过概率场景进行电力系统的大规模多变量时间序列预测

Kaijie Xu, Anqi Wang, Xilin Dai

发表机构 * ZJU-UIUC Institute, Zhejiang University（浙江大学伊利诺伊大学厄巴纳香槟校区联合学院）

AI总结针对现有基准无法评估大规模多变量概率预测的安全性与保真度权衡问题，提出包含多达36,964个通道的电力系统基准PowerPhase和场景式分位数预测器PowerForge，在多个网格上取得最佳平均排名。

详情

AI中文摘要

概率预测模型越来越多地部署在具有不同通道物理特性和运行约束的多变量系统上，但现有基准无法大规模评估这两个属性。公开的规范多变量基准最多包含2,000个通道，而电力系统基准要么缺乏时间结构，要么缺乏概率评估。我们提出PowerPhase，这是一个基于六个输电网络构建的概率预测基准，联合预测通道数从2,000到36,964，比流行的规范多变量基准高出一个数量级以上。每个目标轨迹是交流潮流求解的输出，PowerPhase配备了约束感知指标，包括Safety_mBrier、NECV和CVaR-alpha，作为CRPS和Distortion的补充。在八个基线和三个随机种子上，分布准确性和约束满足对模型进行不同排序，我们将这种权衡称为安全-保真度。我们进一步提出PowerForge，一种基于场景的分位数预测器，具有类型特定的解码头和变量组之间的因果桥，在每个网格上实现了最佳平均排名。

英文摘要

Probabilistic forecasting models are increasingly deployed on multivariate systems with distinct channel physics and operational constraints, but existing benchmarks evaluate neither property at scale. Public canonical multivariate benchmarks cap out at 2,000 channels, while power-system benchmarks either lack temporal structure or probabilistic evaluation. We introduce PowerPhase, a probabilistic forecasting benchmark built on six transmission grids ranging from 2,000 to 36,964 jointly forecasted channels, more than an order of magnitude beyond popular canonical multivariate benchmarks. Each target trajectory is the output of an AC power-flow solve, and PowerPhase ships with constraint-aware metrics, including Safety_mBrier, NECV, and CVaR-alpha, that complement CRPS and Distortion. Across eight baselines and three seeds, distributional accuracy and constraint satisfaction rank models differently, a trade-off we term safety-fidelity. We further propose PowerForge, a scenario-based quantile forecaster with type-specific decoding heads and a causal bridge between variable groups, which achieves the best average rank on every grid.

URL PDF HTML ☆

赞 0 踩 0

2606.13332 2026-06-12 cs.CV 新提交

OR-Action: Multi-Role Video Understanding with Fine-Grained Actions

OR-Action: 细粒度动作的多角色视频理解

Felix Tristram, Ege Özsoy, Christian Benz, Marcel Walch, Ghazal Ghazaei, Nassir Navab

发表机构 * Technical University of Munich（慕尼黑工业大学）； Munich Center for Machine Learning (MCML)（慕尼黑机器学习中心）； Carl Zeiss AG（卡尔蔡司股份公司）

AI总结针对手术室活动理解中场景图方法缺乏时间建模的问题，提出基于公开数据集的细粒度多角色动作基准，并引入纯视觉时序模型，显著优于图方法，同时提出多视角到单视角特征对齐策略提升单视角性能。

详情

AI中文摘要

对手术室活动的细粒度理解能够实现工作流感知的辅助，但由于杂乱、遮挡和有限的感知，仍然困难。建模该环境的主流方法是使用场景图作为OR交互的可解释表示。然而，在没有显式时间建模的情况下，将它们的逐帧关系预测转换为时间上延伸的细粒度动作是具有挑战性的。为了对当前OR理解方法进行原则性的时间评估，我们引入了第一个以动作为中心的基准，该基准基于公开可用的自我中心-外部中心OR数据集，通过定义细粒度的多角色动作分类法，并通过从地面真实场景图状态变化中蒸馏生成密集动作片段。在该基准上的实验表明，当前的场景图预测方法难以建模时间结构，即使通过图神经网络添加显式建模也是如此。因此，我们引入了一种纯视觉时间模型，当使用所有可用的自我中心视频作为输入时，该模型显著优于基于图的方法。在此模型基础上，我们还引入了一种新颖的多视角到单视角特征对齐策略，提高了多角色动作识别的单视角性能，减少了对大量自我中心视频采集的需求。基准和代码将在接收后发布。

英文摘要

Fine-grained understanding of operating room (OR) activity could enable workflow-aware assistance, yet remains difficult due to clutter, occlusions, and limited sensing. The prevailing approach to model this environment is scene graphs as an interpretable representation of OR interactions. Converting their frame-wise relational predictions into temporally extended, fine-grained actions however, is challenging without explicit temporal modeling. To enable a principled temporal evaluation of current OR understanding methods, we introduce the first action-centric benchmark built on a publicly available ego-exocentric OR dataset by defining a fine-grained, multi-role action taxonomy and generating dense action segments via distillation from ground-truth scene graph state changes. Experiments on this benchmark show that current scene graph prediction methods struggle to model temporal structure, even when adding explicit modeling through Graph Neural Networks. We therefore introduce a vision-only temporal model that outperforms graph-based methods significantly when using all available egocentric video as input. Building on this model we also introduce a novel multi- to single-view feature alignment strategy that improves single-view performance on multi-role action recognition, mitigating the need for extensive egocentric video capture. Benchmark and code will be released upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2606.13322 2026-06-12 cs.CL 新提交

Low-Latency Real-Time Audio Game Commentary System via LLM-Based Parallel Text Generation

基于LLM并行文本生成的低延迟实时音频游戏解说系统

Ryota Kawamatsu, Anum Afzal, Yuki Saito, Shinnosuke Takamichi, Graham Neubig, Katsuhito Sudoh, Hiroya Takamura, Tatsuya Ishigaki

发表机构 * The University of Tokyo（东京大学）； National Institute of Advanced Industrial Science and Technology（产业技术综合研究所）； Technical University of Munich（慕尼黑工业大学）； Keio University（庆应义塾大学）； Carnegie Mellon University（卡内基梅隆大学）； Nara Women’s University（奈良女子大学）

AI总结提出一种并行文本生成与语音播放的低延迟实时游戏解说系统，将平均句间静默从9.6秒降至0.3秒，显著提升解说节奏。

Comments Accepted at IJCAI-ECAI 2026 (Demonstrations Track)

详情

AI中文摘要

我们提出了一种低延迟实时音频游戏解说系统，可直接从实时游戏视频生成语音解说。在这种端到端设置中，关键瓶颈是累积等待时间；传统流程顺序执行帧捕获、文本生成和语音合成，且直到语音播放完成才请求下一次生成。这种严格顺序性导致语句间出现长且不自然的静默。为解决这一延迟瓶颈，我们的系统将文本生成与语音播放并行运行，并预先缓冲多个候选语句，从而在播放边界实现即时合成。在快节奏游戏视频上的实验表明，与顺序基线相比，我们的并行设计将平均句间静默从9.6秒降至0.3秒。它还将与专业演讲的静默时间模式相似度提高了40%以上，一项包含120名经验游戏玩家的用户研究证实，感知到的说话节奏显著改善。我们的演示视频可在以下网址获取：this https URL。

英文摘要

We present a low-latency real-time audio game commentary system that generates spoken commentary directly from live gameplay video. In this end-to-end setting, a key bottleneck is accumulated waiting time; conventional pipelines capture frames, generate text, and synthesize speech sequentially for each utterance, and do not request the next generation until speech playback has completed. This strict sequentiality causes long and unnatural silence between utterances. To address this latency bottleneck, our system runs text generation in parallel with speech playback and buffers multiple candidate utterances ahead of time, enabling immediate synthesis at playback boundaries. Experiments on fast-paced game videos show that our parallel design reduces the mean inter-utterance silence from 9.6 seconds to 0.3 seconds compared to sequential baselines. It also improves similarity to professional speaking--silence timing patterns by over 40 %, and a user study with 120 experienced game players confirms significantly improved perceived speaking rhythm. Our demo video is available at: https://youtu.be/pmrRUlvav8M.

URL PDF HTML ☆

赞 0 踩 0

2606.13317 2026-06-12 cs.CL 新提交

SkillCAT: Contrastive Assessment and Topology-Aware Skill Self-Evolution for LLM Agents

SkillCAT: 面向LLM智能体的对比评估与拓扑感知技能自进化

Kunfeng Chen, Qihuang Zhong, Juhua Liu, Bo Du

发表机构 * School of Computer Science, Wuhan University（武汉大学计算机学院）； School of Computer Science, Fudan University（复旦大学计算机学院）

AI总结提出SkillCAT框架，通过对比因果提取、评估增强进化和拓扑感知任务执行三阶段，实现无需训练的LLM智能体技能自进化，在多个基准上平均提升高达40.40%。

Comments 9 pages, 6 figures

详情

AI中文摘要

LLM智能体的技能自进化方法旨在将执行轨迹转化为可复用的技能文档，但当前流程通常每个任务只学习一条轨迹，在检查前合并候选技能补丁，并在推理前加载完整技能语料库。我们提出SkillCAT，一个无需训练的框架，将该过程分为三个阶段。对比因果提取（CCE）为每个任务采样多条轨迹，并比较同任务的成功/失败对，以识别解释结果差异的证据。评估增强进化（AAE）在源任务克隆上回放每个候选补丁，并在层次化技能补丁合并前仅保留改善或保持任务结果的补丁。拓扑感知任务执行（TTE）将进化后的技能编译成可路由的子技能拓扑，因此推理仅加载与任务相关的能力节点。我们在常见智能体基准上评估SkillCAT，包括SpreadsheetBench、WikiTableQuestions和DocVQA，并进一步测试跨模型和分布外泛化。在这些设置中，SkillCAT将基线平均得分提升高达40.40%，展示了无需模型训练的可靠技能进化。

英文摘要

Skill self-evolution methods for LLM agents aim to turn execution trajectories into reusable skill documents, but current pipelines typically learn from one trajectory per task, merge candidate skill patches before checking them, and load the full skill corpus before inference. We propose SkillCAT, a training-free framework that separates this process into three stages. Contrastive Causal Extraction (CCE) samples multiple trajectories for each task and compares same-task success/failure pairs to identify evidence that explains outcome differences. Assessment-Augmented Evolution (AAE) replays each candidate patch on source-task clones and keeps only patches that improve or preserve task outcomes before hierarchical skill patch merging. Topology-Aware Task Execution (TTE) compiles the evolved skills into a routable sub-skill topology, so inference loads only the capability nodes relevant to the task. We evaluate SkillCAT on common agent benchmarks, including SpreadsheetBench, WikiTableQuestions, and DocVQA, and further test cross-model and out-of-distribution generalization. Across these settings, SkillCAT raises the average score over baselines by up to 40.40%, demonstrating reliable skill evolution without model training.

URL PDF HTML ☆

赞 0 踩 0