arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1998
2606.13435 2026-06-12 cs.RO 新提交

GIVE: Grounding Human Gestures in Vision-Language-Action Models

GIVE:在视觉-语言-动作模型中接地人类手势

Pengfei Liu, Gen Li, Junqiao Fan, Boyu Ma, Jindou Jia, Yang Xiao, Jianfei Yang

发表机构 * MARS Lab, Nanyang Technological University(南洋理工大学MARS实验室)

AI总结 针对VLA模型忽略手势导致意图理解不准的问题,提出GIVE方法,通过视觉和语义双路径增强手势理解,在真实HRI实验中目标识别准确率提升40%,任务成功率提升80%。

Comments Project page: https://luis-cloud-sg.github.io/GIVE-project/

详情
AI中文摘要

人类交流本质上是多模态的,语言通常伴随着非语言线索(如手势)来传达意图。然而,当前的视觉-语言-动作(VLA)模型将机器人操作视为纯文本驱动的任务,忽视了手势在人机交互(HRI)中的重要作用。当语言指令模糊或不明确时,这往往导致意图接地不准确和操作不可靠。为了解决这一挑战,我们提出了GIVE(通过视觉-语义增强的手势意图),一种有效的方法,在不修改架构的情况下,用人类手势理解增强预训练的VLA模型。具体来说,GIVE通过两条互补的路径融入手势信息:一条视觉路径,将手部骨架和指尖射线叠加到机器人观测上,用于显式对象接地;一条语义路径,生成人类手势和任务指令的高级描述,用于鲁棒的意图接地。通过联合利用视觉和语义指导,GIVE使VLA策略能够更好地将手势与操作行为关联,并适应动态交互意图。在真实世界的HRI实验中,GIVE显著优于基线,目标对象识别准确率提升40%,整体任务成功率提升80%,同时展现出对未见空间布局和不同参与者的强大鲁棒性和泛化能力。

英文摘要

Human communication is inherently multimodal, where language is often accompanied by non-verbal cues such as gestures to convey intentions. However, current Vision-Language-Action (VLA) models treat robotic manipulation as a pure text-driven task, overlooking the important role of gestures in Human-Robot Interaction (HRI). This often leads to inaccurate intent grounding and unreliable manipulation when language instructions are ambiguous or underspecified. To address this challenge, we propose GIVE (Gesture Intent via Visual-Semantic Enhancement), an effective approach that enhances pre-trained VLA models with human gesture understanding without architectural modifications. Specifically, GIVE incorporates gesture information through two complementary pathways: a visual pathway that overlays hand skeletons and fingertip rays onto robot observations for explicit object grounding, and a semantic pathway that generates high-level descriptions of human gestures and task instructions for robust intent grounding. By jointly leveraging visual and semantic guidance, GIVE enables VLA policies to better associate gestures with manipulation behaviors and adapt to dynamic interaction intents. In real-world HRI experiments, GIVE substantially outperforms the baseline, improving target object recognition accuracy by 40% and overall task success rate by 80%, while demonstrating strong robustness and generalization to unseen spatial layouts and diverse participants.

2606.13432 2026-06-12 cs.CV cs.AI 新提交

OmniDirector: General Multi-Shot Camera Cloning without Cross-Paired Data

OmniDirector: 无需配对数据的通用多镜头相机克隆

Jiwen Liu, Shujuan Li, Zhixue Fang, Xiaohan Li, Yan Zhou, Zijie Meng, Zhimin Zhang, Yawen Luo, Guoxin Zhang, Yu-Shen Liu, Pengfei Wan

发表机构 * Kuaishou Technology(快手科技) Tsinghua University(清华大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出OmniDirector框架,通过将相机参数编码为网格运动视频,并利用百万级配对数据训练,实现无需交叉配对数据的多镜头相机运动克隆,具备卓越的控制性能。

Comments 12 pages, 8 figures

详情
AI中文摘要

从参考视频中克隆相机运动是视频生成中的一项重要任务,因为视频提供了直观且精确的控制。现有方法要么直接使用无法处理多镜头生成的参数化表示,要么合成交叉配对数据,但受限于数据稀缺性,导致在复杂相机运动克隆中表现不佳。为解决这些问题,我们引入了一种通用的相机运动表示,将相机编码为网格运动视频。该相机网格以视觉方式表示相机参数,并支持集成多样化的轨迹以进行多镜头视频生成。基于此,我们提出了OmniDirector,一个在百万级相机网格-视频对上训练的统一框架,该框架协调角色、动作和相机,为多模态扩散变换器提供导演级别的控制。此外,我们设计了一种新颖的分层提示扩展代理,通过理解信号关系系统地描述相机运动和视觉内容,从而和谐地整合不同的控制信号。大量实验证明了我们框架的卓越性能和出色的可控性。项目页面:此https URL

英文摘要

Cloning camera motion from reference videos is an important task in video generation, as videos provide intuitive and precise control. Existing methods either directly use parametric representations that fail to handle multi-shot generation or synthesize cross-paired data, which suffer from data scarcity, resulting in poor performance in complicated camera motion cloning. To address these issues, we introduce a general camera motion representation that encodes cameras as grid motion videos. This camera grid represents the camera parameters visually and supports the integration of diverse trajectories for multi-shot video generation. Building upon this, we propose OmniDirector, a unified framework trained on a million-scale camera grid-video pairs that coordinates characters, actions, and cameras to provide director-level control for multimodal diffusion transformers. Furthermore, we design a novel hierarchical prompt expansion agent that harmoniously integrates different control signals by systematically describing camera motion and visual content through understanding signal relationships. Extensive experiments demonstrate the superior performance and outstanding controllability of our framework. Project page: https://ymlinfeng.github.io/OmniDirector.github.io/

2606.13427 2026-06-12 cs.CV 新提交

VietFashion: Benchmarking Sketch-Text Composed Image Retrieval for Cultural Outfits

VietFashion:面向文化服饰的草图-文本组合图像检索基准

Hoang-Nguyen Cao, Le-Hoang Bui, Dinh-Khoi Vo, Minh-Triet Tran, Trung-Nghia Le

发表机构 * University of Science, Ho Chi Minh City, Vietnam(胡志明市理科大学) Vietnam National University, Ho Chi Minh City, Vietnam(越南国家大学胡志明市分校)

AI总结 提出VietFashion基准,针对越南传统服饰奥黛,结合手绘草图和文本描述进行多目标检索,揭示现有方法在细粒度文化语义和跨模态组合上的不足。

Comments ICMR 2026. Project page: https://hng0303.github.io/VietFashion

详情
AI中文摘要

文化服饰对视觉检索系统提出了独特挑战,因为其身份往往依赖于标准AI模型难以捕捉的微妙结构和符号细节。我们引入VietFashion,一个以越南传统服饰奥黛为中心的草图-文本组合图像检索新基准。VietFashion使设计师和研究人员能够通过手绘草图(传达服装结构)和文本描述(编码文化语义)的组合来检索具有文化意义的服装。数据集初始包含650张草图,并通过生成模型扩展至超过21,000张带有对齐标题的照片级真实图像。文本提示描述了详细的服装属性,这些属性从时尚杂志中提取以确保真实性和多样性。为了更好地反映设计意图固有的模糊性,VietFashion采用多目标检索设置,其中单个查询可能对应多个有效结果。我们建立了标准化的评估协议,并对最先进的组合图像检索方法进行了基准测试。实验结果表明,在建模细粒度文化语义和多模态组合方面存在显著性能差距,使VietFashion成为细粒度时尚检索的一个具有挑战性的基准。数据集公开于:this https URL。

英文摘要

Cultural garments pose a unique challenge for visual retrieval systems, as their identity often depends on subtle structural and symbolic details that are poorly captured by standard AI models. We introduce VietFashion, a new benchmark for sketch-text composed image retrieval centered on the Ao Dai, a traditional Vietnamese garment. VietFashion enables designers and researchers to retrieve culturally meaningful outfits using a combination of hand-drawn sketches, which convey garment structure, and textual descriptions, which encode cultural semantics. The dataset is initialized with 650 sketches and expanded using generative models to produce over 21,000 photorealistic images with aligned captions. Textual prompts that describe detailed outfit attributes, which are extracted from fashion magazines to ensure authenticity and diversity. To better reflect the inherent ambiguity of design intent, VietFashion adopts a multi-target retrieval setting, where a single query may correspond to multiple valid results. We establish standardized evaluation protocols and benchmark state-of-the-art composed image retrieval methods. Experimental results reveal significant performance gaps in modeling fine-grained cultural semantics and multi-modal composition, positioning VietFashion as a challenging benchmark for fine-grained fashion retrieval. The dataset is publicly available at: https://hng0303.github.io/VietFashion.

2606.13426 2026-06-12 cs.LG stat.ML 新提交

Accelerating Speculative Diffusions via Block Verification

通过块验证加速推测性扩散

Alexander Soen, Hisham Husain, Valentin De Bortoli, Arnaud Doucet

发表机构 * KTH(皇家理工学院) Google Research(谷歌研究) Google DeepMind(谷歌深Mind)

AI总结 提出一种针对扩散模型的推测性采样方案,通过块验证提高草稿接受率,无需训练的Free Drafter实现高达6.3%的加速。

详情
AI中文摘要

推测性解码通过使用草稿模型生成令牌,并采用接受-拒绝方案确保输出与目标分布匹配,从而加速LLM推理。将其适应于连续扩散是困难的,因为推测性采样需要从残差分布中采样。虽然在离散空间中直接,但在连续空间中高效采样残差并非易事。因此,现有的扩散适应要么使用计算效率低下的采样技术,要么依赖替代方案。在这项工作中,我们引入了一种新颖的方案,高效地实现了扩散模型的原始推测性采样机制。我们的方法相比现有方法具有关键优势:它使我们能够将LLM的块验证适应到扩散——这被证明可以提高草稿的接受率。此外,我们形式化并分析了Free Drafter,一种无需训练的扩散启发式自推测草稿生成器。通过启用块验证,我们的Free Drafter在无需额外训练且开销可忽略的情况下,相比现有推测性方法实现了高达6.3%的加速。

英文摘要

Speculative decoding speeds up LLM inference by using a draft model to generate tokens, with an acceptance-rejection scheme that ensures that the output matches the target distribution. Adapting this to continuous diffusions is difficult because speculative sampling requires drawing from a residual distribution. While straightforward in discrete spaces, efficiently sampling this residual in continuous space is non-trivial. Consequently, existing diffusion adaptations either use computationally inefficient sampling techniques or rely on an alternative scheme. In this work, we introduce a novel scheme that efficiently implements the original speculative sampling mechanism for diffusion models. Our approach offers a critical advantage over current methods: it enables us to adapt block verification from LLMs to diffusions -- which provably improves the acceptance rate of drafts. Furthermore, we formalize and analyze the Free Drafter, a heuristic self-speculative drafter for diffusions that requires no training. By enabling block verification, our Free Drafter yields up to a 6.3% speedup over existing speculative methods with no additional training and negligible overhead beyond the existing parallel verification pass.

2606.13411 2026-06-12 cs.CL 新提交

An End-to-End Hybrid Framework for Rumour Detection in Low-Resources Algerian Dialect

面向低资源阿尔及利亚方言谣言检测的端到端混合框架

Dihia Lanasri, Fatima Benbarek

发表机构 * ATM Mobilis USTHB Algiers(阿尔及尔科技大学)

AI总结 针对阿尔及利亚方言谣言检测中资源稀缺、代码切换等问题,提出端到端混合框架,结合Transformer嵌入与经典分类器,F1达0.84,并发现领域预训练比模型规模更重要。

详情
AI中文摘要

社交媒体的快速增长加剧了谣言的传播。在阿尔及利亚语境下,由于方言内容的非正式性和代码切换特性、标注资源的稀缺以及标准阿拉伯语NLP工具在方言文本上的有限有效性,这一问题更具挑战性。本文提出了一种面向阿尔及利亚方言社交媒体内容的端到端谣言检测混合框架。我们通过结合真实社交媒体帖子、合成数据和FASSILA语料库,并基于相似性标注过程进行自动标注,构建了一个领域特定的标注数据集。还引入了一个音译流水线,以生成阿拉伯文字和Arabizi的并行数据集。我们评估了多种方法,包括经典机器学习、深度学习、Transformer和混合模型。实验结果表明,结合Transformer嵌入与经典分类器的混合方法达到了最佳性能,F1分数为0.84。我们还发现,领域特定预训练比模型规模更重要,在社交媒体上训练的模型优于在正式阿拉伯语语料库上训练的更大模型。这些结果证明了在低资源阿尔及利亚方言环境下进行谣言检测的可行性。

英文摘要

The rapid growth of social media has intensified the spread of rumours. This issue is more challenging in the Algerian context due to the informal and code-switched nature of dialectal content, the scarcity of annotated resources, and the limited effectiveness of standard Arabic NLP tools on dialect text. This paper presents an end-to-end rumour detection hybrid framework for Algerian dialect social media content. We build a domain-specific annotated dataset by combining real social media posts, synthetic data, and the FASSILA corpus, with automatic labeling based on a similarity-based annotation process. A transliteration pipeline is also introduced to generate parallel datasets in Arabic script and Arabizi. We evaluate multiple approaches, including classical machine learning, deep learning, transformers, and hybrid models. Experimental results show that a hybrid approach combining transformer embeddings with a classical classifier achieves the best performance, reaching an F1-score of 0.84. We also find that domain-specific pre-training is more important than model size, with social media-trained models outperforming larger models trained on formal Arabic corpora. These results demonstrate the feasibility of rumour detection in low-resource Algerian dialect settings.

2606.13410 2026-06-12 cs.CV cs.HC 新提交

Person Identification from Contextual Motion

基于情境运动的人物识别

Igor Kviatkovsky, Ehud Rivlin, Ilan Shimshoni

发表机构 * Technion – Israel Institute of Technology(以色列理工学院) University of Haifa(海法大学)

AI总结 提出一种生成模型描述动作实例创建过程,并针对监控和认证应用推导概率身份推断方案;引入交互式人物识别场景,通过序列化消息交换最大化互信息,实现高识别率。

详情
AI中文摘要

我们考虑基于运动风格识别人的问题。我们提出了一个描述动作实例创建过程的生成模型,并针对监控和认证应用所驱动的两种常见人物识别场景推导了概率身份推断方案。我们引入了一种新颖的、交互式的人物运动模式识别场景。为此,我们将识别过程形式化为受试者与系统之间的顺序消息交换会话。受试者的行为使用受人类信息处理(HIP)范式启发的概率生成模型建模。在每个阶段,系统向受试者呈现视觉刺激(线索)并记录其运动响应。线索的选择旨在最大化预期响应与受试者身份的互信息。一旦记录,响应用于更新可能受试者身份的后验概率。一旦达到足够的分类置信水平,该过程终止。据我们所知,这是首次在这种交互式设置中解决人物识别问题。我们在五个公开数据集和我们自己的新数据集(包含22名受试者对15个线索的4,476条记录)上报告了高识别率。

英文摘要

We consider the problem of identifying people based on their motion styles. We present a generative model describing the action instance creation process and derive a probabilistic identity inference scheme for two common person identification scenarios motivated by the surveillance and authentication applications. We introduce a novel, \emph{interactive}, scenario for person identification from motion patterns. To this end, we formalize the identification process in the context of a sequential message exchange session between the subject and the system. The subject's behavior is modeled using a probabilistic generative model inspired by the Human Information Processing (HIP) paradigm. At each stage, the system presents a visual stimulus (a cue) to the subject and records their motion response. The cue is selected so as to maximize the mutual information of the expected response and the subject's identity. Once recorded, the response is used to update the a posteriori probability over possible subjects' identities. The process terminates once a sufficient classification confidence level is reached. To the best of our knowledge, this is the first time person identification is addressed in such interactive setting. We report high recognition rates on five publicly available datasets and our own novel dataset consisting of 4,476 recordings of 22 test subjects responding to 15 cues.

2606.13407 2026-06-12 cs.AI 新提交

Optimizing Appliance Scheduling for Solar Energy Management Using Metaheuristic Algorithms

使用元启发式算法优化太阳能管理的电器调度

Hiba Ahmed, Alexander E. I. Brownlee, Jason Adair, Simon T. Powers

发表机构 * Computing Science and Mathematics, University of Stirling(斯特灵大学计算科学与数学学院)

AI总结 提出基于迭代局部搜索和模拟退火的元启发式方法,优化电器启动时间以最大化太阳能利用,并处理多天任务溢出问题。

Comments 9 pages; full results and methodology for poster paper accepted to GECCO 2026

详情
AI中文摘要

可再生能源对于满足未来能源需求至关重要;然而,仅在白天发生的太阳能发电通常与家庭消费模式不一致。诸如炊具、洗衣机和烘干机等电器通常根据用户偏好的时间表运行,而不是根据太阳能可用性,这形成了一个调度优化问题。目标是确定最佳电器启动时间,以最大化可再生能源利用,同时最小化用户不便并遵守系统约束。本文提出了一种使用迭代局部搜索(ILS)和模拟退火(SA)的元启发式方法,以优化电器启动时间,同时考虑电器运行持续时间、功耗、逆变器限制、电池荷电状态约束和太阳能发电预测。与大多数现有工作不同,调度扩展到单日之外,以容纳前几天的未完成任务(溢出),确保操作连续性并支持跨多天的顺序操作。实验结果表明,顺序多日调度框架在独家太阳能发电下有效管理系统约束,同时确保用户便利。这些发现也为未来关于不同规模设备投资、投资回报和用户满意度之间的多目标权衡研究提供了机会。

英文摘要

Renewable energy is essential for meeting future energy demands; however, solar energy generation, which occurs only during daylight hours often does not align with household consumption patterns. Appliances such as cookers, washing machines, and dryers are typically operated according to user preferred schedules rather than solar energy availability, creating a scheduling optimization problem. The objective is to determine optimal appliance start times to maximize renewable energy utilization while minimizing user inconvenience and adhering to system constraints. This paper presents a metaheuristic approach using Iterated Local Search (ILS) and Simulated Annealing (SA) to optimize appliance start times, while considering appliance operating durations, power consumption, inverter limit, battery state of charge constraints, and solar generation forecasts. Unlike most existing work, the scheduling is extended beyond a single day to accommodate unfinished tasks from previous days (spillover), ensuring operational continuity and enabling sequential operation across multiple days. Experimental results show that the sequential multi-day scheduling framework effectively manages system constraints while ensuring user convenience under exclusive solar generation. These findings also open opportunities for future research on multi-objective trade-offs between investment in equipment of various sizes, return on that investment, and user satisfaction.

2606.13405 2026-06-12 cs.AI cs.MA 新提交

Neuro-Symbolic Agents for Regulated Process Automation: Challenges and Research Agenda

用于受规管流程自动化的神经符号代理:挑战与研究议程

Alexander Rombach, Chantale Lauer, Nijat Mehdiyev

发表机构 * German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心(DFKI)) Saarland University(萨尔大学)

AI总结 提出将领域内符号结构(法规、流程模型、合规约束)作为代理核心架构组件,实现合规性内置(compliance-by-construction)以补充护栏监控,并列出神经符号研究挑战。

Comments Accepted as a poster in NILA Workshop @ IJCAI-ECAI 2026

详情
AI中文摘要

基于LLM的代理正在进入受规管行业,在这些行业中,它们自动化判断密集型质量管理流程。我们认为,这些领域中已经嵌入的符号结构,包括法规、类型化流程模型和合规约束,不应仅被视为外部监控机制,而应作为塑造代理决策和行为的核心架构组件。我们提出合规性内置作为基于护栏监控的补充范式:一种防止控制流违规的结构基础,而护栏对于捕获语义错误仍然必不可少。我们在基础和能力层面识别出一组结构化的神经符号研究挑战,并表明共同解决这些挑战能够实现合规性内置。我们呼吁神经符号社区将受规管流程自动化作为一个高影响力的研究领域来参与。

英文摘要

LLM-based agents are entering regulated industries where they automate judgment intensive quality management processes. We argue that symbolic structures already embedded in these domains, including regulations, typed process models, and compliance constraints, should be treated not merely as external monitoring mechanisms but as core architectural components that shape the agent's decision-making and behavior. We propose compliance-by-construction as a complementary paradigm to guardrail-based monitoring: a structural foundation that prevents control-flow violations, while guardrails remain essential for catching semantic errors. We identify a structured set of neuro-symbolic research challenges on foundational and capability level and show that addressing them jointly enables compliance-by-construction. We call on the neuro-symbolic community to engage with regulated process automation as a high impact research domain.

2606.13400 2026-06-12 cs.LG cs.AI cs.RO 新提交

PolyFlow: Safe and Efficient Polytope-Constrained Flow Matching with Constraint Embedding and Projection-free Update

PolyFlow: 安全高效的多面体约束流匹配,具有约束嵌入和无投影更新

Jianming Ma, Qiyue Yang, Yang Zhang, Liyun Yan, Zhanxiang Cao, Yazhou Zhang, Yue Gao

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出PolyFlow,一种将约束直接嵌入模型和流动力学的多面体约束流匹配框架,通过离散时间流公式和无投影架构消除离散化误差并严格满足任意多面体约束,在规划与控制任务中实现零约束违反并降低推理延迟。

Comments 30 pages, 12 figures, Accepted to ICML 2026

详情
AI中文摘要

尽管基于流的生成模型在广泛领域展现了强大的性能,但由于严格的约束要求,在安全关键的物理系统中部署它们仍然具有挑战性。现有方法通常通过事后修正来强制执行安全性,这会产生大量的计算开销,并可能扭曲学习到的分布。我们提出了PolyFlow,一种多面体约束流匹配框架,将约束直接嵌入到模型和流动力学中。PolyFlow引入了离散时间流公式和无投影架构,消除了离散化误差,并保证严格满足任意多面体约束,无需昂贵的迭代求解器。实验结果表明,PolyFlow在规划和控制任务中实现了零约束违反,同时保持了较高的分布保真度。与最先进的约束生成基线相比,PolyFlow显著降低了推理延迟,并在安全性、效率和生成质量之间展示了有利的权衡。代码可在该 https URL 获取。

英文摘要

While flow-based generative models have demonstrated strong performance across a wide range of domains, deploying them in safety-critical physical systems remains challenging due to strict constraint requirements. Existing approaches typically enforce safety through post-hoc corrections, which incur substantial computational overhead and may distort the learned distribution. We propose PolyFlow, a polytope-constrained flow matching framework that embeds constraints directly into the model and flow dynamics. PolyFlow introduces a discrete-time flow formulation and a projection-free architecture, which eliminate the discretization error and guarantee strict satisfaction of arbitrary polyhedral constraints, without the need for expensive iterative solvers. Experimental results show that PolyFlow achieves zero constraint violation while maintaining high distributional fidelity across a range of planning and control tasks. Compared to state-of-the-art constrained generation baselines, PolyFlow significantly reduces inference latency and demonstrates a favorable trade-off between safety, efficiency, and generative quality. Code is available on https://github.com/MJianM/PolyFlow.

2606.13394 2026-06-12 cs.RO 新提交

GeoHAT: Geometry-Adaptive Hybrid Action Transformer for Mobile Manipulation

GeoHAT: 几何自适应混合动作Transformer用于移动操作

Xiangyu Zhu, Renjun Wu, Luzhou Ge, Jinyan Liu, Xuesong Li

发表机构 * Beijing Institute of Technology(北京理工大学)

AI总结 提出GeoHAT框架,通过轻量级傅里叶空间编码器注入几何信息,并采用混合全身动作解码器分解机械臂与基座动作,在ManiSkill-HAB基准上成功率提升23.7%。

详情
AI中文摘要

全身移动操作需要在不断变化的视角下协调移动基座和机械臂,这对几何感知和动作生成提出了挑战。当前的策略要么依赖2D特征,要么依赖缺乏密集空间结构的稀疏3D表示,并且通常将机械臂和基座编码在一个动作向量中,忽略了它们各自不同的控制需求。此外,现有的密集融合策略在噪声深度下可能破坏预训练表示,同时带来沉重的计算开销。我们提出了GeoHAT,一个基于简单原则的端到端扩散框架:几何信息应仅在可靠处注入,且仅在需要处被关注。GeoHAT采用轻量级傅里叶空间编码器,将密集的逐像素3D坐标映射为几何标记,无需额外的3D视觉骨干网络。然后,通过由深度有效性调制的逐标记门控融合,将这些标记选择性地注入视觉基础模型特征中,在保留语义先验的同时丰富空间理解。对于动作生成,混合全身动作解码器将机械臂和基座分解到不同的子空间,并通过稀疏交叉注意力让每个动作模态关注其任务相关的视觉上下文,同时因果时序建模捕获时间步内协调和时间步间依赖。在ManiSkill-HAB仿真基准上的实验表明,GeoHAT实现了79.3%的平均成功率,比最强基线高出23.7%。此外,在多种任务上的真实世界实验也证实了在所有基线上的一致改进。

英文摘要

Whole-body mobile manipulation requires coordinating mobile base and manipulator under shifting viewpoints, posing challenges in geometric perception and action generation. Current policies either rely on 2D features or sparse 3D representations that lack dense spatial structure, and typically encode arm and base within one action vector that ignores their distinct control demands. Moreover, existing dense fusion strategies risk corrupting pretrained representations under noisy depth while incurring heavy computational overhead. We present GeoHAT, an end-to-end diffusion-based framework built on a simple principle: geometry should be injected only where reliable and attended to only where needed. GeoHAT employs a lightweight Fourier spatial encoder that maps dense per-pixel 3D coordinates into geometric tokens without an additional 3D vision backbone. These tokens are then selectively injected into vision foundation model features through per-token gated fusion modulated by depth validity, preserving the semantic prior while enriching spatial understanding. For action generation, a Hybrid Whole-Body Action Decoder decomposes arm and base into distinct subspaces and lets each action modality attend to its task-relevant visual context through sparse cross-attention, while causal temporal modeling captures intra-timestep coordination and inter-timestep dependencies. Experiments on the ManiSkill-HAB simulation benchmark demonstrate that GeoHAT achieves a 79.3% mean success rate, surpassing the strongest baseline by 23.7%. Furthermore, real-world experiments on diverse tasks also confirm consistent improvements over all baselines.

2606.13382 2026-06-12 cs.CV cs.AI 新提交

SmartFont: Dynamic Condition Allocation for Few-Shot Font Generation

SmartFont: 少样本字体生成的动态条件分配

Zian Yang, Zixin Wang

发表机构 * Fudan University(复旦大学)

AI总结 提出SmartFont扩散框架,通过全局内容-风格生成与弱监督局部校正专家结合,并引入去噪状态条件分配模块动态加权全局与局部特征,实现少样本字体生成的全局完整性与局部细节保真度平衡。

详情
AI中文摘要

少样本字体生成同时需要全局结构完整性和细粒度局部风格保真度。现有方法通常要么依赖全局内容-风格建模(鲁棒但解耦不完美),要么强调组件/局部建模(捕捉细节但严重依赖局部先验和参考覆盖)。我们认为关键挑战不仅在于学习更纯净的条件,而在于通过生成过程中的多级分配来组织互补但有偏的全局和局部条件。为此,我们提出SmartFont,一个基于扩散的少样本字体生成框架,结合全局内容-风格生成与弱监督局部校正专家。局部分支通过弱组件监督学习专家级局部概念和语义有意义的空间图,实现无需显式组件条件推理的细粒度校正。在此基础上,去噪状态条件分配模块在时间步和注入块上自适应地加权全局内容、全局风格和局部校正特征。大量实验表明,SmartFont实现了更好的全局-局部平衡,提高了字形质量和局部细节保真度。

英文摘要

Few-shot font generation simultaneously requires global structural completeness and fine-grained local style fidelity. Existing methods usually either rely on global content-style modeling, which is robust but imperfectly disentangled, or emphasize component/local modeling, which captures fine details but relies heavily on local priors and reference coverage. We argue that the key challenge is not merely to learn purer conditions, but to organize complementary yet biased global and local conditions through multi-level allocation during generation. To this end, we propose SmartFont, a diffusion-based few-shot font generation framework that combines global content-style generation with weakly supervised local corrective experts. The local branch performs semantic-spatial allocation by learning expert-wise local concepts and semantically meaningful spatial maps under weak component supervision, enabling fine-grained correction without requiring explicit component-conditioned inference. On top of this, a denoising-state condition allocation module adaptively weights global content, global style, and local corrective feature across timesteps and injection blocks. Extensive experiments show that SmartFont achieves better global-local balance, improves glyph quality and local detail fidelity.

2606.13381 2026-06-12 cs.LG 新提交

Hölder++: Improving the Quality-Coherence Trade-off in Multimodal VAEs

Hölder++:改进多模态VAE中的质量-一致性权衡

Huyen Vo, María Martínez-García, Isabel Valera

发表机构 * Hölder++: Improving the Quality-Coherence Trade-off in Multimodal VAEs Supplementary Material(Hölder++:多模态VAE中质量与一致性权衡的改进补充材料)

AI总结 针对多模态VAE生成质量与语义一致性之间的权衡问题,提出Hölder++,通过精确Hölder池化、扩展架构和层次推理,在提升一致性的同时保持生成质量。

Comments Accepted at ICML 2026. Camera-ready version

详情
AI中文摘要

现有的多模态变分自编码器(VAE)方法面临生成质量与一致性之间的权衡——即它们难以生成既真实多样又在各模态间语义一致的样本。最近的一项工作表明,使用Hölder池化的简单近似作为聚合方法,尽管假设所有模态共享单一表示,但能提高一致性超过SOTA MMVAE+。然而,它略微牺牲了样本多样性。受此启发,我们提出Hölder++,一种新颖的多模态VAE,通过以下方式改进生成质量-一致性权衡:(i) 首次实现无近似的Hölder池化用于多模态VAE;(ii) 扩展架构,建模不同的共享和私有(即模态特定)表示(Hölder+);(iii) 层次推理,进一步增强共享和私有表示之间的解耦(Hölder++)。我们的实验证实,Hölder++持续改进生成质量-一致性权衡,产生更结构化的潜在空间,并学习对下游任务信息丰富的共享表示。

英文摘要

Existing approaches for multimodal variational autoencoders (VAEs) face a trade-off between generative quality and coherence-i.e., they struggle to generate realistic and diverse samples that, at the same time, are semantically consistent across modalities. A recent work shows that using a simple approximation to Hölder pooling as an aggregation method improves coherence over the SOTA MMVAE+, despite assuming a single shared representation across all modalities. Yet, it slightly compromises sample diversity. Inspired by this insight, we propose Hölder++, a novel multimodal VAE that improves the generative quality-coherence trade-off through: (i) the first implementation of Hölder pooling without any approximation for multimodal VAEs; (ii) an extended architecture that models distinct shared and private (i.e., modality-specific) representations (Hölder+); and (iii) hierarchical inference that further enhances the disentanglement between the shared and private representations (Hölder++). Our experiments corroborate that Hölder++ consistently improves the generative quality-coherence trade-off, yields more structured latent spaces, and learns shared representations that are informative for downstream tasks.

2606.13379 2026-06-12 cs.LG cs.AR cs.ET 新提交

Positional Encoding in the Context of Memristor-Based Analog Computation for Automatic Speech Recognition

基于忆阻器的模拟计算在自动语音识别中的位置编码

Benedikt Hilmes, Nick Rossenbach, Ralf Schlüter

发表机构 * Machine Learning and Human Language Technology Group, Faculty of Computer Science, RWTH Aachen University(亚琛工业大学计算机科学学院机器学习和人类语言技术组) Apptek GmbH(Apptek 有限公司)

AI总结 针对忆阻器模拟计算中位置编码导致模数转换精度下降的问题,通过调整ADC权重和精度位比例或移除编码相关线性变换,分别降低约50%和30%的性能损失。

Comments Accepted at Interspeech 2026

详情
AI中文摘要

忆阻器通过实现向量-矩阵乘法的模拟执行,为自然语言处理神经模型的资源高效计算提供了新机遇。然而,目前这些器件在权重编程和执行过程中都容易产生较大的失真。在这项工作中,我们发现转换后的位置编码的大输出值会导致忆阻器计算中模数转换(ADC)的严重退化。通过调整特定忆阻器层的ADC权重和精度位的比例,我们将执行退化相对降低了约50%,同时保持估计能耗稳定。此外,我们研究了ADC无法修改的情况。在这种情况下,移除编码相关的线性变换后,退化可相对降低约30%。

英文摘要

Memristors provide a new chance for resource-efficient computation of neural models for natural language processing by enabling analog execution of vector-matrix-multiplication. Yet, computations on these devices are currently subject to larger distortion, both in weight programming and execution. In this work, we identify large output values of transformed positional encodings to cause major degradation within analog-to-digital conversion (ADC) as part of memristor-based computation. By adjusting the proportion of weight and precision bits of the ADC of specific memristor layers, we reduce the degradation of the execution by ~50% relative, while keeping the estimated energy consumption stable. Additionally, we investigate scenarios where the ADC cannot be modified. In that case the degradation can be reduced by ~30% relative after removing encoding-related linear transformations.

2606.13376 2026-06-12 cs.CV 新提交

MoVerse: Real-Time Video World Modeling with Panoramic Gaussian Scaffold

MoVerse: 基于全景高斯支架的实时视频世界建模

Yang Zhou, Ziheng Wang, Yuqin Lu, Haofeng Liu, Jun Liang, Shengfeng He, Jing Li

发表机构 * South China University of Technology Columbia University Orange Team, Youku Moku-Lab, HUJING Digital Media \& Entertainment Group Singapore Management University

AI总结 提出MoVerse,从单张窄视场图像实时构建可交互漫游的360度全景世界,通过拓扑感知扩散补全视场、全景几何残差预测生成3D高斯支架,并结合双向扩散教师蒸馏为因果自回归学生实现低延迟视频渲染。

详情
AI中文摘要

我们提出MoVerse,一个实时视频世界模型,能够从单张窄视场图像创建可交互导航的场景。该设置具有挑战性,因为输入仅观察到环境的一小部分,而交互式漫游需要完整的周围世界、持久的几何结构、可控的相机运动以及时间上一致的高保真观测。MoVerse通过将世界构建与观测渲染分离来解决这个问题。它首先使用拓扑感知扩散将输入扩展为重力对齐的360°全景图,在3D推理之前闭合缺失的视场。然后,利用全景几何感知残差预测将全景图提升为持久的3D高斯支架,形成密集且可直接渲染的空间记忆。最后,一个高斯条件视频渲染器将沿用户指定相机轨迹的支架渲染结果转换为逼真的视频。为了使该渲染器适用于交互,我们训练了一个双向扩散教师用于高质量条件渲染,并将其蒸馏为一个因果自回归学生以实现有界延迟流式传输。这种设计结合了显式3D表示的可控性和长程一致性以及生成视频模型的感知质量。MoVerse在单个NVIDIA RTX 4090 GPU上支持8 FPS的实时场景漫游,展示了通往具有交互式视频输出的单图像世界创建的实用路径。

英文摘要

We present MoVerse, a real-time video world model that creates an interactively navigable scene from a single narrow-field-of-view image. This setting is challenging because the input observes only a small fraction of the environment, while interactive roaming requires a complete surrounding world, persistent geometry, controllable camera motion, and temporally coherent high-fidelity observations. MoVerse addresses this problem by separating world construction from observation rendering. It first expands the input into a gravity-aligned 360$^\circ$ panorama with topology-aware diffusion, closing the missing field of view before 3D reasoning. It then lifts the panorama into a persistent 3D Gaussian scaffold using panoramic geometry-aware residual prediction, yielding a dense and directly renderable spatial memory. Finally, a Gaussian-conditioned video renderer translates scaffold renderings along user-specified camera trajectories into photorealistic video. To make this renderer practical for interaction, we train a bidirectional diffusion teacher for high-quality conditional rendering and distill it into a causal autoregressive student for bounded-latency streaming. This design combines the controllability and long-range consistency of explicit 3D representations with the perceptual quality of generative video models. MoVerse supports real-time scene roaming at 8~FPS on a single NVIDIA RTX~4090 GPU, demonstrating a practical path toward single-image world creation with interactive video output.

2606.13370 2026-06-12 cs.AI 新提交

A Quantitative Experimental Repeated Measures Study of Training Dynamics in a Small Llama Style Language Model Under a Compute-Aware Token Budget

在计算感知令牌预算下小型Llama风格语言模型训练动态的定量实验重复测量研究

Joe Dwyer

发表机构 * Department of Computer Information Science, ECPI University(ECPI大学计算机信息科学系)

AI总结 本研究通过重复测量设计,分析在固定计算预算下训练小型Llama模型时,验证损失、困惑度等指标随令牌数变化的动态,发现早期快速改进后出现非单调退化,表明计算感知评估应关注训练轨迹而非终点指标。

详情
AI中文摘要

本研究考察了在固定、计算受限的令牌预算下训练的小型Llama风格语言模型的训练动态。研究并未仅通过终点性能来评估效率,而是采用定量实验重复测量设计,分析验证损失、验证困惑度、滚动波动性、回退行为、尖峰行为以及种子间变异性如何在基于令牌的训练区间内变化。在拥有426万参数的模型上,使用TinyStories语料库、CPU全精度训练以及约2000万累积训练令牌的目标预算,进行了六次独立训练运行。在21个区间内收集指标,产生了126个种子-区间观测值。重复测量方差分析显示,验证损失、验证困惑度和滚动波动性存在统计显著的区间效应。描述性轨迹揭示了早期快速改进,随后在后期训练区间出现非单调退化。平均验证损失从初始化的8.3552降至接近400万令牌时的2.7996,但在最终检查点增至3.9010。验证困惑度遵循相同模式,在训练早期急剧下降,随后上升。衍生遥测进一步显示了反复的验证损失回退,并且在预定义标准下没有区间汇总证据表明存在稳定阶段。这些发现表明,计算感知的语言模型评估应检查训练轨迹而非仅终点指标。在受限计算设置中,额外的令牌暴露可能增加计算成本而不产生成比例的泛化收益,而区间级遥测可以揭示终点指标可能掩盖的不稳定性、回归和收益递减。

英文摘要

This study examines training dynamics in a small Llama-style language model trained under a fixed, compute-constrained token budget. Rather than evaluating efficiency solely through endpoint performance, the study uses a quantitative experimental repeated measures design to analyze how validation loss, validation perplexity, rolling volatility, backslide behavior, spike behavior, and between-seed variability change across token-based training intervals. Six independent training runs were conducted on a 4.26-million-parameter model using the TinyStories corpus, CPU-based full-precision training, and a target budget of approximately 20 million cumulative training tokens. Metrics were collected across 21 intervals, producing 126 seed-by-interval observations. Repeated measures ANOVA showed statistically significant interval effects for validation loss, validation perplexity, and rolling volatility. Descriptive trajectories revealed rapid early improvement followed by non-monotonic degradation during later training intervals. Mean validation loss decreased from 8.3552 at initialization to 2.7996 near 4 million tokens, but increased to 3.9010 by the final checkpoint. Validation perplexity followed the same pattern, falling sharply early in training before rising later. Derived telemetry further showed recurrent validation-loss backslides and no interval-summary evidence of a stable phase under the predefined criteria. These findings suggest that compute-aware language model evaluation should examine training trajectories rather than endpoint metrics alone. In constrained compute settings, additional token exposure may increase computational cost without producing proportional generalization gains, and interval-level telemetry can reveal instability, regression, and diminishing returns that final metrics may obscure.

2606.13368 2026-06-12 cs.AI cs.CV 新提交

IterCAD: An Iterative Multimodal Agent for Visually-Grounded CAD Generation and Editing

IterCAD:一种用于视觉引导的CAD生成与编辑的迭代多模态智能体

Tao Hu, Jiaxin Ai, Licheng Wen, Xueheng Li, Shu Zou, Siqi Li, Nianchen Deng, Xinyu Cai, Hongbin Zhou, Pinlong Cai, Daocheng Fu, Yu Yang, Hairong Zhang, Botian Shi, Xuemeng Yang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出IterCAD,一种闭环交互式CAD生成与编辑的多模态智能体框架,通过渐进式SFT和几何感知强化学习优化,在代码可执行性和几何精度上显著超越现有方法。

详情
AI中文摘要

计算机辅助设计在现代制造业中至关重要,然而现有的自动化方法主要依赖于开环、一次性生成,与迭代的实际实践不匹配。在本文中,我们提出了IterCAD,一个统一的闭环交互式CAD生成与编辑的多模态智能体框架。我们将任务形式化为多模态智能体与可执行CAD沙箱之间的多轮交互,涵盖三个任务:绘图到代码、文本到代码和交互式编辑。为此,我们开发了一个数据合成流水线,结合先进的工业制造特征,生成符合标准的多视图工程图纸、复杂的代码编辑任务和高保真交互轨迹。我们通过渐进式SFT,然后结合几何感知强化学习和可行前缀掩码来优化智能体,以增强代码可执行性和几何保真度。最后,我们引入了IterCAD-Bench评估套件,并提出了Chamfer距离容忍度-召回率(CD-TR)曲线及其AUC-TR指标,建立了一个无幸存者偏差的标准,统一了代码有效性和几何精度。大量实验表明,IterCAD在多个基准测试中取得了极具竞争力的性能,在代码可执行性和几何精度上显著优于现有方法,并在闭环迭代优化中展现出卓越的能力。

英文摘要

Computer-Aided Design is pivotal in modern manufacturing, yet existing automated methods predominantly rely on open-loop, one-shot generation, creating a mismatch with iterative real-world practices. In this paper, we present IterCAD, a unified multimodal agent framework for closed-loop, interactive CAD generation and editing. We formulate the task as a multi-turn interaction between a multimodal agent and an executable CAD sandbox, covering three tasks: Drawing-to-Code, Text-to-Code, and Interactive Editing. To support this, we develop a data synthesis pipeline incorporating advanced industrial manufacturing features to generate standard-compliant multi-view engineering drawings, complex code-editing tasks, and high-fidelity interaction trajectories. We optimize the agent via progressive SFT followed by geometry-aware reinforcement learning with viable-prefix masking to enhance code executability and geometric fidelity. Finally, we introduce the IterCAD-Bench evaluation suite and propose the Chamfer Distance Tolerance-Recall (CD-TR) curve alongside its AUC-TR metric, establishing a survivor-bias-free standard that unifies code validity and geometric precision. Extensive experiments demonstrate that IterCAD achieves highly competitive performance across multiple benchmarks, significantly outperforming existing approaches in both code executability and geometric precision, while exhibiting superior capabilities in closed-loop iterative refinement.

2606.13366 2026-06-12 cs.CV cs.MM 新提交

Dual-Constrained Diffusion Image Compression for Operational Rate-Distortion-Perception Optimization

双约束扩散图像压缩用于操作率失真感知优化

Sanxin Jiang, Jiro Katto, Heming Sun

发表机构 * Shanghai University of Electric Power(上海电力大学) Waseda University(早稻田大学) Institute of Science Tokyo(东京科学大学)

AI总结 提出DCIC框架,结合学习编解码器和基于扩散的解码器,通过联合失真和等幂约束实现率失真感知帕累托前沿的连续导航,无需额外码率开销。

详情
AI中文摘要

率失真感知(RDP)权衡通过施加重建的分布约束扩展了经典率失真理论,为联合控制保真度和感知真实性的神经图像压缩提供了统一框架。虽然先前的工作实现了接近最优的率感知权衡,但明确实现完整RDP曲面的实用框架仍然很少,主要由于在解码器引入公共随机性的困难。我们提出DCIC(双约束扩散图像压缩),它将学习编解码器与基于扩散的解码器相结合,受联合失真和等幂约束的支配。失真约束限制了相对于基础编解码器输出的重建保真度;等幂约束——要求重新编码恢复图像恢复基础编解码器重建——作为分布感知要求的可处理替代。它们通过一致噪声注入的迭代优化引导反向去噪过程,实现公共随机性而无需额外码率开销。在固定码率下,双衰减因子$(K_D, K_P)$共同导航失真感知平面的帕累托前沿,从单个比特流实现连续可调的保真度-真实感权衡。DCIC$_{RD}$($K_P{=}0$)和DCIC$_{RP}$($K_D{=}0$)作为边界曲线出现,DCIC$_{RDP}$($K_D = K_P=1$)实现最优内部工作点。在CelebA-HQ、CLIC2020和ImageNet-1K上,跨CNN、Transformer和混合架构的实验证实,DCIC$_{RDP}$在所有感知编解码器中实现了优越的BD-PSNR,而DCIC$_{RP}$在BD-FID上与专用感知方法相匹配,验证了完整RDP曲面导航的实用价值。

英文摘要

The rate-distortion-perception (RDP) trade-off extends classical rate--distortion theory by imposing a distributional constraint on reconstructions, providing a unified framework for neural image compression that jointly governs fidelity and perceptual realism. While prior work achieves near-optimal rate--perception trade-offs, practical frameworks explicitly realizing the full RDP surface remain scarce, primarily due to the difficulty of introducing common randomness at the decoder. We propose DCIC (Dual-Constrained Diffusion Image Compression), which integrates a learned codec with a diffusion-based decoder governed by joint distortion and idempotence constraints. The distortion constraint bounds reconstruction fidelity relative to the base codec output; the idempotence constraint -- requiring that re-encoding the restored image recovers the base codec reconstruction -- serves as a tractable surrogate for the distributional perception requirement. Together, they steer the reverse denoising process via iterative optimization with consistent noise injection, realizing common randomness without additional rate overhead. At fixed rate, dual attenuation factors $(K_D, K_P)$ jointly navigate the Pareto frontier of the distortion-perception plane, enabling continuously adjustable fidelity-realism trade-offs from a single bitstream. DCIC$_{RD}$ ($K_P{=}0$) and DCIC$_{RP}$ ($K_D{=}0$) arise as boundary curves, with DCIC$_{RDP}$ ($K_D = K_P=1$) realizing the optimal interior operating point. Experiments on CelebA-HQ, CLIC2020, and ImageNet-1K across CNN, Transformer, and hybrid architectures confirm that DCIC$_{RDP}$ achieves superior BD-PSNR over all perceptual codecs, while DCIC$_{RP}$ matches dedicated perception-oriented methods in BD-FID, validating the practical value of full RDP surface navigation.

2606.13364 2026-06-12 cs.LG cs.CV 新提交

VideoMDM: Towards 3D Human Motion Generation From 2D Supervision

VideoMDM: 从2D监督走向3D人体运动生成

Amir Mann, Gal Michael Harari, Merav Keidar, Or Litany

发表机构 * Technion(以色列理工学院) NVIDIA(英伟达)

AI总结 提出VideoMDM框架,利用单目视频的2D姿态通过扩散模型学习3D运动先验,使用深度加权的2D重投影损失近似3D监督,在HumanML3D上接近全3D监督性能。

Comments https://videomdm.github.io/

详情
AI中文摘要

我们提出VideoMDM,一个基于扩散的框架,直接从单目视频中提取的精确2D姿态训练3D人体运动先验,无需任何3D真实数据。预训练的2D到3D提升器提供近似的3D姿态序列,作为有噪声的教师:这些序列被扩散,模型在3D空间去噪,并通过重投影预测并与精确关键点比较在2D空间进行监督。我们证明,在温和假设下,深度加权的2D重投影损失在期望上等价于直接3D监督,并将标准3D运动正则化器——速度一致性和过参数化表示对齐——适应到这一2D设置。与仅在推理时将2D提升到3D的方法不同,VideoMDM在训练期间学习一个连贯的3D运动流形。在HumanML3D上,它几乎缩小了与完全3D监督的MDM的差距(FID 0.88 vs 0.54);在真实视频数据集Fit3D和NBA上,该方法学习生成一致被人类偏好的运动,并取得了强定量结果。

英文摘要

We introduce VideoMDM, a diffusion-based framework that trains 3D human motion priors directly from accurate 2D poses extracted from monocular videos, without any 3D ground truth. A pretrained 2D-to-3D lifter provides approximate 3D pose sequences that serve as a noisy teacher: these are diffused, denoised by the model in 3D, and supervised in 2D by reprojecting the prediction and comparing against accurate keypoints. We show that, under mild assumptions, a depth-weighted 2D reprojection loss is equivalent in expectation to direct 3D supervision, and we adapt standard 3D motion regularizers - velocity consistency and over-parameterized representation alignment - to this 2D setting. Unlike methods that lift 2D to 3D only at inference, VideoMDM learns a coherent 3D motion manifold during training. On HumanML3D it nearly closes the gap to fully 3D-supervised MDM (FID 0.88 vs 0.54); On real video datasets Fit3D and NBA the method learns to generate motions consistently preferred by humans, with strong quantitative results.

2606.13361 2026-06-12 cs.AI cs.CE cs.MA 新提交

Can I Buy Your KV Cache?

我能买你的KV缓存吗?

Luoyuan Zhang

发表机构 * Harbin Institute of Technology, Shenzhen (HITSZ)(哈尔滨工业大学(深圳))

AI总结 针对AI代理重复计算相同文档KV缓存的问题,提出由发布者预计算KV缓存,其他代理付费加载以跳过预填充,实验表明在Qwen3-4B上计算成本降低9-50倍,并设计了代理原生预填充CDN架构。

详情
AI中文摘要

现在,在世界各地,AI代理正在重复同样的荒谬行为:为了读取一份文档,每个代理都从头开始重新计算。每个代理都重新运行预填充——大型模型最计算密集的步骤——在相同的文本上,只是为了重建一个与之前代理刚刚构建的完全相同的键值(KV)缓存。相同的答案,被计算了一百万次。我们提出了一个几乎粗鲁简单的建议:只计算一次。让发布者预计算文档的KV缓存,然后让每个其他代理购买加载该缓存并跳过预填充的权利。这可行,并且是token精确的:加载预计算的KV并继续与从头开始预填充匹配(24/24个贪婪token,并且在logits级别),没有准确度损失。在Qwen3-4B上,重用比预填充计算便宜9-50倍,并且差距随长度增加而扩大(预填充的注意力与L^2成比例),因此一次重用就足以收回成本。然后关键部分:KV存储在哪里。传输它失败了,因为KV几乎不可压缩,因此每次加载的出口成本比它节省的预填充成本还要高。将其托管在提供方侧,正如生产中的提示缓存那样,完全消除了出口成本。奖励的大小由我们测量的计算节省决定:为80M代理提供一份热门的3774-token文档,重新预填充成本约150万美元,而重用计算成本仅约3万美元(减少49.7倍)。API收取的0.1倍缓存读取关税在测量范围内为用户提供了10倍的折扣,因此10倍是下限,而测量的约50倍计算节省超过了它,与物理约50倍的差距是提供方的利润:每份热门文档数百万美元。我们构建了由此产生的代理原生预填充CDN,并将无损KV压缩和跨方支付层作为开放问题。

英文摘要

Right now, across the world, AI agents are repeating the same absurd act: to read one document, they each recompute it from scratch. Every agent re-runs prefill, the most compute-intensive step a large model takes, over identical text, only to rebuild a key-value (KV) cache identical to the one the agent before it just built. The same answer, computed a million times. We make a proposal that is almost offensively simple: compute it once. Let a publisher precompute a document's KV cache, and let every other agent buy the right to load it and skip prefill. It works, and it is token-exact: loading a precomputed KV and continuing matches prefilling from scratch (24/24 greedy tokens, and at the logits level), with no accuracy cost. On Qwen3-4B, reuse is 9-50x cheaper in compute than prefill, and the gap widens with length (prefill's attention scales with L^2), so a single reuse already pays it back. Then the part that matters: where the KV lives. Shipping it fails, because KV is nearly incompressible, so per-load egress costs more than the prefill it saves. Hosting it provider-side, exactly as production prompt-caching works, removes egress entirely. The size of the prize is set by our measured compute saving: serving one hot 3774-token document to 80M agents costs ~$1.5M to re-prefill but only ~$0.03M of reuse compute (49.7x less). The 0.1x cache-read tariff APIs charge passes a 10x discount to users while sitting inside this measured envelope, so the 10x is a floor that the measured ~50x compute saving clears, and the gap to the physical ~50x is provider margin: millions of dollars per popular document. We frame the resulting agent-native prefill CDN and leave lossless KV compression and a cross-party payment layer as the open problems.

2606.13355 2026-06-12 cs.RO cs.AI 新提交

Real-Time Execution with Autoregressive Policies

基于自回归策略的实时执行

Sangkyu Lee, Seohyeon Park, Tackgeun You, Avi Caciularu, Idan Szpektor, Hwasup Lim, Youngjae Yu

发表机构 * Korea Institute of Science and Technology(韩国科学技术研究院) Seoul National University(首尔大学) Google Research(谷歌研究院)

AI总结 通过异步推理和约束解码实现自回归策略的实时执行,在保证低延迟的同时提升任务完成速度,实验表明其性能优于流匹配策略。

详情
AI中文摘要

实时执行通过异步推理实现平滑动作轨迹和快速响应,对于大规模视觉-语言-动作模型的实际部署至关重要。然而,近期关于实时执行的工作主要关注扩散策略的变体,尽管自回归策略在同步推理中滚动速度较慢,更需要实时性。相比之下,我们证明自回归策略可以通过调整分词范围和应用约束解码来实现实时执行,从而保证严格的延迟界限,支持多轨迹解码以最大化性能。在模拟和真实环境中,我们发现自回归策略始终优于同等水平的流匹配策略,同时显著提升了同步推理的任务完成速度。结合自回归策略的固有优势(如更快的收敛速度和更好的指令遵循泛化能力),这些结果证实自回归策略仍是一种支持实时执行的竞争性策略类型。

英文摘要

Real-time execution, enabled by asynchronous inference that ensures both smooth action trajectories and fast reactivity, is critical for realistic deployments of large-scale Vision-Language-Action models. However, recent work on real-time execution primarily focuses on variants of diffusion policies, even though it is more critical for autoregressive policies given their slower rollout speed in synchronous inference. In contrast, we demonstrate that autoregressive policies can achieve real-time execution by adjusting the tokenization horizon and applying constrained decoding, thereby guaranteeing strict latency bounds that enable multi-trajectory decoding to maximize performance. Across simulated and real-world environments, we find that the autoregressive policy consistently outperforms its equivalent-level flow-matching policy counterpart while achieving significantly improved task completion speeds from synchronous inference. Coupled with the inherent advantages of autoregressive policies, such as faster convergence and better generalizability in instruction-following, these results confirm that autoregressive policies can remain a competitive policy type supporting real-time execution.

2606.13352 2026-06-12 cs.RO 新提交

Low cost, easily manufactured, highly flexible strain and touch sensitive fiber for robotics applications

低成本、易制造、高柔性应变与触觉传感纤维用于机器人应用

Christian Diaz Herrera, Srushti Raste, Simin Liu, Miles Modeste, Jiyang, Yin, Katelyn McCall, Yuxing Jared Yao, Roopkamal Chahal, Simon Chidley, Trung Ha, T. David Westmoreland, Sonia Roberts

发表机构 * Wesleyan University(卫斯理大学)

AI总结 提出一种仅用廉价商用部件和工具快速制造的导电纤维,兼具电阻应变传感和电容触觉传感功能,实验验证其在机器人抓取、姿态估计和近场跟踪中的应用。

详情
AI中文摘要

现有的机器人拉伸和触觉传感器通常在材料成本、所需制造设备或制造时间方面至少有一项昂贵。我们提出并实验表征了一种导电纤维,仅使用廉价的商用现成部件(导电线程$0.07/英尺,硅胶管$0.94/英尺)和工具(环形针穿线器$2),可快速制造(20厘米长度2分钟)。我们展示了其作为电阻应变传感器的三种应用:触发气动辅助手指的抓取、感知气动机器人带的位置、以及估计柔性固体的姿态。我们还展示了其作为电容传感器的两种应用:首先,作为触觉传感器触发商业机器人手臂移动;其次,作为近场传感器使机器人手臂跟随移动的手。电容传感器通过编织制成,展示了纤维的高柔性。我们讨论了提高制造可扩展性的方法及其成本权衡。最后,我们展示了一种修复切断纤维的方法。

英文摘要

Existing stretch and touch sensors for robots are generally expensive with respect to at least one of material costs, required manufacturing equipment, or manufacturing time. We present and experimentally characterize a conductive fiber made using only inexpensive commercial off-the-shelf parts (conductive thread at $0.07/ft, silicone tubing at $0.94/ft) and tools (loop-style needle threader at $2), which can be manufactured quickly (20 cm length in 2 minutes.) We demonstrate its use as a resistive strain sensor with three applications: Triggering a grasp in a pneumatically actuated assistive finger, sensing the pose of a pneumatically actuated robotic strap, and estimating the pose of a flexible solid. We also demonstrate that it can be used as a capacitive sensor with two applications: First, as a touch sensor which triggers a commercial robot arm to move, and second, as a near-field sensor enabling the robot arm to follow a moving hand. The capacitive sensors are knitted, showcasing the high flexibility of the fiber. We discuss methods for improving manufacturing scalability and their cost trade-offs. Finally, we demonstrate a method for repairing a cut fiber.

2606.13349 2026-06-12 cs.CL 新提交

From Passive Generation to Investigation: A Proactive Scientific Peer Review Agent

从被动生成到主动调查:一种主动的科学同行评审代理

Haishuo Fang, Yue Feng, Iryna Gurevych

发表机构 * Ubiquitous Knowledge Processing Lab (UKP Lab), Technical University of Darmstadt(达姆施塔特工业大学通用知识处理实验室) National Research Center for Applied Cybersecurity ATHENE, Germany(德国国家应用网络安全研究中心 ATHENE) School of Computer Science, University of Birmingham(伯明翰大学计算机科学学院)

AI总结 提出ProReviewer,一种基于LLM的主动科学同行评审代理,将评审建模为马尔可夫决策过程,通过结构化评审日志引导主动调查,在五个质量维度上平均得分最高,优于现有方法。

详情
AI中文摘要

大型语言模型(LLM)在自动化科学同行评审方面显示出潜力。然而,现有方法通常难以生成有具体证据支持的深入评审。我们认为,一个关键限制是缺乏根据累积证据主动调查论文可疑部分的灵活性,就像人类评审员所做的那样。在本文中,我们探讨如何使基于LLM的评审代理能够进行这种主动调查。我们发现,这可以自然地表述为马尔可夫决策过程(MDP),并提出了ProReviewer,一种科学同行评审代理,它通过维护的结构化评审日志主动评审论文。结构化评审日志作为代理的工作空间,用于跟踪评审过程中收集的证据和中间发现。实验表明,使用8B骨干网络、通过监督微调训练并通过强化学习优化的ProReviewer,在五个质量维度上取得了最高平均分,相对优于基于提示的方法(使用更大的前沿LLM)高达39%,优于最强的微调基线16%。在人工评估中,它也取得了对基线最高的胜率。

英文摘要

Large language models (LLMs) have shown promise in automating scientific peer review. However, existing approaches often struggle to generate in-depth reviews supported by concrete evidence. We argue that a key limitation is the lack of flexibility to proactively investigate suspicious parts of a paper based on accumulated evidence, as human reviewers do. In this paper, we explore how to enable an LLM-based review agent to perform such proactive investigation. We find that this can be naturally formulated as a Markov Decision Process (MDP), and propose ProReviewer, a scientific peer review agent that proactively reviews a paper guided by a maintained, structured review log. The structured review log serves as a workspace for the agent to track evidence and intermediate findings collected during review. Experiments show that ProReviewer with an 8B backbone, trained by supervised fine-tuning and optimized by reinforcement learning, achieves the highest average score across five quality dimensions, outperforming prompt-based methods with much larger frontier LLMs by up to 39% and the strongest fine-tuned baseline by 16% relatively. It also attains the highest win rates against baselines in human evaluation.

2606.13348 2026-06-12 cs.CL cs.AI 新提交

IVIE: A Neuro-symbolic Approach to Incremental and Validated Generation of Interactive Fiction Worlds

IVIE:一种用于增量且经过验证的交互式小说世界生成的神经符号方法

Micaela Vaucher, Santiago Silveira, Santiago Góngora, Luis Chiruzzo

发表机构 * Instituto de Computación, Facultad de Ingeniería, Universidad de la República(乌拉圭共和国大学工程学院计算机研究所)

AI总结 提出IVIE神经符号方法,结合LLM的创造力与符号验证的连贯性,通过四阶段增量生成管道构建可玩的交互式小说世界,人类评估显示其生成沉浸式、主题连贯的世界,平衡了灵活性与叙事一致性。

Comments 10 pages, 3 figures. To appear in the Proceedings of the 16th International Conference on Computational Creativity (ICCC'26), June 2026

详情
AI中文摘要

交互式小说中的计算创造力面临一个基本矛盾:大型语言模型(LLM)可能产生创意叙事,但难以维持世界连贯性,而符号系统确保一致性但缺乏创意灵活性。我们提出IVIE(增量与验证的交互体验),一种从零开始生成完整且可玩的交互式小说世界的神经符号方法。基于PAYADOR的神经符号框架,IVIE实现了一个四阶段增量生成管道,将创意决策——设定与角色创建、谜题设计——委托给LLM,同时通过符号验证将世界状态接地。该系统生成具有相互关联的地点、功能性物品、非玩家角色和连贯谜题的世界,所有这些都围绕一个中心目标导向架构组织。人类评估表明,该方法生成了沉浸式、主题连贯的世界,具有高玩家参与度。结果似乎表明,神经符号方法成功平衡了灵活性与叙事连贯性:符号验证在不消除生成自由的情况下将LLM生成接地。然而,挑战依然存在:LLM的不一致性偶尔会绕过谜题约束,客观验证的空白允许一些结构上不可能的目标。我们为未来的神经符号交互式叙事系统确定了关键设计考虑因素,特别是关于LLM的能力及其局限性。

英文摘要

Computational creativity in Interactive Fiction faces a fundamental tension: Large Language Models (LLM) may produce creative narratives but struggle with world coherence, while symbolic systems ensure consistency but lack creative flexibility. We present IVIE (Incremental & Validated Interactive Experiences), a neuro-symbolic approach to generating complete and playable interactive fiction worlds from scratch. Building upon PAYADOR's neuro-symbolic framework, IVIE implements a four-stage incremental generation pipeline that delegates creative decisions--setting and character creation, puzzle design--to LLMs while grounding the world state through symbolic validation. The system generates worlds with interconnected locations, functional items, non-player characters, and coherent puzzles, all structured around a central goal-oriented architecture. Human evaluation shows the approach generates immersive, thematically coherent worlds with high player engagement. Results seem to indicate that the neuro-symbolic approach successfully balances flexibility with narrative coherence: symbolic validation grounds LLM generation without eliminating generative freedom. However, challenges remain: LLM inconsistencies occasionally bypass puzzle constraints, and objective validation gaps allow some structurally impossible goals. We identify key design considerations for future neurosymbolic interactive storytelling systems, particularly regarding LLM capabilities and their limitations.

2606.13347 2026-06-12 cs.LG 新提交

Enhanced Low-Density Region Exploration in Classifier-Guided Diffusion Models Through Modified Reverse Diffusion Sampling

改进反向扩散采样在分类器引导扩散模型中的低密度区域探索

Jagriti Singh, Shekhar Verma, Muneendra Ojha

发表机构 * University of Allahabad(阿拔斯大学)

AI总结 提出一种无需额外训练的采样时间密度感知方法,通过修改分类器梯度引导轨迹朝向低置信区域并引导采样朝向预测真实图像,以增强扩散模型对低密度区域的探索。

详情
AI中文摘要

扩散模型已成为高保真图像合成的最先进生成模型,特别是在无分类器引导和分类器引导形式中。然而,标准分类器引导将概率质量集中在高密度类均值周围,导致对类条件分布尾部罕见样本的覆盖不足。最近关于基于扩散的尾部采样的工作通过训练一个额外的低密度寻求分类器(使用合成与真实判别器)来缓解这一问题,但代价是额外的网络和训练。与此同时,许多采样器和蒸馏技术加速或改进扩散采样,但并未明确解决长尾覆盖问题。我们提出一种纯采样时间、密度感知的分类器引导条件扩散模型扩展,针对低密度区域且无需任何额外训练。我们像大多数扩散模型一样,对噪声图像应用引导而非预测噪声。从预训练的ImageNet条件扩散模型和分类器开始,我们通过修改分类器梯度将轨迹引导向低置信区域,并在每个时间步引导采样过程朝向预测的真实图像,从而修改引导反向动力学。第一个引导有助于探索低概率样本,第二个引导有助于生成接近真实数据流形的样本。所提出的采样器在64x64分辨率下一致提高了ADM模型的召回率,同时保持可比的FID,并且使用256x256 ADM模型,我们展示了两种引导不同组合的视觉结果。我们还表明,标准ADM分类器引导结合预测真实图像引导,有助于在ImageNet上使用256x256 ADM模型生成高感知质量的样本。

英文摘要

Diffusion models have emerged as state-of-the-art generative models for high-fidelity image synthesis, particularly in their classifier-free guided and classifier-guided forms. However, standard classifier guidance concentrates probability mass around high-density class mean, leading to poor coverage of rare samples in the tails of the class-conditional distributions. Recent work on diffusion-based tail sampling mitigates this by training an additional low-density-seeking classifier with a synthetic-vs-real discriminator, at the cost of additional networks and training. In parallel, a number of samplers and distillation techniques accelerate or refine diffusion sampling, but do not explicitly address long-tail coverage. We propose a purely sampling-time, density-aware extension of classifier-guided conditional diffusion model that targets low-density regions without any additional training. We have applied guidance at noisy images not on predicted noise like most diffusion models. Starting from a pretrained conditional diffusion model and classifier on ImageNet, we modify the guided reverse dynamics by steering trajectories toward low-confidence regions via the modified classifier gradient, and at each time step, we also guide the sampling process toward the predicted real image. 1st guidance helps explore low-probability samples, and 2nd guidance helps to generate samples to be close to the real data manifold. The proposed sampler consistently improves ADM model recall at 64x64 resolution while maintaining a comparable FID, and with a 256x256 ADM model, we showed the results visually with different combinations of both guidance. We also showed that standard ADM classifier guidance, combined with predicted real image guidance, helps generate high perceptual quality samples with a 256x256 ADM model on ImageNet.

2606.13345 2026-06-12 cs.CV 新提交

JointEdit3D: Feed-Forward 3D Scene Editing in a Unified Latent Space

JointEdit3D:统一潜在空间中的前馈3D场景编辑

Xinnan Zhu, Ruijie Xu, Jiayu Ying, Daoguo Dong, Jiachen Xu, Yuan Xie, Xin Tan

发表机构 * East China Normal University(华东师范大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Fudan University(复旦大学) Tencent(腾讯)

AI总结 提出JointEdit3D,在统一RGB-几何重建生成潜在空间中通过非对称潜在修复实现前馈3D场景编辑,引入SceneAnchor分支和编辑/背景感知损失,并构建SceneEdit3D-15K数据集和SceneEdit3D-Bench基准,显著提升编辑区域质量和3D结构完整性。

Comments Preprint. Project page: https://xinnan-zhu.github.io/JointEdit3D-Page/

详情
AI中文摘要

现有的3D场景编辑方法通常依赖于对显式3D表示进行逐场景优化或级联编辑-重建流水线,导致测试时成本高、3D感知有限以及结构不一致。为了在编辑过程中耦合外观合成和几何预测,我们构建了一个统一的RGB-几何重建生成潜在空间,并将其适应于前馈3D场景编辑。由此产生的框架JointEdit3D通过仅观察单个编辑后的RGB参考潜在变量,并在源场景锚定下生成剩余的RGB视图和编辑后的几何潜在变量,执行非对称潜在修复。JointEdit3D引入了一个专门的SceneAnchor分支来注入源场景结构而不强制直接复制,并采用编辑/背景感知损失来平衡编辑区域的保真度与未编辑内容的保持。为了解决缺乏用于标准化3D场景编辑评估的配对资源的问题,我们引入了SceneEdit3D-15K数据集,该数据集包含15K个配对编辑样本和渲染器提供的3D注释,以及SceneEdit3D-Bench,一个精心挑选的100样本基准。实验表明,JointEdit3D在保持竞争性背景保留的同时,在编辑区域质量和3D结构完整性方面优于先前基线。

英文摘要

Existing 3D scene editing methods typically rely on per-scene optimization over explicit 3D representations or cascaded edit-and-reconstruct pipelines, resulting in high test-time cost, limited 3D awareness, and structural inconsistencies. To couple appearance synthesis and geometry prediction during editing, we build on a unified RGB-geometry reconstruction-generation latent space and adapt it to feed-forward 3D scene editing. The resulting framework, \textbf{JointEdit3D}, performs asymmetric latent inpainting by observing only a single edited RGB reference latent and generating the remaining RGB views and edited geometry latent under source-scene anchoring. JointEdit3D introduces a dedicated SceneAnchor Branch to inject source-scene structure without forcing direct copying, and adopts edit/background-aware losses to balance edited-region fidelity with unedited-content preservation. To address the lack of paired resources for standardized 3D scene editing evaluation, we introduce SceneEdit3D-15K, a dataset with 15K paired editing samples and renderer-provided 3D annotations, together with SceneEdit3D-Bench, a curated 100-sample benchmark. Experiments show that JointEdit3D improves edited-region quality and 3D structural completeness over prior baselines while maintaining competitive background preservation.

2606.13340 2026-06-12 cs.RO 新提交

EMG-Based Adaptation of Anisotropic Virtual Fixtures for Robot-Assisted Surgical Resection and Dissection

基于EMG的各向异性虚拟夹具自适应方法用于机器人辅助手术切除与解剖

Dario Onfiani, Michael Dyck, Luigi Biagiotti, Julian Klodmann

发表机构 * University of Modena and Reggio Emilia(摩德纳大学) German Aerospace Center (DLR)(德国航空航天中心)

AI总结 提出一种基于EMG信号自适应调节各向异性虚拟夹具的框架,通过实时推断外科医生意图动态调整约束,实验证明能提高手术精度和运动一致性,降低认知负荷。

详情
AI中文摘要

本文针对机器人辅助腹腔镜手术中的精细任务(如切除和解剖),开发了一种自适应辅助系统。尽管虚拟夹具在引导外科医生运动方面具有显著优势,但传统虚拟夹具通常由固定几何形状定义,缺乏适应手术流程或外科医生即时意图的灵活性。为解决这些局限性,我们提出了一种自适应各向异性虚拟夹具的新框架。此外,我们引入了一种直观的控制接口,该接口基于从EMG信号推断的外科医生意图,实时调节夹具的几何形状。该方法允许外科医生通过收缩前臂肌肉动态扩展或解除约束,实现精确引导运动和工具自由重新定位之间的无缝切换。基于标准化手术训练任务的初步用户研究实验结果表明了所提方法的有效性。该系统在任务精度和运动一致性方面表现出显著改善,同时降低了感知认知负荷、努力和挫败感。

英文摘要

In this paper, we address the development of an adaptive assistance system for robot-assisted laparoscopic surgery, specifically for delicate tasks such as Resection and Dissection. Even if Virtual Fixtures offer significant advantages for guiding a surgeon's movements, conventional Virtual Fixtures are often defined by fixed geometries, lacking the flexibility to adapt to the surgical workflow or the surgeon's immediate intent. To address these limitations, we propose a novel framework for an adaptive and anisotropic virtual fixture. In addition, we introduce an intuitive control interface that modulates the fixture's geometry in real-time based on the surgeon's intent, inferred from EMG signals. This approach allows the surgeon to dynamically expand or disengage the constraint by contracting their forearm muscles, enabling seamless transitions between precise guided motion and free repositioning of the tool. Experimental results from a pilot user study, based on a standardized surgical training task, demonstrate the effectiveness of the proposed method. The system showed significant improvements in task accuracy and movement consistency, alongside a reduction in perceived cognitive load, effort, and frustration.

2606.13338 2026-06-12 cs.LG 新提交

Navigating the Safety-Fidelity Trade-off: Massive-Variate Time Series Forecasting for Power Systems via Probabilistic Scenarios

导航安全-保真度权衡:通过概率场景进行电力系统的大规模多变量时间序列预测

Kaijie Xu, Anqi Wang, Xilin Dai

发表机构 * ZJU-UIUC Institute, Zhejiang University(浙江大学伊利诺伊大学厄巴纳香槟校区联合学院)

AI总结 针对现有基准无法评估大规模多变量概率预测的安全性与保真度权衡问题,提出包含多达36,964个通道的电力系统基准PowerPhase和场景式分位数预测器PowerForge,在多个网格上取得最佳平均排名。

详情
AI中文摘要

概率预测模型越来越多地部署在具有不同通道物理特性和运行约束的多变量系统上,但现有基准无法大规模评估这两个属性。公开的规范多变量基准最多包含2,000个通道,而电力系统基准要么缺乏时间结构,要么缺乏概率评估。我们提出PowerPhase,这是一个基于六个输电网络构建的概率预测基准,联合预测通道数从2,000到36,964,比流行的规范多变量基准高出一个数量级以上。每个目标轨迹是交流潮流求解的输出,PowerPhase配备了约束感知指标,包括Safety_mBrier、NECV和CVaR-alpha,作为CRPS和Distortion的补充。在八个基线和三个随机种子上,分布准确性和约束满足对模型进行不同排序,我们将这种权衡称为安全-保真度。我们进一步提出PowerForge,一种基于场景的分位数预测器,具有类型特定的解码头和变量组之间的因果桥,在每个网格上实现了最佳平均排名。

英文摘要

Probabilistic forecasting models are increasingly deployed on multivariate systems with distinct channel physics and operational constraints, but existing benchmarks evaluate neither property at scale. Public canonical multivariate benchmarks cap out at 2,000 channels, while power-system benchmarks either lack temporal structure or probabilistic evaluation. We introduce PowerPhase, a probabilistic forecasting benchmark built on six transmission grids ranging from 2,000 to 36,964 jointly forecasted channels, more than an order of magnitude beyond popular canonical multivariate benchmarks. Each target trajectory is the output of an AC power-flow solve, and PowerPhase ships with constraint-aware metrics, including Safety_mBrier, NECV, and CVaR-alpha, that complement CRPS and Distortion. Across eight baselines and three seeds, distributional accuracy and constraint satisfaction rank models differently, a trade-off we term safety-fidelity. We further propose PowerForge, a scenario-based quantile forecaster with type-specific decoding heads and a causal bridge between variable groups, which achieves the best average rank on every grid.

2606.13332 2026-06-12 cs.CV 新提交

OR-Action: Multi-Role Video Understanding with Fine-Grained Actions

OR-Action: 细粒度动作的多角色视频理解

Felix Tristram, Ege Özsoy, Christian Benz, Marcel Walch, Ghazal Ghazaei, Nassir Navab

发表机构 * Technical University of Munich(慕尼黑工业大学) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心) Carl Zeiss AG(卡尔蔡司股份公司)

AI总结 针对手术室活动理解中场景图方法缺乏时间建模的问题,提出基于公开数据集的细粒度多角色动作基准,并引入纯视觉时序模型,显著优于图方法,同时提出多视角到单视角特征对齐策略提升单视角性能。

详情
AI中文摘要

对手术室活动的细粒度理解能够实现工作流感知的辅助,但由于杂乱、遮挡和有限的感知,仍然困难。建模该环境的主流方法是使用场景图作为OR交互的可解释表示。然而,在没有显式时间建模的情况下,将它们的逐帧关系预测转换为时间上延伸的细粒度动作是具有挑战性的。为了对当前OR理解方法进行原则性的时间评估,我们引入了第一个以动作为中心的基准,该基准基于公开可用的自我中心-外部中心OR数据集,通过定义细粒度的多角色动作分类法,并通过从地面真实场景图状态变化中蒸馏生成密集动作片段。在该基准上的实验表明,当前的场景图预测方法难以建模时间结构,即使通过图神经网络添加显式建模也是如此。因此,我们引入了一种纯视觉时间模型,当使用所有可用的自我中心视频作为输入时,该模型显著优于基于图的方法。在此模型基础上,我们还引入了一种新颖的多视角到单视角特征对齐策略,提高了多角色动作识别的单视角性能,减少了对大量自我中心视频采集的需求。基准和代码将在接收后发布。

英文摘要

Fine-grained understanding of operating room (OR) activity could enable workflow-aware assistance, yet remains difficult due to clutter, occlusions, and limited sensing. The prevailing approach to model this environment is scene graphs as an interpretable representation of OR interactions. Converting their frame-wise relational predictions into temporally extended, fine-grained actions however, is challenging without explicit temporal modeling. To enable a principled temporal evaluation of current OR understanding methods, we introduce the first action-centric benchmark built on a publicly available ego-exocentric OR dataset by defining a fine-grained, multi-role action taxonomy and generating dense action segments via distillation from ground-truth scene graph state changes. Experiments on this benchmark show that current scene graph prediction methods struggle to model temporal structure, even when adding explicit modeling through Graph Neural Networks. We therefore introduce a vision-only temporal model that outperforms graph-based methods significantly when using all available egocentric video as input. Building on this model we also introduce a novel multi- to single-view feature alignment strategy that improves single-view performance on multi-role action recognition, mitigating the need for extensive egocentric video capture. Benchmark and code will be released upon acceptance.

2606.13322 2026-06-12 cs.CL 新提交

Low-Latency Real-Time Audio Game Commentary System via LLM-Based Parallel Text Generation

基于LLM并行文本生成的低延迟实时音频游戏解说系统

Ryota Kawamatsu, Anum Afzal, Yuki Saito, Shinnosuke Takamichi, Graham Neubig, Katsuhito Sudoh, Hiroya Takamura, Tatsuya Ishigaki

发表机构 * The University of Tokyo(东京大学) National Institute of Advanced Industrial Science and Technology(产业技术综合研究所) Technical University of Munich(慕尼黑工业大学) Keio University(庆应义塾大学) Carnegie Mellon University(卡内基梅隆大学) Nara Women’s University(奈良女子大学)

AI总结 提出一种并行文本生成与语音播放的低延迟实时游戏解说系统,将平均句间静默从9.6秒降至0.3秒,显著提升解说节奏。

Comments Accepted at IJCAI-ECAI 2026 (Demonstrations Track)

详情
AI中文摘要

我们提出了一种低延迟实时音频游戏解说系统,可直接从实时游戏视频生成语音解说。在这种端到端设置中,关键瓶颈是累积等待时间;传统流程顺序执行帧捕获、文本生成和语音合成,且直到语音播放完成才请求下一次生成。这种严格顺序性导致语句间出现长且不自然的静默。为解决这一延迟瓶颈,我们的系统将文本生成与语音播放并行运行,并预先缓冲多个候选语句,从而在播放边界实现即时合成。在快节奏游戏视频上的实验表明,与顺序基线相比,我们的并行设计将平均句间静默从9.6秒降至0.3秒。它还将与专业演讲的静默时间模式相似度提高了40%以上,一项包含120名经验游戏玩家的用户研究证实,感知到的说话节奏显著改善。我们的演示视频可在以下网址获取:this https URL。

英文摘要

We present a low-latency real-time audio game commentary system that generates spoken commentary directly from live gameplay video. In this end-to-end setting, a key bottleneck is accumulated waiting time; conventional pipelines capture frames, generate text, and synthesize speech sequentially for each utterance, and do not request the next generation until speech playback has completed. This strict sequentiality causes long and unnatural silence between utterances. To address this latency bottleneck, our system runs text generation in parallel with speech playback and buffers multiple candidate utterances ahead of time, enabling immediate synthesis at playback boundaries. Experiments on fast-paced game videos show that our parallel design reduces the mean inter-utterance silence from 9.6 seconds to 0.3 seconds compared to sequential baselines. It also improves similarity to professional speaking--silence timing patterns by over 40 %, and a user study with 120 experienced game players confirms significantly improved perceived speaking rhythm. Our demo video is available at: https://youtu.be/pmrRUlvav8M.

2606.13317 2026-06-12 cs.CL 新提交

SkillCAT: Contrastive Assessment and Topology-Aware Skill Self-Evolution for LLM Agents

SkillCAT: 面向LLM智能体的对比评估与拓扑感知技能自进化

Kunfeng Chen, Qihuang Zhong, Juhua Liu, Bo Du

发表机构 * School of Computer Science, Wuhan University(武汉大学计算机学院) School of Computer Science, Fudan University(复旦大学计算机学院)

AI总结 提出SkillCAT框架,通过对比因果提取、评估增强进化和拓扑感知任务执行三阶段,实现无需训练的LLM智能体技能自进化,在多个基准上平均提升高达40.40%。

Comments 9 pages, 6 figures

详情
AI中文摘要

LLM智能体的技能自进化方法旨在将执行轨迹转化为可复用的技能文档,但当前流程通常每个任务只学习一条轨迹,在检查前合并候选技能补丁,并在推理前加载完整技能语料库。我们提出SkillCAT,一个无需训练的框架,将该过程分为三个阶段。对比因果提取(CCE)为每个任务采样多条轨迹,并比较同任务的成功/失败对,以识别解释结果差异的证据。评估增强进化(AAE)在源任务克隆上回放每个候选补丁,并在层次化技能补丁合并前仅保留改善或保持任务结果的补丁。拓扑感知任务执行(TTE)将进化后的技能编译成可路由的子技能拓扑,因此推理仅加载与任务相关的能力节点。我们在常见智能体基准上评估SkillCAT,包括SpreadsheetBench、WikiTableQuestions和DocVQA,并进一步测试跨模型和分布外泛化。在这些设置中,SkillCAT将基线平均得分提升高达40.40%,展示了无需模型训练的可靠技能进化。

英文摘要

Skill self-evolution methods for LLM agents aim to turn execution trajectories into reusable skill documents, but current pipelines typically learn from one trajectory per task, merge candidate skill patches before checking them, and load the full skill corpus before inference. We propose SkillCAT, a training-free framework that separates this process into three stages. Contrastive Causal Extraction (CCE) samples multiple trajectories for each task and compares same-task success/failure pairs to identify evidence that explains outcome differences. Assessment-Augmented Evolution (AAE) replays each candidate patch on source-task clones and keeps only patches that improve or preserve task outcomes before hierarchical skill patch merging. Topology-Aware Task Execution (TTE) compiles the evolved skills into a routable sub-skill topology, so inference loads only the capability nodes relevant to the task. We evaluate SkillCAT on common agent benchmarks, including SpreadsheetBench, WikiTableQuestions, and DocVQA, and further test cross-model and out-of-distribution generalization. Across these settings, SkillCAT raises the average score over baselines by up to 40.40%, demonstrating reliable skill evolution without model training.