arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1998
2606.13030 2026-06-12 cs.CV 新提交

A Multi-Modal Framework with Cross-Subject Pseudo-Labeling and Semantic Alignment for Micro-Gesture Recognition

一种结合跨主体伪标签与语义对齐的多模态微手势识别框架

Haoran Zhang, Haokun Zhang, Pengyu Liu, Yujia Zhang, Weibao Xue, Yanbin Hao

发表机构 * School of Computer Science and Information Engineering, Hefei University of Technology (HFUT)(合肥工业大学计算机科学与信息工程学院) School of Computer Science, University of Auckland (UOA)(奥克兰大学计算机科学学院)

AI总结 针对微手势识别中低信噪比、长尾分布和跨主体域偏移问题,提出多模态框架,通过显著性引导提取、平方根平滑加权、正交语义嵌入损失和跨模态伪标签策略,实现有效识别,F1分数达68.13%。

Comments 14 pages, 2 figures

详情
AI中文摘要

微手势(MGs)是自发的、细微的身体动作,经常传达隐藏的人类情感。在未修剪视频中识别MGs仍然极具挑战性,因为其极低的信噪比、严重的长尾类分布以及跨主体评估场景中固有的域偏移。在本文中,我们为第四届MiGA-IJCAI挑战赛的Track 1提出了一个全面的多模态框架。为了捕捉细粒度表示,我们设计了一个显著性引导的多模态提取流程,整合了68关键点骨架关节坐标、3D热图体积和高分辨率RGB视觉特征。我们引入了一种温和的平方根平滑加权机制,配合正交语义嵌入损失,以保护尾部类别而不损害整体识别能力。更重要的是,为了弥合跨主体泛化差距,我们提出了一种跨模态伪标签(CMPL)策略用于无监督域适应,显著提升了单模态鲁棒性。最后,采用温度缩放软投票机制以减轻后期融合中的过度自信。大量实验表明,我们的框架达到了具有竞争力的68.13%的F1分数,获得第四名。

英文摘要

Micro-gestures (MGs) are spontaneous and subtle body movements that frequently convey hidden human emotions. Recognizing MGs in untrimmed videos remains highly challenging due to their extremely low signal-to-noise ratio, severe long-tailed class distribution, and the inherent domain shift encountered in cross-subject evaluation scenarios. In this paper, we propose a comprehensive multi-modal framework for Track 1 of the 4th MiGA-IJCAI Challenge. To capture fine-grained representations, we design a saliency-guided multi-modal extraction pipeline integrating 68-keypoint skeleton joint coordinates, 3D heatmap volumes, and high-resolution RGB visual features. We introduce a gentle square-root smoothed weighting mechanism paired with an Orthogonal Semantic Embedding Loss to protect tail classes without compromising overall recognition capabilities. More importantly, to bridge the cross-subject generalization gap, we propose a Cross-Modal Pseudo-Labeling (CMPL) strategy for unsupervised domain adaptation, which significantly boosts single-modal robustness. A temperature-scaled soft-voting mechanism is finally utilized to alleviate overconfidence during late fusion. Extensive experiments demonstrate that our framework achieves a competitive F1-score of 68.13\%, securing the 4th place.

2606.13024 2026-06-12 cs.LG cs.AI 新提交

CausalMoE: A Billion-Scale Multimodal Foundation Model for Granger Causal Discovery with Pattern-Routed Heterogeneous Experts

CausalMoE:基于模式路由异构专家的十亿规模多模态基础模型用于格兰杰因果发现

Bo Liu, Di Dai, Jingwei Liu, Jiarui Jin, Xiaocheng Fang, Guangkun Nie, Hongyan Li, Shenda Hong

发表机构 * State Key Laboratory of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University(北京大学智能科学与技术学院通用人工智能国家重点实验室) National Institute of Health Data Science, and Institute for Artificial Intelligence, Peking University(北京大学健康医疗大数据国家研究院、人工智能研究院)

AI总结 提出CausalMoE,一种十亿规模多模态格兰杰因果基础模型,通过模式路由混合异构专家解耦动态机制,结合因果自注意力与LLM/VLM先验,实现稀疏因果图恢复,在监督和少样本场景中达到最优。

详情
AI中文摘要

格兰杰因果发现(GCD)是分析复杂系统中时间依赖性的基础。然而,现有的神经GCD方法主要依赖“一刀切”范式,难以捕捉真实世界时间序列中固有的分布偏移和动态机制变化,常导致表示纠缠和虚假因果图。本文提出CausalMoE,一种十亿规模多模态格兰杰因果基础模型,显式建模补丁级异质性。CausalMoE引入模式路由混合异构专家,动态识别潜在时间模式并将补丁路由到专门领域专家,有效解耦机制特定动态与共享动态。为确保可解释的图恢复,我们设计了一种跨变量运行的因果感知自注意力机制,通过近端优化生成稀疏格兰杰因果图。此外,CausalMoE是首个集成LLM和VLM以对齐数值信号与文本和视觉先验的模型,在复杂场景中正则化因果估计。大量实验表明,CausalMoE在全监督基准上达到新最优,同时在传统方法失败的少样本设置中有效泛化。

英文摘要

Granger Causal Discovery (GCD) is fundamental for analyzing temporal dependencies in complex systems. However, existing neural GCD methods predominantly rely on a "one-size-fits-all" paradigm, struggling to capture distribution shifts and dynamic regime changes inherent in real-world time series. This often leads to entangled representations and spurious causal graphs. In this paper, we propose CausalMoE, a billion-scale multimodal Granger causal foundation model that explicitly models patch-level heterogeneity. CausalMoE introduces a Pattern-Routed Mixture of Heterogeneous Experts, which dynamically identifies latent temporal patterns and routes patches to specialized domain experts, effectively decoupling regime-specific mechanisms from shared dynamics. To ensure interpretable graph recovery, we design a Causality-Aware Self-Attention mechanism operating across variables, yielding sparse Granger causal graphs via proximal optimization. Furthermore, CausalMoE is the first to integrate LLMs and VLMs to align numerical signals with textual and visual priors, regularizing causal estimation in complex scenarios. Extensive experiments demonstrate that CausalMoE establishes a new state-of-the-art on fully supervised benchmarks, while effectively generalizing to few-shot settings where traditional methods fail.

2606.13022 2026-06-12 cs.CV cs.LG 新提交

Quality-Preserving Imperceptible Adversarial Attack on Skeleton-based Human Action Recognition

基于骨架的人体动作识别中保质量不可察觉对抗攻击

Ziyi Chang, Kanglei Zhou, Xiaohui Liang, Hubert P. H. Shum

发表机构 * Durham University(杜伦大学) Tsinghua University(清华大学) Beihang University(北京航空航天大学) Zhongguancun Laboratory(中关村实验室)

AI总结 针对骨架动作识别的对抗攻击常引入噪声扰动降低动作质量,本文提出一种基于分布的对抗攻击方法,通过最小化经验风险与真实风险的差距来保持动作质量,并设计新指标评估自然性,实验表明该方法在攻击成功率和动作质量上均优于现有方法。

详情
AI中文摘要

针对骨架人体动作识别的对抗攻击已受到广泛关注。然而,现有方法通常引入类似噪声的扰动,导致攻击后动作质量下降,从而在S-HAR系统的最新进展中本质上是可察觉的。我们发现这种退化源于先前对抗攻击优化过程中经验风险与真实风险之间的差距。为解决此问题,我们提出一种在不损害动作质量的情况下获得对抗动作的攻击方法。为最小化风险差距并保持动作质量,我们提出一种基于分布的对抗攻击方法,不引入类似噪声的扰动。为忠实评估动作质量,我们提出一种新指标,该指标与人类对真实世界自然性的感知一致。在两个数据集上对最先进的S-HAR方法进行了实验,通过定性和定量分析证明了我们的方法在攻击成功率和攻击后动作质量方面的优越性。我们的保质量攻击应用和基于分布的方法的成功引发了关于动作识别器鲁棒性的严重担忧,强调了在该领域进一步改进的必要性。

英文摘要

Adversarial attacks on skeletal human action recognition have received significant attention. However, existing methods typically introduce noise-like perturbations that degrade motion quality post-attack, and thereby are inherently perceptible with recent advancements in S-HAR systems. We discover that this degradation stems from the gap between empirical and true risks during the optimization process of previous adversarial attacks. To address this issue, we propose an attack where adversarial motions are obtained without compromising their motion quality. To minimize the risk gap and preserve motion quality, we propose a distribution-based adversarial attack method without introducing noise-like perturbations. To faithfully evaluate the motion quality, we propose a new metric that aligns with human perception on real-world naturalness. Experiments have been conducted on the state-of-the-art S-HAR methods across two datasets, demonstrating the superiority of our method in both the attack success rate and the post-attack motion quality through qualitative and quantitative analyses. The success of our quality-preserving attack application and distribution-based method raises serious concerns about the robustness of action recognizers, highlighting the need for further enhancements in this domain.

2606.13020 2026-06-12 cs.AI 新提交

SciR: A Controllable Benchmark for Scientific Reasoning in LLMs

SciR: 面向LLM科学推理的可控基准

Pierre Beckmann, Marco Valentino, Andre Freitas

发表机构 * Idiap Research Institute(Idiap研究 institute) EPFL(瑞士联邦理工学院) School of Computer Science, University of Sheffield(谢菲尔德大学计算机科学学院) University of Manchester(曼彻斯特大学) National Biomarker Centre, CRUK Manchester Institute(国家生物标志物中心,CRUK曼彻斯特研究所)

AI总结 提出SciR基准,通过形式对象生成可验证的多范式科学推理任务,并控制信息提取和推理难度两个维度,揭示LLM在科学推理中的弱点。

详情
AI中文摘要

科学推理中反复出现三种范式的推理形式:演绎、归纳和因果溯因。目前,在科学环境中可靠地评估LLM在这三种推理上的表现尚不可及:基于人工标注的科学基准成本高昂且缺乏机制性真值,而合成逻辑推理基准则不像真实的科学文档。我们引入了SciR,这是一个将多范式推理与可控科学渲染相结合的基准,以三个范式性科学问题为锚点。任务从形式对象(演绎树、归纳规则假设、因果图)生成,以保证可验证答案,然后通过每个轨道的领域调优体裁渲染成多文档科学论述。该构建使我们能够独立变化两个难度轴:提取推理所需关键信息的难度,以及原则性推理本身的难度。我们测试了六个模型。两个轴都对每个模型造成伤害,且其效应叠加。渲染甚至伤害了神经符号管道,后者将推理交给经过验证的求解器。这两个轴产生了每个模型的提取与推理轮廓:例如,像deepseek-r1这样的推理模型在推理轴上大多超过了非推理指令模型。据我们所知,SciR是第一个在提取和推理难度上具有参数化控制的多范式科学推理基准。

英文摘要

Three paradigmatic forms of inference recur across scientific reasoning: deduction, induction, and causal abduction. Reliably evaluating LLMs on these in scientific settings is currently out of reach: scientific benchmarks built on human annotations are costly and lack mechanistic ground truth, while synthetic logical-reasoning benchmarks do not resemble real scientific documents. We introduce SciR, a benchmark that combines multi-paradigm reasoning with controllable scientific rendering, anchored on three paradigmatic scientific problems. Tasks are generated from formal objects (deduction tree, inductive rule hypothesis, causal graph) to guarantee verifiable answers, then rendered into multi-document scientific discourse via per-track domain-tuned genres. The construction lets us independently vary two difficulty axes: how hard it is to extract the key information needed for inference, and how hard the principled inference itself is. We test six models. Both axes hurt every model, and their effects compound. The rendering even hurts neurosymbolic pipelines, which hand inference to a verified solver. The two axes yield a per-model extraction-vs-inference profile: for instance, reasoning models like deepseek-r1 mostly surpass non-reasoning instruct models on the inference axis. To our knowledge, SciR is the first multi-paradigm scientific-reasoning benchmark with parametric control on both extraction and inference difficulty.

2606.13016 2026-06-12 cs.AI 新提交

Otters++: A Time-to-first-spike Based Energy Efficient Optical Spiking Transformer

Otters++: 一种基于首次脉冲时间的高能效光学脉冲Transformer

Zhanglu Yan, Jiayi Mao, Kaiwen Tang, Fanfan Li, Gang Pan, Tao Luo, Bowen Zhu, Qianhui Liu, Weng-Fai Wong

发表机构 * National University of Singapore(新加坡国立大学) Westlake University(西湖大学) Shandong University(山东大学) Zhejiang University(浙江大学) Agency for Science, Research and Technology(新加坡科技研究局)

AI总结 提出Otters++,利用光电器件自然信号衰减实现TTFS计算,通过层等效与混合训练方法,在GLUE上达到84.17%平均分且能耗更低。

详情
AI中文摘要

脉冲神经网络(SNN)有望实现高能效推理,而首次脉冲时间(TTFS)编码尤其吸引人,因为每个神经元最多发放一次脉冲。然而,在实践中,这一优势往往因计算时间衰减项并将其与突触权重相乘的成本而减弱。我们通过将物理硬件“缺陷”——光电器件中的自然信号衰减——转化为TTFS的主要计算来解决这一问题,命名为Otters++。具体来说,我们利用定制In$_2$O$_3$光电突触的实测衰减直接实现TTFS时间项,从而消除了显式数字衰减计算的需求。为了将该思想扩展到Transformer模型,我们建立了Otters++与量化神经网络(QNN)之间的逐层功能等价性,并开发了一种混合训练方法,在前向传播中使用忠实于器件的SNN计算,在后向传播中通过等效QNN路径使用QNN直通梯度,并结合模型蒸馏。这避免了对离散首次脉冲事件的微分,并减少了直接TTFS-SNN训练中的过度稀疏问题。我们进一步通过采样运行间变化使训练感知实测器件噪声,并通过考虑器件共享和多跳通信来细化系统级能耗模型。在GLUE数据集上,Otters++将平均得分提高到84.17%,同时相比先前的脉冲Transformer基线保持明显的能耗优势。这些结果表明,基于物理的TTFS计算在实际硬件效应下可以高效、可训练且鲁棒。

英文摘要

Spiking neural networks (SNNs) are promising for energy-efficient inference, and time-to-first-spike (TTFS) coding is especially attractive because each neuron fires at most once. In practice, however, this benefit is often reduced by the cost of computing a temporal decay term and multiplying it by the synaptic weight. We address this issue by turning a physical hardware "bug," the natural signal decay in optoelectronic devices, into the main computation of TTFS, named Otters++. Specifically, we use the measured decay of a custom In$_2$O$_3$ optoelectronic synapse to directly realize the TTFS temporal term, removing the need for explicit digital decay computation. To scale this idea to Transformer models, we establish a layer-wise functional equivalence between the Otters++ and a quantized neural network (QNN), and develop a hybrid training method that uses device-faithful SNN computation in the forward pass and QNN straight-through gradients through the equivalent QNN path in the backward pass, together with model distillation. This avoids differentiation through discrete first-spike events and reduces the over-sparsity problem in direct TTFS-SNN training. We further make training aware of measured device noise by sampling run-to-run variation, and refine the system-level energy model by accounting for device sharing and multi-hop communication. On GLUE dataset, Otters++ improves the average score to 84.17\% while maintaining a clear energy advantage over prior spiking Transformer baselines. These results show that physically grounded TTFS computing can be efficient, trainable, and robust under realistic hardware effects.

2606.13007 2026-06-12 cs.LG cs.AI 新提交

scLLM-DSC: LLM-Knowledge Enhanced Cross-Modal Deep Structural Clustering for Single-Cell RNA Sequencing

scLLM-DSC:基于LLM知识增强的跨模态深度结构聚类用于单细胞RNA测序

Ping Xu, Pengjiang Li, Tian Du, Zaitian Wang, Jiawei Gu, Ziyue Qiao, Pengfei Wang, Yuanchun Zhou

发表机构 * Computer Network Information Center, Chinese Academy of Sciences(中国科学院计算机网络信息中心) University of Chinese Academy of Sciences(中国科学院大学) Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences(中国科学院大学杭州高等研究院) School of Computing and Information Technology, Great Bay University(大湾区大学计算机科学与技术学院) School of Engineering, Westlake University(西湖大学工学院)

AI总结 提出scLLM-DSC框架,通过知识驱动语义视图与结构感知拓扑视图的跨模态对比对齐,利用LLM增强单细胞RNA测序数据的聚类性能,显著优于现有方法。

详情
AI中文摘要

聚类是scRNA-seq分析的基础,是识别细胞群体和解析组织异质性的基石。然而,现有方法专注于挖掘数值统计模式,由于忽略了基因编码的内在生物学功能,存在语义不可知的问题。虽然大语言模型(LLM)提供了有前景的语义能力,但生成式预训练目标与判别式下游任务之间的结构不匹配阻碍了它们直接适应细胞聚类。为弥合这一差距,我们提出了scLLM-DSC,一种新颖的LLM知识增强跨模态深度结构聚类框架。与数据驱动范式不同,scLLM-DSC通过协同两个视图建立语义基础表示:从NCBI基因先验和上下文化的Cell2Sentence嵌入中提取的知识驱动语义视图,以及通过图引导编码器提取的结构感知拓扑视图。关键的是,我们引入了一种跨模态对比对齐机制,以在统一潜在空间中强制生物学语义与转录组特征之间的一致性。广泛的基准测试表明,scLLM-DSC在聚类准确性上显著优于十一个最先进的基线方法。

英文摘要

Clustering is fundamental to scRNA-seq analysis, serving as a cornerstone for identifying cell populations and resolving tissue heterogeneity. However, existing methods focus on mining numerical statistical patterns, suffering from semantic agnosticism by neglecting the intrinsic biological functions encoded by genes. While Large Language Models (LLMs) offer promising semantic capabilities, their direct adaptation to cell clustering is hindered by the structural mismatch between generative pre-training objectives and discriminative downstream tasks. To bridge this gap, we propose scLLM-DSC, a novel LLM-Knowledge Enhanced Cross-Modal Deep Structural Clustering framework. Diverging from data-driven paradigms, scLLM-DSC establishes a semantically-grounded representation by synergizing two views: a Knowledge-Driven Semantic View derived from NCBI gene priors and contextualized Cell2Sentence embeddings, and a Structure-Aware Topological View extracted via a graph-guided encoder. Crucially, we introduce a cross-modal contrastive alignment mechanism to enforce consistency between biological semantics and transcriptomic features within a unified latent space. Extensive benchmarks demonstrate that scLLM-DSC significantly outperforms eleven state-of-the-art baselines in clustering accuracy.

2606.13006 2026-06-12 cs.SD 新提交

Emo-LiPO: Listwise Preference Optimization for Fine-Grained Emotion Intensity Control in LLM-based Text-to-Speech

Emo-LiPO:基于LLM的文本到语音中细粒度情感强度控制的列表式偏好优化

Yihang Lin, Li Zhou, Congwei Cao, Dongchu Xie, Xiaoxue Gao, Chen Zhang, Haizhou Li

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Agency for Science, Technology and Research(新加坡科技研究局) National University of Singapore(新加坡国立大学) Shenzhen Research Institute of Big Data(深圳市大数据研究院) Shenzhen Loop Area Institute(深圳市环区研究院)

AI总结 提出Emo-LiPO框架,将情感强度控制建模为学习排序问题,通过列表式偏好优化对齐文本与语音的情感强度,实现更忠实连续的情感表达,在ESD-plus数据集上显著提升情感准确性和强度可控性。

Comments Accepted by IJCAI 2026. Emotional TTS, Preference Optimization, Emotion Intensity Control

详情
AI中文摘要

基于大型语言模型(LLM)的文本到语音(TTS)系统能够实现提示条件的情感控制,但由于文本与语音之间的语义-声学差距,在细粒度情感强度方面存在困难。为了解决这一挑战,我们将LLM-based TTS中的情感强度控制形式化为一个学习排序问题,并提出了Emo-LiPO,一种列表式偏好优化框架,该框架将提示条件的语音生成与文本中表达的相对情感强度对齐。Emo-LiPO在固定文本下显式建模每种情感内的全局强度排序,从而实现更忠实和连续的情感表达。我们进一步构建了ESD-plus,一个具有显式情感强度变化的多说话人数据集,以支持细粒度情感建模和评估。在ESD-plus上的实验表明,与基于监督学习和DPO的LLM TTS基线相比,Emo-LiPO显著提高了情感准确性和强度可控性,特别是在高强度水平上表现尤为突出。

英文摘要

Large language model (LLM)-based text-to-speech (TTS) systems enable prompt-conditioned emotional control but struggle with fine-grained emotion intensity due to the semantic -- acoustic gap between text and speech. To address this challenge, we formulate emotion intensity control in LLM-based TTS as a learning-to-rank problem and propose Emo-LiPO, a listwise preference optimization framework that aligns prompt-conditioned speech generation with relative emotion intensity expressed in text. Emo-LiPO explicitly models global intensity ordering within each emotion under fixed transcripts, enabling more faithful and continuous emotional expression. We further construct ESD-plus, a multi-speaker dataset with explicit emotion intensity variations, to support fine-grained emotion modeling and evaluation. Experiments on ESD-plus demonstrate that Emo-LiPO significantly improves emotion accuracy and intensity controllability over both supervised- and DPO-based LLM TTS baselines, with particularly pronounced gains at high intensity levels.

2606.12997 2026-06-12 cs.LG stat.ML 新提交

Reliability of Probabilistic Emulation of Physical Systems

物理系统概率仿真的可靠性

Sam F. Greenbury, Radka Jersakova, Paolo Conti, Marjan Famili, Christopher Iliffe Sprague, Edwin Brown, Jason D. McEwen

发表机构 * The Alan Turing Institute(艾伦·图灵研究所) Autodesk Research(欧特克研究院) PhysicsX Orbital University of Sheffield(谢菲尔德大学) University College London(伦敦大学学院)

AI总结 比较生成模型与CRPS训练集成在物理系统概率仿真中的可靠性,发现CRPS集成在覆盖率和推理速度上更优。

详情
AI中文摘要

目前,生成物理系统概率预测的两种主要方法已经出现:生成模型(如扩散或流匹配)以及注入随机性的确定性模型集成(使用连续排序概率评分(CRPS)损失训练)。虽然这两种方法都表现出强大的预测准确性,但其不确定性的可靠性尚未得到系统评估。我们通过开发一个框架来填补这一空白,该框架在匹配模型大小和计算预算的情况下,评估这两种方法在多种二维时空物理系统中的表现。我们通过检查预测区间的经验覆盖率来评估概率仿真的可靠性,同时考虑准确性和计算效率指标。CRPS训练的集成在单步预测和自回归展开中通常能实现更可靠的不确定性,显示出比在潜在空间中训练生成模型的标准替代方案更好的覆盖率。此外,CRPS方法提供了显著更快的推理速度。当生成模型在环境空间而非压缩潜在空间中训练时(这在高维问题中通常不可行),它们表现出与CRPS训练集成相当的覆盖率,但推理延迟显著更大。相比之下,当CRPS训练的集成在潜在空间中训练时,其覆盖率相对于环境空间没有明显下降。生成模型和CRPS训练的集成都表现出良好的预测准确性。为促进未来的研究和应用,我们发布了AutoCast,一个实现生成模型和CRPS训练集成的模块化框架,以及AutoSim,一个用于快速原型的灵活数据集生成包。

英文摘要

Two dominant approaches have emerged for generating probabilistic forecasts of physical systems: generative models, such as diffusion or flow matching; and ensembles of deterministic models with stochasticity injected, trained using the continuous ranked probability score (CRPS) loss. While both approaches have demonstrated strong predictive accuracy, the reliability of their uncertainties has not been systematically assessed. We address this gap by developing a framework to evaluate both approaches across diverse 2D spatiotemporal physical systems, under matched model size and computational budget. We assess the reliability of probabilistic emulation by inspecting the empirical coverage of predictive intervals, while also considering accuracy and computational efficiency metrics. CRPS-trained ensembles typically achieve more reliable uncertainties on both single-step prediction and autoregressive rollouts, demonstrating better coverage than the standard alternative of training generative models in a latent space. Moreover, the CRPS approach offers significantly faster inference. When generative models are trained in ambient rather than a compressed latent space, which is often infeasible for high-dimensional problems, they exhibit comparable coverage to CRPS-trained ensembles, though with substantially larger inference latency. In contrast, when CRPS-trained ensembles are trained in latent space they do not show a marked degradation in coverage with respect to ambient space. Both generative models and CRPS-trained ensembles demonstrate good predictive accuracy. To facilitate future research and application, we release AutoCast, a modular framework implementing both generative models and CRPS-trained ensembles, alongside AutoSim, a flexible dataset generation package for rapid prototyping.

2606.12995 2026-06-12 cs.RO 新提交

GenHOI: Contact-Aware Humanoid-Object Interaction by Imitating Generated Videos without Task-Specific Training

GenHOI: 通过模仿生成视频实现接触感知的人形机器人-物体交互,无需任务特定训练

Zhihai Bi, Qiang Zhang, Guoyang Zhao, Jiahang Cao, Xueyin Luo, Yushan Zhang, Jinglan Xu, Ruoyu Geng, Yulin Li, Andrew F. Luo, Jun Ma

发表机构 * The University of Tokyo(东京大学) National University of Singapore(新加坡国立大学) University of California, Los Angeles(加州大学洛杉矶分校) Tsinghua University(清华大学)

AI总结 提出GenHOI框架,通过模仿单个生成视频实现人形机器人零样本执行多种物体交互任务,无需任务特定训练或物理演示数据,利用接触事件和手-物接触区域编码为几何约束优化轨迹。

详情
AI中文摘要

人形机器人-物体交互(HOI)是人形机器人的基本能力,但由于动态平衡与与多样物体稳定交互之间的紧密耦合,它仍然具有挑战性。现有方法通常需要耗时的任务特定策略训练或依赖于刚性轨迹回放,这限制了它们适应新颖交互场景的能力。在这项工作中,我们提出了\textit{GenHOI},一个简单而有效的框架,通过直接模仿单个生成视频,使人类形机器人能够以零样本方式执行多样化的物体交互任务,无需任务特定训练或物理演示数据。GenHOI首先在仿真中重建机器人-物体场景并渲染第一帧图像,该图像与语言命令一起条件化任务导向交互视频的合成。然后分析生成的视频以识别交互相关的接触事件并估计手-物体接触区域,这些被编码为以物体为中心的几何约束,将视觉交互线索转化为物理基础的优化先验。在这些先验的指导下,从视频中恢复的参考运动被细化和平滑,以解决2D视频生成中固有的尺度模糊性,同时将单个参考轨迹适应于未见过的机器人-物体相对姿态。优化后的轨迹最终由闭环跟踪控制器执行。我们在包括箱子抓取、非对称双臂椅子搬运、从下方抬桌子和圆柱物体包裹在内的多样化物体交互任务中,通过大量仿真和真实世界实验验证了所提出的框架。

英文摘要

Humanoid-Object Interaction (HOI) is a fundamental capability for humanoid robots, yet it remains challenging due to the tight coupling between dynamic balance and stable interaction with diverse objects. Existing methods often require time-consuming task-specific policy training or rely on rigid trajectory replay, which limits their ability to accommodate novel interaction scenarios. In this work, we present \textit{GenHOI}, a simple yet effective framework that enables humanoid robots to perform diverse object-interaction tasks in a zero-shot manner by directly imitating a single generated video, without task-specific training or physical demonstration data. GenHOI first reconstructs the robot-object scene in simulation and renders a first-frame image, which, together with the language command, conditions the synthesis of a task-oriented interaction video. The generated video is then analyzed to identify interaction-relevant contact events and estimate hand-object contact regions, which are encoded as object-centric geometric constraints that convert visual interaction cues into physically grounded optimization priors. Guided by these priors, the reference motion recovered from the video is refined and smoothed to resolve the scale ambiguity inherent in 2D video generation, while adapting a single reference trajectory to unseen robot-object relative poses. The optimized trajectory is finally executed by a closed-loop tracking controller. We validate the proposed framework in extensive simulation and real-world experiments across diverse object-interaction tasks, including box grasping, asymmetric bimanual chair carrying, table lifting from below, and cylindrical-object enveloping.

2606.12991 2026-06-12 cs.AI 新提交

APCyc: Property-Informed Design of Cyclic Peptides via Automated Cyclization

APCyc:通过自动环化实现环肽的性质导向设计

Yifan Zhao, Lang Qin, Jintai Chen

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) AI-Peptide Drug Design Joint Laboratory(AI-多肽药物设计联合实验室)

AI总结 提出APCyc框架,通过扩展残基词汇和显式编码环化位点与连接类型,结合贝叶斯后验引导,实现目标感知的环肽从头设计并联合优化多种理化性质。

Comments Accepted at the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)

详情
AI中文摘要

环肽是现代药物发现中一类有前景的治疗化合物,通常具有更好的稳定性和结合亲和力。然而,环肽的从头设计仍然具有挑战性,因为方法必须识别口袋适应的环化模式和连接位点,同时控制药物相关性质。这一挑战对于主要在线性肽数据上训练的生成模型尤为突出,这些模型可能无法捕捉环化特异性约束。为解决这一局限性,我们引入了APCyc,一个目标感知的从头环肽生成框架,该框架显式建模环化并联合优化多种基本理化性质。通过使用扩展的残基词汇表并显式编码环化位点和连接类型信息,APCyc学习环化感知表示,并利用贝叶斯后验引导将采样导向满足多个性质目标的环肽。实验结果表明,我们的模型学习了目标依赖的环化偏好,并实现了环肽设计的有效且可控的多性质优化。本文源代码可在以下网址获取:https://this https URL。

英文摘要

Cyclic peptides represent a promising class of therapeutic compounds in modern drug discovery, often offering improved stability and binding affinity. However, the de novo design of cyclic peptides remains challenging because methods must identify pocket-adaptive cyclization patterns and linkage sites while simultaneously controlling drug-relevant properties. This challenge is particularly pronounced for recent generative models trained predominantly on linear peptide data, which may fail to capture cyclization-specific constraints. To address the limitation, we introduce APCyc, a target-aware de novo cyclic peptide generation framework that explicitly models cyclization and jointly optimizes multiple essential physicochemical properties. By using an expanded residue vocabulary and explicitly encoding cyclization-site and linkage-type information, APCyc learns cyclization-aware representations and leverages Bayesian posterior guidance to steer sampling toward cyclic peptides satisfying multiple property objectives. Experimental results demonstrate that our model learns target-dependent cyclization preferences, and enables effective and controllable multi-property optimization for cyclic peptide design. The source code of this paper is available at https://github.com/HKUSTGZ-ML4Health-Lab/APCyc.

2606.12990 2026-06-12 cs.LG 新提交

Exposure Bias as Epistemic Underidentification in Recursive Forecasting

递归预测中的曝光偏差作为认知欠识别问题

Riku Green, Zahraa S. Abdallah, Telmo M Silva Filho

发表机构 * University of Bristol(布里斯托大学)

AI总结 本文证明递归多步预测中的曝光偏差不仅是分布偏移,更是部分可观测性下的认知欠识别问题,并提出基于来源变量的误差分解与校正方法。

Comments Accepted for ICML 2026 EIML workshop

详情
AI中文摘要

递归多步预测通常被表述为分布偏移:模型在观测历史数据上训练,但部署于自身预测结果上。我们通过证明在部分可观测性或状态截断下,递归展开也是一个认知欠识别问题,表明这种表述是不完整的。即使具有确定性潜在动力学,一步贝叶斯监督仅在观测上下文中识别行为,一旦展开查询自生成诱导状态(其正确的局部目标不能仅由数值状态确定),则无需识别部署的递归预测器。我们通过诱导状态 $Z$ 和来源变量 $P$ 形式化这一点,并推导出诱导状态误差分解为教师强制/展开不匹配、表示-类别逼近和来源信息差距。实验表明,展开进入一个不同的诱导状态区域,固定诱导状态定义了一个不同的局部校正任务,闭环增益不仅来自局部适应,还来自改变展开期间访问的诱导状态。使用简单的二进制来源编码,来源感知校正可以进一步提高性能,尽管增益是有条件的而非均匀的。这些结果将曝光偏差重新定义为自诱导认知不确定性下的推理。

英文摘要

Recursive multi-step forecasting is usually framed as distribution shift: models are trained on observed histories but deployed on their own predictions. We show this framing is incomplete by proving that, under partial observability or state truncation, recursive rollout is also an epistemic underidentification problem. Even with deterministic latent dynamics, one-step Bayes supervision identifies behavior only on observed contexts and need not identify the deployed recursive predictor once rollout queries self-generated induced states whose correct local targets are not determined by numeric state alone. We formalize this with induced states $Z$ and provenance variables $P$, and derive a decomposition of induced-state error into teacher-forcing/rollout mismatch, representation--class approximation, and provenance information gaps. Empirically, we show that rollout enters a distinct induced-state regime, that fixed induced states define a distinct local corrective task, and that closed-loop gains arise not only from local adaptation but also from changing the induced states visited during rollout. Using a simple binary provenance encoding, provenance-aware correction can further improve performance, though gains are conditional rather than uniform. These results recast exposure bias as reasoning under self-induced epistemic uncertainty.

2606.12988 2026-06-12 cs.CV cs.AI 新提交

A Machine Learning Framework for Real-Time Personalized Ergonomic Pose Analysis

一种用于实时个性化人体工学姿态分析的机器学习框架

Manex Atxa, Bruno Simoes, Julen Balzategui

发表机构 * Vicomtech Foundation(Vicomtech基金会) Basque Research and Technology Alliance(巴斯克研究与技术联盟) BRTA

AI总结 提出利用三维体积视频数据实时预测人体工学/非工学姿态的方法,结合3D点云多角度分析与个性化深度学习分类器,克服固定视角遮挡问题,实现实时评估。

Comments 13 pages, 7 figures, conference 24CMH

详情
AI中文摘要

本文介绍了一种利用三维体积视频数据实时预测人体工学和非工学姿态的新方法。尽管该方法是为人体工学评估设计的,但它可以适应其他需要实时分析人体姿态的应用。该系统的一个突出特点是能够在评估过程中分析3D点云,从而实现多角度计算。这克服了相机通常提供固定视角的关键限制,从而限制了全面姿态评估可用的数据,尤其是在发生遮挡时。系统持续自动地对实时流数据使用选定的视角进行姿态推断;然而,只有用户手动选择和标记的姿态用于训练个性化深度学习分类器。该方法通过一个案例研究进行了优化,其中RGB-D相机捕捉了执行负重任务的受试者,实现了实时骨骼标记。模型在此数据上训练,并在训练阶段后对新流数据实时进行推断。本研究通过结合最先进的3D数据技术和传统的2D姿态估计算法,为实时人体工学评估提供了一种可扩展且实用的方法。它解决了工作场所环境中日益增长的安全与健康监测需求,标志着对该领域的显著贡献。

英文摘要

This paper introduces a new methodology for real-time prediction of ergonomic and non-ergonomic human poses using volumetric video data in three dimensions. Although the methodology was designed for ergonomic assessments, it can be adapted to other applications requiring real-time analysis of human posture. One aspect that makes this system stand out is its ability to analyze 3D point clouds during the assessment, enabling computation from multiple angles. This overcomes a critical limitation of cameras which provide often a fixed viewpoint, thereby restricting the data available for a thorough postural evaluation, especially when occlusions occur. The system continuously and automatically performs pose inference using the chosen perspective on the real-time streaming data; however, only the poses manually selected and labeled by the user are used to train the personalized deep learning classifier. The methodology has been refined through a case study in which RGB-D cameras captured subjects performing load-lifting tasks, enabling real-time skeletal labeling. The model was trained on this data and, following the training phase, performs inference on new streaming data in real time. This research offers a scalable and pragmatic approach for real-time ergonomic evaluation by combining state-of-the-art 3D data technologies and traditional 2D pose estimation algorithms. It addresses the increasing need for safety and health monitoring in workplace environments, marking a notable contribution to the domain.

2606.12987 2026-06-12 cs.CV cs.AI cs.LG cs.RO 新提交

Diffusion Transformer World-Action Model for AV Scene Prediction

扩散Transformer世界-动作模型用于自动驾驶场景预测

Ruslan Sharifullin, Benjamin Jiang, Kai Xi Chew

发表机构 * Stanford University(斯坦福大学)

AI总结 提出紧凑潜世界模型,结合扩散Transformer(DiT)预测未来场景,在nuScenes上实现4.8倍更好的KID,并实现动作可控性(转向ρ=0.81)。

Comments 10 pages, 9 figures, 2 tables

详情
AI中文摘要

动作条件世界模型使自动驾驶车辆能够根据自身规划的控制预测未来摄像头场景,从而无需真实世界部署即可进行规划和仿真,但在紧凑、可训练的规模下,未来具有模糊性,且该领域的标准失真度量具有误导性:它们奖励模糊的回归均值而非逼真的预测。我们通过一个紧凑的潜世界模型应对这一问题,该模型给定当前前摄像头潜变量和一系列自我动作,预测未来场景潜变量,由冻结解码器渲染为$256 \ imes 256$帧,最多提前8秒,在150个保留的nuScenes场景上评估。我们首先基准测试预测位置:在跨越四个表示族的六个冻结编码器中,具有时间上下文的V-JEPA2将转向RMSE比最佳单帧编码器降低40%。然后我们训练一个潜扩散Transformer(DiT),并通过受控诊断识别其所需的四个要素:空间token、$x_0$目标、残差锚定以及与目标不确定性匹配的采样。在Stable-Diffusion-VAE编码-预测-解码流水线中,我们揭示了核心矛盾:失真度量(余弦相似度、SSIM)倾向于模糊均值,掩盖了扩散模型更接近真实帧分布的事实。基于Inception的FID和KID揭示了清晰的感知-失真边界:扩散模型达到KID 0.078,而回归为0.375(好4.8倍),且可部署的训练校准使其无需测试时真实值即可实用。该模型真正具有动作可控性(转向驱动场景位移,Spearman $\ ho = 0.81$,而回归为$-0.18$)。我们将有限的单次运动归因于共享当前锚点,并设计了一个紧凑的170万参数“跳跃”模型,恢复完整的真实运动幅度($1.02\ imes$ GT),而单次模型捕获不到一半。

英文摘要

Action-conditioned world models let an autonomous vehicle predict future camera scenes from its own planned controls, enabling planning and simulation without real-world rollouts, but at compact, trainable scale the futures are ambiguous and the field's standard distortion metrics actively mislead: they reward a blurry regression mean over a realistic prediction. We confront this with a compact latent world model that, given the present front-camera latent and a sequence of ego-actions, predicts future scene latents a frozen decoder renders to $256 \times 256$ frames up to 8 seconds ahead, evaluated on 150 held-out nuScenes scenes. We first benchmark where to predict: across six frozen encoders spanning four representation families, V-JEPA2 with temporal context reduces steering RMSE by 40% over the best single-frame encoder. We then train a latent Diffusion Transformer (DiT) and, through a controlled diagnosis, identify the four ingredients it needs: spatial tokens, the $x_0$ objective, residual anchoring, and sampling matched to target uncertainty. In a Stable-Diffusion-VAE encode-predict-decode pipeline we expose the central tension: distortion metrics (cosine similarity, SSIM) favor the blurry mean, masking that the diffusion model is far closer to the real frame distribution. Inception-based FID and KID reveal a clean perception-distortion frontier: diffusion attains KID 0.078 versus 0.375 for regression ($4.8\times$ better), and a deployable train-derived calibration makes this practical without test-time ground truth. The model is genuinely action-controllable (steering drives scene displacement, Spearman $ρ= 0.81$, vs $-0.18$ for regression). We trace limited single-pass motion to a shared-present anchor and engineer a compact 1.7M-parameter "jump" model that recovers full ground-truth motion magnitude ($1.02\times$ GT), where single-pass models capture less than half.

2606.12985 2026-06-12 cs.CV 新提交

Objects Before Words: Object-First Inductive Biases for Grounding Language in Child-View Video

物体先于词汇:用于从儿童视角视频中语言接地学习的物体优先归纳偏置

Sathira Silva, Abrham Kahsay Gebreselasie, Muhammad Umer Sheikh, Kartik Kuckreja, Daniel Harari, Muhammad Haris Khan

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学) Weizmann Institute of Science(魏茨曼科学研究所)

AI总结 针对婴儿视角视频中命名参照物出现时间和位置的双重歧义,提出BabyMind方法,通过物体优先的归纳偏置、掩码区域接口和原型空间多实例对比学习,在稀疏弱监督下提升语言接地性能。

详情
AI中文摘要

从自然经验中学习接地词汇含义需要解决婴儿视角记录中的两个歧义:命名参照物何时出现以及在杂乱画面中的位置。在SAYCam风格的数据中,看护者的语言稀疏且与自我中心视频弱同步,因此单帧对比配对会产生噪声正样本,其中目标物体缺失或被干扰物纠缠。我们提出BabyMind,一种在稀疏、噪声监督下用于儿童视角对比学习的物体优先偏置。BabyMind使用离线掩码区域接口提取候选物体嵌入,通过跟踪将短话语中心窗口内的候选物体链接成轻量级物体文件,并使用原型空间多实例对比目标将话语与物体文件袋对齐。轨迹一致性和全局物体一致性正则化器稳定学习,并将物体文件结构转移到评估时使用的全局帧嵌入中。在SAYCam-S上,BabyMind将Labeled-S 15强制选择准确率比CVCL提高了+2.6个点,并在词汇内分布外基准测试中取得一致提升。代码可在该网址获取。

英文摘要

Learning grounded word meaning from natural experience requires resolving two ambiguities in infant-view recordings: when the named referent appears and where it is in a cluttered frame. In SAYCam-style data, caregiver speech is sparse and weakly synchronized with egocentric video, so single-frame contrastive pairing yields noisy positives in which the intended object is absent or entangled with distractors. We propose BabyMind, an object-first bias for child-view contrastive learning under sparse, noisy supervision. BabyMind extracts candidate object embeddings using an offline mask-based region interface, links candidates across a short utterance-centered window into lightweight object files via tracking, and aligns utterances to bags of object files with a prototype-space multiple-instance contrastive objective. Track-coherence and global-object agreement regularizers stabilize learning and transfer object-file structure into the global frame embedding used at evaluation. On SAYCam-S, BabyMind improves Labeled-S 15 forced-choice accuracy by +2.6 points over CVCL and yields consistent gains on in-vocabulary out-of-distribution benchmarks. Code is available at https://github.com/sathiiii/BabyMind.

2606.12984 2026-06-12 cs.CL 新提交

SkillChain: Closing the Loop on Skill Evolution for Image-Based E-Commerce AI Assistants

SkillChain: 为基于图像的电商AI助手闭环技能演化

Yimin Hu, Mengtao Xu, Hao Guo, Yuheng Song, Xiaoyong Zhu, Bo Zheng

发表机构 * Alibaba Group(阿里巴巴集团)

AI总结 提出SkillChain框架,通过技能创建、路由优化和主体精炼三阶段自动化技能生命周期,解决电商图像助手多意图混淆问题,显著提升响应质量和用户参与度。

详情
AI中文摘要

基于图像的AI助手现已大规模部署在电商平台上,其中单张上传图像可能触发根本不同的用户意图:产品搜索、风格推荐、视觉百科或实用工具调用,每种意图都需要自己的响应格式、工具调用和领域知识。如果没有按意图的行为约束,基于LLM的系统会混淆这些异构模式,达不到领域质量标准,而意图空间的广度和动态性使得手动工程不可行。为解决这一问题,我们提出了SkillChain,它闭环了技能演化的生产反馈循环,通过三个阶段自动化技能生命周期:用于从任务规范和轨迹中引导启动的技能创建器、用于路由对齐的路由优化器,以及通过双路径LLM-Judge评估进行迭代技能主体精炼的主体精炼器。部署在生产规模的电商图像助手上,SkillChain显著提高了聚合响应质量,在结构合规性和内容质量上提升最大;为期一周的在线A/B实验进一步证实了用户参与度、内容消费和长期留存率的显著提升。

英文摘要

Image-based AI assistants are now deployed at production scale on e-commerce platforms, where a single uploaded image can trigger fundamentally different user intents: product search, style recommendation, visual encyclopedia, or utility tool calls, each demanding its own response format, tool invocation, and domain knowledge. Without per-intent behavioral constraints, LLM-based systems conflate these heterogeneous modes and fall short of domain quality standards, while the breadth and dynamism of the intent space render manual engineering infeasible. To address this, we present SkillChain, which closes the production feedback loop on Skill evolution, automating the lifecycle of Skills through three stages: Skill Creator for bootstrapping from task specs and trajectories, Route Optimizer for routing alignment, and Body Refiner for iterative Skill Body refinement via dual-path LLM-Judge evaluation. Deployed on a production-scale e-commerce image assistant, SkillChain substantially improves aggregate response quality, with the strongest gains on structural compliance and content quality; a one-week online A/B experiment further confirms significant gains in user engagement, content consumption, and long-term retention.

2606.12983 2026-06-12 cs.AI 新提交

Structured Testbench Generation for LLM-Driven HDL Design and Verification-Oriented Data Curation

面向LLM驱动的硬件描述语言设计与验证数据整理的结构化测试台生成

En-Ming Huang, Yu-Hung Kao, Ren-Hao Deng, Wei-Po Hsin, Yao-Ting Hsieh, Cheng Liang, Hsiang-Yu Tsou, Mu-Chi Chen, Yu-Kai Hung, Shao-Chun Ho, Po-Hsuang Huang, Shih-Hao Hung, H. T. Kung

发表机构 * National Taiwan University(国立台湾大学) Academia Sinica(中央研究院) Harvard University(哈佛大学)

AI总结 提出STG框架,利用硬件设计固有结构生成确定性测试台,比迭代LLM方法快720倍,编译成功率更高,覆盖率更高,误判更少,并用于数据整理和测试时扩展。

Comments 9 pages, 10 figures

详情
AI中文摘要

自动化测试台生成已成为大型语言模型(LLM)驱动的寄存器传输级(RTL)工作流中的关键瓶颈,其中大量候选设计必须快速可靠地验证。现有的基于提示的方法将测试台生成视为无约束的代码合成,产生随机输出,具有高令牌成本、低可重复性和不足的覆盖率。为了解决这一差距,我们提出了STG,一个结构化测试台生成框架,利用硬件设计的固有结构生成确定性测试台。作为直接验证工具,STG比基于迭代LLM的测试台生成流程快720倍,具有更高的编译成功率,实现更高的覆盖率,并减少对不正确DUT的错误通过判定。STG还通过暴露有缺陷的基准测试台帮助识别RTL生成基准中的错误。作为数据整理引擎,它在单个CPU核心上比基于LLM的过滤快11倍,能耗低127倍,由此得到的蒸馏模型在我们的多基准评估中提供了最先进的性能。作为测试时扩展预言,它减少了14-47%的节点数。我们的模型可在https://this URL获取。

英文摘要

Automated testbench generation has become a critical bottleneck in large language model (LLM)-driven Register Transfer Level (RTL) workflows, where large numbers of candidate designs must be verified rapidly and reliably. Existing prompt-based approaches treat testbench generation as unconstrained code synthesis, yielding stochastic outputs with high token cost, low reproducibility, and insufficient coverage. To address this gap, we present STG, a Structured Testbench Generation framework that exploits the inherent structure of hardware designs to generate deterministic testbenches. As a direct verification tool, STG runs 720x faster than an iterative LLM-based testbench generation flow and higher rate of successful compilation, achieves higher coverage, and reduces false-pass verdicts on incorrect DUTs. STG also helps identify errors in RTL generation benchmarks by exposing faulty benchmark testbenches. As a data curation engine, it is 11x faster than LLM-based filtering on a single CPU core with 127x less energy, and the resulting distilled models provide state-of-the-art performance in our multi-benchmark evaluation. As a test-time scaling oracle, it reduces node count by 14-47\%. Our models are available at https://huggingface.co/collections/AS-SiliconMind/siliconmind-v12.

2606.12981 2026-06-12 cs.CV 新提交

Camera and LiDAR BEV Fusion for Cooperative 3D Object Detection on TUMTraf V2X

用于TUMTraf V2X协同3D目标检测的相机与LiDAR BEV融合

Muhammad Shahbaz, Shaurya Agarwal

发表机构 * Department of Civil, Environmental and Construction Engineering, University of Central Florida(中佛罗里达大学土木、环境与建筑工程系)

AI总结 提出一种融合路边相机与基础设施-车辆点云的BEV空间检测器,采用CenterPoint风格头部和IoU重排序,在DriveX 2026挑战赛公开测试集上达到0.85 mAP,并分析了训练/验证与测试集重叠对分数的影响。

详情
AI中文摘要

我们描述了一种为DriveX 2026挑战赛的TUMTraf V2X协同3D目标检测赛道开发的相机与LiDAR融合检测器。该检测器在共享的鸟瞰视图空间中融合三个路边相机与一个融合的基础设施-车辆点云,并通过带有广义IoU回归损失和IoU质量重排序头的CenterPoint风格头部预测边界框。在提供的训练和验证分割上训练后,模型在公开Codabench测试分割上达到了0.85的3D mAP。在迭代系统时,我们观察到50个测试帧中有44个也出现在已发布的训练(40个)和验证(4个)分割中并带有标签。因此,我们进行了两项额外研究来量化这种重叠对最终分数的影响:(1)一个微调运行,对44个重叠帧进行过采样,达到0.89 mAP;(2)一个后处理运行,将这些帧上的预测替换为已发布的真实值,达到0.99 mAP(上传到我们的Codabench账户进行测试,但未在排行榜上发布)。报告了所有三种配置及其每类结果。

英文摘要

We describe a Camera and LiDAR fusion detector developed for the TUMTraf V2X cooperative 3D object detection track of the DriveX 2026 challenge. The detector fuses three roadside cameras with a fused infrastructure-plus-vehicle point cloud in a shared bird's-eye-view space and predicts boxes through a CenterPoint-style head with a generalized IoU regression loss and an IoU quality re-ranking head. Trained on the provided train and validation splits, the model reaches a 3D mAP of 0.85 on the public Codabench test split. While iterating on the system, we observed that 44 of the 50 test frames are also present in the released train (40) and validation (4) splits with their labels. We therefore conducted two additional studies to quantify how this overlap affects the final score: (1) a finetuning run that oversamples the 44 overlapping frames, reaching 0.89 mAP, and (2) a post-processing run that replaces predictions on those frames with the released ground truth, reaching 0.99 mAP (uploaded to our Codabench account for testing but not published on the leaderboard). All three configurations and their per-class results are reported.

2606.12979 2026-06-12 cs.LG 新提交

EPM-JEPA: Operator-Side Experience Modulation in JEPA-Family World Models

EPM-JEPA:JEPA系列世界模型中的算子侧经验调制

Vedant Pandya

发表机构 * School of Artificial Intelligence and Data Engineering (SAIDE), Indian Institute of Technology Jodhpur(印度理工学院焦特布尔分校人工智能与数据工程学院)

AI总结 提出EPM-JEPA,通过LoRA在权重层面调制预测器,以应对测试时动态偏移;实验表明其优于无记忆基线,但效果弱于预期,并揭示了三种独立动力学过程。

Comments 16 pages, 5 figures, 9 tables, 5 code listings. Pre-registered experimental study with mechanism analysis

详情
AI中文摘要

JEPA系列世界模型使用静态预测器,其权重在测试时动态偏离训练时不会自适应。我们比较了在分布偏移下将累积经验融入JEPA预测器的两种机制:操作数侧注入(EI-JEPA),将压缩的经验表示作为残差添加到预测器的隐藏状态;以及算子侧调制(EPM-JEPA),通过应用于预测器权重的LoRA生成低秩权重增量。在预注册的比较(Moving MNIST,重力偏移)中,EPM-JEPA(D_shift^{n=50} = 0.7848 +/- 0.0078,三个种子)与EI-JEPA(0.8238)相差delta = 4.74% - 根据我们声明的标准,结果C:零结果 - 是一个有效结果。作为次要的、非预注册的观察,EPM-JEPA在无记忆基线(0.8000)上提高了1.90%,且在所有种子上一致,而EI-JEPA低于基线,表明收益特定于权重级调制。我们的主要贡献是机制分析:D_shift^{n=50}轨迹反映了三个独立的动力学过程——缓冲区循环、EMA目标漂移和内在的LoRA稳定瞬态(+0.021)——而非收敛到平衡。这些发现推动了PEM-JEPA,一个基于物理的后续模型,以解决这一动力学峰值限制。

英文摘要

JEPA-family world models use a static predictor whose weights do not adapt when test-time dynamics diverge from training. We compare two mechanisms for incorporating accumulated experience into a JEPA predictor under distribution shift: operand-side injection, where a compressed experience representation is added as a residual to the predictor's hidden state (EI-JEPA), and operator-side modulation, where the same representation generates low-rank weight deltas via LoRA applied to the predictor's weights (EPM-JEPA). On a pre-registered comparison (Moving MNIST, gravity shift), EPM-JEPA (D_shift^{n=50} = 0.7848 +/- 0.0078, three seeds) differs from EI-JEPA (0.8238) by delta = 4.74% - Outcome C: a null result - by our stated criterion, a valid outcome. As a secondary, non-pre-registered observation, EPM-JEPA improves 1.90% over a no-memory baseline (0.8000), consistently across seeds, while EI-JEPA underperforms the baseline, indicating the benefit is specific to weight-level modulation. Our primary contribution is a mechanism analysis: the D_shift^{n=50} trajectory reflects three independent dynamical processes - buffer cycling, EMA target drift, and an intrinsic LoRA settling transient of +0.021 - rather than convergence to equilibrium. These findings motivate PEM-JEPA, a physics-grounded successor addressing this dynamical-peak limitation.

2606.12977 2026-06-12 cs.CV cs.AI cs.CR cs.LG 新提交

Efficient, Robust, and Anti-Collusion Fingerprinting of Image Diffusion Models

图像扩散模型的高效、鲁棒且抗共谋指纹识别

Jianwei Fei, Yunshu Dai, Zhihua Xia, Xiaochun Cao, Jiantao Zhou, Alessandro Piva, Benedetta Tondi

发表机构 * University of Florence(佛罗伦萨大学) Shenzhen Campus of Sun Yat-sen University(中山大学深圳校区) College of Cyber Security, Jinan University(暨南大学网络空间安全学院) State Key Laboratory of Internet of Things for Smart City, University of Macau(澳门大学智慧城市物联网国家重点实验室) Department of Computer and Information Science, Faculty of Science and Technology, University of Macau(澳门大学科技学院计算机与信息科学系) University of Siena(锡耶纳大学)

AI总结 针对生成式文本到图像模型指纹识别缺乏抗共谋攻击鲁棒性的问题,提出基于个性化归一化模块的编码方法,并引入无损函数不变参数变换的抗共谋机制,实现高保真、高鲁棒且首次主动抵御共谋攻击的指纹识别。

详情
AI中文摘要

模型指纹识别,即将用户特定标识(指纹)嵌入生成输出中,最近已成为保护生成式文本到图像(T2I)模型知识产权并防止未经授权重新分发的流行解决方案。在这项工作中,我们揭示了现有生成模型指纹识别方法中一个先前未被探索的系统性漏洞:它们缺乏对共谋攻击的鲁棒性,其中多个攻击者结合他们的模型以移除或掩盖指纹。为了解决这个问题,我们迈出了为T2I模型开发具有抗共谋能力的鲁棒指纹识别方法的第一步。所提出的方法将比特串(即指纹)编码到集成到T2I模型中的个性化归一化模块(PNM)的系数中,从而可以从任何生成的图像中可靠地恢复指纹。为了防御共谋攻击并防止未经授权的模型重新分发,我们引入了一种基于无损函数不变参数变换的抗共谋机制。该机制显著降低了共谋模型的图像生成质量,使其实际上无法使用。此外,我们的方法允许开发者通过重新参数化PNM高效地创建多个带指纹的T2I模型副本,而无需重新训练。我们还引入了一种最坏情况优化策略,以提高对模型级攻击的鲁棒性。实验表明,所提出的方法在多个T2I图像生成和编辑任务中实现了高保真度和鲁棒性,指纹提取准确率超过99.5%。与现有方法相比,我们的方法首次通过显著增加共谋模型的FID,展示了对共谋攻击的显著主动鲁棒性。

英文摘要

Model fingerprinting, embedding user-specific identifiers (fingerprints) into generated outputs, has recently emerged as a popular solution to protect the intellectual property rights (IPR) of generative text-to-image (T2I) models and prevent unauthorized redistribution. In this work, we reveal a previously unexplored systematic vulnerability in existing generative model fingerprinting methods: they lack robustness against collusion attacks, where multiple attackers combine their models to remove or obscure the fingerprints. To address this issue, we take the first step towards a robust fingerprinting method for T2I models with anti-collusion capabilities. The proposed method encodes strings of bits, namely fingerprints, into the coefficients of a personalized normalization module (PNM) incorporated into T2I models, so that fingerprints can be reliably recovered from any generated image. To defend against collusion attacks and prevent unauthorized model redistribution, we introduce an anti-collusion mechanism based on lossless function-invariant parameter transformations. This mechanism significantly degrades the image generation quality of colluded models, making them effectively unusable. Moreover, our method allows developers to efficiently create multiple copies of fingerprinted T2I models by reparameterizing the PNM without the need for retraining. We also introduce a worst-case optimization strategy to improve robustness against model-level attacks. Our experiments demonstrate that the proposed method achieves high fidelity and robustness across multiple T2I image generation and editing tasks, with fingerprint extraction accuracy exceeding 99.5%. Compared with existing methods, our method demonstrates, for the first time, a notable proactive robustness to collusion attacks by significantly increasing the FID of colluded models.

2606.12976 2026-06-12 cs.AI 新提交

A Mathematical Forum Platform for Collaborative Problem Solving and Dataset Generation for AI Reasoning

面向协作问题求解与AI推理数据集生成的数学论坛平台

Akbar Erkinov, Nurmukhammad Abdurasulov

发表机构 * Independent Researchers, San Francisco, CA, USA(独立研究者,美国加利福尼亚州旧金山)

AI总结 提出一个集成图像到LaTeX转换管线的论坛系统,消除数学内容分享的摩擦,支持桌面和移动端,并生成社区验证的数学问题数据集以训练AI推理。

Comments 11 pages, 3 figures

详情
AI中文摘要

在在线论坛中分享数学内容仍然是学生和教师的一个显著痛点:编写原始LaTeX容易出错,独立的光学字符识别工具需要切换平台,而当前的论坛软件没有提供从公式照片到渲染帖子的集成路径。我们提出了一个统一系统,通过将图像到LaTeX转换管线直接嵌入论坛发布界面来消除这一摩擦。用户上传或拍摄数学表达式的图像;系统通过Mathpix OCR API路由该图像,检测返回的输出是LaTeX还是包含内联数学的纯文本,应用适当的分隔符规范化,并在帖子提交到数据库之前以LaTeX或Markdown模式提供实时预览。该架构分为三个松散耦合的层:图像处理、渲染和存储,并支持桌面和移动客户端。已提交一份涵盖核心方法的美国临时专利申请。我们描述了完整的系统设计、每个组件的细节、数据模式以及关键的技术创新,并将该工作与现有的独立工具和论坛平台进行对比,以展示其填补的实际空白。除了直接的可用性之外,我们认为这种部署的平台构成了一个持续增长、社区验证的数学问题和逐步解决方案数据集,该资源可用于训练和基准测试AI系统以实现准确的数学推理。

英文摘要

Sharing mathematical content in online forums remains a significant friction point for students and educators: writing raw LATEX is error-prone, standalone optical character recognition tools require platform switching, and current forum software offers no integrated path from a photograph of a formula to a rendered post. We present a unified system that eliminates this friction by embedding an image to LATEX conversion pipeline directly inside a forum posting interface. A user uploads or captures an image of a mathematical expression; the system routes it through the Mathpix OCR API, detects whether the returned output is LATEX or plain text containing inline math, applies the appropriate delimiter normalisation, and renders a live preview in either LATEX or Markdown mode before the post is committed to the database. The architecture is organized in three loosely coupled layers: image processing, rendering, and storage, and supports both desktop and mobile clients. A provisional US patent application has been filed covering the core methods. We describe the full system design, each component in detail, the data schema, and the key technical innovations, and we position the work against existing standalone tools and forum platforms to demonstrate the practical gap it closes. Beyond immediate usability, we argue that a deployed platform of this kind constitutes a continuously growing, community-validated dataset of mathematical problems and step-by-step solutions, a resource that can be used to train and benchmark AI systems for accurate mathematical reasoning

2606.12971 2026-06-12 cs.LG 新提交

Predicting Cognitive Load from Speech and Interaction Dynamics in Dyadic Conversations

从二元对话中的语音和交互动态预测认知负荷

Tahiya Chowdhury

发表机构 * Department of Computer Science, Colby College(科尔比学院计算机科学系)

AI总结 研究在自然协作对话中,通过语音和交互动态特征预测感知认知负荷,发现对话交互(如话轮转换)能有效预测时间压力、脑力工作等认知负荷维度。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

从语音估计认知负荷主要在受控实验室环境中研究,对其在自然协作对话中的可靠性了解有限。我们研究语音和交互动态是否能预测二元对话中的感知认知负荷。我们分析了53对执行九项协作任务的对话音频,提取静态声学、动态和交互特征,训练双头门控循环单元编码器预测认知负荷分数。结果表明,对话交互为预测与时间压力、脑力工作、努力和任务表现相关的认知负荷提供了有用信号。时间需求与话轮转换动态(如重叠和说话者切换)相关,而脑力需求与说话者之间的不平衡参与相关。这些发现强调了任务结构和对话交互在自然协作环境中建模认知负荷的重要性。

英文摘要

Estimating cognitive load from speech has largely been studied in controlled laboratory settings, with limited understanding of its reliability in natural collaborative conversations. We investigate whether speech and interaction dynamics predict perceived cognitive load during dyadic conversations. We analyze audio from 53 dyads performing nine collaborative tasks and extract static acoustic, dynamic, and interaction features to train a two-head Gated Recurrent Unit encoder to predict cognitive load scores. Results show conversational interaction provides useful signals for predicting cognitive load related to time pressure, mental work, effort, and task performance. Temporal demand is associated with turn-taking dynamics such as overlap and speaker switch, while mental demand is linked to imbalanced participation between speakers. These findings highlight the importance of task structure and conversational interaction for modeling cognitive load in natural collaborative settings.

2606.12969 2026-06-12 cs.AI 新提交

Multi-Modal Agents for Power Distribution Defect Detection: An Evaluation of Foundation Models

用于配电缺陷检测的多模态智能体:基础模型评估

Quan Quan

发表机构 * Quan Quan

AI总结 提出多模态智能体框架,系统评估基础模型在感知、推理和工具使用三方面的能力,用于配电缺陷检测的闭环自动化。

详情
AI中文摘要

配电网络对可靠电力输送至关重要,但传统检测方法在语义理解、泛化和闭环自动化方面存在局限。为解决这些挑战,本文提出了一种专门用于配电缺陷检测的多模态智能体框架。本研究的核心是系统评估多模态基础模型作为统一认知引擎的能力。我们严格评估了它们在三个关键能力上的综合表现:(1)感知,模型必须准确识别设备并生成专家级的缺陷描述;(2)推理,模型根据视觉发现解释原因、评估严重性并基于领域知识规划维护策略;(3)工具使用,模型作为自主操作者执行动作——如查询知识库或生成工单——以实现闭环维护。为支持此评估,我们开发了领域特定的评估数据集和综合基准。实验结果表明了当前基础模型在这三个维度的优势与局限,为在高风险工业环境中部署自主智能体提供了实证依据。

英文摘要

The power distribution network is critical to reliable electricity delivery, yet traditional inspection methods face limitations in semantic understanding, generalization, and closed-loop automation. To address these challenges, this paper proposes a Multi-Modal Agent framework specifically for power distribution defect detection. Central to this study is the systematic evaluation of multimodal foundation models as unified cognitive engines. We rigorously assess their integrated performance across three critical capabilities: (1) Perception, where the model must accurately identify equipment and generate expert-level descriptions of defects; (2) Reasoning, where the model interprets visual findings to diagnose causes, assess severity, and plan maintenance strategies based on domain knowledge; and (3) Tool Usage, where the model acts as an autonomous operator to execute actions -- such as querying knowledge bases or generating work orders -- to achieve closed-loop maintenance. To support this evaluation, a domain-specific evaluation dataset and a comprehensive benchmark are developed. Experimental results demonstrate the strengths and limitations of current foundation models in these three dimensions, providing empirical evidence for deploying autonomous agents in high-stakes industrial environments.

2606.12966 2026-06-12 cs.LG cs.NE 新提交

Circuit Synchronization Precedes Generalization: Causal Evidence from Fourier Structure in Grokking Transformers

电路同步先于泛化:来自Grokking Transformer中傅里叶结构的因果证据

Achyuthan Sivasankar

发表机构 * New York University(纽约大学)

AI总结 提出频率同步度(FSD)指标,发现其在模算术任务中比grokking早500-3000步同步,且通过权重衰减控制验证了间隔期的正则化本质,提供因果证据。

Comments 16 pages, 6 figures, 10 tables

详情
AI中文摘要

Grokking——模算术上的transformer从近乎随机突然转变为近乎完美的验证准确率——归因于傅里叶电路,但其时机、因果结构和可控性仍知之甚少。我们引入了频率同步度(FSD),一种无需先验电路知识的归一化、置换检验的傅里叶电路同步度量。在九个模加法配置(素数p∈{53,71,97,113,131},三个种子)中,FSD在grokking前500-3000步同步(平均领先+1722步;所有九个为正,符号检验p≈0.004),并且在所有九个案例中先于受限logit损失基线(Nanda等人的排除损失),使其成为最早可用的预测器。我们提供了直接因果证据,证明相间间隙是一种正则化现象:在FSD峰值步骤分叉训练并变化权重衰减λ,会产生严格单调的更早grokking,且Δ_t与1/λ成正比。该定律在三个素数(p∈{53,97,131};两个干净案例的R²=1.00和R²=0.99)上重复,表示为Δ_t ~ C/λ,与(1/λ)*log(||W_mem||/τ)一致。架构消融实验表明,仅注意力模型在强FSD前兆下grok;仅MLP模型从不grok;单层模型的FSD滞后,确认了前兆是多块电路属性。

英文摘要

Grokking -- where a transformer on modular arithmetic suddenly transitions from near-chance to near-perfect validation accuracy -- is attributed to a Fourier circuit, but its timing, causal structure, and controllability remain poorly understood. We introduce the Frequency Synchronization Degree (FSD), a normalised, permutation-tested metric for Fourier circuit synchronisation requiring no prior circuit knowledge. Across nine modular addition configurations (primes p in {53, 71, 97, 113, 131}, three seeds), FSD synchronises 500-3,000 steps before grokking (mean lead +1,722 steps; all nine positive, sign-test p~0.004), and precedes a restricted-logit loss baseline (Nanda et al.'s excluded loss) in all nine cases, making it the earliest available predictor. We provide direct causal evidence that the inter-phase gap is a regularisation phenomenon: forking training at the FSD-ceiling step and varying weight decay lambda produces strictly monotone earlier grokking, with Delta_t proportional to 1/lambda. This law replicates across three primes (p in {53,97,131}; R^2=1.00 and R^2=0.99 for two clean cases), captured as Delta_t ~ C/lambda, consistent with (1/lambda)*log(||W_mem||/tau). Architecture ablations show an attention-only model groks with a strong FSD precursor; an MLP-only model never groks; a single-layer model's FSD lags, confirming the precursor is a multi-block circuit property.

2606.12965 2026-06-12 cs.RO 新提交

EmbodiSteer: Steering Embodiment-Agnostic Visuomotor Policies with Joint-Space Guidance for Zero-Shot Cross-Embodiment Deployment

EmbodiSteer: 用关节空间引导的具身无关视觉运动策略实现零样本跨具身部署

Shihefeng Wang, Kangchen Lv, Mingrui Yu, Xiang Li

发表机构 * Department of Automation, Tsinghua University(清华大学自动化系) Beijing Key Laboratory of Embodied Intelligence Systems(北京具身智能系统重点实验室) Institute for Embodied Intelligence and Robotics, Tsinghua University(清华大学具身智能与机器人研究所)

AI总结 提出EmbodiSteer框架,通过前向运动学和雅可比更新将推理时的扩散采样提升到目标机器人关节空间,并加入全身碰撞感知引导,实现零样本、具身感知的部署,在模拟和物理机器人上显著降低碰撞率并提高任务成功率。

Comments The first two authors contribute equally

详情
AI中文摘要

可扩展的机器人模仿学习依赖于来自不同机器人的大规模异构数据或无身体数据,使得笛卡尔末端执行器动作成为具身无关策略学习的关键接口。然而,仅末端执行器的抽象使得笛卡尔策略对部署的机器人身体无感知,导致其在全身碰撞避免等机器人特定约束下脆弱。为克服这一限制,我们提出EmbodiSteer,一种无需训练的框架,将具身无关的视觉运动策略引导至零样本、具身感知的部署。EmbodiSteer将策略学习保持在笛卡尔空间,同时通过前向运动学和基于雅可比的更新,高效地将推理时的扩散采样提升到目标机器人的关节空间。在每个去噪步骤后,通过关节轨迹上的全身碰撞感知引导,机械臂可以在保持学习到的末端执行器行为的同时避开碰撞。与仅笛卡尔执行相比,EmbodiSteer在9个模拟机器人上将碰撞率降低46.1%,任务成功率提高28.5%,并在高度受限场景下的两个物理机器人上实现碰撞率降低90.0%,成功率提高36.7%。我们的项目页面位于此https URL。

英文摘要

Scalable robot imitation learning relies on large-scale heterogeneous data from diverse robots or body-free data, making Cartesian end-effector actions a key interface for embodiment-agnostic policy learning. However, end-effector-only abstraction leaves Cartesian policies unaware of the deployed robot body, making them brittle under robot-specific constraints such as whole-body collision avoidance. To overcome this limitation, we present EmbodiSteer, a training-free framework that steers embodiment-agnostic visuomotor policies toward zero-shot, embodiment-aware deployment. EmbodiSteer keeps policy learning in Cartesian space while efficiently lifting inference-time diffusion sampling into the target robot's joint space via forward kinematics and Jacobian-based updates. With whole-body collision-aware guidance over joint trajectories after each denoising step, the arm can be steered away from collisions while preserving learned end-effector behavior. Compared with Cartesian-only execution, EmbodiSteer reduces collision rate by 46.1% and improves task success rate by 28.5% across 9 simulated robots, and further achieves 90.0% collision rate reduction and 36.7% success rate increase on two physical robots in highly constrained scenarios. Our project page is at https://frankwang67.github.io/EmbodiSteer-Page.

2606.12958 2026-06-12 cs.CV 新提交

YOLO-AMC: An Improved YOLO Architecture with Attention Mechanisms for Building Crack Detection

YOLO-AMC:一种改进的带有注意力机制的YOLO架构用于建筑裂缝检测

Ching-Yu Tsai, Chia-Min Lin, Chih-Hsiang Yang, Yung-Che Wang, Jen-Shiun Chiang

发表机构 * Department of Electrical and Computer Engineering, Tamkang University(淡江大学电机与计算机工程系)

AI总结 提出YOLO-AMC,在YOLOv11中移除C2PSA并引入GAM、Res-CBAM、SA等注意力机制,增强裂缝检测性能,在测试集上mAP@0.5达0.9917,速度110.95 FPS,兼顾精度与部署效率。

Comments 14 pages, 8 tables, 6 figures. Expanded version of IET ICETA 2025 conference paper

详情
AI中文摘要

裂缝检测在基础设施检查和结构健康监测(SHM)中起着重要作用。然而,裂缝通常表现为薄、低对比度的结构,且容易受到背景噪声的影响,给现有目标检测模型带来了挑战。本研究提出了一种改进的基于YOLO的架构,集成了注意力机制,称为YOLO-AMC(用于裂缝检测的YOLO注意力机制),以增强自动裂缝检测性能。基于YOLOv11,移除了原始的C2PSA模块,并在Neck的多尺度特征融合层中引入了多种注意力机制,包括全局注意力机制(GAM)、残差卷积块注意力模块(Res-CBAM)和Shuffle Attention(SA),以加强跨尺度特征整合。实验结果表明,YOLO-AMC在多个评估指标上始终优于基线模型YOLOv11n和YOLOv8n。在评估的注意力模块中,GAM取得了最佳检测性能,在测试数据集上获得了mAP@0.5 = 0.9917和mAP@0.5:0.95 = 0.9506,高于YOLOv11(0.9833 / 0.9112)和YOLOv8(0.9707 / 0.8921)。此外,在保持7.6 GFLOPs计算复杂度的同时,所提出的模型在NVIDIA RTX 4090平台上达到了110.95 FPS,在Raspberry Pi 5边缘设备上约为5 FPS,展示了准确性与部署效率之间的良好权衡。本研究的实现代码可在GitHub上获取,网址为:https://this https URL。

英文摘要

Crack detection plays an important role in infrastructure inspection and Structural Health Monitoring (SHM). However, cracks typically appear as thin, low-contrast structures and are easily affected by background noise, posing challenges for existing object detection models. This study proposes an improved YOLO-based architecture with integrated attention mechanisms, termed YOLO-AMC (YOLO with Attention Mechanisms for Crack Detection), to enhance automated crack detection performance. Based on YOLOv11, the original C2PSA module is removed, and multiple attention mechanisms, including Global Attention Mechanism (GAM), Residual Convolutional Block Attention Module (Res-CBAM), and Shuffle Attention (SA), are introduced into the multi-scale feature fusion layers of the Neck to strengthen cross-scale feature integration. Experimental results demonstrate that YOLO-AMC consistently outperforms baseline models YOLOv11n and YOLOv8n across multiple evaluation metrics. Among the evaluated attention modules, GAM achieves the best detection performance, obtaining mAP@0.5 = 0.9917 and mAP@0.5:0.95 = 0.9506 on the test dataset, which are higher than those of YOLOv11 (0.9833 / 0.9112) and YOLOv8 (0.9707 / 0.8921). Furthermore, while maintaining a computational complexity of 7.6 GFLOPs, the proposed model achieves 110.95 FPS on an NVIDIA RTX 4090 platform and approximately 5 FPS on a Raspberry Pi 5 edge device, demonstrating a favorable trade-off between accuracy and deployment efficiency. The implementation code for this study is available on GitHub at https://github.com/CY-Tsai24/YOLO-AMC.

2606.12956 2026-06-12 cs.RO 新提交

SERF: Spatiotemporal Environment and Robot Feature Map for Long-Horizon Mobile Manipulation

SERF:面向长时域移动操作任务的时空环境与机器人特征地图

Sunghwan Kim, Byeonghyun Pak, Kehan Long, Yulun Tian, Nikolay Atanasov

发表机构 * UC San Diego(加州大学圣地亚哥分校) Agency for Defense Development(国防发展局) SceniX Inc.(SceniX公司) University of Michigan(密歇根大学)

AI总结 提出SERF地图,将环境与机器人身体表示为共享潜空间中的神经点,并在线更新,作为VLA模型的状态输入,提升长时域移动操作中的推理能力,在BEHAVIOR-1K上优于纯图像基线。

Comments Project page: https://existentialrobotics.org/serf/

详情
AI中文摘要

长时域机器人移动操作需要对定位、环境变化和任务进度进行持续推理,而这些都难以仅从图像观测中推断。在本文中,我们表明,将移动操作策略条件化于一个时空特征地图可以改善长时域上的推理。该地图将环境和铰接机器人身体表示为共享潜空间中的神经点,并从自我中心观测和本体感受状态在线更新。我们使用基于对象的刚性跟踪更新环境神经点,并使用正向运动学更新机器人神经点。通过从多个参考帧和空间尺度提取地图标记,我们将时空环境与机器人特征(SERF)地图作为状态输入到视觉-语言-动作(VLA)模型中,为策略提供局部和全局上下文。我们在BEHAVIOR-1K(一个家庭环境中的长时域移动操作基准)上展示了SERF。实验表明,SERF VLA策略优于纯图像基线,通过遵循更直接的轨迹更快地达到子目标,提高了对场景配置变化的鲁棒性,并能从物体掉落失败中恢复。

英文摘要

Long-horizon robot mobile manipulation requires continual reasoning about localization, environment changes, and task progress, all of which are challenging to infer from image observations alone. In this paper, we show that conditioning a mobile manipulation policy on a spatiotemporal feature map improves reasoning over long horizons. The map represents the environment and the articulated robot body as neural points in a shared latent space and is updated online from egocentric observations and proprioceptive state. We update the environment neural points using object-level rigid tracking and the robot neural points using forward kinematics. We use our spatiotemporal environment and robot feature (SERF) map as a state input to a vision-language-action (VLA) model by extracting map tokens from multiple reference frames and spatial scales, providing the policy with both local and global context. We demonstrate SERF on BEHAVIOR-1K, a benchmark for long-horizon mobile manipulation in household environments. Experiments show that the SERF VLA policy outperforms image-only baselines, reaches subgoals faster by following more direct trajectories, improves robustness to scene-configuration shifts, and recovers from object-drop failures.

2606.12954 2026-06-12 cs.RO 新提交

Towards Reliable Sequential Object Picking in Clutter: The Runner-up Solution to RGMC 2025

面向杂乱环境中的可靠顺序物体抓取:RGMC 2025 亚军方案

Wei Yu, Xidan Zhang, Ziyi Zheng, Weijie Kong, Huixu Dong

发表机构 * School of Mechanical Engineering, Zhejiang University(浙江大学机械工程学院)

AI总结 针对杂乱环境中的顺序物体抓取任务,提出集成硬件-软件流水线,结合多功能夹爪设计与物体分布及遮挡关系新表示,实现高效识别、搜索与顺序抓取,获RGMC 2025亚军。

Comments First, Second and Third Coauthor contributed equally to this work

详情
AI中文摘要

作为机器人操作中的长期挑战,在杂乱环境中稳定高效地抓取在工业场景中至关重要。尽管近期研究在杂乱抓取中取得了较高的成功率,但对于顺序物体搜索与分类等更具挑战性的任务,成熟解决方案仍然较少。本工作基于杂乱环境抓取基准(CEPB)解决杂乱环境中的顺序物体抓取问题,并展示了我们在ICRA 2025第十届机器人抓取与操作竞赛(RGMC)的“杂乱抓取”赛道中的方案。该任务提出了几个关键挑战。首先,它需要鲁棒且考虑碰撞的抓取,在包括刚性和可变形物体在内的多样化物体集上具有高成功率。其次,它要求高效搜索目标物体,这对方案的清理和搜索策略提出了严格要求。为应对上述挑战,我们设计了一个集成的硬件-软件流水线,结合了物体识别、清理和多模态抓取。主要贡献包括多功能夹爪的硬件设计以及杂乱空间中物体分布和遮挡关系的新表示。该流水线实现了对杂乱环境中物体的高效识别、搜索和顺序抓取,在实验室测试和竞赛场景中均表现出色,最终在RGMC 2025的“杂乱抓取”赛道中获得第二名。

英文摘要

As a long-standing challenge in robotic manipulation, stable and efficient grasping in cluttered environments is of great importance in industrial settings. While recent studies have achieved relatively high success rates in grasping from clutter, there remain few mature solutions for more demanding tasks such as sequential object search and sorting. This work addresses sequential object picking in cluttered environments based on the Cluttered Environment Picking Benchmark (CEPB) and presents our solution to the Pick-in-Clutter track of the 10th Robotic Grasping and Manipulation Competition (RGMC) at ICRA 2025. The task poses several key challenges. First, it requires robust and collision-aware grasping with high success rates across a diverse set of objects, including both rigid and deformable ones. Second, it demands efficient search for target objects, which places stringent requirements on the decluttering and searching strategies of the solution. To address the above challenges, we design an integrated hardware-software pipeline that combines object recognition, decluttering, and multi-modal grasping. The main contributions include the hardware design of a multifunctional gripper and novel representations for object distribution and occlusion relationships in cluttered space. This pipeline enables efficient recognition, search, and sequential grasping of objects in clutter, demonstrating strong performance in both laboratory tests and competition scenarios, and ultimately achieving second place in the Pick-in-Clutter track of the RGMC 2025.

2606.12953 2026-06-12 cs.AI cs.CV cs.LG eess.IV 新提交

OpenMedQ: Broad Open Pretraining for Medical Vision-Language Models

OpenMedQ:面向医学视觉语言模型的广泛开放预训练

Ibrahim Gulluk, Max Van Puyvelde, Olivier Gevaert

发表机构 * Stanford University(斯坦福大学) Stanford University School of Medicine(斯坦福大学医学院) Ghent University(根特大学)

AI总结 提出OpenMedQ,在14个数据集(约335万样本)上预训练医学视觉语言模型,在PathVQA上BLEU-1达75.9,超越562B参数的Med-PaLM M,并在8个未见医学分类任务上取得最高平均macro-F1(0.757)。

Comments Medical Imaging with Deep Learning (MIDL) 2026, Short Paper Track

详情
AI中文摘要

我们提出OpenMedQ,一个在迄今为止最广泛的完全开放医学混合数据集上预训练的医学视觉语言模型:包含14个数据集,总计约335万预训练样本,涵盖病理学、放射学、显微镜和纯文本临床问答。OpenMedQ在PathVQA上达到最先进的BLEU-1(75.9),击败了参数多达562B(约大80倍)的Med-PaLM M变体,并在VQA-MED上匹配了最佳报告的BLEU-1(64.5)。其视觉编码器在相同的下游配方下迁移到8个未见过的医学分类基准,获得了最高的平均macro-F1(0.757),优于BiomedCLIP(0.745)、PMC-CLIP(0.745)、PubMedCLIP(0.746)和从头训练的基线(0.616)。我们公开了代码,并提供了一个交互式演示,作为社区的可复现基线。

英文摘要

We present OpenMedQ, a medical vision-language model pretrained on the broadest fully-open medical mix to date: 14 datasets totaling ~3.35M pretraining samples spanning pathology, radiology, microscopy, and text-only clinical QA. OpenMedQ reaches state-of-the-art BLEU-1 on PathVQA (75.9), beating Med-PaLM M variants up to 562B parameters (~80x larger), and matches the best reported VQA-MED BLEU-1 (64.5). Its vision encoder, transferred to 8 unseen medical classification benchmarks under an identical downstream recipe, obtains the highest average macro-F1 (0.757) among BiomedCLIP (0.745), PMC-CLIP (0.745), PubMedCLIP (0.746), and a from-scratch baseline (0.616). We release our code and an interactive demo is publicly available as a reproducible baseline for the community.

2606.12945 2026-06-12 cs.AI 新提交

Learning What to Remember: A Cognitively Grounded Multi-Factor Value Model for Agentic Memory

学习该记住什么:一种基于认知的多因素记忆价值模型

Zhibao Chen, Qian Cheng

发表机构 * Huatai Securities(华泰证券) OneBeget.com

AI总结 针对长期LLM代理的记忆管理问题,提出一种基于认知心理学的多因素记忆价值函数,通过无梯度优化学习权重,统一控制编码深度、遗忘风险和检索排名,在LongMemEval上显著优于单一因素和近因策略。

Comments 11 pages, 3 figures

详情
AI中文摘要

长期运行的LLM代理积累的交互历史远超任何上下文窗口,迫使面临一个持续决策:在固定记忆预算下,哪些内容应深度编码、哪些应遗忘、哪些应检索。生产系统采用语义相似性或近因性——两者对于遗忘决策都是错误指定的,因为遗忘决策是在未来查询未知的整合时刻做出的。我们提出一个多因素记忆价值函数 V(m)=∑_i w_i f_i(m),涵盖七个可解释因素(情感强度、目标相关性、价值对齐、自我/用户相关性、任务效用、可靠性和使用历史),这些因素来自认知心理学,其权重通过无梯度优化器从下游目标中学习,并且该单一标量统一控制编码深度、遗忘风险和检索排名。我们提出一个方法论观点:在LongMemEval上,针对保留的评估问题对目标相关性进行评分,使得黄金证据保留率达到≈0.98——这衡量的是检索,而非遗忘。在现实盲态模式下,学习到的多因素价值在479个可用案例中保留了0.770±0.011的黄金证据,而均匀权重为0.657,最佳单一因素为0.518,近因性为0.368;每对差距的95%自助法置信区间均高于零,且基于相同因素的神经网络与线性模型持平。学习到的权重是可解释的——可靠性、情感强度和自我/用户相关性占主导,而查询时的目标相似性在遗忘决策中被正确降权。一个带有植入混淆的受控合成任务证实,学习器恢复了分离性权重(保留率1.00),而均匀权重失败(0.62)。该基础架构是开源的;所有实验在单CPU上运行,无需API调用。

英文摘要

Long-running LLM agents accumulate interaction histories far larger than any context window, forcing a standing decision: what to encode deeply, what to forget, and what to retrieve under a fixed memory budget. Production systems answer with semantic similarity or recency -- both mis-specified for the forgetting decision, which is made at consolidation time before the future query is known. We propose a multi-factor memory value function V(m)=\sum_i w_i f_i(m) over seven interpretable factors (emotional intensity, goal relevance, value alignment, self/user relevance, task utility, reliability, and usage history) drawn from cognitive psychology, whose weights are learned from a downstream objective by a gradient-free optimiser, and whose single scalar uniformly controls encoding depth, forget risk, and retrieval rank. We make a methodological point: on LongMemEval, scoring goal relevance against the held-out evaluation question saturates gold-evidence retention at \approx 0.98 -- this measures retrieval, not forgetting. In the realistic blind regime, a learned multi-factor value retains 0.770 \pm 0.011 of gold evidence across 479 usable cases, versus 0.657 for uniform weights, 0.518 for the best single factor, and 0.368 for recency; every paired gap's 95% bootstrap CI is above zero, and a neural network over the same factors ties the linear model. The learned weights are interpretable -- reliability, emotional intensity, and self/user relevance dominate, while query-time goal similarity is correctly down-weighted for the forgetting decision. A controlled synthetic task with planted confounds confirms the learner recovers a separating weighting (1.00 retention) where uniform weighting fails (0.62). The substrate is open-source; all experiments run on a single CPU with no API calls.

2606.12942 2026-06-12 cs.AI 新提交

PRISMR: Overcoming Parse Collapse in Multimodal Listwise Ranking via Parameterized Representation Internalization

PRISMR: 通过参数化表示内化克服多模态列表排序中的解析崩溃

Hao Jiang, Xin Li, Annan Wang, Zhi Yang, Haoxiang Zhang, Yichi Zhang, Weisi Lin

发表机构 * Nanyang Technological University(南洋理工大学) Peking University(北京大学) Independent Researcher(独立研究员)

AI总结 针对多模态长上下文场景中生成式列表排序的解析崩溃问题,提出PRISMR框架,用参数化结构条件替代临时上下文列表处理,通过轻量级超网络并行编码候选并生成LoRA权重,显著减少解析崩溃并提升排序性能。

详情
AI中文摘要

基于大型多模态模型(LMM)的生成式列表排序旨在单次前向传播中捕获全局列表上下文,但其效果在长上下文多模态场景中会退化。我们识别出一种重复出现的失败模式——解析崩溃,即自回归解码器生成流畅但不完整的排序,通过静默省略候选并提前终止。这种失败源于有限的上下文利用,而非简单的格式错误,使得提示工程和约束解码不足以解决。我们提出PRISMR(参数化表示内化用于语义多模态排序)框架,用参数化结构条件替代临时的上下文内列表处理。PRISMR使用轻量级超网络并行编码多模态候选并生成项目特定的LoRA权重,这些权重被合成为LMM的实例特定适配器。这种范式在保留基础模型的同时,实现了更鲁棒的列表结构内化。我们进一步引入了一个大规模多模态评论排序基准用于评估。实验表明,PRISMR显著减少了解析崩溃,提高了列表排序性能,并有效跨领域和指令微调骨干网络迁移。

英文摘要

Generative listwise ranking with Large Multimodal Models (LMMs) aims to capture global list context in a single forward pass, but its effectiveness degrades in long-context multimodal scenarios. We identify a recurring failure mode, parse collapse, where the autoregressive decoder produces fluent yet incomplete rankings by silently omitting candidates and terminating early. This failure stems from limited context utilization rather than simple formatting mistakes, making prompt engineering and constrained decoding insufficient. We propose PRISMR (Parameterized Representation Internalization for Semantic Multimodal Ranking), a framework that replaces transient in-context list processing with parametric structural conditioning. PRISMR uses a lightweight hypernetwork to encode multimodal candidates in parallel and generate item-specific LoRA weights, which are synthesized into an instance-specific adapter for a LMM. This paradigm enables more robust internalization of list structure while preserving the base model. We further introduce a large-scale multimodal review-ranking benchmark for evaluation. Experiments demonstrate that PRISMR substantially reduces parse collapse, improves listwise ranking performance, and transfers effectively across domains and instruction-tuned backbones.