arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1733
2606.14250 2026-06-15 cs.RO 新提交

SyLink Hand: A Synergy-Inspired Linkage-Driven Anthropomorphic Hand for Human-Like Dexterity

SyLink Hand:一种受协同作用启发的连杆驱动拟人手,实现类人灵巧性

Hao Wu, Yanzhe Wang, Yu Feng, Yitong Li, Jingxiang Guo, Jian Liu, Jianshu Zhou

发表机构 * National University of Singapore(新加坡国立大学) Zhejiang University(浙江大学)

AI总结 受人类手部协同作用启发,提出SyLink Hand拟人灵巧手,通过生物力学协同原理与连杆驱动机构结合,在紧凑低成本架构中实现外观、运动学和功能的高度拟人化,验证了协同启发连杆设计有效平衡拟人度、机械简单性和功能多样性。

详情
AI中文摘要

设计在功能灵巧性与机械简单性之间取得平衡的拟人机器人手仍然是一个重大挑战。受人类手部协同作用的启发,本文提出了SyLink Hand,一种拟人灵巧手,它将生物力学协同原理与连杆驱动传动机制相结合,在紧凑且成本效益高的架构中实现了外观、运动学和功能的高度拟人化。使用动作捕捉手套对自然手部运动进行生物力学分析,揭示了手部关节之间的强运动学相关性,为简化但功能性的自由度配置提供了基础。在这些协同特性的指导下,采用优化的连杆机构来协调多个关节运动并再现自然手指轨迹。进一步提出了一种新颖的球形四杆连杆机构,以在紧凑的外形下实现掌指关节的屈曲/伸展和外展/内收的解耦。最终原型集成了19个关节,由11个执行器驱动,总质量为520克,制造成本约为400美元。实验评估证明了其类人运动学性能、高承载能力以及多样的抓取和操作技能。这些结果验证了协同启发、基于连杆的设计有效平衡了拟人度、机械简单性和功能多样性,突显了其在需要灵巧性的机器人应用中实际部署的潜力。

英文摘要

Designing anthropomorphic robotic hands that balance functional dexterity with mechanical simplicity remains a significant challenge. Inspired by human hand synergies, this paper presents the SyLink Hand, an anthropomorphic dexterous hand that integrates biomechanical synergy principles with linkage-driven transmission mechanisms to achieve a high degree of anthropomorphism in appearance, kinematics, and functionality within a compact and cost-effective architecture. Biomechanical analysis of natural hand motions using motion capture gloves reveals strong kinematic correlations among hand joints, providing the basis for a simplified yet functional degree-of-freedom (DOF) configuration. Guided by these synergistic characteristics, optimized linkage mechanisms are employed to coordinate multiple joint motions and reproduce natural finger trajectories. A novel spherical four-bar linkage is further proposed to achieve decoupled flexion/extension (Flex/Ext) and abduction/adduction (Abd/Add) at the metacarpophalangeal joint within a compact form factor. The resulting prototype integrates 19 joints driven by 11 actuators, with a total mass of 520g and a manufacturing cost of approximately USD 400. Experimental evaluations demonstrate its human-like kinematic performance, high load-bearing capability, and versatile grasping and manipulation skills. These results validate that the synergy-inspired, linkage-based design effectively balances anthropomorphism, mechanical simplicity, and functional versatility, highlighting its potential for practical deployment in dexterity-demanding robotic applications.

2606.14249 2026-06-15 cs.AI 新提交

HarnessX: A Composable, Adaptive, and Evolvable Agent Harness Foundry

HarnessX: 一个可组合、自适应且可演化的智能体框架铸造厂

Tingyang Chen, Shuo Lu, Kang Zhao, Weicheng Meng, Hanlin Teng, Tianhao Li, Chao Li, Xule Liu, Jian Liang, Zhizhong Zhang, Yuan Xie, Heng Qu, Kun Shao, Jian Luan

发表机构 * Darwin Agent Team(达尔文智能体团队)

AI总结 提出HarnessX,通过替换代数组合框架原语、基于AEGIS多智能体演化引擎自适应调整,并利用轨迹反馈闭环优化,在五个基准上平均提升14.5%性能。

详情
AI中文摘要

AI智能体的性能关键依赖于运行时框架,包括提示、工具、记忆和控制流,这些中介了模型如何观察、推理和行动。然而,当今的框架在很大程度上仍然是手工制作和静态的:每个新模型或任务仍然需要定制的脚手架,并且在执行过程中产生的丰富轨迹很少被提炼为系统性的改进。我们引入了HarnessX,一个用于可组合、自适应和可演化的智能体框架的铸造厂。HarnessX通过替换代数组装类型化的框架原语,通过AEGIS(一个基于轨迹驱动的多智能体演化引擎,建立在符号适应与强化学习之间的操作镜像上)进行自适应,并通过将轨迹转化为框架更新和模型训练信号来闭合框架-模型循环。在五个基准测试(ALFWorld、GAIA、WebShop、tau^3-Bench和SWE-bench Verified)上,HarnessX平均提升了14.5%(最高达44.0%),其中基线最低时增益最大。这些结果表明,智能体的进步不一定来自模型规模的扩展:从执行反馈中组合和演化运行时接口是一个可行且互补的杠杆。完整的代码库将在未来版本中开源。

英文摘要

AI agent performance depends critically on the runtime harness, comprising the prompts, tools, memory, and control flow that mediate how a model observes, reasons, and acts. Yet today's harnesses remain largely hand-crafted and static: each new model or task still demands bespoke scaffolding, and the rich traces produced during execution are rarely distilled back into systematic improvement. We introduce HarnessX, a foundry for composable, adaptive, and evolvable agent harnesses. HarnessX assembles typed harness primitives via a substitution algebra, adapts them through AEGIS, a trace-driven multi-agent evolution engine grounded in an operational mirror between symbolic adaptation and reinforcement learning, and closes the harness-model loop by turning trajectories into both harness updates and model training signal. Across five benchmarks (ALFWorld, GAIA, WebShop, tau^3-Bench, and SWE-bench Verified), HarnessX yields an average gain of +14.5% (up to +44.0%), with gains largest where baselines are lowest. These results suggest that agent progress need not come from model scaling alone: composing and evolving runtime interfaces from execution feedback is an actionable and complementary lever. The complete codebase will be open-sourced in a future release.

2606.14245 2026-06-15 cs.LG 新提交

Where Black-box Drug-Target Interaction Prediction Models Look: Cross-Method Explainability

黑盒药物-靶标相互作用预测模型关注何处:跨方法可解释性

Ali Vefghi, Zahed Rahmati, Mohammad Akbari

发表机构 * Amirkabir University of Technology(阿米尔卡比尔理工大学)

AI总结 通过梯度归因与特征消融等方法,对BridgeDPI模型进行可解释性审计,揭示模态主导性、填充伪影及化学一致性片段,为计算药物发现提供可检验假设。

详情
AI中文摘要

药物-靶标相互作用(DTI)和亲和力(DTA)预测器日益获得强大的基准分数,但它们对序列、指纹和图特征的内部使用通常仍不透明。我们对BridgeDPI架构在三个不同数据集(包括Gao、Human和此http URL)上进行可解释性审计。本研究结合基于梯度的归因方法——积分梯度、显著性、逐层相关性传播、SmoothGrad和SmoothGrad-IG——与特征级消融实验以及跨方法的严格交集共识,以减少单一解释器偏差。我们总结了原始输入、桥接相似性支架以及图卷积中的敏感性和符号效应,包括边级敏感性和定向边移除。结果表明,当将可解释性视为模型批评时,它最具信息量:它揭示了模态主导性、填充和特殊标记伪影、跨层的数据集依赖的协作与抑制效应,以及方法一致时化学上一致的片段和组成基序。这些分析不能替代结构或实验真值,但它们可以为计算药物发现流程中的下游验证提供可检验的假设。更广泛地说,将现代XAI应用于当代DTI/DTA模型仍是对训练权重和数据中隐含的丰富结构的初步探索——但即使这第一层审查已经帮助研究人员将预测与药物侧和靶标侧表示联系起来,并优先考虑外部验证。

英文摘要

Drug-target interaction (DTI) and affinity (DTA) predictors increasingly achieve strong benchmark scores, yet their internal use of sequence, fingerprint, and graph features often remains opaque. We present an interpretability audit of BridgeDPI architecture on three different datasets including Gao, Human, and C.elegans. This study combines gradient-based attributions -- integrated gradients, saliency, layer-wise relevance propagation, SmoothGrad, and SmoothGrad-IG -- with feature-wise occlusion ablation and strict intersection consensus across methods to reduce single-explainer bias. We summarize sensitivity and signed effects at raw inputs, at the bridge similarity scaffold, and through the graph convolution, including edge-level sensitivities and targeted edge removals. The results show that explainability is most informative when treated as model criticism: it reveals modality dominance, padding and special-token artifacts, dataset-dependent cooperative versus suppressive effects across layers, and chemistry-consistent fragment and composition motifs where methods agree. These analyses do not substitute for structural or experimental ground truth, yet they can provide testable hypotheses for downstream validation in computational drug discovery pipelines. More broadly, applying modern XAI to contemporary DTI/DTA models is still an early pass over the rich structure implicit in trained weights and data -- yet even this first layer of scrutiny already helps researchers relate predictions to drug- and target-side representations and to prioritize external validation.

2606.14243 2026-06-15 cs.CL 新提交

Decoupled Mixture-of-Experts for Parametric Knowledge Injection

解耦混合专家用于参数化知识注入

Baoqing Yue, Weihang Su, Qingyao Ai, Yichen Tang, Changyue Wang, Jiacheng Kang, Jingtao Zhan, Yiqun Liu

发表机构 * Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系)

AI总结 提出解耦混合专家(DMoE)架构,将外部知识转化为独立可更新的专家模块,通过轻量级不确定性感知路由器在基础模型知识不足时激活相关专家,实现参数级知识增强,在知识密集型基准上优于检索和适配器基线。

详情
AI中文摘要

知识注入旨在为大型语言模型(LLMs)配备外部、领域特定或时效性知识。现有方法通常在灵活性和集成性之间面临权衡:检索增强生成将知识保留在模型外部,但仅提供提示级增强;而基于后训练的方法将新知识编码到共享参数中,但可能引入灾难性遗忘、知识冲突和昂贵的更新。在本文中,我们提出解耦混合专家(DMoE),一种用于参数化知识注入的模块化架构,它将专家和路由器从基础模型中解耦。DMoE将外部知识语料库转换为可独立更新的专家模块,并使用轻量级不确定性感知路由器,仅在基础模型在生成过程中缺乏足够知识时激活相关专家。为了支持高效的自回归推理,DMoE仅将专家附加到最后一层前馈网络,在保留KV缓存重用的同时实现参数级知识增强。在知识密集型基准上的实验表明,DMoE在答案质量上持续优于检索和适配器基线。

英文摘要

Knowledge injection aims to equip large language models (LLMs) with external, domain-specific, or time-sensitive knowledge. Existing approaches typically face a trade-off between flexibility and integration: retrieval-augmented generation keeps knowledge outside the model but only provides prompt-level augmentation, whereas post-training based methods encode new knowledge into shared parameters but may introduce catastrophic forgetting, knowledge conflict, and costly updates. In this paper, we propose Decoupled Mixture-of-Experts (DMoE), a modular architecture for parametric knowledge injection that decouples both experts and the router from the base model. DMoE converts external knowledge corpora into independently updatable expert modules and uses a lightweight uncertainty-aware router to activate relevant experts only when the base model lacks sufficient knowledge during generation. To support efficient auto-regressive inference, DMoE attaches experts only to the final-layer feed-forward network, preserving KV-cache reuse while enabling parameter-level knowledge augmentation. Experiments on knowledge-intensive benchmarks show that DMoE consistently improves answer quality over retrieval and adapter-based baselines.

2606.14240 2026-06-15 cs.AI 新提交

AFFORDANCE20Q: Evaluating Affordance Reasoning from Physical Properties

AFFORDANCE20Q:从物理属性评估可承担性推理

Yifan Jiang, Meige Yang, Zitong Li, Jay Pujara

发表机构 * Information Sciences Institute, University of Southern California(南加州大学信息科学研究所) University of Southern California(南加州大学)

AI总结 提出Affordance20Q基准,通过20个问题游戏评估模型从物理属性推理物体可承担性的能力,发现LLM与人类差距约20分,并开发KARI方法提升开源模型达15.2分。

详情
AI中文摘要

可承担性推理,即从物体的物理属性(如形状和材料)推断其动作可能性,是人类物理理解的基础,对大型语言模型(LLM)也越来越关键。然而,现有的可承担性基准大多在评估设置中暴露明确的物体身份,使模型能够依赖记忆的物体-可承担性映射,而不是基于物理属性进行推理。为弥补这一空白,我们引入了Affordance20Q,这是一个新颖的可承担性推理基准,以20个问题游戏的形式呈现,不暴露物体身份。在每个游戏中,模型通过询问关于物体物理属性的是/否问题,从候选集中识别隐藏物体的可承担性。Affordance20Q包含1,009个游戏,涵盖454个物体和59种可承担性,所有数据均经过手动筛选、细化和标注。我们对15个最先进的LLM进行了全面实验,发现与人类表现相比存在显著差距(约20分)。基于KL的信息增益(IG)分析进一步表明,随着游戏进行,模型未能提出具有区分性的问题。为缩小差距,我们开发了基于知识库锚定的规则归纳(KARI),这是一个基于LLM的流程,用于生成基于知识库(KB)证据的可承担性规则。KARI将开源LLM的性能提升了最多15.2分,而KB的有限覆盖阻碍了进一步的提升。我们在https://this.url发布所有代码和数据。

英文摘要

Affordance reasoning, the inference of an object's action possibilities from its physical properties (e.g., shape and material), is fundamental to human physical understanding and increasingly critical for Large Language Models (LLMs). However, existing affordance benchmarks largely expose explicit object identities in the evaluation setup, allowing models to rely on memorized object-affordance mappings rather than reasoning over physical properties. To address this gap, we introduce Affordance20Q, a novel affordance reasoning benchmark formulated as a 20-Questions game without exposing the object's identity. In each game, the model identifies a hidden object's affordance from a candidate set by asking yes/no questions about its physical properties. Affordance20Q comprises 1,009 games over 454 objects and 59 affordances, all manually filtered, refined, and annotated. We conduct comprehensive experiments with 15 state-of-the-art LLMs and find a substantial gap (~20 points) compared to human performance. A KL-based information-gain (IG) analysis further shows that models fail to ask discriminating questions as the game progresses. To close the gap, we develop KB-Anchored Rule Induction (KARI), a pipeline based on LLMs that generates affordance rules grounded in evidence from knowledge bases (KBs). KARI improves open-source LLMs by up to 15.2 points, while the limited coverage of KBs hinders further gains. We release all our code and data at https://github.com/1171-jpg/Affordance20Q.git

2606.14239 2026-06-15 cs.AI 新提交

SkillAudit: Ground-Truth-Free Skill Evolution via Paired Trajectory Auditing

SkillAudit: 通过配对轨迹审计实现无真实反馈的技能进化

Haowen Gao, Haoran Chen, Can Wang, Shasha Guo, Liang Pang, Zhaoyang Liu, Huawei Shen, Xueqi Cheng

发表机构 * State Key Laboratory of AI Safety, Institute of Computing Technology, CAS(中国科学院计算技术研究所人工智能安全国家重点实验室) University of Chinese Academy of Sciences(中国科学院大学) Tongyi Lab, Alibaba Group(阿里巴巴集团通义实验室)

AI总结 提出SkillAudit框架,通过配对轨迹审计和过程对齐对比评估,无需真实反馈即可进化智能体技能,在89个任务中平均奖励达73.9%。

Comments 20 pages, 5 figures

详情
AI中文摘要

智能体技能是结构化的程序化包,指导冻结的LLM智能体在专门工作流中操作。技能在部署后很少保持足够:边缘情况、API变化和部署约束只有通过使用才会显现,使得技能进化成为实际需求。现有方法依赖于特权反馈,如保留验证分数、隐藏测试结果或环境奖励——当实践者只有任务描述和工作空间数据时,这些信号通常不可用。我们引入SkillAudit,一个无需真实反馈即可进化智能体技能的框架。关键思想是配对轨迹审计:在每次迭代中,同一任务在有和没有候选技能的情况下执行,隔离技能如何改变智能体行为而无需外部标签。为了将行为差异转化为编辑指导,SkillAudit使用过程对齐对比评估(PACE),一组评估器将轨迹差异映射到与技能文档中特定段落相关的诊断信号。一个结构验证器,从任务规范编译一次然后固定,检查任务约束并回滚有害更新。SkillAudit通过两个管道路由编辑:Refine从广泛有用的技能中移除噪声或不相关的指导,而Repair替换与任务冲突的段落。在跨越8个专业领域的89个容器化任务中,SkillAudit实现了73.9%的平均任务奖励,优于没有技能的智能体(40.9%)和静态专家技能(56.7%)。这些增益是在进化过程中不访问隐藏测试、参考解决方案或外部评分函数的情况下获得的。

英文摘要

Agent skills are structured procedural packages that guide frozen LLM agents in specialized workflows. Skills rarely remain sufficient after deployment: edge cases, API changes, and deployment constraints become visible only through use, making skill evolution a practical necessity. Existing methods depend on privileged feedback such as held-out validation scores, hidden test outcomes, or environment rewards -- signals often unavailable when a practitioner has only a task description and workspace data. We introduce SkillAudit, a framework for evolving agent skills without ground-truth feedback. The key idea is paired trajectory auditing: at each iteration, the same task is executed with and without the candidate skill, isolating how the skill changes agent behavior without external labels. To turn behavioral differences into edit guidance, SkillAudit uses Process-Aligned Contrastive Evaluation (PACE), a cluster of evaluators that maps trajectory divergences to diagnostic signals linked to specific passages in the skill document. A structural verifier, compiled once from the task specification and then fixed, checks task constraints and rolls back harmful updates. SkillAudit routes edits through two pipelines: Refine removes noisy or irrelevant guidance from broadly useful skills, while Repair replaces passages that conflict with the task. Across 89 containerized tasks spanning 8 professional domains, SkillAudit achieves 73.9% average task reward, outperforming an agent without skills (40.9%) and the static expert skill (56.7%). These gains are obtained without accessing hidden tests, reference solutions, or external scoring functions during evolution.

2606.14237 2026-06-15 cs.RO 新提交

BIM-Loc: BIM-Integrated Discrepancy-Aware LiDAR-based Indoor Localization

BIM-Loc:集成BIM的差异感知激光雷达室内定位

Yinqiang Zhang, Liang Lu, Yipeng Pan, Maolin Lei, Yuhan Xie, Zhanteng Xie, Xiaowei Luo, Jia Pan

发表机构 * Department of AI & Data Science, University of Hong Kong (HKU)(香港大学人工智能与数据科学系) Department of Architecture and Civil Engineering, City University of Hong Kong (CityU)(香港城市大学建筑与土木工程系) Humanoids and Human Centered Mechatronics Research Line, Italian Institute of Technology (IIT)(意大利技术研究院人形机器人与以人为本机电一体化研究组)

AI总结 提出BIM-Loc,一种直接集成建筑信息模型(BIM)的差异感知激光雷达定位方法,通过多命中射线投射、BIM集成因子位姿图优化和层次贝叶斯推断,实现与BIM坐标系对齐的轨迹估计和在线差异检测,显著提升定位精度与鲁棒性。

Comments 24 pages, 21 figures, accepted by International Journal of Robotics Research (IJRR), to be published

详情
AI中文摘要

准确且鲁棒的定位是服务机器人和巡检机器人的基本要求,尤其是在特征稀疏的室内环境中,传统系统因缺乏明显地标而难以工作。虽然先验地图可以增强鲁棒性,但对于新的或频繁变化的环境,精确且紧凑的、捕捉真实世界细节的地图往往不可用。本文提出BIM-Loc,一种新颖的差异感知激光雷达定位方法,直接集成设计阶段的建筑信息模型(BIM)。BIM-Loc同时估计与BIM坐标系对齐的轨迹,并以在线方式识别真实世界观测与设计BIM之间的差异。我们的核心贡献包括:(1) 一种新颖的多命中射线投射策略,用于高效的BIM-点云数据关联和将3D观测投影到2D纹理空间;(2) 一个集成BIM因子的位姿图优化框架,强制里程计、连续扫描和BIM结构之间的一致性;(3) 一个层次贝叶斯推断模块,增量更新连续的2D表面表示以进行差异检测,并将更新从像素传播到结构级别。在仿真和实际应用中的广泛评估表明,BIM-Loc在定位精度和鲁棒性方面显著优于最先进的基于地图的方法。

英文摘要

Accurate and robust localization is a fundamental requirement for service and inspection robots, particularly in feature-sparse indoor environments where traditional systems struggle due to a lack of distinct landmarks. While prior maps can enhance robustness, precise and compact maps capturing real-world details are often unavailable for new or frequently changing environments. This paper presents BIM-Loc, a novel discrepancy-aware LiDAR-based localization method that directly integrates Building Information Models (BIM) from the design phase. BIM-Loc simultaneously estimates trajectories aligned with the BIM coordinate system and identifies discrepancies between real-world observations and the as-designed BIM in an online fashion. Our core contributions include: (1) a novel multi-hit ray casting strategy for efficient BIM-point data association and projection of 3D observations into 2D texture space; (2) a pose graph optimization framework with BIM-integrated factors that enforces consistency among odometry, sequential scans, and BIM structures; and (3) a hierarchical Bayesian inference module that incrementally updates a continuous 2D surface representation for discrepancy detection, propagating updates from the pixel to the structure level. Extensive evaluations in both simulation and real-world applications demonstrate that BIM-Loc significantly outperforms state-of-the-art map-based methods in localization accuracy and robustness.

2606.14235 2026-06-15 cs.LG 新提交

Implicit Variational Rejection Sampling

隐式变分拒绝采样

Jian Xu, Shigui Li, Wei Chen, Jiacheng Li, Zhiqi Lin, Delu Zeng, Xinghao Ding, John Paisley, Qibin Zhao

发表机构 * RIKEN iTHEMS RIKEN AIP South China University of Technology(华南理工大学) Xiamen University(厦门大学) Columbia University(哥伦比亚大学)

AI总结 提出隐式变分拒绝采样(IVRS),结合隐式分布与拒绝采样,通过神经网络构建提议分布并用判别器估计密度比来改进后验近似,引入IR-ELBO作为质量度量,实验优于传统变分推断。

详情
AI中文摘要

变分推断(VI)是贝叶斯机器学习中用于近似复杂后验分布的基本推断技术。传统的VI通常依赖于均值场分解,这可能无法充分捕捉真实后验的复杂性。最近的进展利用神经网络建模隐式分布,提供了更大的灵活性。然而,神经网络架构的实际约束仍然会导致不准确性。在本文中,我们提出了一种称为隐式变分拒绝采样(IVRS)的方法,该方法将隐式分布与拒绝采样相结合,以改进后验近似。我们的方法使用神经网络构建隐式提议分布,并通过一个判别器网络进行拒绝采样,该网络估计隐式提议与真实后验之间的密度比,以细化近似。为此,我们引入了隐式重采样证据下界(IR-ELBO)作为度量重采样分布质量的指标,并推导出更紧的变分下界。实验结果表明,我们的方法优于传统的变分推断技术。

英文摘要

Variational Inference (VI) is a fundamental inference technique in Bayesian machine learning for approximating complex posterior distributions. Traditional VI often relies on the mean-field factorization, which can inadequately capture true posterior complexity. Recent advancements have leveraged neural networks to model implicit distributions, offering increased flexibility. However, the practical constraints of neural network architectures still produces inaccuracies. In this paper, we propose a method called Implicit Variational Rejection Sampling (IVRS), which integrates implicit distributions with rejection sampling to improve the posterior approximation. Our method uses neural networks to construct implicit proposal distributions, and rejection sampling with a discriminator network that estimates the density ratio between the implicit proposal and the true posterior for refining the approximation. Towards this end, we introduce the Implicit Resampling Evidence Lower Bound (IR-ELBO) as a metric to characterize the resampled distribution's quality and derive a tighter variational lower bound. Experimental results demonstrate that our method outperforms traditional variational inference techniques.

2606.14230 2026-06-15 cs.CV cs.CL 新提交

A Multi-Domain Feature Fusion Framework for Generalizable Deepfake Detection Across Different Generators

面向不同生成器的可泛化深度伪造检测的多域特征融合框架

Amna Amjid, Sana Qadir, Mehwish Fatima, Raja Khurram Shahzad

发表机构 * School of Electrical Engineering and Computer Science (SEECS), National University of Sciences and Technology (NUST), Islamabad, Pakistan(巴基斯坦国立科技大学电气工程与计算机科学学院) Department of Communication, Quality Management and Information Systems, Mid Sweden University, Ostersund Campus, Sweden(瑞典中部大学通信、质量管理和信息系统系,厄斯特松德校区) Department of Computer Science, Electrical and Space Engineering, Lulea University of Technology, Luleå, Sweden(瑞典吕勒奥理工大学计算机科学、电气与空间工程系)

AI总结 提出SGFF-Net,融合空间、梯度和DWT频率表示,在双残差学习架构中实现跨生成器和跨范式的深度伪造检测,准确率提升至79.80%。

详情
AI中文摘要

深度伪造是人工生成的图像、音频或视频,威胁隐私、安全和信息完整性。检测此类内容对于打击虚假信息至关重要,因为最新模型能生成高度逼真的内容。虽然基于空间或频率的方法在基于生成对抗网络(GANs)的深度伪造上取得了良好的检测率,但它们往往难以处理最近的扩散模型生成的图像。特别是,现有方法很少利用互补的多域表示或系统地评估跨生成器的鲁棒性。为了解决这些挑战,我们提出了一种多域深度伪造检测框架SGFF-Net(空间-梯度-频率融合网络),该网络在双残差学习架构中集成了空间、梯度和基于DWT(离散小波变换)的频率表示。实验结果表明,SGFF-Net在数据集内评估中达到了98.95%的准确率,并在跨模型(70.46%)和跨范式(69.94%)设置中提升了性能。结合多源训练和数据增强进一步增强了鲁棒性,在跨模型评估中准确率从70.46%提升到79.80%,在跨范式评估中从69%提升到78%,在真实世界数据上从61.50%提升到75.80%。与单域检测器不同,SGFF-Net在空间、梯度和小波频率域中学习互补的取证线索,从而在跨生成器和跨范式评估中具有更强的鲁棒性。结果进一步表明,将多域表示与数据多样性和增强相结合,显著提高了泛化能力,为开发更可靠的深度伪造检测系统提供了实用见解。

英文摘要

Deepfakes are artificially generated images, audio, or videos that threaten privacy, security, and information integrity. Detecting such content is crucial for countering disinformation, as the latest models generate highly realistic content. While spatial- or frequency-based approaches achieve good detection rates on Generative Adversarial Networks (GANs)-based generated deepfakes, they often struggle with recent diffusion model-generated images. In particular, existing approaches rarely exploit complementary multi-domain representations or systematically evaluate cross-generator robustness. To address these challenges, we propose a multi-domain deepfake detection framework called SGFF-Net (Spatial-Gradient-Frequency Fusion Network) that integrates spatial, gradient, and DWT (Discrete Wavelet Transform)-based frequency representations within a dual residual learning architecture. Experimental results show that the SGFF-Net achieves 98.95\% accuracy in intra-dataset evaluation and improves performance in both cross-model (70.46\%) and cross-paradigm (69.94\%) settings. Incorporating multi-source training and data augmentation further enhances robustness, increasing accuracy from 70.46\% to 79.80\% in cross-model evaluation, from 69\% to 78\% in cross-paradigm evaluation, and from 61.50\% to 75.80\% on real-world data. Unlike single-domain detectors, the SGFF-Net learns complementary forensic cues across spatial, gradient, and wavelet-frequency domains, resulting in greater robustness under cross-generator and cross-paradigm evaluation. The results further show that combining multi-domain representations with data diversity and augmentation substantially improves generalization, providing practical insights for developing more reliable deepfake detection systems.

2606.14222 2026-06-15 cs.LG 新提交

Learning the Context of Errors: Black-Box Online Adaptation of Time Series Foundation Models

学习错误的上下文:时间序列基础模型的黑盒在线自适应

Xilin Dai, Yiding Liu, Hongjie Xia, Yifan Hu, Zewei Dong, Jiang-Ming Yang, Qiang Xu

发表机构 * Ant International(蚂蚁国际) The Chinese University of Hong Kong(香港中文大学)

AI总结 针对黑盒时间序列基础模型在线自适应问题,提出ORCA方法,通过学习基础模型预测误差的上下文(输入和输出)进行自适应,在5个模型和8个数据集上验证有效性。

详情
AI中文摘要

时间序列基础模型(TSFMs)的快速发展推动了跨领域的零样本预测。受当前大型语言模型形式的启发,未来的TSFMs可能作为商业化的闭源API服务提供。然而,许多现有的在线自适应方法仍然依赖于白盒访问进行参数微调或梯度反向传播。这种范式不匹配引发了一个问题:在TSFMs的黑盒在线自适应中,我们应该学习什么?我们用一个见解来回答:基础模型的预测误差取决于基础模型的输入和输出(即错误的上下文)。为了验证这一见解,我们提出了ORCA(在线残差上下文自适应)。我们在5个最先进的TSFMs和8个数据集上进行了大量实验,以证明我们方法的有效性。此外,通过消融研究,我们定量分析了不同适配器学习假设对黑盒在线自适应最终性能的影响。代码可在https://this URL获取。

英文摘要

The rapid evolution of Time Series Foundation Models (TSFMs) has advanced zero-shot forecasting across diverse domains. Inspired by the current form of Large Language Models, future TSFMs may be offered as commercialized, closed-source API services. However, many existing online adaptation methods still rely on white-box access for parameter fine-tuning or gradient backpropagation. This paradigm mismatch raises a question: In black-box online adaptation for TSFMs, what should we learn? We answer this with an insight: the predictive errors of the base model are conditioned on both the input and output of the base model (i.e., the context of errors). To validate this insight, we propose ORCA (Online Residual Contextual Adaptation). We conduct extensive experiments across 5 state-of-the-art TSFMs and 8 datasets to demonstrate the effectiveness of our approach. Furthermore, through ablation studies, we quantitatively analyze the impact of different adapter learning hypotheses on the final adaptation performance in black-box online adaptation. Code available at https://github.com/Fifthky/ORCA.

2606.14219 2026-06-15 cs.RO cs.AI 新提交

Selective Agentic Recovery for UAV Autonomy with a Persistent Mission Runtime

面向无人机自主性的选择性代理恢复与持久任务运行时

Taewoo Park, Kyeonghyun Yoo, Seunghyun Yoo, Hwangnam Kim

发表机构 * Department of Electrical and Electronic Engineering, Korea University(高丽大学电气与电子工程系)

AI总结 提出持久任务运行时(PMR)框架,通过选择性调用外部代理推理器实现无人机恢复,引入学习型调用认知价值(learned-CVI)门控机制,在Gazebo/PX4基准测试中将硬/模糊场景成功率从5.0%提升至95.0%,同时减少16.7%的远程调用和29.2%的令牌消耗。

Comments 17 pages, 2 figures. Preprint

详情
AI中文摘要

代理AI可以通过在基于航点或设定点的局部执行遇到阻塞路径、重复无进展行为或任务级模糊时提供高层恢复推理来支持无人机自主性。然而,在物理无人机上,远程推理只有在选择性调用时最有用,因为每次调用都会引入延迟、资源成本、后端不确定性以及验证返回决策的需求。本文提出持久任务运行时(PMR),一种无人机恢复框架,它保持任务循环和安全关键执行在本地,同时仅将外部代理推理器用作按需恢复模块。推理器从预定义的恢复技能中选择,每个返回的决策在影响飞行之前经过解析、验证、安全过滤并映射到本地执行器动作。PMR引入了学习型调用认知价值(learned-CVI),一种紧凑的准入门控,用于估计远程代理推理何时可能改善近期任务进展以证明其操作成本合理。在包含八个场景的固定400次运行Gazebo/PX4基准测试中,learned-CVI将硬/模糊场景成功率从仅本地的5.0%提升至95.0%,优于一次性推理和周期性推理基线分别20.0和32.5个百分点,并且相对于手动调整的基于规则的调用基线,减少了16.7%的远程代理调用和29.2%的日志令牌。

英文摘要

Agentic AI can support unmanned aerial vehicle (UAV) autonomy by providing high-level recovery reasoning when local waypoint- or setpoint-based execution encounters blocked passages, repeated no-progress behavior, or mission-level ambiguity. On physical UAVs, however, remote reasoning is most useful when it is invoked selectively, since each call introduces latency, resource cost, backend uncertainty, and a need to validate the returned decision. This paper presents Persistent Mission Runtime (PMR), a UAV recovery framework that keeps the mission loop and safety-critical execution local while using an external agentic reasoner only as an on-demand recovery module. The reasoner selects from predefined recovery skills, and each returned decision is parsed, verified, safety-filtered, and mapped to local executor actions before it can affect flight. PMR introduces learned Cognitive Value of Invocation (learned-CVI), a compact admission gate that estimates when remote agentic reasoning is likely to improve near-term mission progress enough to justify its operational cost. Across a fixed 400-run Gazebo/PX4 benchmark with eight scenarios, learned-CVI raises hard/ambiguous-regime success from 5.0% under local-only autonomy to 95.0%, outperforms one-shot and periodic reasoning baselines by 20.0 and 32.5 percentage points, and reduces remote-agent calls by 16.7% and logged tokens by 29.2% relative to a manually tuned rule-based invocation baseline.

2606.14218 2026-06-15 cs.RO cs.AI cs.LG 新提交

Universal Manipulation Exoskeleton: Learning Compliant Whole-body Policies with Real-time Torque Feedback

通用操控外骨骼:利用实时扭矩反馈学习全身柔顺策略

Litian Liang, Jingxi Xu, Xinda Qi, Yujun Cai, Houzhu Ding, Luqi Wang, Zhixin Sun, Jyh-Herng Chow, Ming Yang, Mark Cutkosky

发表机构 * Ant Group(蚂蚁集团) Stanford University(斯坦福大学)

AI总结 提出通用操控外骨骼(UME),通过实时触觉扭矩反馈和全身数据采集,使机器人学习主动柔顺策略,在受限空间中完成移动操作、力控翻转等任务。

详情
AI中文摘要

为了使机器人在家庭环境中安全工作,它们需要具备柔顺性,并在接触过程中对扭矩和力反馈做出反应。然而,现有的大多数数据采集管道仍然缺乏捕捉力和扭矩数据以学习主动柔顺策略的能力。在本文中,我们提出了通用操控外骨骼(UME),一种上肢外骨骼,它提供实时触觉扭矩反馈,同时记录整个手臂的配置和关节扭矩信号用于遥操作。凭借透明的扭矩反馈,人类操作员甚至可以在蒙眼的情况下拔出运动学约束的物体。UME成本低、重量轻且便携。配备嵌入式IMU,它支持移动操作的遥操作。通过我们提出的通用重定向算法,UME可以遥操作多种机器人,包括7自由度OpenArm、7自由度Franka和6自由度X-ARM。我们证明,这些能力的组合使得学习双臂、全身和主动柔顺策略成为可能,这些策略在高度受限的空间中有效运行。学习到的鲁棒自主策略在各种任务中实现了高成功率,包括长时程移动操作、力介导的箱子翻转、视觉遮挡的箱子推挤以及空间受限的桌面操作。视频、代码和更多信息可在此https URL找到。

英文摘要

For robots to work safely in household environments, they need to be compliant and react to torque and force feedback during contact. However, the majority of existing data collection pipelines still lack the ability to capture force and torque data for learning active compliant policies. In this paper, we present Universal Manipulation Exoskeleton (UME), an upper-limb exoskeleton that provides real-time haptic torque feedback while recording whole-arm configurations and joint torque signals for teleoperation. With transparent torque feedback, human operators can even unsheathe kinematically constrained objects while blindfolded. UME is low-cost, lightweight, and portable. Equipped with an embedded IMU, it enables teleoperation for mobile manipulation. With our proposed universal retargeting algorithm, UME can teleoperate a range of robots, including the 7DoF OpenArm, 7DoF Franka, and 6DoF X-ARM. We demonstrate that this combination of capabilities enables learning bimanual, whole-body, and active compliant policies that operate effectively in highly constrained spaces. The learned robust autonomous policies achieve high success rates across a variety of tasks, including long-horizon mobile manipulation, force-mediated box flipping, visually occluded box pushing, and space-constrained tabletop manipulation. Videos, code, and additional information can be found at https://ume-exo.github.io.

2606.14217 2026-06-15 cs.LG q-bio.BM 新提交

Curvature-Informed Potential Energy Surface for Protein-Ligand Binding Affinity Prediction

曲率信息势能面用于蛋白质-配体结合亲和力预测

Peng-Fei Sun, Chuan-Xian Ren, Hong Yan

发表机构 * Sun Yat-Sen University(中山大学) City University of Hong Kong(香港城市大学)

AI总结 提出曲率信息势能面图神经网络CPES,通过物理启发的曲率表示建模构象柔性,结合光谱交叉注意力捕获结合诱导的动力学变化,提升亲和力预测性能。

详情
AI中文摘要

准确预测蛋白质-配体结合亲和力对于基于结构的药物发现至关重要。最近的几何深度学习方法通过将蛋白质-配体复合物表示为三维图,取得了有前景的性能。然而,大多数现有方法主要依赖于来自单一结合构象的静态相互作用几何,而忽略了分子柔性和结合诱导的构象变化。为了解决这一局限性,我们提出了一种曲率信息势能面(CPES)图神经网络用于蛋白质-配体结合亲和力预测,该网络结合了物理启发的曲率表示来建模构象柔性。CPES首先从平衡构型下评估的势能面Hessian矩阵导出曲率谱描述符,其特征值定义了势能面的局部主曲率。然后,它使用光谱交叉注意力来比较未结合的配体和蛋白质与结合复合物,从而捕获结合诱导的构象动力学变化。同时,通过几何感知消息传递、软聚类和双向交叉注意力,从静态结构特征中学习层次化的蛋白质-配体相互作用表示。最后,CPES融合曲率信息动态表示与静态相互作用表示进行亲和力回归。在多个基准数据集上的广泛评估表明,CPES实现了改进的预测性能并提供了物理可解释性。

英文摘要

Accurate prediction of protein-ligand binding affinity is essential for structure-based drug discovery. Recent geometric deep learning methods have achieved promising performance by representing protein-ligand complexes as three-dimensional graphs. However, most existing approaches mainly rely on static interaction geometry from a single bound conformation, while neglecting molecular flexibility and binding-induced conformational changes. To address this limitation, we propose a curvature-informed potential energy surface (CPES) graph neural network for protein-ligand binding affinity prediction, which incorporates physics-informed curvature representations to model conformational flexibility. CPES first derives curvature spectral descriptors from the Hessian of the potential energy surface evaluated at equilibrium configurations, whose eigenvalues define the local principal curvatures of the potential energy surface. It then uses spectral cross-attention to compare the unbound ligand and protein with the bound complex, thereby capturing binding-induced changes in conformational dynamics. In parallel, hierarchical protein-ligand interaction representations are learned from static structural features through geometry-aware message passing, soft clustering, and bidirectional cross-attention. Finally, CPES fuses the curvature-informed dynamic representations with static interaction representations for affinity regression. Extensive evaluations on multiple benchmark datasets demonstrate that CPES achieves improved predictive performance and offers physical interpretability.

2606.14216 2026-06-15 cs.RO cs.SY eess.SY 新提交

Short-Horizon Position Accuracy of Single-Track Models: Implications for Motion Planning of Autonomous Vehicles

单轨模型的短时位置精度:对自动驾驶车辆运动规划的启示

Aron J. Aertssen, Lars A. T. H. van Alen, Igo J. M. Besselink, Rudolf G. M. Huisman, René M. J. G. van de Molengraft

发表机构 * Department of Mechanical Engineering, Eindhoven University of Technology(埃因霍温理工大学机械工程系) Safety & Driver Controls Group, Vehicle Development, DAF Trucks N.V.(DAF卡车公司车辆开发部安全与驾驶员控制组)

AI总结 本文通过实车实验对比三种单轨车辆模型的短时位置精度,分析模型复杂度、参数化质量与位置精度的权衡,为模型预测控制中的模型选择提供依据。

Comments Submitted to The International Journal of Automotive Engineering, Official Journal of the Society of Automotive Engineers of Japan, Inc. (JSAE)

详情
AI中文摘要

准确且计算高效的车辆模型对于自动驾驶车辆的运动规划至关重要,其中位置精度直接影响轨迹可行性和安全性。然而,位置精度尚未针对实际测量进行系统评估。因此,本文通过多种驾驶操作下的车辆测量,比较了三种单轨车辆模型的短时位置精度。模型参数通过使用仪器化测试车辆的专用实验进行识别。本文旨在提供对模型复杂度、参数化质量和位置精度之间权衡的洞察,以便在模型预测控制应用中做出明智的模型选择,而非确定单一最佳模型。

英文摘要

Accurate and computationally efficient vehicle models are essential for motion planning of autonomous vehicles, where positional accuracy directly affects trajectory feasibility and safety. However, the positional accuracy has not been systematically evaluated against real measurements. Therefore, this paper compares the short-horizon positional accuracy of three single-track vehicle models against vehicle measurements across various driving maneuvers. Model parameters are identified through dedicated experiments with the instrumented test vehicle. Rather than identifying a single best model, this work aims to provide insight into the trade-offs between model complexity, parameterization quality, and positional accuracy for informed model selection in Model Predictive Control applications.

2606.14215 2026-06-15 cs.LG 新提交

LapidaryEngine: Fully Conversational Crystal Generation

LapidaryEngine: 全对话式晶体生成

Yusei Ito, Yuta Suzuki, Tomoya Murata, Masaki Adachi

发表机构 * Lattice Lab, Toyota Motor Corporation(丰田汽车公司Lattice实验室) The University of Osaka(大阪大学)

AI总结 提出LapidaryEngine,首个支持全对话式晶体生成的模型,通过枢轴表示实现文本与晶体结构的双向翻译,支持自由形式自然语言请求和迭代优化。

Comments 11 main pages, 5 main figures, and 1 table

详情
AI中文摘要

大型语言模型(LLM)的出现激发了直接从自然语言指令生成定制晶体材料的愿景,使用户能够通过直观的对话式交互设计材料。现有的文本到晶体生成模型代表了朝着这一目标的重要早期步骤,但它们存在两个关键限制:(i)输入格式受限,需要高度结构化的描述(例如化学式),以及(ii)单向生成,模型可以将文本映射到晶体,但无法执行逆映射。这些限制阻碍了全对话式工作流程,并妨碍了与用户固有的模糊和不断变化的需求的对齐。我们通过LapidaryEngine解决了这些挑战,这是第一个支持全对话式晶体生成的模型。LapidaryEngine接受自由形式的自然语言请求,并以类似对话的方式执行迭代优化和编辑。关键创新在于枢轴表示,这是一种第三种中间形式,尽管缺乏直接配对数据集,但仍能实现文本与晶体结构之间的双向翻译。利用这一枢轴可以稳健地解释用户反馈并进行精确的结构控制。我们在多种任务上展示了LapidaryEngine,包括绝缘体发现、稳定性优化、成分修改和结构编辑,展示了其以交互方式使生成的材料与用户意图对齐的能力。

英文摘要

The emergence of Large Language Models (LLMs) has inspired the vision of generating bespoke crystal materials directly from natural-language instructions, enabling users to design materials through intuitive, conversational interaction. Existing text-to-crystal generative models represent important early steps toward this goal, but they suffer from two critical limitations: (i) restricted input formats that require highly structured descriptions (e.g., chemical formulas), and (ii) one-directional generation, where models can map text to crystal but cannot perform the inverse. These limitations prevent fully conversational workflows and hinder alignment with users' inherently ambiguous and evolving desiderata. We address these challenges with LapidaryEngine, the first model to support fully conversational crystal generation. LapidaryEngine accepts free-form natural-language requests and performs iterative refinement and editing in a dialogue-like manner. The key innovation is a pivot representation, a third, intermediate form that enables bidirectional translation between text and crystal structures despite the absence of direct paired datasets. Leveraging this pivot allows robust interpretation of user feedback and precise structural control. We demonstrate LapidaryEngine across diverse tasks, including insulator discovery, stability optimization, compositional modification, and structural editing, showcasing its ability to align generated materials with user intent in an interactive manner.

2606.14211 2026-06-15 cs.AI cs.LG 新提交

Closing the Reflection Gap: A Free Calibration Bonus for Agentic RL

缩小反思差距:面向智能体强化学习的免费校准奖励

Yinglun Zhu

发表机构 * University of California, Riverside(加州大学河滨分校)

AI总结 针对LLM智能体在环境反馈后自我评估不准确的问题,提出RefGRPO方法,通过对比反思与实际结果计算免费校准奖励并动态调整系数,同时提升反思校准和任务准确率。

详情
AI中文摘要

LLM越来越多地被部署为与外部环境交互并观察执行结果、错误消息和工具输出等反馈的智能体。一个功能良好的智能体应能利用这些反馈准确评估自身表现。然而,我们发现存在持续的反思差距:LLM智能体在观察到具体环境反馈后,倾向于错误评估自身输出——即使对于它们正确回答的问题也是如此——而标准RL由于信用分配不匹配几乎无济于事。为缩小这一差距,我们提出RefGRPO,一种简单而有效的修复方法,通过两个关键要素增强标准RL算法:一个免费校准奖励,通过对比智能体自身反思与实际结果计算(无需额外奖励模型、LLM评判或外部标注),以及对其系数的动态调度。与标准RL基线相比,我们的方法在五个基准的文本到SQL任务上同时提高了反思校准(例如,将不自信率从44.4%降至7.7%)和任务准确率(例如,从75.1%提升至76.5%)。由此产生的校准反思将智能体转变为基于环境反馈的自身验证器,进一步实现:(i)更好的自我改进,使用反思作为伪奖励而无需结果监督;(ii)更有效的测试时选择性预测,仅提交标记为正确的rollout。

英文摘要

LLMs are increasingly deployed as agents that interact with external environments and observe feedback such as execution results, error messages, and tool outputs. A well-functioning agent should be able to leverage this feedback to accurately assess its own performance. Yet we find a persistent reflection gap: LLM agents tend to mis-assess their own outputs after observing concrete environment feedback -- even for questions they correctly answered -- and standard RL barely helps due to a credit-assignment mismatch. To close this gap, we propose RefGRPO, a simple yet effective fix that augments standard RL algorithms with two key ingredients: a free calibration bonus computed by contrasting the agent's own reflection with the actual outcome (requiring no additional reward model, LLM judge, or external annotation), and a dynamic schedule on its coefficient. Compared to standard RL baselines, our method simultaneously improves reflection calibration (e.g., reduces underconfidence rate $44.4\% \to 7.7\%$) and task accuracy (e.g., $75.1\% \to 76.5\%$) on text-to-SQL across five benchmarks. The resulting calibrated reflection turns the agent into its own verifier grounded in environment feedback, which further enables (i) better self-improvement that uses reflections as pseudo-rewards without outcome supervision, and (ii) more effective test-time selective prediction by committing only to rollouts flagged as correct.

2606.14209 2026-06-15 cs.CL 新提交

Detecting undisclosed LLM-generated content in parliamentary texts

检测议会文本中未披露的LLM生成内容

Minerva Suvanto, Andrea McGlinchey, Peter J. Barclay, Mattias Wahde

发表机构 * Chalmers University of Technology(查尔姆斯理工大学) University of Cambridge(剑桥大学)

AI总结 本研究训练可解释文本分类器,检测英国和瑞典议会文本中未披露的LLM使用情况,发现自2022年起两国议会中未披露的LLM使用持续增加。

详情
AI中文摘要

在本文中,我们评估了英国和瑞典议会文本中未披露的LLM生成内容的程度。在许多领域,如新闻或学术写作,通常要求明确披露是否使用了LLM等AI工具。对于议会文本,AI使用披露指南更为模糊。然而,为了保持透明度和维护公众信任,通常建议议员在撰写议会动议等文本时说明是否使用了AI。我们使用LLM出现前的议会文本及其LLM生成版本训练了一个可解释(玻璃箱)文本分类器。然后将分类器应用于包含近期议会文本的测试集,发现自2022年起,两国议会中未披露的LLM使用均呈稳步增长趋势。

英文摘要

In this paper, we evaluate the extent of undisclosed LLM-generated content in texts from the parliaments of the United Kingdom and Sweden. In many areas, such as in journalism or in academic writing, there are often requirements to clearly disclose whether AI tools, such as LLMs, have been used. In the case of parliamentary texts, the guidelines on disclosure of AI use are more vague. However, in order to maintain transparency and retain public trust, it is generally recommended that parliamentarians should state whether or not they have used AI when writing texts, such as parliamentary motions. Here, we train an interpretable (glass-box) text classifier using pre-LLM parliamentary texts and LLM-generated versions of such texts. We then apply the classifier to a test set containing recent parliamentary texts, finding a steady increase in undisclosed LLM use, in both parliaments, from 2022 onwards.

2606.14200 2026-06-15 cs.AI cs.LG 新提交

When Should Agent Trust Be Conditional? Characterizing and Attacking Skill-Conditional Reputation in Agent Swarms

何时应条件化智能体信任?表征与攻击智能体群中的技能条件声誉

Yihan Xia, Taotao Wang

发表机构 * Shenzhen University(深圳大学)

AI总结 研究异构LLM智能体群中技能条件信任的适用条件,通过相图分析揭示其在高异质性、稀疏证据和技能相关场景下有效,但存在跨技能证据被攻击者利用的风险,提出条件信息值测试(CIVT)量化攻击影响。

Comments 18 pages, 8 figures, 2 tables

详情
AI中文摘要

开放平台越来越多地将任务路由给异构的LLM智能体——它们在基础模型、框架和工具栈上有所不同——其能力因技能而异:一个智能体在某项技能上表现出色,在另一项技能上可能毫无用处。标准的声誉方法为每个智能体总结一个单一的全局信任分数,但这里的标量是错误的对象,因为将每个任务路由到全局最受信任的智能体会放弃专业化的价值。我们研究技能条件信任R(i | k)——对于需要技能k的任务,应赋予智能体i的信任,而不是每个智能体一个分数——并提出三个可证伪的问题:何时条件化是值得的,应借用多少跨技能证据,以及这种借用是否安全。受控的相图分析回答了前两个问题:条件信任仅在特定区域获胜——高智能体异质性、稀疏的每技能证据和相关的技能——而实现这种数据效率的耦合强度β是双刃剑,因为相同的跨技能借用也是一个洗钱渠道。在14个真正异构的AppWorld智能体的公共基准上,实际池落在有益区域内——一个微小但真实的增益,每技能最佳智能体在不同技能间确实发生变化。然后我们展示,一个在一种技能上有廉价证据而在目标技能上没有证据的攻击者劫持条件路由器,将路由遗憾从0驱动到0.94,而我们的零成本条件信息值测试(CIVT)将其评为绿色——而它污染的无门控信任判决读数为-0.06,而非诚实的+0.19。零证据门限限制了攻击但并未消除它;我们在明确预算下表征了剩余成本。我们不声称抗女巫攻击——我们量化了权衡。

英文摘要

Open platforms increasingly route tasks among heterogeneous LLM agents--differing in base model, scaffold, and tool stack--whose competence varies sharply by skill: an agent excellent at one skill may be useless at another. The standard reputation approach summarizes each agent by a single global trust score, but that scalar is the wrong object here, because routing every task to the globally most-trusted agent leaves the value of specialization unclaimed. We study skill-conditional trust R(i | k)--the trust to place in agent i for a task requiring skill k, rather than one score per agent--and pose three falsifiable questions: when is conditioning worth it, how much cross-skill evidence should be borrowed, and whether that borrowing is safe. A controlled phase-diagram analysis answers the first two: conditional trust wins only in a specific regime--high agent heterogeneity, sparse per-skill evidence, and correlated skills--and the coupling strength beta that buys this data efficiency is dual-use, because the same cross-skill borrowing is also a laundering channel. On a public benchmark of 14 genuinely heterogeneous AppWorld agents, real pools land inside the beneficial regime--a small but genuine gain, with the per-skill best agent genuinely changing across skills. We then show that an attacker with cheap evidence in one skill and none in a target skill hijacks the conditional router, driving routing regret from 0 to 0.94 on a pool our zero-cost Conditional Information Value Test (CIVT) rates GREEN--while the ungated trust verdict it contaminates reads -0.06 instead of the honest +0.19. A zero-evidence gate bounds the attack but does not eliminate it; we characterize the residual cost under an explicit budget. We do not claim Sybil-resistance--we quantify the trade-off.

2606.14199 2026-06-15 cs.CL cs.AI cs.LG 新提交

OdysSim: Building Foundation Models for Human Behavior Simulation

OdysSim: 构建人类行为模拟的基础模型

Xuhui Zhou, Weiwei Sun, Weihua Du, Jiarui Liu, Haojia Sun, Qianou Ma, Tongshuang Wu, Yiming Yang, Maarten Sap

发表机构 * Carnegie Mellon University, Language Technologies Institute(卡内基梅隆大学语言技术研究所)

AI总结 提出OdysSim,通过SOUL分类法统一62个数据集和23个基准任务,采用混合训练、任务特定强化学习和专家蒸馏,构建8B参数行为基础模型OSim,在多数任务上超越前沿模型,并实现更类人输出和零样本迁移。

Comments 34 pages. Code: https://github.com/sunnweiwei/OdysSim ; Models and data: https://huggingface.co/collections/cmu-lti/odyssim

详情
AI中文摘要

大型语言模型越来越多地被部署为人类模拟器,用于交互式评估和社会模拟。然而,以有用性为导向的后训练使它们趋向于同质化、过于随和的助手风格,造成了行为上的Sim2Real差距。我们提出了OdysSim,这是对行为基础模型(即经过训练以大规模模拟人类行为的模型)进行的最大规模开放系统研究。我们提出了SOUL,一个包含五个能力轴(CONV、SS、COG、ROLE、EVAL)的分类法,将62个数据集和23个基准任务统一在一个框架下。具体来说,我们整理了OdysSim语料库(2140万次交互,100亿个token,并配备了反向生成的社交上下文),构建了SOUL-Index基准,并开发了一个端到端的训练方案,结合了中期训练、任务特定强化学习和专家蒸馏。由此产生的开源8B OSim模型在23个任务中的8个上排名第一或并列第一,按此计数优于任何单个前沿模型,在对话和社交任务上取得了最大的提升。其输出在长度、格式和词汇选择上也更接近人类,并在τ-bench上零样本迁移到分布外的用户模拟,在反应一致性上几乎与真实用户匹配(93.2 vs 93.5)。我们进一步表明,LLM作为评判者的强化学习会引发奖励黑客模式,而我们的检测器可以在后训练期间缓解这些模式。总之,我们的发现表明,行为基础模型需要重新思考LLM的训练范式。我们发布所有工件以支持未来的研究。

英文摘要

Large language models are increasingly deployed as human simulators for interactive evaluation and social simulation. Yet helpfulness-driven post-training pulls them toward a homogeneous, overly agreeable assistant register, creating a behavioral Sim2Real gap. We present OdysSim, the largest open systematic investigation of behavioral foundation models, i.e., models trained to simulate human behavior at scale. We propose SOUL, a taxonomy of five capability axes (CONV, SS, COG, ROLE, EVAL) that unifies 62 datasets and 23 benchmark tasks under one framework. Specifically, we curate the OdysSim corpus (21.4M interactions, 10B tokens, retrofitted with back-generated social contexts), construct the SOUL-Index benchmark, and develop an end-to-end training recipe combining midtraining, task-specific RL, and expert distillation. The resulting open 8B OSim model ranks first or tied-first on 8 of 23 tasks, outperforming any individual frontier model by this count, with the strongest gains on conversational and social tasks. Its outputs are also more human-like in length, formatting, and word choice, and it transfers zero-shot to out-of-distribution user simulation on $τ$-bench, nearly matching real users on reaction alignment (93.2 vs. 93.5). We further show that LLM-as-judge RL induces reward-hacking patterns, and that our detectors can mitigate them during post-training. Together, our findings suggest that behavioral foundation models require rethinking the LLM training paradigm. We release all artifacts to support future research.

2606.14195 2026-06-15 cs.LG math.OC 新提交

Structured Noise Adaptation for Sequential Bayesian Filtering with Embedded Latent Transfer Operators

基于嵌入潜传递算子的序贯贝叶斯滤波的结构化噪声自适应

Naichang Ke, Pongpisit Thanasutives, Yoshinobu Kawahara

发表机构 * The University of Osaka(大阪大学) RIKEN Center for Advanced Intelligence Project (AIP)(理化学研究所革新智能综合研究中心(AIP))

AI总结 针对ELTO卡尔曼滤波器噪声模型无法适应非平稳过程的问题,提出结构化噪声参数化方法,结合最优时不变噪声学习与动态参数自适应,提升时变噪声环境下的状态估计性能。

Comments Accepted by TMLR

详情
AI中文摘要

基于嵌入潜传递算子(ELTO)的卡尔曼滤波器成为序贯状态估计的新型统计工具。然而,一个关键限制源于其使用简化的噪声模型,无法动态适应非平稳过程。为解决此限制,我们引入了一种基于ELTO的贝叶斯滤波方法,对滤波器的噪声模型采用新的结构化参数化。该参数化实现了结构化噪声自适应,将最优时不变噪声模型的数据驱动学习与响应非平稳过程中动态变化的动态参数自适应相结合。实验结果表明,我们的结构化噪声自适应提高了滤波器在噪声、时变环境中的动态状态估计性能。

英文摘要

Kalman filters based on the Embedded Latent Transfer Operators (ELTO) emerge as novel statistical tools for sequential state estimation. However, a critical limitation stems from their use of simplified noise models, which fail to dynamically adapt to non-stationary processes. To address this limitation, we introduce an ELTO-based Bayesian filtering approach with a new structured parameterization for the filter's noise model. This parameterization enables structured noise adaptation, which couples the data-driven learning of an optimal time-invariant noise model with dynamic parameter adaptation that responds to changes in dynamics within non-stationary processes. Empirical results show that our structured noise adaptation improves the filter's dynamic state estimation performance in noisy, time-varying environments.

2606.14194 2026-06-15 cs.CV cs.LG 新提交

Hybrid Classical-Quantum (HCQ) Alzheimer's Classification via Supervised $β$-VAE and Quantum Kernels

混合经典-量子(HCQ)阿尔茨海默病分类:基于监督β-VAE与量子核

Tia Tiwari, Vamshi Krishna Kancharla, Neelam Sinha

发表机构 * Centre for Brain Research, Indian Institute of Science(印度科学研究所脑研究中心) Vision and AI Lab (VAL), Indian Institute of Science(印度科学研究所视觉与人工智能实验室)

AI总结 提出两阶段混合经典-量子流水线,通过监督3D β-VAE压缩MRI为64维潜码,经PLS选择6个成分编码为6量子比特态,利用量子核SVM实现AD分类,在ADNI-1上达72.1%准确率与0.799 AUC。

详情
AI中文摘要

本文提出了一种两阶段混合经典-量子(HCQ)流水线,用于从3D T1加权结构MRI体素中进行二元阿尔茨海默病(AD)分类,其中经典和量子组件设计为互补而非独立运行。一个监督的3D β-变分自编码器(VAE)在体素级重建、KL散度和焦点分类损失下进行端到端训练,将每个3D MRI体积(从152 x 184 x 152重采样为96 x 96 x 96)压缩为64维潜码。偏最小二乘(PLS)回归选择潜码中最佳区分阿尔茨海默病(AD)与认知正常(CN)受试者的六个分量,并将其重新缩放为旋转角度,通过ZZ量子特征映射编码到六量子比特寄存器上,得到相应的量子态。预计算核支持向量机(SVM)的输入是一个N x N Gram矩阵(N = 308),通过计算每对量子态之间的重叠得到。本工作的新颖之处在于量子核直接作用于由监督自编码器端到端学习的疾病感知特征,而非预提取的输入。在308名ADNI-1受试者(包括137名AD和171名CN)上,基线模型达到67.2%的准确率和0.759的AUC,而稳定性增强变体达到72.1%的准确率和0.799的AUC,且交叉验证方差减半。3D Grad-CAM进一步帮助验证了模型对与阿尔茨海默病相关脑区的关注。HCQ流水线可作为跨生物医学成像领域的诊断分类通用框架,这些领域对经典方法存在类似挑战。

英文摘要

This paper presents a two-stage Hybrid Classical-Quantum (HCQ) pipeline for binary Alzheimer's disease (AD) classification from 3D T1-weighted structural MRI volumes, where the classical and quantum components are designed to complement each other rather than operate independently. A supervised 3D $β$-variational autoencoder (VAE) is trained end-to-end under voxel-wise reconstruction, KL-divergence, and focal classification losses that compress each 3D MRI volume (resized from 152 x 184 x 152 to 96 x 96 x 96) into a 64-dimensional latent code. Partial Least Squares (PLS) regression selects the six components in the latent code that best separate Alzheimer's Disease (AD) from cognitively normal (CN) subjects and rescales them into rotation angles, which are encoded onto a six-qubit register using the ZZ quantum feature map to give us the respective quantum states. The input to a precomputed-kernel Support Vector Machine (SVM) is an N x N Gram matrix (N = 308), created by calculating the overlap between every pair of quantum states. The novelty of this work lies in the fact that the quantum kernel operates directly on disease-aware features that are learned end-to-end by a supervised autoencoder, rather than on pre-extracted inputs. On 308 ADNI-1 subjects, consisting of 137 AD and 171 CN subjects, the baseline achieved 67.2% accuracy and 0.759 AUC, while the stability-enhanced variant reached 72.1% accuracy and 0.799 AUC with cross-fold variance halved. 3D Grad-CAM further helped validate our model's focus on brain regions linked to Alzheimer's. The HCQ pipeline could serve as a general-purpose framework for diagnostic classification across biomedical imaging domains that present similar challenges for classical approaches.

2606.14192 2026-06-15 cs.LG 新提交

DRIVE: Distributional and Retrieval-Augmented Bidding with Value Evaluation

DRIVE:基于分布与检索增强的价值评估竞价方法

Miduo Cui, Haochen Wang, Shangqin Mao, Xun Yang, Qianlong Xie, Xingxing Wang, Xuri Ge, Ying Zhou, Zhiwei Xu

发表机构 * Machine Learning, ICML(机器学习,ICML)

AI总结 提出DRIVE框架,通过解耦候选动作生成与决策,结合分布建模、检索增强和价值评估,提升离线自动竞价性能。

Comments Accepted to ICML 2026

详情
AI中文摘要

自动竞价是实时广告系统的核心组成部分,决策必须在预算和成本约束下优化长期性能,而在线探索风险极高。离线强化学习以及最近基于Transformer的序列建模在从日志数据中学习竞价策略方面显示出前景,但其单峰和纯参数化公式通常将多种有效竞价策略折叠为次优的平均动作,并在稀疏或长尾流量下表现不可靠。为缓解这些限制,我们提出DRIVE(基于分布与检索增强的价值评估竞价),一个统一的基于Transformer的框架,将候选动作生成与决策解耦,用于离线自动竞价。DRIVE结合了分布动作建模、从高质量历史决策中检索增强的候选生成以及基于价值的评估,以在推理时选择最有希望的出价。在AuctionNet和额外离线强化学习基准上的大量实验表明,DRIVE持续改善竞价性能,并在多种基于Transformer的方法上具有良好的泛化能力。

英文摘要

Auto-bidding is a core component of real-time advertising systems, where decisions must optimize long-term performance under budget and cost constraints, while online exploration is prohibitively risky. Offline reinforcement learning and, more recently, Transformer-based sequence modeling have shown promise for learning bidding policies from logged data, but their unimodal and purely parametric formulations often collapse multiple effective bidding strategies into suboptimal averaged actions and perform unreliably under sparse or long-tail traffic. To mitigate these limitations, we propose DRIVE (Distributional and Retrieval-Augmented Bidding with Value Evaluation), a unified Transformer-based framework that decouples candidate action generation from decision making for offline auto-bidding. DRIVE combines distributional action modeling, retrieval-augmented candidate generation from high-quality historical decisions, and value-based evaluation to select the most promising bid at inference time. Extensive experiments on AuctionNet and additional offline reinforcement learning benchmarks demonstrate that DRIVE consistently improves bidding performance and generalizes well across multiple Transformer-based methods.

2606.14188 2026-06-15 cs.RO cs.AI cs.LG cs.SY eess.SY math.OC 新提交

Robustness without Wrinkles: Parallel Simulation and Robust MPC for Certified Deformable Manipulation

无皱鲁棒性:并行仿真与鲁棒MPC实现可认证的变形体操作

Wei-Chen Li, Jeffrey Fang, Sasanka Polisetti, Yuexi Song, Glen Chou

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出CORD-SLS实时控制方法,通过GPU并行可微仿真与接触平滑实现高效梯度规划,结合鲁棒模型预测控制与共形预测校准,在绳索和布料操作中达到毫秒级规划与高安全性。

详情
AI中文摘要

我们提出了CORD-SLS,一种用于安全变形物体操作的实时控制方法,重点关注绳索和布料。其核心是一个带有接触平滑的GPU并行可微仿真器,能够通过间歇性接触实现高效的基于梯度的规划。为了在模型和感知不确定性下鲁棒地满足约束,我们开发了一种实时、GPU并行的输出反馈鲁棒模型预测控制(MPC)算法,该算法利用该仿真器进行规划。我们进一步证明,该仿真器加速了基于模型的强化学习,用于训练神经操作策略。为了提高现实世界的鲁棒性,我们使用共形预测来校准视觉反馈和感知误差界限,用于MPC,从而产生可达管,实现高概率的安全控制。我们在仿真和硬件上对高维、接触丰富的绳索和布料操作任务(包括避障、布线、折叠和平整)评估了CORD-SLS。在各种设置中,CORD-SLS实现了毫秒级规划速度,在安全性、速度和任务成功率方面均优于基线方法。

英文摘要

We present CORD-SLS, a real-time control method for safe deformable object manipulation, with a focus on ropes and cloth. At its core is a GPU-parallel differentiable simulator with contact smoothing which enables efficient gradient-based planning through intermittent contact. To robustly satisfy constraints under model and sensing uncertainty, we develop a real-time, GPU-parallel output-feedback robust model predictive control (MPC) algorithm that plans with this simulator. We further show that the simulator accelerates model-based RL for training neural manipulation policies. To improve real-world robustness, we use conformal prediction to calibrate visual-feedback and perception-error bounds for MPC, producing reachable tubes that enable high-probability safe control. We evaluate CORD-SLS on high-dimensional, contact-rich rope and cloth manipulation tasks in simulation and hardware, including obstacle avoidance, routing, folding, and smoothing. Across settings, CORD-SLS achieves millisecond-speed planning, exceeding baselines in safety, speed, and task success.

2606.14179 2026-06-15 cs.CL 新提交

CacheRL:Multi-Turn Tool-Calling Agents via Cached Rollouts and Hybrid Reward

CacheRL: 通过缓存轨迹和混合奖励实现多轮工具调用智能体

Md Amirul Islam, Sumiran Thakur, Huancheng Chen, Su Min Park, Jiayun Wang, Gyuhak Kim

发表机构 * Center for Advanced AI Accenture(埃森哲高级人工智能中心)

AI总结 提出CacheRL系统,通过混合思维轨迹流水线、三级模糊缓存和缓存层级感知奖励,以100倍更少计算量实现92%的多步工具调用准确率,接近GPT-5的94%。

详情
AI中文摘要

我们提出了CacheRL,一个用于训练小型智能体基础模型的系统,在多步工具调用任务上实现了92%的过程准确率,接近GPT-5的94%,同时所需计算量减少100倍。我们的方法解决了实际智能体训练中的三个挑战:大规模从大模型迁移工具调用知识、无需昂贵实时工具执行的强化学习、以及从有噪声的缓存环境中稳健学习。CacheRL引入了三项关键创新。首先,混合思维轨迹流水线通过LLM生成的推理轨迹增强智能体轨迹,产生不仅教授模型调用什么工具还教授为什么调用的训练样本。其次,CacheAgentLoop通过三级模糊缓存消除了实时执行成本,同时使用token级掩码保持轨迹保真度。第三,缓存层级感知奖励动态调整答案质量权重,以避免因缓存引起的限制而惩罚模型。通过迭代监督微调(SFT)和组相对策略优化(GRPO),CacheRL将Qwen3-4B-Thinking的验证奖励从0.43提升到0.78。在公开的智能体工具调用基准测试中,我们的模型达到了与GPT-5等前沿模型竞争的性能。消融研究表明,移除知识迁移会使性能降低41%,而缓存感知奖励贡献了17%的提升。有趣的是,强化学习提高了训练稳定性,但在强监督微调基础上带来的提升有限,这表明在构建实用的小型智能体模型时,数据质量和奖励设计比复杂的优化方法更重要。

英文摘要

We present CacheRL, a system for training small agent foundation models that achieves 92 percent process accuracy on multi-step tool-calling tasks, approaching GPT-5's 94 percent while requiring 100 times less compute. Our approach addresses three challenges in practical agent training: transferring tool-calling knowledge from large models at scale, enabling reinforcement learning without costly live tool execution, and learning robustly from noisy cached environments. CacheRL introduces three key innovations. First, a hybrid thinking trajectory pipeline augments agent trajectories with LLM-generated reasoning traces, producing training examples that teach models not only what tools to call but also why. Second, the CacheAgentLoop eliminates live execution costs through a three-tier fuzzy cache while preserving trajectory fidelity using token-level masking. Third, a cache-tier-aware reward dynamically adjusts answer-quality weights to avoid penalizing models for cache-induced limitations. Through iterative supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO), CacheRL improves Qwen3-4B-Thinking's validation reward from 0.43 to 0.78. On public agentic tool-calling benchmarks, our model achieves competitive performance against frontier models such as GPT-5. Ablation studies show that removing knowledge transfer reduces performance by 41 percent, while cache-aware rewards contribute a 17 percent improvement. Interestingly, reinforcement learning improves training stability but yields limited gains beyond strong supervised fine-tuning, suggesting that data quality and reward design play a more important role than complex optimization methods in building practical small agent models.

2606.14176 2026-06-15 cs.AI 新提交

VeriGeo: Controllable Geometry Question Generation with Numerical and Analytical Verification

VeriGeo: 可控几何问题生成与数值和分析验证

Xiaoxian Duan, Zequn Liu, Yingce Xia

发表机构 * Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Zhongguancun Academy(中关村学院)

AI总结 提出VeriGeo框架,通过可执行推理轨迹和三级验证流水线,实现用户约束下的可控几何问题生成,并利用验证引导的反思修复无效生成,提升数据可靠性。

Comments 32 pages, 4 figures, 9 tables

详情
AI中文摘要

几何问题生成对AI辅助教育和多模态数学推理有用,但可靠合成仍然困难,因为问题陈述、图表、约束和解决方案应相互一致。现有方法常在可控性和可靠性之间权衡:基于种子的改写灵活但可验证性弱,而图表优先的构建提高了有效性但不太适合任意用户指定的约束。我们引入VeriGeo,一个基于可执行推理轨迹的可控几何生成框架。给定用户约束(如目标概念和难度),Author代理生成问题和图表,Solver代理产生与证明对齐的解决方案。两个代理使用共享的动作序列,将自然语言、图表、几何约束和证明步骤连接成可验证的表示。三级流水线检查数值一致性、分析可实现性和全局一致性,使用验证引导的反射来修复可恢复的失败并拒绝不可恢复的失败。在五个LLM骨干上,原始生成经常无法通过这些检查,而VeriGeo修复了大部分无效尝试。在VeriGeo生成的8.7k示例上进行监督微调,在端到端多模态LLM求解器中实现了GeoQA最佳报告性能,并在PGPS9K和MathVista-GPS上取得了强劲结果,证明了验证合成数据对改进多模态几何推理的有效性。

英文摘要

Geometry problem generation is useful for AI-assisted education and multimodal mathematical reasoning, but reliable synthesis remains difficult because the problem statement, diagram, constraints, and solution should be mutually consistent. Existing methods often trade off controllability and reliability: seed-based rewriting is flexible but weakly verifiable, whereas diagram-first construction improves validity but is less suited to arbitrary user-specified constraints. We introduce VeriGeo, a controllable geometry generation framework grounded in executable reasoning traces. Given user constraints such as target concepts and difficulty, an Author agent generates a problem and diagram, and a Solver agent produces a proof-aligned solution. Both agents use a shared action sequence that connects natural language, diagrams, geometric constraints, and proof steps into a verifiable representation. A three-stage pipeline checks numerical consistency, analytical realizability, and global consistency, using verification-guided reflection to repair recoverable failures and reject unrecoverable ones. Across five LLM backbones, raw generations frequently fail these checks, while VeriGeo repairs a substantial fraction of the invalid attempts. Supervised fine-tuning on 8.7k examples generated by VeriGeo achieves the best reported GeoQA performance among end-to-end multimodal LLM-based solvers, and obtains strong results on PGPS9K and MathVista-GPS, demonstrating the effectiveness of verified synthetic data for improving multimodal geometry reasoning.

2606.14172 2026-06-15 cs.LG cs.CV 新提交

Context-aware Modality-Topology Co-Alignment for Multimodal Attributed Graphs

上下文感知的模态-拓扑协同对齐用于多模态属性图

Sirui Zhang, Xu Wang, Zhengyu Wu, Xunkai Li, Hongchao Qin

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出CoMAG框架,通过任务自适应可靠上下文学习和模态保持的跳令牌对齐,统一处理图任务和模态任务,在保持稀疏边线性复杂度的同时提升结构预测、跨模态匹配和图条件生成性能。

详情
AI中文摘要

多模态属性图(MAGs)通过将图拓扑与文本、图像等异质属性耦合来建模真实世界实体。它们支持需要结构和类别判别表示以进行图中心任务,以及需要细粒度跨模态对应以进行模态中心任务。然而,现有的MAG方法通常依赖固定的图上下文或统一融合的表示,导致任务无关的传播和过度压缩的融合,阻碍了多样化的任务需求和模态特定证据的保留。为了解决这个问题,我们提出了CoMAG,一个统一的MAG骨干网络,学习任务自适应的可靠上下文并在其中进行模态保持的对齐。CoMAG首先通过从多模态语义一致性估计边可靠性、用语义邻居补充原始拓扑以及通过任务感知门选择上下文组件来进行可靠上下文学习。然后,它通过维护模态特定的多跳轨迹、跨模态匹配模态-跳令牌以及解耦共享和私有表示来进行模态保持的跳令牌对齐。因此,CoMAG在一次前向传播中产生图和模态表示,同时保留模态特定的线索。我们进一步分析了稳定传播、缓解过度平滑和控制模态崩溃。在九个OpenMAG数据集上的实验将CoMAG与仅特征、仅图、多模态和统一的MAG基线在图级预测、模态匹配和图条件生成方面进行了比较。结果表明,CoMAG达到了最佳报告性能,证明任务自适应的可靠上下文和模态保持的对齐改善了结构预测、跨模态匹配和图条件生成,同时保持了稀疏边线性复杂度。

英文摘要

Multimodal Attributed Graphs (MAGs) model real-world entities by coupling graph topology with heterogeneous attributes such as text and images. They support graph-centric tasks requiring structural and class-discriminative representations, and modality-centric tasks requiring fine-grained cross-modal correspondence. However, existing MAG methods often rely on fixed graph contexts or uniformly fused representations, causing task-agnostic propagation and over-compressed fusion that hinder diverse task requirements and modality-specific evidence preservation. To address this, we propose CoMAG, a unified MAG backbone that learns task-adaptive reliable contexts and modality-preserving alignment within them. CoMAG first conducts Reliable Context Learning by estimating edge reliability from multimodal semantic consistency, complementing raw topology with semantic neighbors, and selecting context components through a task-aware gate. It then performs Modality-preserving Hop-token Alignment by maintaining modality-specific multi-hop trajectories, matching modality-hop tokens across modalities, and decoupling shared and private representations. Thus, CoMAG produces graph and modality representations from one forward pass while retaining modality-specific cues. We further analyze stable propagation, over-smoothing mitigation, and modality-collapse control. Experiments on nine OpenMAG datasets compare CoMAG with feature-only, graph-only, multimodal, and unified MAG baselines across graph-level prediction, modality matching, and graph-conditioned generation. Results show that CoMAG achieves the best reported performance, demonstrating that task-adaptive reliable contexts and modality-preserving alignment improve structural prediction, cross-modal matching, and graph-conditioned generation while retaining sparse edge-linear complexity.

2606.14169 2026-06-15 cs.LG 新提交

Machine Learning for Biomedical Raman Spectroscopy: From Spectral Acquisition to Clinical Translation

生物医学拉曼光谱的机器学习:从光谱采集到临床转化

Bogdan Oancea, Ana Maria Seciu-Grama, Nicoleta Siminea, Laura Mihaela Stefan, Alice Stoica, Joel Sjoberg, Marian Necula, Ana-Maria Prelipcean, Corneliu Ovidiu Vrancianu, Eduard Milea, Andrei Păun, Ion Petre, Mihaela Păun

发表机构 * National Institute of Research and Development for Biological Sciences(罗马尼亚生物科学研究院) University of Bucharest(布加勒斯特大学) University of Turku(图尔库大学)

AI总结 综述机器学习在生物医学拉曼光谱全流程中的应用,包括预处理、诊断分类、可解释性分析及临床转化障碍,强调标准化与鲁棒验证的必要性。

Comments 52 pages, 2 figures

详情
AI中文摘要

拉曼光谱能够无标记、化学特异性地表征生物系统,已成为癌症诊断、分子分型、微生物鉴定和术中决策支持的重要工具。然而,生物医学拉曼光谱具有高维、噪声大、受荧光背景、采集变异性和生物异质性影响的特点,因此鲁棒的计算分析至关重要。本综述考察了机器学习在生物医学拉曼光谱全流程中的作用,从预处理和信号校正到无监督结构发现、监督诊断和分子分层、表示学习和迁移学习、可解释性、生物标志物发现以及与成像、病理学和分子谱分析的多模态整合。重点强调机器学习不仅用于诊断分类,还用于生物学可解释和临床可操作的分析。我们还讨论了临床转化的主要障碍,包括数据集规模有限、仪器间变异性、预处理不一致、外部验证不足、可重复性问题以及软件、数据和元数据共享有限。我们认为,进展需要方法学进步以及标准化、鲁棒验证、可解释性和可部署分析框架。通过整合方法学、生物医学和转化视角,本综述概述了开发可靠且临床可部署的拉曼-人工智能系统的关键方向。

英文摘要

Raman spectroscopy provides label-free, chemically specific characterization of biological systems and has become an important tool for cancer diagnosis, molecular subtyping, microbiological identification, and intraoperative decision support. Biomedical Raman spectra are, however, high-dimensional, noisy, and affected by fluorescence background, acquisition variability, and biological heterogeneity, making robust computational analysis essential. This review examines the role of machine learning across the biomedical Raman spectroscopy pipeline, from preprocessing and signal correction to unsupervised structure discovery, supervised diagnosis and molecular stratification, representation and transfer learning, explainability, biomarker discovery, and multimodal integration with imaging, pathology, and molecular profiling. Emphasis is placed on the use of machine learning not only for diagnostic classification, but also for biologically interpretable and clinically actionable analysis. We also discuss the main barriers to clinical translation, including limited dataset sizes, inter-instrument variability, inconsistent preprocessing, insufficient external validation, reproducibility concerns, and limited sharing of software, data, and metadata. We argue that progress will require methodological advances together with standardization, robust validation, explainability, and deployment-ready analytical frameworks. By integrating methodological, biomedical, and translational perspectives, this review outlines key directions for developing reliable and clinically deployable Raman-AI systems.

2606.14168 2026-06-15 cs.CV 新提交

MUSE: Agentic 3D Scene Authoring via Memory-Grounded Incremental Requirement Satisfaction

MUSE: 基于记忆的增量需求满足的智能体3D场景创作

Ruijie Xu, Xinnan Zhu, Jiayu Ying, Daoguo Dong, Yuzhou Ji, Xin Tan

发表机构 * East China Normal University(华东师范大学) Fudan University(复旦大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出MUSE多智能体框架,通过增量需求满足实现可控3D场景构建与编辑,在AuthorBench上显著提升目标达成率和场景保持率。

详情
AI中文摘要

文本驱动的3D场景生成是数字内容创作、具身AI仿真和交互设计中的一项有前景的技术,然而实际工作流程通常需要在保留非目标内容的同时对现有场景进行细化、扩展或修正。现有方法可以生成逼真且结构合理的场景,但它们通常缺乏具有需求级状态跟踪的可编辑性,因此部分级故障常常导致全场景重新生成或人工干预。为应对这一挑战,我们将可控3D场景创作形式化为增量需求满足,统一了构建和编辑。在本文中,我们提出了MUSE,一个基于记忆的多智能体框架,其中架构师将指令编译为结构化需求,雕刻师执行局部场景操作,检查员验证每一步并更新工作记忆、场景记忆和技能记忆。为了评估需求级可控性和保留感知编辑,我们引入了AuthorBench,提供145个受限构建案例和1584个保留感知编辑池,并配有外部结构化检查。在全构建案例上,MUSE将全目标成功率从37.9提升至80.7,表面约束满足率从35.0提升至92.6,优于最强基线。在分层240案例编辑测试集上,MUSE实现了49.6的全目标成功率、99.9的保留率和仅0.6的非预期更改率。除了自动指标外,对比较局部编辑基线的人工评估支持与用户意图更强的对齐,下游导航代理测试表明空间稳定性更强。结合验证我们记忆设计的消融实验,这些结果确立了MUSE作为可控3D场景创作的有效框架。

英文摘要

Text-driven 3D scene generation is a promising technique for digital content creation, embodied AI simulation, and interactive design, yet practical workflows often require refining, extending, or correcting existing scenes while preserving non-target content. Existing methods can produce realistic and structurally plausible scenes, but they generally lack editability with requirement-level state tracking, so part-level failures often lead to full-scene regeneration or manual intervention. To tackle this challenge, we formulate controllable 3D scene authoring as incremental requirement satisfaction, unifying construction and editing. In this paper, we present MUSE, a memory-grounded multi-agent framework in which an Architect compiles instructions into structured requirements, a Sculptor executes local scene operations, and an Inspector verifies each step while updating Working, Scene, and Skill Memory. To evaluate requirement-level controllability and preservation-aware editing, we introduce AuthorBench, offering 145 constrained construction cases and a 1,584-case preservation-aware editing pool paired with external structured checks. On full construction cases, MUSE improves All-Goal success from 37.9 to 80.7 and surface-constraint fulfillment from 35.0 to 92.6 over the strongest baseline. On a stratified 240-case editing test split, MUSE achieves 49.6 All-Goal success, 99.9 preservation rate, and only 0.6 unintended change rate. Beyond automated metrics, human evaluations on compared local-editing baselines support stronger alignment with user intent, and downstream navigation-proxy tests indicate stronger spatial stability. Combined with ablations validating our memory designs, these results establish MUSE as an effective framework for controllable 3D scene authoring.

2606.14162 2026-06-15 cs.CV 新提交

VideoWeave: Unlocking Geometric Consistency in Video Generation via Joint Geometry-Video Modeling

VideoWeave: 通过联合几何-视频建模解锁视频生成中的几何一致性

Xunzhi Xiang, Zixuan Duan, Yabo Chen, Zhengxuan Wei, Guiyu Zhang, Zixiao Gu, Zhe Gao, Haibin Huang, Chi Zhang, Qi Fan, Xuelong Li

发表机构 * Nanjing University(南京大学) Institute of Artificial Intelligence, China Telecom (TeleAI)(中国电信人工智能研究院(TeleAI)) Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 提出VideoWeave,一种在潜在空间后训练框架,利用隐式几何模型特征约束生成分布,缓解几何重建误差,提升视频几何一致性。

详情
AI中文摘要

大规模视频扩散模型通常无法随时间保持3D结构,导致几何漂移和视角变化下的不合理运动。现有方法通常通过使用显式几何重建(如深度图、点云或重建的3D结构)来定义条件、监督或奖励信号,从而强制几何一致性,这使得生成器对上游几何管道的误差敏感。我们提出VideoWeave,一种潜在空间后训练框架,利用隐式几何模型特征约束生成分布,提供更灵活、非刚性的引导形式,减轻几何模型重建误差的影响。具体来说,VideoWeave将这些特征适配为几何潜在变量,并在共享去噪空间中与视频潜在变量联合建模,使得几何在训练过程中塑造生成分布。为支持这一过程,我们构建了GeoVid-80K,一个包含8万视频的配对外观和几何表示数据集。在文本到视频和图像到视频生成上的实验表明,VideoWeave在保持强视觉质量的同时改善了几何连贯性。VideoWeave项目页面见此https URL。

英文摘要

Large-scale video diffusion models often fail to preserve 3D structure over time, causing geometric drift and implausible motion under viewpoint changes. Existing methods usually enforce geometric consistency by using explicit geometry reconstructions, such as depth maps, point clouds, or reconstructed 3D structures, to define conditions, supervision, or reward signals, making the generator sensitive to errors from upstream geometry pipelines. We propose VideoWeave, a latent-space post-training framework that uses implicit geometry-model features to constrain the generative distribution, providing a more flexible and non-rigid form of guidance that mitigates the impact of reconstruction errors from geometry models. Specifically, VideoWeave adapts these features into geometry latents and jointly models them with video latents in a shared denoising space, allowing geometry to shape the generative distribution during training. To support this process, we build GeoVid-80K, an 80K-video dataset with paired appearance and geometry representations. Experiments on text-to-video and image-to-video generation show that VideoWeave improves geometric coherence while preserving strong visual quality. VideoWeave project page at https://videoweave.github.io/

2606.14160 2026-06-15 cs.RO 新提交

GAIT: Legged Robot Proprioceptive State Estimation with Attention over Inertial-Leg Tokens

GAIT: 基于惯性-腿部令牌注意力的足式机器人本体状态估计

Young-Rang Seo, Hajun Kim, Sangmin Kim, Dongyun Kang, Hae-Won Park

发表机构 * Korea Advanced Institute of Science and Technology(韩国科学技术院)

AI总结 提出一种将惯性-腿部(IL)令牌化与注意力机制结合的方法,用于足式机器人本体状态估计,通过自适应加权不同传感器测量值提升估计性能,在未见步态和复杂地形上优于现有方法。

详情
AI中文摘要

本文提出了一种方法,将惯性-腿部(IL)令牌化应用于基于注意力的网络,用于足式机器人的本体状态估计。与现有的将所有传感器测量值拼接成单个扁平向量的学习型状态估计器不同,所提出的架构将惯性测量和每条腿的测量表示为单独的令牌,并使用注意力机制学习每个测量的相对重要性。这种设计允许网络根据当前的接触条件重新加权每个测量,反映了前向运动学测量的可靠性取决于相应脚是否接触的事实。然而,与传统的接触辅助估计器不同,所提出的方法无需依赖显式的接触估计器或基于静止接触假设的显式测量更新即可学习这种行为。为了验证所提出的方法,我们在Unitree Go1机器人上进行了实验,包括模拟中未建模的碎石地形和训练中未见过的步态模式。实验结果表明,所提出的方法在未见步态模式下比现有的学习型状态估计器实现了更好的估计性能,并且也优于基于接触辅助的模型方法。

英文摘要

In this paper, we propose a method that applies Inertial-Leg (IL) tokenization to an attention-based network for proprioceptive state estimation in legged robots. Unlike existing learning-based state estimators that concatenate all sensor measurements into a single flat vector, the proposed architecture represents inertial measurements and leg-wise measurements as individual tokens and uses an attention mechanism to learn the relative importance of each measurement.This design allows the network to reweight each measurement according to the current contact condition, reflecting the fact that the reliability of forward kinematic measurements depends on whether the corresponding foot is in contact. Unlike conventional contact-aided estimators, however, the proposed method learns this behavior without relying on an explicit contact estimator or on explicit measurement updates based on a stationary contact assumption. To validate the proposed method, we conducted experiments on a Unitree Go1 robot, including debris terrain not modeled in simulation and gait patterns not seen during training. Experimental results show that the proposed method achieves better estimation performance than existing learning-based state estimators under unseen gait patterns and also improves performance over contact-aided model-based methods.