arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3841
热门方向导航
2606.08200 2026-06-09 cs.AI cs.LG 新提交

Online Agent-as-a-Judge: Situation-Generating Evaluation for Interactive Agents

在线智能体作为裁判:面向交互式智能体的情境生成评估

Hyogon Ryu, Jeonghwan Kim, Yewon Lim, Chaeun Lee, Jeongwook Kim, Donghoon Ham

发表机构 * KAIST(韩国科学技术院)

AI总结 提出在线智能体作为裁判框架,通过部署环境内评估智能体主动生成相关情境,以评估交互式社交智能体的能力,提高标准覆盖率和与人类标签的一致性。

Comments ICML 2026 Workshop on Trustworthy AI for Good

详情
AI中文摘要

评估基于LLM的交互式社交智能体具有挑战性,因为社交相关行为不仅取决于孤立输出,还取决于先前的交互、社会角色和后续行动。现有方法通常允许目标智能体在环境中自由行动,然后对生成的轨迹进行评分。然而,这种被动设置可能会遗漏仅在特定社交情境下才可观察到的能力;例如,如果没有出现分歧,冲突处理可能不会被测试。我们提出在线智能体作为裁判,一种面向交互式社交智能体的情境生成评估框架。在线智能体作为裁判部署一个环境内评估智能体,通过环境原生的对话和行动协议与目标智能体交互,主动引出与评估标准相关的情境。生成的轨迹为评估即时响应和后续行为提供了证据。在一个包含32个设计师编写的社会标准的生命模拟环境中,在线智能体作为裁判提高了标准覆盖率和与人类标签的一致性,为被动方法可能未观察到的行为提供了更可靠的基于证据的评估。

英文摘要

Evaluating LLM-powered interactive social agents is challenging because socially relevant behaviors depend not only on isolated outputs, but also on prior interactions, social roles, and downstream actions. Existing methods typically allow a target agent to act freely in an environment and then score the resulting trajectory. However, this passive setup can miss capabilities that only become observable under specific social circumstances; for example, conflict handling may remain untested if no disagreement arises. We propose Online Agent-as-a-Judge, a situation-generating evaluation framework for interactive social agents. Online Agent-as-a-Judge deploys an in-world evaluator agent that interacts with the target agent through the environment's native dialogue and action protocol, actively eliciting situations relevant to the evaluation criteria. The resulting trajectories provide evidence for assessing both immediate responses and subsequent behavior. In a life-simulation environment with $32$ designer-authored social criteria, Online Agent-as-a-Judge improves criteria coverage and agreement with human labels, yielding more reliable evidence-grounded evaluations of behaviors that passive methods can leave unobserved.

2606.08197 2026-06-09 cs.CL cs.DC 新提交

AlignFed: Alignment-Aware Asynchronous Federated Fine-Tuning for Large Language Models in Heterogeneous Edge Environments

AlignFed: 异构边缘环境中大语言模型的对齐感知异步联邦微调

Yan Wang, Ziyi Gao, Rui Wang

发表机构 * University of Science and Technology Beijing(北京科技大学)

AI总结 提出AlignFed框架,通过多阶段语义对齐机制(版本感知更新分组、跨版本语义对齐、公平性感知聚合)解决异步联邦微调中大语言模型在异构边缘环境中的模型漂移、客户端漂移和聚合不公平问题。

详情
AI中文摘要

大语言模型(LLMs)显著推动了边缘智能的发展,并已广泛应用于自动驾驶、工业检测和个性化物联网服务等多种场景。然而,由于严格的数据隐私约束、高度异构的计算和通信资源以及本地数据的非独立同分布(non-IID)特性,在边缘设备上协作适配LLMs仍面临严峻挑战。联邦微调(FFT)能够在无需暴露原始数据的情况下实现分布式模型的协作优化。然而,传统的同步聚合存在严重的掉队者效应,导致系统延迟高、资源利用率低。现有的异步联邦学习方法主要针对中小规模模型设计,难以解决LLM微调中特有的挑战,即由陈旧更新引起的模型漂移、由数据异质性加剧的客户端漂移以及由快速客户端主导导致的聚合公平性失衡。针对这些问题,本文提出AlignFed,一种面向异构边缘环境的LLMs异步联邦微调框架。AlignFed采用轻量级多阶段语义对齐机制,包含三个核心模块:版本感知的更新分组、基于小批量校准集的跨版本语义对齐,以及结合更新新鲜度和客户端参与频率的公平性感知聚合。该框架有效缓解了跨版本模型漂移和客户端漂移,同时增强了聚合公平性,从而在高异质性和显著更新陈旧性的场景中实现稳定高效的异步联邦优化。

英文摘要

Large Language Models (LLMs) have significantly propelled the advancement of edge intelligence and have been widely deployed across various scenarios, including autonomous driving, industrial inspection, and personalized IoT services. However, the collaborative adaptation of LLMs on edge devices continues to face formidable challenges due to strict data privacy constraints, highly heterogeneous computing and communication resources, and the non-independent and identically distributed (non-IID) nature of local data. Federated Fine-Tuning (FFT) enables the collaborative optimization of distributed models without exposing raw data. Yet, traditional synchronous aggregation suffers from a severe straggler effect, resulting in high system latency and low resource utilization. Existing asynchronous federated learning methods are predominantly designed for small-to-medium-scale models and struggle to address the specific challenges inherent in LLM fine-tuning namely, model drift caused by stale updates, aggravated client drift stemming from data heterogeneity, and aggregation fairness imbalance resulting from the dominance of fast clients. To address these issues, this paper proposes AlignFed, an asynchronous federated fine-tuning framework for LLMs tailored to heterogeneous edge environments. AlignFed employs a lightweight multi-stage semantic alignment mechanism comprising three core modules: version-aware update grouping, cross-version semantic alignment based on a mini-batch calibration set, and fairness-aware aggregation that integrates both update freshness and client participation frequency. This framework effectively mitigates cross-version model drift and client drift while enhancing aggregation fairness, thereby achieving stable and efficient asynchronous federated optimization in scenarios characterized by high heterogeneity and significant update staleness.

2606.08194 2026-06-09 cs.CL cs.AI 新提交

GlobeAudio: A Multilingual Multicultural Benchmark for Naturalistic Evaluation of Large Audio-Language Models

GlobeAudio:用于大型音频-语言模型自然主义评估的多语言多文化基准

Ryner Tan, Wenxuan Zhang

发表机构 * Singapore University of Technology and Design(新加坡科技设计大学)

AI总结 提出GlobeAudio基准,包含5637道多语言多选题,评估大型音频-语言模型在自然音频条件下的听觉推理和文化理解能力,发现开源模型和低资源语言存在显著性能差距。

详情
AI中文摘要

大型音频-语言模型(LALMs)在统一框架中整合了音频感知和语言理解,支持广泛的实际应用。尽管近期取得了进展,但LALMs的评估相对于实际需求仍严重不足:大多数评估缺乏真正的语言和文化真实性,而其他评估则未能捕捉声学真实性。为弥补这一差距,我们提出了GlobeAudio,一个旨在评估自然音频理解的多语言和多文化基准。GlobeAudio包含5637道多项选择题,涵盖六种类型多样的语言,由母语者基于自然发生的音频精心制作。为了表现良好,模型必须具有更高层次的听觉推理技能和文化基础的解释。我们系统地评估了代表性的闭源和开源LALMs,以及级联的ASR-LLM流水线。我们的实验揭示了在自然声学条件下的显著性能差距,特别是对于开源模型和低资源语言。这些发现凸显了当前LALMs的关键局限性,并强调了自然音频评估对未来音频-语言系统的重要性。GlobeAudio可在https://huggingface.co/datasets/iNLP-Lab/GlobeAudio 获取。

英文摘要

Large Audio-Language Models (LALMs) integrate audio perception and language understanding within a unified framework, enabling a wide range of real-world applications. Despite recent advances, evaluation for LALMs remains heavily underspecified relative to real-world requirements: most lack true linguistic and cultural authenticity, while others fail to capture acoustic realism. To bridge this gap, we propose GlobeAudio, a multilingual and multicultural benchmark designed to evaluate naturalistic audio understanding. GlobeAudio consists of 5,637 multiple-choice questions across six typologically diverse languages, expertly crafted by native speakers grounded on naturally occurring audio. In order to do well, models must possess higher-level auditory reasoning skills and culturally grounded interpretation. We systematically evaluate representative closed-source and open-source LALMs, as well as cascaded ASR-LLM pipelines. Our experiments reveal substantial performance gaps under natural acoustic conditions, particularly for open-source models and low-resource languages. These findings highlight critical limitations of current LALMs and underscore the importance of naturalistic audio evaluation for future audio-language systems. GlobeAudio can be found at https://huggingface.co/datasets/iNLP-Lab/GlobeAudio .

2606.08191 2026-06-09 cs.LG cs.AI q-bio.QM 新提交

Frequency-Domain Latent Attention Gating for Cross-Domain Token Aggregation

频域潜在注意力门控用于跨域令牌聚合

Kewei Li, Rongying Zhang, Xueli Wang, Xiwen Gong, Zhongjian Wang, Lan Huang, Ruochi Zhang, Fengfeng Zhou

发表机构 * College of Computer Science and Technology, Jilin University(吉林大学计算机科学与技术学院) Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University(教育部符号计算与知识工程重点实验室) Institute for Quantitative and Computational Biology, University of California(加州大学定量与计算生物学研究所) Greenwich High School(格林威治高中) BCPM Data Limited(BCPM数据有限公司)

AI总结 提出FLaG模块,通过实FFT变换、可学习潜在查询的频谱分量汇总、通道门控和时域重建,实现跨域令牌聚合,在AMP预测、图像分类和文本分类任务上取得提升。

详情
AI中文摘要

令牌聚合是将令牌表示映射到样本级预测的模型中的常见瓶颈,然而大多数池化方法仅在原始令牌域中操作。我们提出FLaG,一个即插即用的聚合模块,它使用实FFT变换令牌表示,用可学习的潜在查询汇总频谱分量,应用通道门控,并重建增强的时域令牌以进行最终池化。我们在使用ESM2的抗菌肽(AMP)活性预测、使用ResNet18在CIFAR-10和CIFAR-100上的图像分类,以及使用RoBERTa在IMDB和GLUE上的文本分类中评估FLaG。FLaG在ESM2-8M抗菌肽任务和CIFAR-100上取得了最明显的提升,同时在IMDB和GLUE上与强文本基线保持竞争力。然后,我们通过频带消融、门控汇总、残基扰动、潜在查询读出和结构代理分层来探究其在AMP设置中的行为。我们发现低频带贡献最大,其余高频带模式更具样本特异性。门控充当广泛共享的频谱重加权阶段,交叉注意力模式是样本特异性的,具有轻微的查询差异,并且高螺旋肽在两种细菌中表现出更强的平均频谱敏感性。补充材料、源代码和数据发布在https://www.healthinformaticslab.org/supp/ 和 https://github.com/Kewei2023/AMPCliff/tree/FLaG。

英文摘要

Token aggregation is a common bottleneck in models that map token representations to sample-level predictions, yet most pooling methods operate only in the original token domain. We propose FLaG, a plug-in aggregation module that transforms token representations with the real FFT, summarizes spectral components with learnable latent queries, applies a channel-wise gate, and reconstructs enhanced time-domain tokens for final pooling. We evaluate FLaG on antimicrobial peptide (AMP) activity prediction with ESM2, image classification with ResNet18 on CIFAR-10 and CIFAR-100, and text classification with RoBERTa on IMDB and GLUE. FLaG achieves its clearest gains on the ESM2-8M antimicrobial peptide tasks and on CIFAR-100, while remaining competitive with strong text baselines on IMDB and GLUE. Then we probe its behavior on the AMP setting with band knockouts, gate summaries, residue perturbations, latent-query readouts, and structure-proxy stratification. We find that low-frequency bands contribute the most overall, and the remaining higher-band pattern is more sample-specific. The gate acts as a broadly shared spectral reweighting stage and the cross-attention patterns are sample-specific with mild query-wise differentiation, and higher-helix peptides exhibit stronger average spectral sensitivity in both bacteria. The supplementary materials, source code and data are released at https://www.healthinformaticslab.org/supp/ and https://github.com/Kewei2023/AMPCliff/tree/FLaG.

2606.08186 2026-06-09 cs.RO 新提交

Propeller-Assisted Robust 3D Hopping Robot with Hierarchical Force Allocation

螺旋桨辅助的鲁棒三维跳跃机器人及分层力分配

Chuhan Zhang, Hongbo Zhang, Yanlin Chen, Yunxi Tang, Yun-Hui Liu, Mingyi Liu, Xiangyu Chu

发表机构 * The Chinese University of Hong Kong(香港中文大学) Guangdong Technion–Israel Institute of Technology(广东以色列理工学院) Technion–Israel Institute of Technology(以色列理工学院) Multiscale Medical Robotics Centre(多尺度医疗机器人中心)

AI总结 提出一种螺旋桨辅助的单腿三维跳跃机器人Pro-OMEGA2,通过分层力分配框架协调腿与三旋翼的力,实现鲁棒跳跃和扰动恢复。

Comments 8 pages, 9 figures, 1 table. Accepted to the 2026 IEEE International Conference on Automation Science and Engineering (CASE)

详情
AI中文摘要

单腿跳跃机器人概念简单但高度动态且天生不稳定。实现鲁棒的三维跳跃仍然困难,因为地面反作用力仅在短暂的支撑阶段可用,而机器人在飞行阶段欠驱动。一个未解决的关键问题是如何提高飞行阶段的控制能力。螺旋桨辅助提供了一种有希望的解决方案,但需要仔细协调腿产生的接触力和螺旋桨推力在支撑和飞行阶段的配合。本文介绍了Pro-OMEGA2,一种螺旋桨辅助的三维单腿跳跃机器人,具有主动3-RSR并联腿和安装在躯干上的三旋翼用于辅助姿态调节。为了解决力协调挑战,我们提出了一种基于单刚体模型的分层力分配框架。腿产生主要的支撑接触力,而三旋翼提供辅助姿态调节,补偿支撑阶段的残余姿态力矩并在飞行阶段维持姿态。室内和室外场景的真实机器人实验展示了持续的三维跳跃,包括地形过渡和脉冲推挤恢复,验证了在未建模接触和外部扰动下的鲁棒性。

英文摘要

Monopedal hopping robots are conceptually simple but highly dynamic and inherently unstable. Achieving robust 3D hopping is still difficult because ground reaction forces are available only during the short stance phase, while the robot is underactuated in flight. A key unresolved issue is how to improve flight-phase control authority. Propeller assistance provides a promising solution, but it requires careful coordination of leg-generated contact forces and propeller thrusts across stance and flight. This paper presents Pro-OMEGA2, a propeller-assisted 3D monopedal hopping robot with an active 3-RSR parallel leg and a trunk-mounted tri-rotor for auxiliary attitude regulation. To address the force coordination challenge, we propose a Hierarchical Force Allocation (HFA) framework based on a single rigid body (SRB) model. The leg generates the main stance contact wrench, while the tri-rotor provides auxiliary attitude regulation, compensating the residual attitude moment in stance and maintaining attitude during flight. Real-robot experiments in indoor and outdoor scenarios demonstrate sustained 3D hopping, including terrain transitions and impulsive push recovery, validating robustness under unmodeled contact and external disturbances.

2606.08184 2026-06-09 cs.CL 新提交

TextEconomizer: Enhancing Lossy Text Compression with Denoising Transformers and Entropy Coding

TextEconomizer:利用去噪变换器和熵编码增强有损文本压缩

Mahbub E Sobhani, Anika Tasnim Rodela, Chowdhury Mofizur Rahman, Dewan Md. Farid, Swakkhar Shatabda

发表机构 * United International University(联合国际大学) BRAC University(BRAC大学) Southeast University(东南大学)

AI总结 提出TextEconomizer编码器-解码器框架,结合去噪变换器和熵编码,实现50%-80%的压缩率,参数减少153倍,在BLEU等指标上保持近完美文本质量。

Comments Published in Neural Networks (Elsevier), Vol. 203, 2026

详情
Journal ref
Neural Networks, Vol. 203, 109111, 2026
AI中文摘要

有损文本压缩在保留核心含义的同时减少数据大小,适用于摘要、自动分析和数字存档。尽管基于变换器的模型在语言建模中占主导地位,但将上下文向量和熵编码集成到序列到序列(Seq2Seq)生成中仍未充分探索。一个关键挑战在于从编码器输出中识别信息最丰富的上下文向量,并引入熵编码以提高存储效率,同时即使在噪声文本下也能保持高质量输出。我们提出了TextEconomizer,一种与变换器神经网络配对的编码器-解码器框架,无需数据集维度的先验知识即可将可变大小输入减少50%至80%。我们的模型通过熵编码实现了有竞争力的压缩比,同时通过BLEU、ROUGE、METEOR和语义相似度评分评估,提供了近乎完美的文本质量。TextEconomizer的参数数量比同类模型少约153倍,实现了5.39倍的压缩比,且不牺牲语义质量。我们还评估了一个基于LSTM的自编码器,实现了最先进的67倍压缩比,参数减少196倍;以及LLaMAFormer,一种改进的变换器,参数比ICAE少263倍,同时保持有竞争力的文本质量。TextEconomizer在平衡内存效率和高保真输出方面显著超越了现有的基于变换器的模型,标志着有损压缩在最优空间利用方面的突破。

英文摘要

Lossy text compression reduces data size while preserving core meaning, making it well-suited for summarization, automated analysis, and digital archives. Despite the dominance of transformer-based models in language modeling, integrating context vectors and entropy coding into Sequence-to-Sequence (Seq2Seq) generation remains underexplored. A key challenge lies in identifying the most informative context vectors from encoder output and incorporating entropy coding to enhance storage efficiency while maintaining high-quality outputs, even under noisy text. We introduce TextEconomizer, an encoder-decoder framework paired with a transformer neural network that reduces variable-sized inputs by 50% to 80% without prior knowledge of dataset dimensions. Our model achieves competitive compression ratios via entropy coding while delivering near-perfect text quality, assessed by BLEU, ROUGE, METEOR, and semantic similarity scores. TextEconomizer operates with approximately 153x fewer parameters than comparable models, achieving a 5.39x compression ratio without sacrificing semantic quality. We also evaluate an LSTM-based autoencoder achieving a state-of-the-art 67x compression ratio with 196x fewer parameters, and LLaMAFormer, a modified transformer with 263x fewer parameters than ICAE while maintaining competitive text quality. TextEconomizer significantly surpasses existing transformer-based models in balancing memory efficiency and high-fidelity outputs, marking a breakthrough in lossy compression with optimal space utilization.

2606.08170 2026-06-09 cs.RO 新提交

Learning from Human Driving: A Human-in-the-Loop Online Behavior Cloning Framework for Autonomous Driving

从人类驾驶中学习:一种用于自动驾驶的人机协同在线行为克隆框架

Yuhong Shi, Jianyi Liu, Lihang Sun, Li Li, Xudong Dong

发表机构 * State Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University(西安交通大学人工智能与机器人研究所人机混合增强智能国家重点实验室)

AI总结 提出人机协同在线行为克隆框架HiL-OBC,通过人类干预初始化策略、贝叶斯潜在行为建模和在线更新,结合大模型感知与人类驾驶智能,在CARLA基准上显著提升驾驶性能。

详情
AI中文摘要

随着大型基础模型(LFM)的发展,数据驱动的自动驾驶取得了显著进展。然而,现有范式在复杂交互和长尾场景中仍面临分布偏移和因果混淆的严峻挑战。这些限制往往导致在极端条件下缺乏人类级别的决策灵活性和安全性。为克服这一局限,本文提出了一种用于自动驾驶的人机协同在线行为克隆框架(HiL-OBC),旨在深度融合LFM的跨模态感知能力与人类专家的高级驾驶智能。具体而言,HiL-OBC的部署通过三个关键阶段执行:带人类干预的策略初始化、基于贝叶斯策略适应的潜在行为建模,以及在线部署与更新。此外,我们设计了一种多模态在线行为克隆(MOBC)模型,通过轻量级网络架构、接管触发机制和多变量损失函数在线优化基础驾驶策略,从而增强系统在复杂环境中的决策鲁棒性。我们在LangAuto-Human CARLA基准上评估了HiL-OBC。实验结果表明,通过人机协同机制优化的驾驶策略实现了显著的性能提升:StructNav、LFG和LMDrive的驾驶得分(DS)分别提高了47.25%、31.59%和32.12%,同时各种实验设置和关键组件的分析凸显了人机协同学习在提高决策鲁棒性和整体驾驶性能方面的优势。

英文摘要

With the evolution of large foundation models (LFMs), data-driven autonomous driving has made significant strides. However, existing paradigms still face severe challenges in complex interaction and long-tail scenarios due to distribution shift and causal confusion. These limitations often result in a lack of human-level decision-making flexibility and safety in extreme conditions. To overcome this limitation, this paper proposes a Human-in-the-Loop Online Behavior Cloning frame work (HiL-OBC) for autonomous driving, which aims to deeply integrate the cross-modal perceptual capabilities of LFMs with the high-level driving intelligence of human experts. Specifically, HiL-OBC deployment is executed through three critical phases: policy initialization with human intervention, latent behavioral modeling with Bayesian policy adaptation, and online deploy ment and updates. Furthermore, we design a Multi-modal Online Behavior Cloning (MOBC) model, which optimizes the base driving policy online through a lightweight network architecture, a takeover trigger mechanism, and a multi-variant loss function, thereby enhancing the system's decision-making robustness in complex environments. We evaluated the HiL-OBC on the LangAuto-Human CARLA benchmark. Experimental results demonstrate that the driving policies optimized via the human-in-the-loop mechanism achieve substantial performance gains: the DS of StructNav, LFG, and LMDrive increased by 47.25%, 31.59%, and 32.12%, respectively, with a simultaneous of various experimental settings and key components highlights the advantages of human-in-the-loop learning in improving decision-making robustness and overall driving performance.

2606.08169 2026-06-09 cs.RO cs.AI cs.CL cs.HC cs.LG 新提交

CLASP: Language-Driven Robot Skill Selection and Composition using Task-Parameterized Learning

CLASP: 基于语言驱动的机器人技能选择与组合,采用任务参数化学习

Markus Knauer, Valentin Gieraths, Tai Mai, Samuel Bustamante, Alin Albu-Schäffer, Freek Stulp, João Silvério

发表机构 * German Aerospace Center (DLR), Institute of Robotics and Mechatronics (RMC)(德国航空航天中心(DLR),机器人与机电一体化研究所(RMC)) Technical University of Munich (TUM)(慕尼黑工业大学(TUM))

AI总结 提出CLASP架构,结合任务参数化核化运动基元(TP-KMP)与预训练视觉语言模型(VLM),通过自然语言命令实现技能选择、组合和主动学习,无需微调,在7自由度机械臂上达到73.3%-100%成功率。

Comments 23 pages, 11 figues, 4 tables, 1 listing

详情
AI中文摘要

使机器人能够理解自然语言命令并执行任务,同时保持数据效率仍然具有挑战性。视觉-语言-动作(VLA)和视觉-语言模型(VLM)等基础模型提供了直观的交互通道,但需要大量数据;任务参数化模仿学习实现了数据效率,但缺乏自然语言基础。这项工作通过一个模块化架构弥合了这一差距,该架构将任务参数化核化运动基元(TP-KMP)与预训练VLM相结合。在学习过程中,技能从2到5次动觉演示中获取,VLM生成描述每个技能参数和前提条件的技能模式。在执行过程中,VLM解释命令以选择技能,推理参数绑定,并通过协方差加权组合创建新颖行为。当没有技能或组合足够时,系统识别能力差距并请求有针对性的演示,所有这些都无需微调。在7自由度机械臂上的验证显示,在需要技能选择、组合和主动学习的场景中,成功率达到73.3%-100%。

英文摘要

Enabling robots to understand and execute tasks from natural language commands while maintaining data efficiency remains challenging. Foundation models such as vision-language-action (VLA) and vision-language models (VLMs) provide intuitive interaction channels but require extensive data; task-parameterized imitation learning achieves data efficiency but lacks natural language grounding. This work bridges this gap through a modular architecture combining task-parameterized kernelized movement primitives (TP-KMPs) with pretrained VLMs. During learning, skills are acquired from 2 to 5 kinesthetic demonstrations, and the VLM generates skill schemas describing each skill's parameters and preconditions. During execution, the VLM interprets commands to select skills, reason about parameter bindings, and create novel behaviors through covariance-weighted composition. When no skill or composition suffices, the system identifies capability gaps and requests targeted demonstrations, all without fine-tuning. Validation on a 7-DoF manipulator shows success rates of 73.3%-100% in scenarios requiring skill selection, composition, and active learning.

2606.08167 2026-06-09 cs.LG cs.AI 新提交

Explaining Data Mixing Scaling Laws

解释数据混合缩放定律

Rui Dai, Shuran Zheng

发表机构 * Beijing Institute of Technology(北京理工大学) IIIS, Tsinghua University(清华大学智能产业研究院)

AI总结 提出统一框架解释多领域数据混合中模型损失行为,基于能力竞争和噪声减少两个关键因素,在多个尺度上有效预测高性能混合。

Comments Published to ICML 2026

详情
AI中文摘要

最近的研究建立了经验缩放定律来预测多领域数据混合上的模型性能。然而,对这些模型损失行为的理论理解仍然缺失。在这项工作中,我们提出了一个统一框架来解释数据混合的底层机制。我们的方法将最初为标准神经缩放定律(如Kaplan和Chinchilla)开发的理论视角扩展到多领域设置。基于领域在基本技能上重叠而在专门技能上分化的分布假设,我们确定了控制不同数据混合训练模型领域损失的两个关键因素:\textit{能力竞争},其中有限模型能力的分配全局耦合了领域损失;以及\textit{噪声减少},其中最优权重向更难学习的领域转移以最小化整体噪声。实证评估表明,我们的框架通过以更低的平均相对误差拟合损失景观并识别出更高性能的训练混合,优于现有基线。最重要的是,我们的模型成功跨尺度外推,使用较小尺度上拟合的参数预测大型未见尺度的高效混合。此外,与之前的经验定律相比,我们的模型使用显著更少的参数实现了这些结果。我们的代码可在 https://github.com/meiqwq/Explaining-Data-Mixing-Scaling-Laws 获取。

英文摘要

Recent research has established empirical scaling laws to predict model performance on multi-domain data mixtures. However, a theoretical understanding of these model loss behaviors remains absent. In this work, we propose a unified framework to explain the underlying mechanics of data mixing. Our approach extends theoretical perspectives originally developed for standard neural scaling laws (e.g., Kaplan and Chinchilla) to the multi-domain setting. Based on the distributional assumption that domains overlap on fundamental skills while diverging on specialized skills, we identify two key factors that govern the domain losses of models trained on different data mixtures: \textit{Capacity Competition}, where the allocation of finite model capacity couples domain losses globally, and \textit{Noise Reduction}, where optimal weights shift toward harder-to-learn domains to minimize overall noise. Empirical evaluations show that our framework outperforms existing baselines by fitting the loss landscape with a lower Mean Relative Error and identifying higher-performing training mixtures. Most importantly, our model successfully extrapolates across scales, predicting highly effective mixtures for large, unseen scales using parameters fitted on smaller ones. In addition, our model achieves these results using significantly fewer parameters compared to previous empirical laws. Our code is available at https://github.com/meiqwq/Explaining-Data-Mixing-Scaling-Laws.

2606.08164 2026-06-09 cs.CV 新提交

How Much MRI Preprocessing Is Enough? A Cost-Utility Study for Brain MRI Foundation Models

MRI预处理需要多少才够?脑MRI基础模型的成本效用研究

Jiangshuan Pang, Wangyang Tang, Jing Yan, Zhixuan Cheng, Youzhe He, Zhenkun Zhuang, Tao Zhou, Shiping Liu

发表机构 * University of the Chinese Academy of Sciences(中国科学院大学) BGI Research(华大研究院)

AI总结 本研究通过比较P0-P7预处理级别对自监督3D MRI预训练的影响,发现并非预处理越强越好,P2是最低成本可行级别,更强预处理仅在特定任务中带来有限提升,且下游可补偿。

详情
AI中文摘要

MRI预处理定义了脑MRI基础模型看到的输入分布,但它通常被视为常规数据清理而非建模选择。我们询问对于自监督3D MRI预训练,多少预处理值得其计算成本。保持语料库、3D ViT骨干网络、掩码协议和下游评估不变,我们在20,000个异质脑MRI体积上比较了用于掩码自编码(MAE)和联合嵌入预测学习(JEPA)的分级P0-P7预处理谱,然后将编码器迁移到IDH预测、MCI分类、脑年龄回归和GLI/PED肿瘤分割。结果不支持简单的“越多越好”规则。P0/P1数值不稳定,使P2成为成本最低的可行级别;超过P2,选择最佳可行预处理级别仅使MAE的聚合效用提高3.4个百分点,JEPA提高1.8个百分点,且大多数配对增益在统计上未解决。更强的预处理仅在选定场景中有益:IDH略有改善,AGE和GLI/PED通常在P2附近或最佳,而MCI显示出最清晰的P7经验增益。跨级别MCI迁移进一步表明,大部分P7优势可以通过在下游应用更强的预处理来恢复,而不需要在预训练全程使用P7。这些发现将MRI预处理重新定义为一种下游感知的成本效用决策,而非默认的升级流水线。代码可在https://github.com/PangJiangShuan/PreBrain获取。

英文摘要

MRI preprocessing defines the input distribution seen by brain MRI foundation models, yet it is usually treated as routine data cleaning rather than a modeling choice. We ask how much preprocessing is worth its computational cost for self-supervised 3D MRI pretraining. Keeping the corpus, 3D ViT backbone, masking protocol, and downstream evaluations fixed, we compare a graded P0-P7 preprocessing spectrum for masked autoencoding (MAE) and joint-embedding predictive learning (JEPA) on 20,000 heterogeneous brain MRI volumes, then transfer the encoders to IDH prediction, MCI classification, brain age regression, and GLI/PED tumor segmentation. The results do not support a simple "more is better" rule. P0/P1 are numerically unstable, making P2 the lowest-cost feasible level; beyond P2, choosing the best feasible preprocessing level improves aggregate utility by only 3.4 percentage points for MAE and 1.8 percentage points for JEPA, with most paired gains statistically unresolved. Stronger preprocessing is beneficial only in selected regimes: IDH improves modestly, AGE and GLI/PED are often near or best at P2, and MCI shows the clearest empirical P7 gain. Cross-level MCI transfer further shows that much of the P7 advantage can be recovered by applying stronger preprocessing downstream, without requiring P7 throughout pretraining. These findings recast MRI preprocessing as a downstream-aware cost-utility decision rather than a default escalation pipeline. Code is available at https://github.com/PangJiangShuan/PreBrain.

2606.08161 2026-06-09 cs.LG cs.AR cs.NA math.NA 新提交

AttentionCap: Transformer Based Capacitance Matrix Learning Toward Full-Chip Extraction

AttentionCap: 基于Transformer的电容矩阵学习用于全芯片提取

Jiechen Huang, Hector R. Rodriguez, Dingcheng Yang, Zuochang Ye, Yibo Lin, Wenjian Yu

发表机构 * Dept. Computer Science & Tech., BNRist, Tsinghua Univ., Beijing, China(清华大学计算机科学与技术系,北京信息科学与技术国家研究中心) School of IC, BNRist, Tsinghua Univ., Beijing, China(清华大学集成电路学院,北京信息科学与技术国家研究中心) School of IC, Peking Univ., Beijing, China(北京大学集成电路学院)

AI总结 提出AttentionCap,一种定制化Transformer,结合Gram表示、对称注意力输出层和归一化拉普拉斯损失,实现多层多节点下的高精度电容矩阵预测,速度提升192倍。

Comments Accepted at the 63rd ACM/IEEE Design Automation Conference (DAC '26)

详情
AI中文摘要

随着基于规则的模式匹配的电容提取精度在先进节点上难以维持,开发基于深度学习的2D电容模型的趋势日益增长。然而,现有的基于MLP和CNN的方法将其输入限制在特定工艺节点的固定金属层组合上,限制了其在实际中的可用性。认识到电容矩阵与流行的注意力机制之间的固有相似性,我们提出了AttentionCap,一种定制化的Transformer用于电容矩阵学习,具有Gram表示框架、物理对齐的对称注意力输出层以及新颖的归一化拉普拉斯损失。我们还引入了工艺节点嵌入以实现多节点学习。在合成数据上训练后,AttentionCap在多层多节点设置下,对未见过的真实设计实现了0.67%/3.99%的自电容/耦合电容误差,相比CNN-Cap基线,自电容/耦合误差降低了4.6倍/5.7倍,推理速度提高了192倍。预训练的AttentionCap仅需5000个样本和4000步微调即可准确迁移到未见过的节点。凭借对未见过的真实设计的足够精度和对新工艺节点的强迁移能力,AttentionCap为现代EDA工作流程提供了很高的实用价值。代码和数据可在https://github.com/THU-numbda/AttentionCap获取。

英文摘要

As capacitance extraction accuracy of rule-based pattern matching becomes difficult to sustain at advanced nodes, a growing trend emerges to develop deep-learning-based 2D capacitance models. However, existing MLP- and CNN-based methods constrain their input to fixed metal-layer combinations in a specific process node, limiting their usability in practice. Recognizing the inherent similarity between capacitance matrix and the prevailing attention mechanism, we propose AttentionCap, a customized Transformer for capacitance matrix learning, with a Gram representation framework, a physics-aligned symmetric-attention output layer, and a novel normalized Laplacian loss. We also introduce a process-node embedding to enable multi-node learning. Trained on synthetic data, AttentionCap attains 0.67\%/3.99\% self/coupling-capacitance error on unseen real designs under a multi-layer and multi-node setting, surpassing the CNN-Cap baseline with 4.6$\times$/5.7$\times$ lower self/coupling error and 192$\times$ faster inference speed. A pretrained AttentionCap accurately transfers to an unseen node with only 5K samples and 4K finetuning steps. With sufficient accuracy on unseen real designs and strong transferability to new process nodes, AttentionCap offers highly practical value for modern EDA workflows. Code and data are available at https://github.com/THU-numbda/AttentionCap.

2606.08156 2026-06-09 cs.CV cs.AI 新提交

RAPID: Layer-Wise Redundancy-Aware Pruning and Importance-Driven Token Merging for Efficient ViT

RAPID: 逐层冗余感知剪枝与重要性驱动的令牌合并以实现高效ViT

Kyumin Choi, Ikbeom Jang

发表机构 * Hankuk University of Foreign Studies(韩国外国语大学)

AI总结 提出RAPID框架,根据ViT网络深度自适应调整令牌缩减策略:浅中层用冗余相似度感知剪枝,深层用重要性相似度感知合并,在ImageNet-1K上实现更优的精度-压缩帕累托前沿。

Comments 7 pages, 2 figures

详情
AI中文摘要

视觉Transformer(ViT)取得了强大性能,但由于二次自注意力复杂度而遭受高计算成本。尽管令牌缩减技术(如剪枝和合并)缓解了这一问题,但它们通常忽略了表示在网络深度上的演化。我们提出RAPID,一种深度感知的令牌缩减框架,可根据令牌表示的逐层特征自适应调整缩减策略。主要方法贡献是一种分叉策略:在浅层到中层,RAPID采用冗余相似度感知剪枝度量来消除过度表示的局部模式。当特征在更深层过渡到全局语义概念时,框架转向重要性相似度感知合并机制。该阶段利用分类(CLS)令牌注意力权重来保护语义关键令牌,同时融合不太重要但相似的邻居。在ImageNet-1K上使用ViT和DeiT架构的实验验证表明,与ToMe和ToFu等即插即用基线相比,RAPID建立了更优的精度-压缩帕累托前沿。RAPID在激进压缩场景下尤其鲁棒,在极端缩减率下比ToMe准确率高出4.29%。我们的框架提供了一种免训练模板,通过将缩减策略与层次化特征演化对齐来优化视觉模型。

英文摘要

Vision Transformers (ViTs) achieve strong performance but suffer from high computational costs due to quadratic self-attention complexity. Although token reduction techniques such as pruning and merging mitigate this, they typically overlook how representations evolve across network depth. We propose RAPID, a depth-aware token reduction framework that adapts reduction strategies to the layer-wise characteristics of token representations. The primary methodological contribution is a bifurcated strategy: in shallow-to-middle layers, RAPID employs a redundancy-similarity aware pruning metric to eliminate over-represented local patterns. As features transition to global semantic concepts in deeper layers, the framework shifts to an importance-similarity aware merging mechanism. This stage leverages classification (CLS) token attention weights to protect semantically critical tokens while fusing less important but similar neighbors. Empirical validation on ImageNet-1K using ViT and DeiT architectures demonstrates that RAPID establishes a superior accuracy-compression Pareto frontier compared to plug-and-play baselines such as ToMe and ToFu. RAPID is particularly robust in aggressive compression regimes, achieving up to 4.29% higher accuracy than ToMe at extreme reduction rates. Our framework provides a training-free template for optimizing vision models by aligning reduction strategies with hierarchical feature evolution.

2606.08155 2026-06-09 cs.LG cs.IR 新提交

Have I Solved This Before? Retrieving Similar Segmentation Problems for Evolutionary Learning

我以前解决过这个问题吗?检索相似分割问题进行进化学习

Andreas Margraf, Henning Cui, Jörg Hähner

发表机构 * University of Augsburg(奥格斯堡大学)

AI总结 提出一种基于检索相似分割问题的进化学习方法,通过重用已有管道避免从头训练模型,降低开发成本,并分析跨域迁移的可行性。

详情
AI中文摘要

监控系统的可靠集成和稳固配置是实现现代制造环境高效率和生产率的基本前提。关于传感器类型和系统架构的设计决策必须在早期阶段且在高不确定性下做出。本文研究了一种偏离传统监控系统开发过程的研究方向,将注意力从算法设计转向对检测问题的更深入分析。与传统设计周期不同,本文提出逐步收集知识并将其存储在抽象系统模型中。这使得能够检索未来用例的相似解决方案,避免了昂贵的从头开始模型训练,而是允许对现有基础配置进行增量改进。重用先前生成的管道降低了后期昂贵修订的风险。由于关于滤波器管道的跨域可转移性知之甚少,本研究分析了检索滤波器管道以将其转移到不同但相似的分割问题的潜力。最后,我们统计分析了这种主要应用于图像分割问题的“迁移学习”变体的优势。此外,我们讨论了简单模型如何帮助在设计过程中平衡复杂性、技术要求和可靠性之间的权衡。

英文摘要

Reliable integration and solid configuration of monitoring systems constitute a fundamental prerequisites for achieving high efficiency and productivity in contemporary manufacturing environments. Design decisions on sensor type and system architecture have to be made at an early stage and under comparably high uncertainty. This work investigates a research direction that deviates from the traditional monitoring-system development process by shifting the attention from algorithm design to a deeper analysis of the inspection problem. In contrast to traditional design cycles, this paper proposes to gradually collect knowledge and store it in an abstract system model. This enables the retrieval of similar solutions for future use cases, preventing the need for expensive model training from scratch and allowing instead for the incremental refinement of existing base configurations. Reuse of previously generated pipelines reduces the risk of late and costly revisions. As there is little knowledge on cross-domain transferability of filter pipelines, this study analyzes the potential of retrieving filter pipelines to transfer them to different but similar segmentation problems. Finally, we statistically analyze the benefits of this `transfer learning' variant which is predominantly applied to image segmentation problems. In addition, we discuss how simple models help balancing the trade-off between complexity, technical requirements, and reliability in the design process.

2606.08154 2026-06-09 cs.RO 新提交

SynthICL: Scalable In-context Imitation Learning with Synthetic Data

SynthICL: 基于合成数据的可扩展上下文模仿学习

Cheng Qian, Ruomeng Fan, Yifei Ren, Yilong Wang, Edward Johns

发表机构 * The Robot Learning Lab(机器人学习实验室) Imperial College London(伦敦帝国理工学院)

AI总结 提出SynthICL框架,利用纯RGB合成数据训练上下文模仿学习策略,避免深度传感和真实数据,通过子目标预测提升控制精度,在16个真实操作任务中平均成功率79%。

详情
AI中文摘要

上下文模仿学习(ICIL)使机器人能够通过将预训练策略以任务特定示例为条件,在测试时无需重新训练,从少量演示中学习新任务。尽管前景广阔,训练可泛化且可扩展的上下文模仿策略仍是一个开放挑战。我们提出SynthICL,一个完全基于RGB合成数据训练ICIL策略的可扩展框架。具体而言,我们构建了一个数据生成流水线以产生高保真ICIL数据,并在所得数据集上训练了一个流匹配变换器策略。SynthICL避免了先前方法中对深度传感、精确相机校准和真实世界训练数据的需求,提供了一种更简单且更可扩展的替代方案。我们进一步通过训练模型预测下一个子目标图像来融入子目标预测,从而实现更精确且视觉上可控的操作。在16个未见过的真实世界操作任务上评估,SynthICL在测试时仅提供一个演示的情况下实现了79%的平均成功率,并优于先前方法。项目页面:https://synth-icl.github.io

英文摘要

In-context imitation learning (ICIL) enables robots to learn new tasks from a small number of demonstrations by conditioning a pre-trained policy on task-specific examples, without retraining at test time. Despite this promise, training generalizable and scalable in-context imitation policies remains an open challenge. We present SynthICL, a scalable framework that trains ICIL policies entirely from RGB-only synthetic data. Specifically, we build a data generation pipeline to produce high-fidelity ICIL data and train a flow-matching transformer policy on the resulting dataset. SynthICL avoids the need for depth sensing, precise camera calibration, and real-world training data in prior approaches, offering a simpler and more scalable alternative. We further incorporate subgoal prediction by training the model to predict the next subgoal images, enabling more precise and visually grounded control. Evaluated on 16 unseen real-world manipulation tasks, SynthICL achieves an average success rate of 79% with only one demonstration provided at test time and outperforms prior methods. Project page: https://synth-icl.github.io

2606.08153 2026-06-09 cs.LG cs.AI 新提交

LogNEO: A GPT-Neo Reinforcement Learning Framework for Accurate Real-Time Log Anomaly Detection

LogNEO:基于GPT-Neo的强化学习框架用于精确实时日志异常检测

David Eje, Tanmay Sharma, Khush Patel, Manuel Mazzara, Leonard Johard

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出LogNEO,利用GPT-Neo模型和基于位置感知奖励的PPO微调,在HDFS、BGL和Thunderbird基准上达到F1分数0.927、0.913和0.984,召回率比LogGPT提升6%,并在生产部署中实现45ms端到端延迟。

Comments 8 pages, 5 figures, 6 tables

详情
AI中文摘要

检测大规模系统日志中的异常对于现代计算基础设施的可靠性和安全性至关重要。我们提出LogNEO,一个基于EleutherAI的GPT-Neo(13亿参数)构建的日志异常检测器,并通过一种新颖的部分信用、指数衰减位置感知奖励方案结合交叉熵正则化(使用近端策略优化PPO)进行微调。位置感知奖励显式建模预测难度:早期位置因正确预测获得更高奖励,而后期位置因错误受到更强惩罚。LogNEO在HDFS、BGL和Thunderbird基准上分别达到0.927、0.913和0.984的F1分数,在保持相当精度的同时,召回率比先前最先进的LogGPT提升高达6个百分点。基于Apache Kafka、Redis和TensorRT加速推理的生产微服务部署在每秒15000个事件下实现了45毫秒的端到端延迟。

英文摘要

Detecting anomalies in large-scale system logs is critical for the reliability and security of modern computing infrastructure. We present LogNEO, a log anomaly detector built on EleutherAI's GPT-Neo (1.3B parameters) and fine-tuned with a novel partial-credit, exponentially decaying position-aware reward scheme combined with cross-entropy regularisation via Proximal Policy Optimisation (PPO). The position-aware reward explicitly models prediction difficulty: early positions receive higher rewards for correct predictions, while later positions incur stronger penalties for errors. LogNEO attains F1-scores of 0.927, 0.913, and 0.984 on the HDFS, BGL, and Thunderbird benchmarks, improving recall by up to 6 percentage points over the prior state-of-the-art LogGPT while maintaining comparable precision. A production microservice deployment over Apache Kafka, Redis, and TensorRT-accelerated inference demonstrates 45 ms end-to-end latency at 15,000 events per second.

2606.08152 2026-06-09 cs.RO 新提交

Vision-Guided Dual-Arm Humanoid Robotic Disassembly of End-of-Life 18650 Lithium-ion Battery Packs

视觉引导的双臂人形机器人拆解报废18650锂离子电池组

Yile Chen, Zhihao Liu, Xi Vincent Wang, Lihui Wang

发表机构 * KTH Royal Institute of Technology(瑞典皇家理工学院)

AI总结 提出一种视觉引导的双臂拆解流水线,利用通用平行爪夹持器、RGB-D感知和预训练抓取检测器,在无夹具条件下从任意初始姿态拆解21节18650电池组,实现80%端到端成功率。

详情
AI中文摘要

来自电动汽车和便携式电子产品的退役锂离子电池组数量不断增长,需要安全、灵活且可选择性到单个电池的自动化拆解。然而,现有的机器人系统大多假设已知电池组姿态、外部夹具或专用工具,使得在姿态不确定性下无夹具的电池级拆解仍未解决。本文提出一种视觉引导的双臂流水线,使用通用平行爪夹持器、RGB-D感知和预训练抓取检测器,从任意初始姿态拆解一个21节18650电池组。姿态不确定性通过一个学习-过滤感知栈和离散的看-移动腕部相机校正来吸收,而双臂之间的任务中支持转移则无需任何外部夹具即可扩展有效工作空间。该流水线实现了8/10的端到端成功率,电池定位均方根误差为2.4毫米,每个电池组的平均循环时间为6.0分钟,为工业电池回收提供了一个实用的、无夹具的基础模块。

英文摘要

The growing volume of retired lithium-ion battery packs from electric vehicles and portable electronics calls for automated disassembly that is safe, flexible, and selective down to the individual cell. Existing robotic systems, however, mostly assume known pack poses, external fixtures, or specialised tooling, leaving fixture-free cell-level disassembly under pose uncertainty largely unsolved. This paper presents a vision-guided dual-arm pipeline that disassembles a 21-cell 18650 pack from an arbitrary initial pose using only general-purpose parallel-jaw grippers, RGB-D sensing, and a pre-trained grasp detector. Pose uncertainty is absorbed by a learn-and-filter perception stack with discrete look-and-move wrist-camera corrections, while a mid-task support transfer between the two arms extends the effective workspace without any external clamp. The pipeline achieves an 8/10 end-to-end success rate, a cell-localisation root-mean-square error of $2.4$\,mm, and a mean cycle time of 6.0\,minutes per pack, providing a practical, fixture-free building block for industrial battery recycling.

2606.08150 2026-06-09 cs.CV 新提交

Property-Informed Diffusion-Based Text-to-Microstructure Generation

基于属性信息的扩散模型文本到微结构生成

Bingxuan Dai, Hongsong Wang, Jie Gui

发表机构 * School of Cyber Science and Engineering, Southeast University(东南大学网络空间安全学院) School of Computer Science and Engineering, Southeast University(东南大学计算机科学与工程学院) Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education(教育部新一代人工智能技术及其跨学科应用重点实验室(东南大学)) Purple Mountain Laboratories(紫金山实验室) Engineering Research Center of Blockchain Application, Supervision And Management (Southeast University), Ministry of Education(教育部区块链应用监管工程研究中心(东南大学))

AI总结 提出一种属性信息驱动的扩散网络,从文本描述直接生成3D微结构,通过对比文本-结构对齐和测试时奖励引导对齐确保生成结构的语义和物理可行性。

Comments Published in CVPR2026, Code is at: https://github.com/hongsong-wang/PropDiff-TMG

详情
AI中文摘要

设计满足预期功能的3D超材料微结构仍然是一个重大挑战,因为它通常需要领域专业知识、迭代模拟和大量手动调整。现有的基于期望目标属性自动生成微结构的逆向设计工作往往受限于设计多样性不足,并在确保生成结构的物理可行性方面面临挑战。为解决这一问题,提出了一种属性信息驱动的扩散网络,能够直接从文本描述生成3D微结构。与传统的属性条件方法不同,我们的方法利用文本输入中丰富的语义和物理属性指导,支持多样化的结构合成。为了强制生成结构与目标文本提示之间的一致性,采用了双重对齐策略,包括对比文本-结构对齐和测试时奖励引导对齐。实验结果表明,该模型能够在广泛材料类别中生成语义有意义且物理上合理的结构。我们的方法在交互式微结构设计方面具有良好潜力,并为结合语言接口与逆向材料发现开辟了新方向。代码可在 https://github.com/hongsong-wang/PropDiff-TMG 获取。

英文摘要

Designing 3D metamaterial microstructures that meet the intended functions remains a major challenge, as it typically requires domain expertise, iterative simulations, and extensive manual tuning. Existing work on inverse design that automatically generates microstructures based on desired target properties often suffers from limited design diversity and faces challenges in ensuring the physical feasibility of the generated structures. To address this issue, a property-informed diffusion-based network is proposed that enables the generation of 3D microstructures directly from textual descriptions. Unlike traditional property conditioning methods, our approach leverages rich guidance in terms of semantics and physical properties in the text input to support diverse structure synthesis. To enforce consistency between the generated structures and the target textual prompts, a dual alignment strategy is adopted, including contrastive text-structure alignment and test-time reward-guided alignment. Experimental results show that the model is capable of generating semantically meaningful and physically plausible structures across a wide range of material categories. Our approach has good potential for interactive microstructure design and opens up new directions for combining language-based interfaces with inverse material discovery. Code is available at: https://github.com/hongsong-wang/PropDiff-TMG

2606.08146 2026-06-09 cs.AI 新提交

SAGE: An LLM-driven Self Reflective Agentic Framework for Fraud Detection

SAGE: 一种LLM驱动的自我反思智能体框架用于欺诈检测

Yichen Chen, Siying Li, Yuhang Liang, Lijun Wang, Renyang Liu

发表机构 * National University of Singapore(新加坡国立大学) University of Chinese Academy of Sciences(中国科学院大学) China Mobile Communications Group(中国移动通信集团有限公司)

AI总结 提出SAGE,首个端到端LLM驱动的多智能体欺诈检测框架,通过数据诊断树和自然语言梯度优化,在五个数据集上平均F1提升40.86%。

详情
AI中文摘要

支付、电子商务和电信系统中的欺诈检测需要在个体层面准确、在严重类别不平衡下鲁棒,并且易于风险管理者理解。现有方法至少缺乏这些要求之一:自动化机器学习系统在固定数值空间中搜索,缺乏对数据集的语义感知;基于图神经网络的方法需要预定义的关系图,在个体决策层面仍然不透明;通用大语言模型(LLM)智能体的设计未考虑现实欺诈检测中的召回率和精确率约束。在本文中,我们提出SAGE,首个端到端LLM驱动的多智能体欺诈检测框架。SAGE协调三个专用智能体,基于六层数据诊断树(DDT)和由自然语言梯度引导的马尔可夫决策过程做出决策,在欺诈特定奖励下自动优化模型。在五个欺诈数据集和五个LLM骨干网络上,SAGE在96.00%的方法-数据集比较中获胜,平均F1比基线提升40.86%。代码可在https://github.com/yichenC1c/SAGE获取。

英文摘要

Fraud detection in payment, e-commerce, and telecommunications systems requires accuracy at the individual level, robustness under severe class imbalance, and ease of understanding for risk managers. Existing methods fall at least one of these requirements: automated machine learning systems search a fixed numerical space without semantic awareness of the dataset; graph neural network-based methods require pre-defined relational graphs and remain opaque at the individual-decision level; and the design of general-purpose large language model (LLM) agents does not consider the recall and precision constraints specific to real-world fraud detection. In this paper, we propose SAGE, the first end-to-end LLM-driven multi-agent framework for fraud detection. SAGE coordinates three dedicated agents that make decisions based on a six-layer Data Diagnostic Tree (DDT) and a Markov decision process guided by natural-language gradients, automatically optimizing the model under a fraud-specific reward. On five fraud datasets and five LLM backbones, SAGE wins $96.00\%$ of method--dataset comparisons and improves F1 by an average of $40.86\%$ over baselines. The code is available at https://github.com/yichenC1c/SAGE.

2606.08144 2026-06-09 cs.CV 新提交

IMAGINE: Adaptive Schema-Imagery Enhanced Composition for Composed Video Retrieval

IMAGINE:自适应模式-意象增强组合用于组合视频检索

Jiale Huang, Zixu Li, Zhiwei Chen, Zhiheng Fu, Chunxiao Wang, Yupeng Hu

发表机构 * Shandong University(山东大学) Qilu University of Technology (Shandong Academy of Sciences)(齐鲁工业大学(山东省科学院))

AI总结 针对组合视频检索中修改文本隐含语义与显式视觉内容不匹配的问题,提出自适应模式-意象增强组合网络(IMAGINE),通过动态多模态原型捕获隐含概念并调制视觉特征,在三个基准上达到最优性能。

Comments Accepted by ICMR 2026

详情
AI中文摘要

组合视频检索(CVR)旨在检索与参考视频经修改文本修改后匹配的目标视频。现有方法探索跨模态对应关系时,常假设修改对象直接出现在视频中。然而,修改文本常描述未明确呈现但通过语义相关视觉线索隐含表达的概念(例如,“蛋糕”暗示“生日派对”)。当前方法通常依赖在具体空间中对齐显式特征表示,忽略了关键的潜在关联。为解决此问题,我们提出自适应模式-意象增强组合网络(IMAGINE)。与标准显式匹配不同,IMAGINE通过动态多模态原型具体化隐含语义(称为模式意象)。这些原型捕获共享的潜在概念,自适应地调制视觉特征,有效将隐含引导注入检索过程。通过弥合显式视觉内容与隐含检索意图之间的差距,IMAGINE在三个广泛使用的基准上,在CVR和组合图像检索(CIR)中均达到最先进性能。

英文摘要

Composed Video Retrieval (CVR) is designed to retrieve a target video that matches a reference video modified by a modification text. While existing methods explore cross-modal correspondences, they often assume modified objects appear directly in videos. However, modification texts frequently describe concepts not explicitly presented but implicitly expressed through semantically related visual cues (e.g., "cake" implying "birthday party"). Current approaches typically rely on aligning explicit feature representations within the concrete space, neglecting critical latent associations. To address this, we propose an adaptIve scheMa-ImAGery enhanced composItional NEtwork (IMAGINE). Unlike standard explicit matching, IMAGINE materializes implicit semantics (termed schema imagery) via dynamic multimodal prototypes. These prototypes capture shared latent concepts to adaptively modulate visual features, effectively injecting implicit guidance into the retrieval process. By bridging the gap between explicit visual contents and implicit retrieval intentions, IMAGINE achieves state-of-the-art performance in both CVR and Composed Image Retrieval (CIR) across three widely used benchmarks.

2606.08140 2026-06-09 cs.LG 新提交

TRUST-SCF: Transformer-based Risk Understanding and Scoring for Transactional Supply Chain Finance

TRUST-SCF:基于Transformer的交易供应链金融风险理解与评分

Mohammadamin Davoodabadi, Amirabbas Shakeri

发表机构 * Department of Growth Barook Co.(Growth Barook公司)

AI总结 提出TRUST-SCF框架,利用Transformer对交易序列建模,通过金融对齐的注意力偏置、连续延迟预测和标签高效评分管道,实现动态信用评分,实验表明优于基线。

Comments 15 pages, 13 Figures, 3 Tables

详情
AI中文摘要

供应链金融(SCF)和LendTech平台需要能够响应不断变化的交易行为、还款延迟和活跃风险的信用评分系统。我们提出TRUST-SCF,一个基于Transformer的交易级风险预测和动态信用评分框架。每个用户历史被表示为包含利用率、还款延迟和交易位置的交易令牌序列。主要贡献包括:(1) 一种结合利用率相似性和近因性的金融对齐注意力偏置,使模型能够在可比风险暴露条件下比较还款行为;(2) 在对数变换目标空间中进行连续还款延迟预测,减少极端延迟的影响,同时提高对短延迟行为的敏感性;(3) 一个标签高效的信用评分管道,其中最终信用评分不依赖任何显式的外部信用评分标签进行训练,而是从预测延迟、模拟利用率下的潜在风险、实际未付风险暴露和非线性校准中推导得出。在超过30万笔交易的真实交易数据上的实验表明,TRUST-SCF在延迟预测上优于序列基线,并产生与未来还款行为强相关的评分。这些结果表明,TRUST-SCF是SCF和LendTech环境中自适应信用评分和交易级风险缓解的实用框架。

英文摘要

Supply Chain Finance (SCF) and LendTech platforms need credit scoring systems that respond to evolving transaction behavior, repayment delays, and active exposure. We propose TRUST-SCF, a transformer-based framework for transaction-level risk prediction and dynamic credit scoring. Each user history is represented as a sequence of transaction tokens containing utilization, repayment delay and transaction position. The main contributions are: (1) a financially aligned attention bias that combines utilization similarity and recency, enabling the model to compare repayment behavior under comparable exposure conditions; (2) continuous repayment-delay prediction in a log-transformed target space, reducing the influence of extreme delays while improving sensitivity to short-delay behavior and (3) a label-efficient credit-scoring pipeline in which the final credit score is not trained using any explicit external credit-score label, but is instead derived from predicted delay, potential risk over simulated utilization, actual unpaid exposure, and nonlinear calibration. Experiments on real transaction data from more than 300,000 transactions show that TRUST-SCF improves delay prediction over sequential baselines and produces scores that are strongly associated with future repayment behavior. These results suggest that TRUST-SCF is a practical framework for adaptive credit scoring and transaction-level risk mitigation in SCF and LendTech environments.

2606.08136 2026-06-09 cs.RO 新提交

Learning Predictive Control with Deep Koopman Operators for Autonomous Vehicle Motion Planning

基于深度Koopman算子的学习预测控制在自动驾驶车辆运动规划中的应用

Xinglong Zhang, Yongqian Xiao, Haotian Cao, Xing Zhou, Xin Yin, Xin Xu

发表机构 * National Natural Science Foundation of China(国家自然科学基金委员会) Science and Technology Innovation Program of Hunan Province(湖南省科技创新计划)

AI总结 提出一种结合深度Koopman算子的学习预测控制框架,通过提升非线性动力学到线性可观测空间,并利用滚动时域演员-评论家学习生成闭环状态反馈策略,在非凸约束下实现高效、安全的实时运动规划。

详情
AI中文摘要

模型预测控制(MPC)广泛应用于自动驾驶车辆(AV)的运动规划,但其实时应用通常受限于对精确模型的需求以及在动态道路环境中在线求解非线性、非凸优化问题。演员-评论家强化学习为在线策略生成提供了一种有前景的替代方案,但其策略学习过程往往缺乏显式的控制理论结构。本文提出了一种基于深度Koopman算子的学习预测控制(LPC)框架,用于在非凸约束下实现高效的实时运动规划。为了处理非线性和不确定的车辆动力学,使用基于深度Koopman的预测器以数据驱动的方式将系统提升到可解释的线性可观测空间。与计算开环控制序列的传统MPC不同,所提出的LPC框架通过滚动时域演员-评论家学习在每个预测区间内生成闭环状态反馈策略。为了确保在非凸环境约束下的安全性,LPC构建了障碍物的凸局部替代表示并定义了相应的势场函数。这些函数及其梯度直接嵌入到演员-评论家结构中,从而实现高效且具有安全意识的策略学习。在红旗EHS3平台上进行的大量仿真和实际实验表明,与CBF-MPC和LMPCC等基准方法相比,该方法在多种避障场景中在安全性、计算效率和驾驶舒适性方面均表现出优越性能。

英文摘要

Model Predictive Control (MPC) is widely used for autonomous-vehicle (AV) motion planning, but its real-time applicability is often limited by the need for accurate models and online solution of nonlinear, nonconvex optimization problems in dynamic road environments. Actor-critic reinforcement learning offers a promising alternative for online policy generation, yet its policy-learning process often lacks explicit control-theoretic structure. This article proposes a learning predictive control (LPC) framework with deep Koopman operators for efficient real-time motion planning under nonconvex constraints. To address nonlinear and uncertain vehicle dynamics, a deep-Koopman-based predictor is used to lift the system into an interpretable linear observable space in a data-driven manner. Unlike traditional MPC, which computes open-loop control sequences, the proposed LPC framework yields a closed-loop state-feedback policy within each prediction interval through receding-horizon actor-critic learning. To ensure safety under nonconvex environmental constraints, LPC constructs convex local surrogate representations of obstacles and defines corresponding potential-field functions. These functions and their gradients are directly embedded into the actor-critic structure, enabling efficient, safety-aware policy learning. Extensive simulations and real-world experiments on the HongQi-EHS3 platform demonstrate favorable performance in diverse obstacle-avoidance scenarios in terms of safety, computational efficiency, and driving comfort, compared with benchmark methods such as CBF-MPC and LMPCC.

2606.08133 2026-06-09 cs.CV 新提交

Gravity-guided Contact Dynamics Estimation from 3D Human Motions

重力引导的3D人体运动接触动力学估计

Cuong Le, Urs Waldmann, Bastian Wandt, Mårten Wadenbäck

发表机构 * Linköping University(林雪平大学)

AI总结 提出GraCE模型,利用人体质心与重力分布,从3D运动数据中准确估计地面接触力与压力分布,优于现有方法。

Comments 14 pages, under submission

详情
AI中文摘要

作用于人体的地面接触力对于生物力学研究或运动表现分析至关重要。先前的方法依赖测力台或压力垫来收集地面接触动力学,限制了其在严格控制环境下的适用性。一个更具扩展性的解决方案是直接从运动捕捉数据估计动力学。近期方法仅根据身体与地面之间的垂直距离粗略估计地面接触动力学,无法捕捉所有接触点的复杂压力分布。为此,我们提出GraCE——重力引导的接触动力学估计,一种新颖的全身接触动力学模型,利用身体质量分布和重力的真实影响来估计人体运动。我们使用人体的重心,基于其与身体的相对距离来估计地面接触。每个接触点上的作用力通过预测的接触概率与根据质心轨迹计算的总外力的乘积来估计。我们在GroundLink数据集上的地面反作用力估计和MOYO数据集上的详细接触压力预测中优于相关工作。代码将在接收后公开。

英文摘要

Ground contact forces acting on the human body, are crucial for biomechanics studies or sport performance analysis. Prior methods rely on force plates or pressure mats to collect ground contact dynamics, limiting their applicability to carefully controlled settings. A more scalable solution is to estimate the dynamics directly from motion capture data. Recent approaches only roughly estimate the ground contact dynamics from the vertical distance between the body and the ground plane, which cannot capture the complex pressure distribution of all contact points. To this end, we propose GraCE -- Gravity-guided Contact Dynamics Estimation, a novel full-body contact dynamics model for human motions using a realistic influence of body mass distribution and gravity. We use the human's center of gravity to estimate the ground contacts based on its relative distance to the human body. The applied force on each contact is estimated via the product of predicted contact probabilities and the total exterior force computed from the center of mass trajectory. We outperform related work on the GroundLink dataset for ground reaction force estimation, and on the MOYO dataset for detailed contact pressure prediction. The code is published upon acceptance.

2606.08132 2026-06-09 cs.CV cs.LG 新提交

Phase Marginalization for Patch-Grid Instability in Vision Transformers

视觉Transformer中补丁网格不稳定性的相位边缘化

Oğuzhan Ercan

发表机构 * Scientific and Technological Research Council of Türkiye(土耳其科学技术研究委员会)

AI总结 提出相位边缘化方法,通过评估结构化补丁网格相位、逆对齐密集输出并在原始图像坐标系聚合,消除视觉Transformer中补丁网格相位引起的预测不稳定性,无需训练即可提升分割、深度和匹配性能。

Comments 13 pages, 1 figure, 9 tables

详情
AI中文摘要

视觉Transformer在固定的补丁网格上操作,这可能导致密集预测中相位依赖的不稳定性:改变补丁划分会改变像素可用的令牌证据,尤其是在边界附近。我们将补丁网格相位形式化为一个干扰变量,并提出相位边缘化,一种事后边缘化方法,该方法评估结构化的补丁网格相位,逆对齐密集输出,并在原始图像坐标系中聚合它们。中心变体,K=4的均匀相位边缘化,无需训练,并在测量的分割、深度和局部匹配设置上优于规范的K=1基线。在受控的Cityscapes实验中,均匀相位边缘化相比基于通用移位的四次前向测试时增强(TTA)提供了适度的计算匹配优势(在最强测试的通用行上平均交并比提高0.31)。一项扩展研究进一步表明,K=4是一个实用的成本-精度权衡:K=8基本不变,K=16在更高延迟下增加很少精度。这些结果将补丁网格相位定位为可测量的干扰变量,并将相位边缘化定位为密集ViT预测的简单诊断和事后边缘化基线。

英文摘要

Vision Transformers operate on fixed patch grids, which can introduce phase-dependent instability for dense prediction: changing the patch partition can change the token evidence available to a pixel, especially near boundaries. We formalize patch-grid phase as a nuisance variable and propose Phase Marginalization, a post-hoc marginalization method that evaluates structured patch-grid phases, inverse-aligns dense outputs, and aggregates them in the original image coordinate system. The central variant, Uniform Phase Marginalization with K = 4, is training-free and improves over the canonical K = 1 baseline across measured segmentation, depth, and local matching settings. In a controlled Cityscapes experiment, Uniform Phase Marginalization provides a modest compute-matched advantage over generic shift-based four-forward test-time augmentation (TTA) (+0.31 mean Intersection-over-Union over the strongest tested generic row). A scaling study further shows that K = 4 is a practical cost-accuracy trade-off: K = 8 is essentially unchanged and K = 16 adds little accuracy at much higher latency. These results position patch-grid phase as a measurable nuisance variable and Phase Marginalization as a simple diagnostic and post-hoc marginalization baseline for dense ViT prediction.

2606.08129 2026-06-09 cs.AI 新提交

Cross-LLM Consistency in Inference: Evidence from Shared Interactions

推理中的跨LLM一致性:来自共享交互的证据

Siyu Lou, Yao Yan, Yuntian Chen, Quanshi Zhang

发表机构 * School of Computer Science Shanghai Jiao Tong University(上海交通大学计算机科学学院) Ningbo Key Laboratory of Advanced Manufacturing Simulation Eastern Institute of Technology, Ningbo(宁波市先进制造仿真重点实验室,宁波东方理工大学) College of Computer and Information Science Chongqing Normal University(重庆师范大学计算机与信息科学学院) SymtrustAI.com Eastern Institute of Technology, Ningbo(宁波东方理工大学)

AI总结 研究发现,不同大型语言模型在相同提示下预测相同目标词时,常共享交互模式,且高级模型一致性更强,共享交互通常阶数更低、正负抵消更弱。

Comments 20 pages, 8 figures

详情
AI中文摘要

大型语言模型(LLM)在架构、训练数据和优化过程上各不相同,但它们仍可能发展出相似的内部推理模式。在本文中,我们使用基于交互的解释来检验这一假设。我们发现,当从相同提示预测相同目标词时,LLM 经常共享交互模式。这种一致性在高级 LLM 中更为明显。共享交互通常比非共享交互阶数更低,且正负抵消更弱。这些结果表明,高级 LLM 可能被隐式优化为共同的推理模式,尽管产生这种跨模型一致性的机制仍有待探索。

英文摘要

Large language models (LLMs) differ in architecture, training data, and optimization procedures, yet they may still develop similar internal inference patterns. In this paper, we examine this hypothesis using interaction-based explanations. We find that LLMs often share interaction patterns when predicting the same target token from the same prompt. This consistency is more pronounced among advanced LLMs. Shared interactions also tend to be lower-order and show weaker positive-negative cancellation than non-shared interactions. These results suggest that advanced LLMs may be implicitly optimized toward common inference patterns, even though the mechanisms that give rise to such cross-model consistency remain open.

2606.08126 2026-06-09 cs.CV 新提交

One Stone, Three Birds: Self-adaptive Optimal Transport for Multi-VLM Selection, Adaptation, and Ensembling

一石三鸟:面向多VLM选择、自适应与集成的自适应最优传输

Qiyu Xu, Zhanxuan Hu, Yu Duan, Yonghang Tai, Huafeng Li, Quanxue Gao, Xiangyong Cao

发表机构 * Xi’an Jiaotong University(西安交通大学) Yunnan Normal University(云南师范大学) Xidian University(西安电子科技大学) Kunming University of Science and Technology(昆明理工大学)

AI总结 提出无训练框架OSTB,通过自适应最优传输估计共识样本-类别结构,同时解决多VLM的模型选择、目标域自适应和预测集成问题。

详情
AI中文摘要

视觉语言模型(VLM)能够从语义类别描述中进行视觉识别,这使得它们在目标标注稀缺或不可用时具有吸引力。然而,大多数部署流程首先选择一个单一的VLM,然后将该模型适应到未标记的目标集。这种单骨干范式隐藏了一个关键假设:所选VLM已经与目标域兼容。在实际的跨域部署中,可能有多个通用和领域专用的VLM是可行的,但没有实例级目标标签可用于识别可靠的模型。因此,部署需要一个耦合的解决方案来进行模型选择、目标适应和预测集成。我们从系统级多VLM的角度重新审视这个问题。我们的核心观察是,上述三个决策依赖于同一个潜在对象:目标集中可信的样本-类别结构。不同的VLM可能编码不同的迁移偏差并产生冲突的预测,但它们的输出仍然可以为估计该结构提供互补证据。我们提出了一石三鸟(OSTB),一个基于自适应最优传输的无训练框架。给定一组冻结的候选VLM,OSTB在不更新VLM参数的情况下估计一个共识的样本到类别传输计划。然后,学习到的传输结构被重用于所有部署目标:通过排序共识计划引起的组合语义和视觉可靠性来进行模型选择;通过拟合传输条件视觉分类器获得目标适应;通过可靠性感知的概率集成实现集成。在自然图像、遥感和医学病理基准上的大量实验表明,OSTB在异构候选池下提高了模型排名、适应稳定性和集成鲁棒性。

英文摘要

Vision-language models (VLMs) enable visual recognition from semantic class descriptions, which makes them attractive when target annotations are scarce or unavailable. Most deployment pipelines, however, first choose a single VLM and then adapt that model to the unlabeled target set. This single-backbone paradigm hides a critical assumption: the selected VLM is already compatible with the target domain. In realistic cross-domain deployment, several general-purpose and domain-specialized VLMs may be plausible, yet no instance-level target labels are available to identify the reliable ones. Deployment therefore requires a coupled solution for model selection, target adaptation, and prediction integration. We revisit this problem from a system-level multi-VLM perspective. Our central observation is that the three decisions above depend on the same latent object: a trustworthy sample-class structure in the target set. Different VLMs may encode different transfer biases and produce conflicting predictions, but their outputs can still provide complementary evidence for estimating this structure. We propose One Stone, Three Birds, a training-free framework based on self-adaptive optimal transport. Given a pool of frozen candidate VLMs, OSTB estimates a consensus sample-to-class transport plan without updating VLM parameters. The learned transport structure is then reused for all deployment objectives: model selection is performed by ranking the combined semantic and visual reliability induced by the consensus plan; target adaptation is obtained by fitting transport-conditioned visual classifiers; and ensembling is implemented through reliability-aware probabilistic integration. Extensive experiments on natural-image, remote-sensing, and medical-pathology benchmarks show that OSTB improves model ranking, adaptation stability, and ensemble robustness under heterogeneous candidate pools.

2606.08123 2026-06-09 cs.CV cs.AI 新提交

Human-Centered Benchmarking of Driver Monitoring Models

以人为中心的驾驶员监控模型基准测试

Ruben Dario Florez-Zela

发表机构 * Universidad Nacional de San Agustin de Arequipa (UNSA)(圣奥古斯丁国立大学(UNSA))

AI总结 针对驾驶员监控模型仅用分类精度评估的不足,提出以人为中心的基准测试框架(HCBF),从精度、可解释性、效率和鲁棒性四维评估,发现模型在帕累托前沿上各占优势,但聚合排名会掩盖关键缺陷。

Comments 9 pages, 3 figures, 7 tables. Code available at: https://github.com/rubendflorezzela/hcbf-driver-monitoring

详情
AI中文摘要

基于视觉的驾驶员监控系统越来越多地部署在安全关键的智能交通环境中,但它们几乎总是仅根据分类精度进行比较。本文认为精度不足以表征模型在实际部署中的适用性,并提出了以人为中心的基准测试框架(HCBF),该框架从四个维度评估模型:精度、可解释性、效率和鲁棒性。该框架应用于四种代表性的轻量级架构:MobileNetV3、ShuffleNetV2、EfficientNet-B0和DeiT-Tiny,在MRL眼睛数据集上进行眼睛状态分类。虽然这些模型在干净数据集上的精度几乎无法区分,但每个模型恰好在一个维度上领先,并且所有四个模型都位于帕累托前沿。在三种面向部署的权重场景下计算的人为中心得分将ShuffleNetV2排在首位。然而,这个聚合胜出者在传感器噪声下保留了不到一半的性能,并且将闭眼分类为睁眼而失败,而Transformer则保持鲁棒。这些发现表明,聚合排名可能掩盖在操作上具有决定性的维度特定漏洞,强调了多维、以人为中心评估的价值。

英文摘要

Vision-based driver monitoring systems are increasingly deployed in safety-critical intelligent transportation settings, yet they are almost always compared on classification accuracy alone. This paper argues that accuracy is insufficient to characterize a model's fitness for real-world deployment, and proposes the Human-Centered Benchmarking Framework (HCBF), which evaluates models across four dimensions: accuracy, explainability, efficiency, and robustness. The framework is applied to four representative lightweight architectures, MobileNetV3, ShuffleNetV2, EfficientNet-B0, and DeiT-Tiny, on the MRL Eye Dataset for eye-state classification. While the models are nearly indistinguishable on clean-set accuracy, each leads in exactly one dimension, and all four lie on the Pareto frontier. A Human-Centered Score computed under three deployment-oriented weighting scenarios ranks ShuffleNetV2 first throughout. However, this aggregate winner retains less than half of its performance under sensor noise and fails by classifying closed eyes as open, whereas the transformer remains robust. These findings show that aggregate ranking can mask dimension-specific vulnerabilities that are operationally decisive, underscoring the value of multi-dimensional, human-centered evaluation.

2606.08122 2026-06-09 cs.AI 新提交

Think Before You Act: Intention-Guided Reasoning for LLM-Based Location Prediction

三思而后行:基于意图引导推理的LLM位置预测

Qingxiang Liu, Anqi Liang, Zhuoyang Jiang, Yutian Jiang, Sisuo Lyu, Yu Ji, Haomin Wen, Yuxuan Liang

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Shanghai Jiao Tong University(上海交通大学) The Hong Kong University of Science and Technology(香港科技大学) Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 提出IntentPOI框架,通过两阶段意图引导推理(先推断用户出行意图,再基于意图选择POI),将位置预测从直接轨迹匹配转化为意图推理,在三个真实数据集上超越11个基线。

详情
AI中文摘要

根据用户的历史签到记录预测其下一个兴趣点(POI)是基于位置服务中的一项基本任务。尽管最近结合大语言模型的方法展现了强大的推理能力和有前景的结果,但它们通常将预测任务建模为一步式的轨迹到位置映射问题,使得预测容易受到浅层轨迹相关性和历史频率偏差的影响。我们认为用户很少直接选择位置,相反,他们通常首先形成出行意图,然后据此选择特定的POI。受此洞察启发,我们提出了IntentPOI,一个两阶段的意图引导推理框架。在思考阶段,我们通过结合历史移动模式、相似同伴行为和时间上下文来推断用户的中间意图。在执行阶段,我们首先构建一个紧凑的候选池,然后执行意图引导推理,以识别与推断意图最一致的位置。通过明确地将意图推断与位置预测解耦,IntentPOI将下一个POI预测从直接的轨迹匹配转变为意图引导推理。在三个真实世界数据集上的大量实验表明,IntentPOI始终优于十一个最先进的基线方法。

英文摘要

Predicting a user's next Point-of-Interest (POI) based on their historical check-in records is a fundamental task in location-based services. While recent methods incorporating large language models have shown strong reasoning capabilities and promising results, they typically formulate the prediction task as a one-step trajectory-to-location mapping problem, making predictions prone to shallow trajectory correlations and historical frequency bias. We argue that users rarely choose locations directly and instead, they usually first form a traveling intention and then accordingly select specific POIs. Motivated by this insight, we propose IntentPOI, a two-stage intention-guided reasoning framework. In the thinking stage, we infer users' intermediate intentions by incorporating historical mobility patterns, similar peer behaviors, and the temporal contexts. In the acting stage, we first construct a compact candidate pool, and then perform intention-guided reasoning to identify locations that best align with the inferred intention. By explicitly decoupling intention inference from location prediction, IntentPOI transforms the next POI prediction from direct trajectory matching into intention-guided reasoning. Extensive experiments on three real-world datasets demonstrate that IntentPOI consistently outperforms eleven state-of-the-art baselines.

2606.08121 2026-06-09 cs.CV 新提交

Trustworthy Visual Predicates for Robust Manipulation Understanding under Degradation

可信视觉谓词用于退化条件下的鲁棒操作理解

Fatemeh Ziaeetabar

发表机构 * University of Tehran(德黑兰大学)

AI总结 提出谓词级可靠性框架,通过结构化谓词词汇表、置信度感知估计和可靠性度量,分析模糊、遮挡等退化对操作理解中视觉谓词的影响,实验表明接触敏感和动态谓词更脆弱。

详情
AI中文摘要

操作理解需要可靠的关联证据,如接触、支撑、包含、运动耦合、抓取、释放和主动手参与。尽管这些视觉谓词广泛用于事件链、图基和神经符号模型,但它们在视觉退化下的可靠性很少被直接分析。本文引入了一个谓词级可靠性框架,用于在模糊、遮挡、光照变化、低分辨率、帧丢失和检测噪声下实现鲁棒的操作理解。该框架定义了结构化谓词词汇表、置信度感知的谓词估计以及用于谓词保持、退化敏感性、时间一致性、置信度加权稳定性和下游影响的可靠性度量。在受控操作视频和公共自我中心或双手数据集(包括VISOR/EPIC-KITCHENS、H2O和ARCTIC)上的实验表明,谓词失败是结构化的而非均匀的。静态空间谓词相对稳健,而接触敏感、动态和派生谓词(如抓取和释放)更脆弱。在严重退化下,检测噪声、遮挡和帧丢失导致最强的可靠性损失。下游分析表明,退化谓词将操作理解准确率从0.89降至0.58,而在中等退化下去除置信度加权将准确率从0.74降至0.64。这些结果表明,谓词可靠性在视觉感知和结构化操作推理之间提供了一个诊断层。

英文摘要

Manipulation understanding requires reliable relational evidence, such as contact, support, containment, motion coupling, grasp, release, and active-hand involvement. Although these visual predicates are widely used in event-chain, graph-based, and neuro-symbolic models, their reliability under visual degradation is rarely analyzed directly. This paper introduces a predicate-level reliability framework for robust manipulation understanding under blur, occlusion, illumination change, low resolution, frame dropping, and detection noise. The framework defines a structured predicate vocabulary, confidence-aware predicate estimation, and reliability metrics for predicate preservation, degradation sensitivity, temporal consistency, confidence-weighted stability, and downstream impact. Experiments on controlled manipulation videos and public egocentric or bimanual datasets, including VISOR/EPIC-KITCHENS, H2O, and ARCTIC, show that predicate failures are structured rather than uniform. Static spatial predicates remain comparatively robust, whereas contact-sensitive, dynamic, and derived predicates such as grasp and release are more fragile. Under severe degradation, detection noise, occlusion, and frame dropping cause the strongest reliability losses. Downstream analysis shows that degraded predicates reduce manipulation-understanding accuracy from 0.89 to 0.58, while removing confidence weighting under moderate degradation reduces accuracy from 0.74 to 0.64. These results show that predicate reliability provides a diagnostic layer between visual perception and structured manipulation reasoning.

2606.08113 2026-06-09 cs.LG math.FA math.OC 新提交

Conditional Random Ordered Transport Spaces

条件随机有序传输空间

Lei Luo, Jian Yang

发表机构 * Nanjing University of Science and Technology(南京理工大学) PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education(PCA实验室,教育部高维信息智能感知与系统重点实验室) School of Computer Science and Engineering(计算机科学与工程学院)

AI总结 提出条件随机有序传输空间(CROTS),通过引入有序传输几何和条件风险泛函,解决分布学习中传输方向是否被允许的问题,并建立稳定性定理。

Comments 24 pages, 1 figure, 2 tables

详情
AI中文摘要

小的Wasserstein距离不能证明变换是可容许的。在证据约束、语义、因果、物理、单调或风险敏感学习中,不仅要问两个概率定律相距多远,还要问质量是否沿着可用信息允许的方向移动。我们引入了条件随机有序传输空间(CROTS),这是一类\(L^0\)值随机概率测度空间,配备了Wasserstein环境度量、闭随机序、硬和软有序传输差异,以及用于在证据sigma域下评估序违反的条件风险泛函。核心对象是随机测度值动力学的一个序可容许传输几何,区别于锥值度量、有序Kantorovich构造、单独的随机Wasserstein空间以及生成路径的模型特定残差。我们发展了CROTS作为可靠分布学习空间理论的基础。结果包括硬和软有序传输的适定性和对偶性、软到硬变分收敛、随机提升空间的可测性和完备性、约化到经典Wasserstein和有序几何、有序测地线、约束重心和投影、条件风险-传输对偶性以及序违反分布的分离。主要稳定性定理表明,随机学习动力学可以在环境Wasserstein度量中收敛,而其局部可容许性泄漏遵循一个独立的条件序-风险递归。由此产生的渐近序-风险下界为证据过度、有序分布偏移、鲁棒性失败和可容许分布动力学提供了数学语言。

英文摘要

A small Wasserstein distance does not certify that a transformation is admissible. In evidence-constrained, semantic, causal, physical, monotone, or risk-sensitive learning, one must ask not only how far two probability laws are, but whether mass has moved in a direction allowed by available information. We introduce conditional random ordered transport spaces (CROTS), a class of \(L^0\)-valued spaces of random probability measures equipped with a Wasserstein ambient metric, a closed stochastic order, hard and soft ordered transport discrepancies, and a conditional risk functional for evaluating order violation under an evidence sigma-field. The central object is an order-admissible transport geometry for random measure-valued dynamics, distinct from cone-valued metrics, ordered Kantorovich constructions, random Wasserstein spaces alone, and model-specific residuals for generative paths. We develop the foundations of CROTS as a space theory for reliable distributional learning. The results include well-posedness and duality for hard and soft ordered transport, soft-to-hard variational convergence, measurability and completeness of the random lifted space, reductions to classical Wasserstein and ordered geometries, ordered geodesics, constrained barycenters and projections, conditional risk-transport duality, and separation of order-violating distributions. The main stability theorem shows that random learning dynamics may converge in the ambient Wasserstein metric while its local admissibility leakage follows a separate conditional order-risk recursion. The resulting asymptotic order-risk floor provides a mathematical language for evidence overreach, ordered distribution shift, robustness failure, and admissible distributional dynamics.

2606.08107 2026-06-09 cs.RO cs.AI 新提交

Ego-Pi: VLA Fine-Tuning for Ego-Centric Human and Robot Data

Ego-Pi: 面向自我中心人类与机器人数据的VLA微调

Ji Woong Kim, Ke Wang, Zipeng Fu, Sirui Chen, Cong Zhao, Jeff Lai, Chelsea Finn

发表机构 * Stanford University(斯坦福大学) Meta

AI总结 为解决机器人数据稀缺问题,利用自我中心人类数据,基于π₀.₅模型微调,使机器人学习新任务语义并组合现有技能,无需对应机器人数据。

详情
AI中文摘要

机器人技术面临数据稀缺的根本挑战。与语言或视觉研究不同,机器人操作没有互联网规模的数据集。一个有前景的途径是利用自我中心人类数据,这类数据更容易收集、范围更广且规模更大。为此,我们研究了跨人类和配备灵巧五指手的类人机器人实体学习的关键设计选择,以$π_{0.5}$模型为基础。我们的结果表明,人类数据使机器人能够学习新的任务语义,并将现有技能组合成新颖的行为,而无需相应的机器人数据。论文网站:https://egopipaper.github.io/

英文摘要

Robotics faces a fundamental challenge of data scarcity. Unlike language or vision research, there is no internet-scale dataset for robotic manipulation. A promising path forward is to leverage egocentric human data, which can be collected more easily, with greater breadth, and at a larger scale. Towards this end, we investigate key design choices for learning across human and humanoid embodiments equipped with dexterous five-finger hands, using the $π_{0.5}$ model as a foundation. Our results show that human data enables robots to learn new task semantics and compose existing skills into novel behaviors without corresponding robot data. The paper website is here: https://egopipaper.github.io/