arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2070
2606.11779 2026-06-11 cs.CV 新提交

Battery detection of XRay images using transfer learning

基于迁移学习的X射线图像电池检测

Nermeen Abou Baker, David Rohrschneider, Uwe Handmann

发表机构 * Ruhr West University of Applied Sciences(鲁尔西应用科学大学)

AI总结 本研究利用迁移学习,基于YOLOv5m模型检测X射线图像中的电池,并分类三种锂离子电池,检测精度达94%,推理时间22毫秒。

Comments Published at the European Symposium on Artificial Neural Networks (ESANN 2022)

详情
AI中文摘要

在许多应用中,检测和分类电池的需求急剧增加。本研究证明了迁移学习在预测图像中是否包含电池、定位电池以及识别三种类型的锂离子电池(即棱柱形、软包和圆柱形)方面的潜力。特别地,它关注于两种应用中的迁移学习方法:使用预训练的YOLOv5m训练大规模数据集以检测电子设备,然后利用这些训练后的权重来检测和分类电池。电池检测精度达到94%,比预训练的YOLOv5m权重高出5%,推理时间为22毫秒。

英文摘要

The need for detecting and sorting batteries is drastically increasing for many applications. This study proves the potential of transfer learning in predicting whether the image contains a battery or not, the location and identifying three types of batteries, namely: prismatic, pouch, and cylindrical Lithium-Ion Batteries (LIB). Particularly, it focuses on the transfer learning method in two applications: Training a large-scale dataset to detect electronic devices using a pre-trained YOLOv5m, then using these latter trained weights to detect and classify the batteries. The precision of battery detection achieves 94%, which outperforms the pretrained YOLOv5m weights with 5%, in 22 ms inference time.

2606.11770 2026-06-11 cs.AI 新提交

SVoT: State-aware Visualization-of-Thought for Spatial Reasoning via Reinforcement Learning

SVoT: 基于强化学习的空间推理状态感知思维可视化

Chao Lei, Yanbei Jiang, Markus Hiller, Zhijian Zhou, Xunye Tian, Krista A. Ehinger, Nir Lipovetzky

发表机构 * School of Computing and Information Systems, The University of Melbourne(墨尔本大学计算与信息系统学院)

AI总结 提出SVoT框架,通过强化学习生成可验证的中间状态和可视化,结合文本与视觉推理链,提升多模态大模型在多跳空间推理中的可靠性。

详情
AI中文摘要

空间推理对多模态大语言模型(MLLMs)仍是一个挑战,因为它需要在中间状态和状态转换上进行可靠的多跳推理。当前研究通常不验证中间状态,并将状态转换视为隐式过程,这限制了多跳空间推理的可靠性。为解决这一问题,我们提出状态感知思维可视化(SVoT),一种强化学习框架,生成交错、可验证的中间状态和可视化。SVoT将转换推理链整合到生成过程中,使模型能够通过交错的文本和视觉推理验证动作前提和效果。我们通过组相对策略优化(GRPO)训练SVoT,通过奖励设计实例化验证,并评估不同细粒度奖励的效果。由于现有基准将状态转换简化为单变量更新,大大简化了问题,我们通过扩展经典环境并引入两个需要多对象交互和数值推理的新领域Pacman和Gather,建立了五个领域。这些领域支持对多跳空间推理的系统评估,并对生成的中间状态和转换推理进行定量验证。具有转换感知监督的SVoT在引入的领域中达到了最先进的性能,在分布外测试集上实现了高达65%的绝对准确率提升。

英文摘要

Spatial reasoning remains a challenge for Multimodal Large Language Models (MLLMs), as it requires reliable multi-hop inference over both intermediate states and state transitions. Current studies often leave intermediate states unverified and treat state transitions as implicit processes, which limits reliability in multi-hop spatial reasoning. To address this, we propose State-aware Visualization-of-Thought (SVoT), a reinforcement learning framework that generates interleaved, verifiable intermediate states and visualizations. SVoT integrates transition reasoning chains into the generation processes, enabling the model to verify action preconditions and effects through interleaved textual and visual reasoning. We train SVoT via Group Relative Policy Optimization (GRPO), instantiating verification through reward design and evaluating the efficacy of different fine-grained rewards. As existing benchmarks reduce state transitions to single-variable updates, substantially simplifying the problems, we establish five domains by extending classical environments and introducing two novel domains, Pacman and Gather, that require multi-object interactions and numerical reasoning. These domains support systematic evaluation of multi-hop spatial reasoning with quantitative verification of generated intermediate states and transition reasoning. SVoT with transition-aware supervision achieves state-of-the-art performance across the introduced domains, yielding up to a 65% absolute accuracy gain on out-of-distribution test sets.

2606.11769 2026-06-11 cs.AI cs.LG 新提交

When Do Data-Driven Systems Exhibit the Capability to Infer?

数据驱动系统何时展现出推理能力?

Maximilian Poretschkin, Tabea Naeven

发表机构 * Fraunhofer Institute for Intelligent Analysis and Information Systems (IAIS)(弗劳恩霍夫智能分析与信息系统研究所) University of Bonn(波恩大学) Lamarr Institute for Machine Learning and Artificial Intelligence(拉马尔机器学习和人工智能研究所)

AI总结 针对欧盟AI法案中推理能力定义模糊的问题,基于统计学习理论提出分级框架,通过信用评分案例展示如何判断系统是否具备推理能力。

详情
AI中文摘要

欧盟AI法案是第一部全面的人工智能法规,为所谓高风险和通用AI系统规定了广泛的义务。AI法案下AI系统的一个关键区别特征是推理能力。由于AI法案未明确定义推理,某些数据驱动系统存在灰色地带。一个具体例子是信用评分系统,被AI法案附件三列出。然而,这些系统通常使用统计模型实现,不清楚它们是否具有推理能力,从而是否属于AI法案的AI定义。受统计学习理论启发,本文开发了一个分级不同推理能力水平的框架。基于AI法案和委员会关于人工智能系统定义的指南,我们分析了哪些水平构成AI法案意义上的充分推理能力,以及哪些地方需要进一步的监管明确性。我们通过创建两个现实的信用评分工作流程来说明该框架,并展示推理是否以及在哪里发生。我们的分析表明,不仅需要考虑单个模型,还需要考虑整个数据处理工作流程。它还表明,开发过程中人类专家的参与可能对推理能力产生重大影响。代码可在此https URL找到。

英文摘要

The European AI Act is the first comprehensive regulation of artificial intelligence (AI), setting out extensive obligations, particularly for so-called high-risk and general-purpose AI systems. A key distinguishing feature of AI systems under the AI Act is the capability to infer. Since the AI Act does not clearly define what inference is, there is a gray area for certain data-driven systems. A specific example is credit scoring systems, which are listed by Annex III of the AI Act. At the same time, however, these are often implemented using statistical models for which it is unclear whether they have the capability to infer and thus fall under the AI definition of the AI Act at all. Motivated by statistical learning theory, this work develops a framework for grading different levels of the capability to infer. Based on the AI Act and the Commission Guidelines on the definition of an artificial intelligence system, we analyze which levels constitute sufficient capability to infer within the meaning of the AI Act and where further regulatory clarity is needed. We illustrate the framework by creating two realistic credit scoring workflows and show whether and where inference occurs in them. Our analysis illustrates that not only individual models but the entire data processing workflow must be considered. It also shows that the involvement of human experts during development can have significant influence on the capability to infer. Code can be found at https://github.com/fraunhofer-iais/inference-framework-creditscorecards.

2606.11762 2026-06-11 cs.CL cs.AI 新提交

Automated Creativity Evaluation of Language Models Across Open-Ended Tasks

语言模型在开放式任务中的自动化创造力评估

Min Sen Tan, Zachary Kit Chun Choy, Syed Ali Redha Alsagoff, Nadya Yuki Wangsajaya, Mohor Banerjee, Swaagat Bikash Saikia, Alvin Chan

发表机构 * Raffles Institution(莱佛士书院) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院) Lee Kong Chian School of Medicine, Nanyang Technological University(南洋理工大学李光前医学院) Centre of AI in Medicine (C-AIM), Nanyang Technological University(南洋理工大学人工智能医学中心)

AI总结 提出一种领域无关的自动化框架,通过语义熵和检索式多智能体评估,量化LLM在开放式任务中的发散与收敛创造力,并在问题解决、研究构思和创意写作三个领域验证其有效性。

Comments Accepted to ACL 2026 (Main Conference). 35 pages, 16 figures. Code: https://github.com/tanminsen/creativity-eval

详情
AI中文摘要

大型语言模型(LLMs)在语言理解、推理和生成方面取得了显著进展,激发了对其创造潜力的日益关注。实现这一潜力需要系统化和可扩展的方法来评估跨不同任务的创造力。然而,大多数现有的创造力指标与特定任务紧密耦合,将领域假设嵌入评估过程,限制了可扩展性和通用性。为解决这一差距,我们引入了一个自动化、领域无关的框架,用于量化LLM在开放式任务中的创造力。我们的方法将测量装置与创造性任务本身分离,实现了可扩展、任务无关的评估。发散创造力通过语义熵(一种无参考且稳健的新颖性和多样性指标)进行测量,并针对人类注释、基于LLM的新颖性判断和基线多样性度量进行了验证。收敛创造力通过一种新颖的基于检索的多智能体评判框架进行评估,该框架提供上下文敏感的任务完成评估,效率提升超过60%。我们在三个性质不同的领域验证了我们的框架:问题解决(MacGyver)、研究构思(HypoGen)和创意写作(BookMIA),使用了广泛的LLM套件。实证结果表明,我们的框架可靠地捕捉了创造力的关键方面,包括新颖性、多样性和任务完成,并揭示了模型属性(如大小、温度、时效性和推理)如何影响创造性表现。我们的工作为自动化的LLM创造力评估建立了可重复和可泛化的标准,为可扩展的基准测试铺平了道路,并加速了创造性AI的进展。

英文摘要

Large language models (LLMs) have achieved remarkable progress in language understanding, reasoning, and generation, sparking growing interest in their creative potential. Realizing this potential requires systematic and scalable methods for evaluating creativity across diverse tasks. However, most existing creativity metrics are tightly coupled to specific tasks, embedding domain assumptions into the evaluation process, and limiting scalability and generality. To address this gap, we introduce an automated, domain-agnostic framework for quantifying LLM creativity across open-ended tasks. Our approach separates the measurement apparatus from the creative task itself, enabling scalable, task-agnostic assessment. Divergent creativity is measured using semantic entropy, a reference-free and robust metric for novelty and diversity, validated against human annotations, LLM-based novelty judgments and baseline diversity measures. Convergent creativity is assessed via a novel retrieval-based multi-agent judge framework that delivers context-sensitive evaluation of task fulfilment with over 60% improved efficiency. We validate our framework in three qualitatively distinct domains: problem-solving (MacGyver), research ideation (HypoGen), and creative writing (BookMIA), using a broad suite of LLMs. Empirical results show that our framework reliably captures key facets of creativity, including novelty, diversity, and task fulfilment, and reveal how model properties, such as size, temperature, recency, and reasoning, impact creative performance. Our work establishes a reproducible and generalizable standard for automated LLM creativity evaluation, paving the way for scalable benchmarking and accelerating progress in creative AI.

2606.11761 2026-06-11 cs.LG 新提交

RCAP: Robust, Class-Aware, Probabilistic Dynamic Dataset Pruning

RCAP: 鲁棒的、类别感知的、概率性动态数据集剪枝

Atif Hassan, Swanand Khare, Jiaul H. Paik

发表机构 * IIT Kharagpur(印度理工学院卡哈拉格普尔分校)

AI总结 提出RCAP算法,通过闭式解估计每类样本保留比例并自适应调整,结合高损失样本优先采样策略,在多种数据集和训练范式下优于现有方法,仅用10%数据即可提升类别不平衡数据集性能1%以上。

Comments Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence (UAI 2025)

详情
Journal ref
pages={1648-1662}, year={2025}, volume={286}, publisher={PMLR}
AI中文摘要

动态数据剪枝技术旨在通过模型训练期间定期选择输入数据的代表性子集来降低计算成本,同时最小化信息损失。然而,现有方法在平衡和不平衡数据集中,特别是在高剪枝率下,往往难以保持较强的最差组准确率。为了解决这一挑战,我们提出了RCAP,一种用于分类任务的鲁棒的、类别感知的、概率性动态数据集剪枝算法。RCAP应用闭式解来估计每个类别应包含在训练子集中的样本比例。该比例通过类别聚合损失在每个epoch自适应调整。随后,它采用自适应采样策略,优先选择具有高损失的样本来填充类别子集。我们在六个从类别平衡到高度不平衡的多样化数据集上,使用五种不同的模型,在三种训练范式(从头训练、迁移学习和微调)下评估了RCAP。我们的方法在所有剪枝率下始终优于最先进的数据集剪枝方法,实现了卓越的最差组准确率。值得注意的是,仅使用10%的数据,RCAP在类别不平衡数据集上相比全数据训练性能提升超过1%,同时平均加速8.69倍。代码可在此https URL获取。

英文摘要

Dynamic data pruning techniques aim to reduce computational cost while minimizing information loss by periodically selecting representative subsets of input data during model training. However, existing methods often struggle to maintain strong worst-group accuracy, particularly at high pruning rates, across balanced and imbalanced datasets. To address this challenge, we propose RCAP, a Robust, Class-Aware, Probabilistic dynamic dataset pruning algorithm for classification tasks. RCAP applies a closed-form solution to estimate the fraction of samples to be included in the training subset for each individual class. This fraction is adaptively adjusted in every epoch using class-wise aggregated loss. Thereafter, it employs an adaptive sampling strategy that prioritizes samples having high loss for populating the class-wise subsets. We evaluate RCAP on six diverse datasets ranging from class-balanced to highly imbalanced using five distinct models across three training paradigms: training from scratch, transfer learning, and fine-tuning. Our approach consistently outperforms state-of-the-art dataset pruning methods, achieving superior worst-group accuracy at all pruning rates. Remarkably, with only $10\%$ data, RCAP delivers $>1\%$ improvement in performance on class-imbalanced datasets compared to full data training while providing an average $8.69\times$ speedup. The code can be accessed at https://github.com/atif-hassan/RCAP-dynamic-dataset-pruning

2606.11745 2026-06-11 cs.CV cs.AI 新提交

From Prompts to Tokens: Internalizing Causal Supervision in Vision-Language Model for Multi-Image Causal Reasoning

从提示到标记:将因果监督内化到视觉-语言模型中进行多图像因果推理

Haoping Yu, Yuanxi Li, Jing Ma

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出BridgeVLM,通过从多图像输入诱导因果图并转换为因果标记,注入LLM解码器进行因果消息传递,显著提升多图像因果推理性能。

详情
AI中文摘要

视觉因果推理对于理解和干预物理世界至关重要,需要从视觉输入中识别因果变量并推理干预效果。尽管最近取得了进展,大型视觉-语言模型(VLM)在此类任务上仍然脆弱,尤其是对于多图像输入上的干预和反事实查询。大多数现有探索通过文本提示注入因果知识,使因果机制外在于模型执行,限制了推理过程中的可靠控制。为了解决这个问题,我们提出了BridgeVLM,它通过从多图像输入中诱导因果图并将其转换为结构化的因果标记,由注入到LLM解码器中的RAMP层执行因果消息传递,从而内化视觉因果推理。我们进一步引入了一个统一的训练接口M3S,用于不同粒度(局部/全局级别)的细粒度因果监督。BridgeVLM在CausalVLBench的干预任务上达到了54.4%的准确率(而提示级监督为33.2%),在Causal3D上将结果从43.6%提升到49.0%,并在CausalVLBench上显著改善了因果结构学习($F_1$:33.4% → 75.1%)。

英文摘要

Visual causal reasoning is essential for understanding and intervening in the physical world, requiring identification of causal variables from visual inputs and reasoning over intervention effects. Despite recent progress, large vision--language models (VLMs) remain brittle at such tasks, especially for interventional and counterfactual queries over multi-image inputs. Most existing explorations inject causal knowledge via textual prompts, leaving causal mechanisms external to model execution and limiting reliable control during inference. To address this problem, we propose BridgeVLM, which internalizes visual causal reasoning by inducing a causal graph from multi-image inputs and converting it into structured Causal Tokens executed by RAMP layers injected into the LLM decoder for causal message passing. We further introduce a unified training interface M3S for fine-grained causal supervision from different granularities (local/global level). BridgeVLM achieves 54.4% accuracy on intervention tasks on CausalVLBench (vs. 33.2% with prompt-level supervision), improves results on Causal3D from 43.6% to 49.0%, and substantially improves causal structure learning on CausalVLBench ($F_1$: 33.4% $\rightarrow$ 75.1%).

2606.11744 2026-06-11 cs.CL cs.AI 新提交

Hey Chat, Can You Teach Me? Structuring Socratic Dialogue for Human Learning in the Wild

嘿,聊天机器人,你能教我吗?为人类学习构建结构化苏格拉底式对话

Sidney Tio, Arunesh Sinha, Pradeep Varakantham

发表机构 * School of Computing and Information Systems, Singapore Management University(新加坡管理大学计算与信息系统学院) Department of Management Science and Information Systems, Rutgers Business School(罗格斯大学商学院管理科学与信息系统系)

AI总结 针对LLM在长对话中教学效果差的问题,提出分离课程规划、苏格拉底对话和知识状态推断的系统,使用PPO策略决定教学顺序,在STEM和非STEM主题上优于基线模型。

Comments 10 Main Body Pages, with Appendices

详情
AI中文摘要

大型语言模型现在被广泛用于日常学习,但底层交互通常是非结构化的聊天,而不是遵循课程。与正式的在线学习系统不同,这些交互没有学生的先前记录,因此对学生已知内容的任何估计都必须从对话本身推断。我们表明,仅通过扩展模型并不能弥补这一差距。前沿和教育调优的LLM在要求长时间辅导学生时表现不佳,因为这需要同时做三件事:导师必须安排课程顺序,进行苏格拉底式对话,并从对话中推断学生的知识状态。我们建议分离这些职责。给定学生查询,我们的系统构建一个先决知识图谱,其中子主题是节点,依赖关系是边,并将辅导视为决定下一个要教授哪个节点以及在该节点上花费多少轮对话后再继续。一个轻量级的PPO策略处理这个顺序决策,而LLM在所选节点进行苏格拉底式交流并返回学生进展信号。在保留的STEM和非STEM主题上,我们的PPO配对导师优于启发式基线、前沿通用模型以及专门用于苏格拉底式对话的模型:无论是在学生达到完全课程掌握的速度上,还是在所需的对话轮数上。明确的课程结构带来了底层模型扩展所无法提供的收益。

英文摘要

Large language models are now widely used for everyday learning, but the underlying interactions are typically unstructured chats rather than following a curriculum. Unlike formal online learning systems, these interactions carry no prior record of the student, so any estimate of what the student already knows must be inferred from the dialogue itself. We show that this gap is not closed by scaling models alone. Frontier and education-tuned LLMs perform poorly when asked to tutor a student over an extended session, because doing so requires three things at once. The tutor must sequence a curriculum, conduct Socratic dialogue, and infer the student's knowledge state from that dialogue. We propose separating these responsibilities. Given a student query, our system constructs a prerequisite knowledge graph in which subtopics are nodes and dependencies are edges, and frames tutoring as deciding which node to teach next and how many dialogue turns to spend on it before moving on. A lightweight PPO policy handles this sequencing decision, while an LLM conducts the Socratic exchange at the chosen node and returns a signal of student progress. Across held-out STEM and non-STEM topics, our PPO-paired tutor outperforms heuristic baselines, frontier general-purpose models, and a model specialised for Socratic dialogue: on both the rate at which students reach full curriculum mastery and the number of turns required. Explicit curriculum structure delivers gains that scaling the underlying model does not.

2606.11743 2026-06-11 cs.RO cs.GR cs.LG 新提交

TacCoRL: Integrating Tactile Feedback into VLA via Simulation

TacCoRL: 通过仿真将触觉反馈集成到视觉-语言-动作模型中

Siyu Ma, Yuqi Liang, Chang Yu, Yunuo Chen, Hao Su, Yixin Zhu, Yin Yang, Chenfanfu Jiang

发表机构 * University of California, Los Angeles(加利福尼亚大学洛杉矶分校) University of California, San Diego(加利福尼亚大学圣迭戈分校) University of Electronic Science and Technology of China(电子科技大学) Peking University(北京大学) University of Utah(犹他大学)

AI总结 提出TacCoRL框架,通过仿真与真实联合训练和强化学习,将触觉反馈注入视觉-语言-动作策略,在接触密集型任务中平均成功率提升22.5%。

详情
AI中文摘要

视觉-语言-动作(VLA)模型为机器人操作提供了强大的视觉、语言和动作先验,但仅凭视觉观察往往缺失接触密集型任务所需的局部接触状态。我们提出TacCoRL,一个可扩展的框架,将触觉反馈注入VLA策略,并通过仿真-真实联合训练和基于仿真的强化学习(RL)进行改进,无需大规模触觉预训练或广泛的真实世界接触探索。关键思想不仅是添加触觉作为输入,而是学习在接近失败状态下接触读数应如何调节动作响应,这些状态在演示中罕见且在硬件上收集风险高。我们使用真实对齐的仿真器作为接触交互的闭环训练环境。混合的仿真和真实轨迹首先在预训练策略中热启动触觉条件动作。具有可验证任务奖励的强化学习随后通过仿真接触回滚优化策略。它强化导致任务完成的触觉条件动作,而真实轨迹上的监督目标将精炼策略锚定到部署的视觉、触觉和动作分布。所得策略直接转移到真实机器人,无需特权仿真状态或在线真实世界RL。在四个双臂接触密集型任务中,最终的视觉-触觉策略平均成功率达到72.5%,而基线为50.0%。结果视频和更多细节见此链接。

英文摘要

Vision-language-action (VLA) models provide strong visual, language, and action priors for robot manipulation, but visual observations alone often miss the local contact state required for contact-rich tasks. We present TacCoRL, a scalable framework that injects Tactile feedback into VLA policies and improves them through sim-real Co-training and simulation-based reinforcement learning (RL), without requiring large-scale tactile pretraining or extensive real-world contact exploration. The key idea is not only adding touch as an input, but learning how contact readings should modulate action responses in near-failure states that are rare in demonstrations and risky to collect on hardware. We use a real-aligned simulator as a closed-loop training environment for contact interaction. Mixed simulated and real trajectories first warm-start tactile-conditioned actions in the pretrained policy. Reinforcement learning with verifiable task rewards then optimizes the policy using simulated contact rollouts. It reinforces tactile-conditioned actions that lead to task completion, while a supervised objective on real trajectories keeps the refined policy anchored to deployment visual, tactile, and action distributions. The resulting policy transfers directly to the real robot without privileged simulation state or online real-world RL. Across four bimanual contact-rich tasks, the final visuo-tactile policy achieves an average success rate of 72.5%, compared to baseline of 50.0%. Result videos and more details are available at https://tac-corl.github.io/

2606.11740 2026-06-11 cs.CV cs.CL 新提交

UniReason-Med: A Shared Grounded Reasoning Interface for 2D-to-3D Transfer in Medical VQA

UniReason-Med: 用于医学VQA中二维到三维迁移的共享基础推理接口

Mengzhuo Chen, Yan Shu, Chi Liu, Hongming Piao, Xidong Wang, Derek Li, Bryan Dai

发表机构 * IQuest Research

AI总结 提出UniReason-Med框架,通过共享基础推理接口从2D医学图像向3D医学VQA迁移推理能力,结合监督微调和强化学习,显著提升3D推理性能。

详情
AI中文摘要

我们研究了当两种输入类型通过共同的推理接口对齐时,来自丰富2D医学图像的基础推理监督是否能够改善3D医学VQA。我们引入了UniReason-Med,一个单一检查点框架,在推理时处理2D图像或切片序列化的3D体积,通过共享框语法、区域标记注入和共同的基础推理策略生成交错文本推理和局部视觉证据。为了训练这个接口,我们构建了UniMed-CoT,一个包含220K指令微调数据集,具有交错的文本推理和基础视觉证据,包括170K 2D和50K 3D样本。通过监督微调后接结果级强化学习,UniReason-Med学会生成基础推理轨迹,而在强化学习期间无需基于IoU/Dice的定位奖励。数据混合和组件消融实验表明,联合2D+3D基础监督显著改善了仅3D训练的3D推理,而基础化和区域标记注入对2D和3D任务都有持续益处。这些结果表明,共享的基础推理接口可以将推理结构从2D图像迁移到切片序列化的体积医学理解。代码和数据公开在https://this URL。

英文摘要

We study whether grounded reasoning supervision from abundant 2D medical images can improve 3D medical VQA when both input types are aligned through a common reasoning interface. We introduce UniReason-Med, a single-checkpoint framework that processes either a 2D image or a slice-serialized 3D volume at inference time, generating interleaved textual reasoning and localized visual evidence through shared box syntax, region-token injection, and a common grounded reasoning policy. To train this interface, we construct UniMed-CoT, a 220K instruction-tuning dataset with interleaved textual reasoning and grounded visual evidence, including 170K 2D and 50K 3D samples. Through supervised fine-tuning followed by outcome-level reinforcement learning, UniReason-Med learns to generate grounded reasoning traces without IoU/Dice-based localization rewards during RL. Data-mixture and component ablations show that joint 2D+3D grounded supervision substantially improves 3D reasoning over 3D-only training, while grounding and region-token injection consistently benefit both 2D and 3D tasks. These results suggest that a shared grounded reasoning interface can transfer reasoning structure from 2D images to slice-serialized volumetric medical understanding. The code and data are publicly available at https://github.com/IQuestLab/unireason-med.

2606.11739 2026-06-11 cs.CV cs.AI 新提交

Multi-View In-Cabin Monitoring System for Public Transport Vehicles

公共交通车辆的多视角座舱内监控系统

Evgeny Gorelik, Kenny Dean Karrow, Fikret Sivrikaya, Sahin Albayrak, Christian Baumann

发表机构 * Technische Universität Berlin(柏林工业大学) German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心)

AI总结 提出一个多视角座舱内监控数据集,包含同步RGB-D图像和LiDAR数据,并提供3D人体姿态和边界框标注,支持多视角3D检测模型评估。

Comments Submitted to ICDM2026

详情
AI中文摘要

我们介绍了一个用于公共交通的多视角座舱内监控数据集,包含来自四个朝内摄像头和覆盖数字化的、部分自动化的德国城市公交车内部空间的旋转LiDAR的同步RGB和深度图像。该数据集包含9,136个同步样本及其标注,并附带一个校准和伪标签流程,可生成乘客的3D人体姿态估计和定向3D边界框。我们还提供了nuScenes格式转换,并基准测试了代表性的多视角3D检测模型(例如Lift-Splat-Shoot和BEVFusion),支持多视角座舱内感知模型的比较评估和小规模训练。该数据集和工具可在以下网址获取:此https URL。

英文摘要

We introduce a multi-view in-cabin monitoring dataset for public transportation with synchronized RGB and depth images from four inward-facing cameras and a rotating LiDAR covering the vehicle interior of a digitalized and partly automated German city bus. The dataset contains 9.136 synchronized samples with annotations and is accompanied by a calibration and pseudo-labeling pipeline that generates 3D human pose estimates and oriented 3D bounding boxes for occupants. We further provide a nuScenes-format conversion and benchmark representative multi-view 3D detection models (e.g., Lift-Splat-Shoot and BEVFusion), supporting comparative evaluation and small-scale training of multi-view in-cabin perception models. The dataset and tools are available at https://github.com/EvgenyGorelik/multiview_incabin_dataset.

2606.11724 2026-06-11 cs.AI 新提交

Mind the Perspective: Let's Reason Recursively for Theory of Mind

注意视角:递归推理实现心智理论

Chao Lei, Guang Hu, Meng Yang, Yanbei Jiang, Nir Lipovetzky

发表机构 * School of Computing and Information Systems, The University of Melbourne, Australia(墨尔本大学计算与信息系统学院) SensiLab, Monash University, Australia(蒙纳士大学SensiLab)

AI总结 提出RecToM框架,通过递归视角构建建模嵌套信念,将高阶信念问题转化为实际世界问题,在多个ToM基准上达到最先进性能。

详情
AI中文摘要

心智理论(ToM)推理需要从部分且不对称的观察中推断智能体的信念,这对大语言模型(LLM)来说仍然是一个开放的挑战。现有的基于提示的方法通过可观察事件过滤或时间信念链来改进ToM推理,但没有显式建模嵌套信念。我们引入了RecToM,一个用于ToM推理的推理时框架,通过递归视角构建来建模嵌套信念。RecToM沿着问题指定的角色链,从先前的角色视角构建每个角色视角,将高阶信念问题简化为最终构建视角内的实际世界问题。我们进一步提供了KD45分析,表明RecToM的视角构建诱导了超越简单事件过滤的良好信念模态。在包括Hi-ToM、Big-ToM和FanToM在内的ToM基准上,跨多个LLM骨干网络的实验表明,RecToM持续优于最近的高级方法,达到了最先进的性能。值得注意的是,RecToM在GPT-5.4和Qwen3.5上达到了Hi-ToM的100%准确率,这是一个需要高阶ToM推理的基准。

英文摘要

Theory of Mind (ToM) reasoning requires inferring agents' beliefs from partial and asymmetric observations, which remains an open challenge for LLMs. Existing prompting-based approaches improve ToM reasoning through observable-event filtering or temporal belief chains, without explicitly modeling nested beliefs. We introduce RecToM, an inference-time framework for ToM reasoning that models nested beliefs via recursive perspective construction. RecToM constructs each character perspective from the preceding character perspective along the character chain specified by the question, reducing higher-order belief questions to actual-world questions within the final constructed perspective. We further provide a KD45 analysis showing that RecToM's perspective construction induces a well-formed belief modality beyond simple event filtering. Experiments on ToM benchmarks, including Hi-ToM, Big-ToM, and FanToM, across multiple LLM backbones show that RecToM consistently outperforms recent advanced approaches, achieving state-of-the-art performance. Notably, RecToM reaches 100\% accuracy on Hi-ToM with GPT-5.4 and Qwen3.5, a benchmark requiring higher-order ToM reasoning.

2606.11722 2026-06-11 cs.LG cs.AI cs.CL 新提交

ICA Lens: Interpreting Language Models Without Training Another Dictionary

ICA Lens: 无需训练另一本词典即可解释语言模型

Sida Liu, Feijiang Han

发表机构 * Independent Researcher(独立研究员) University of Maryland(马里兰大学)

AI总结 提出ICALens,基于独立成分分析(ICA)高效提取语言模型表示中可解释方向,无需训练稀疏自编码器,在SAEBench上表现竞争力。

Comments Ongoing Project

详情
AI中文摘要

在语言模型表示中找到可解释方向对于理解和控制模型行为至关重要。稀疏自编码器(SAE)已成为此目的的标准工具,但将其作为默认的第一透镜通常需要训练、存储和评估大型过完备字典。这一瓶颈限制了快速探索,并提出了一个基本问题:在训练另一个神经字典之前,从激活几何中已经可以看到多少可解释结构?我们的直觉很简单:许多可解释方向对令牌具有选择性,这些方向看起来比随机方向更不服从高斯分布。因此,我们重新审视独立成分分析(ICA),这是一种寻找非高斯方向的经典方法,作为语言模型可解释性的紧凑透镜。我们发现ICA在LLM可解释性中被低估了,因为先前的使用通常依赖于现成的ICA实现,这些实现在LLM激活上不稳定,并且缺乏用于检查和评估恢复方向的系统工具。为弥补这些差距,我们引入了ICALens,这是第一个用于LLM表示的稳定、高效和可审计ICA分析的实用工作流。它结合了优化的GPU并行FastICA流水线、LLM特定的稳定性配方和更好的拟合诊断,实现了高效可靠的逐层分析。在GPT-2 Small、Gemma 2 2B和Qwen 3.5 2B Base上,ICALens高效地恢复了紧凑、人类可解释的方向,无需逐层基于梯度的字典训练。在SAEBench上,ICA在稀疏探测中与公共SAE竞争,并在中小预算下的目标探测扰动中优于它们。这些结果表明,ICA不应被视为弱基线,而应被视为探索语言模型表示的高效且互补的第一透镜。

英文摘要

Finding interpretable directions in language-model representations is critical for understanding and controlling model behavior. Sparse autoencoders (SAEs) have become the standard tool for this purpose, but using them as the default first lens often requires training, storing, and evaluating large overcomplete dictionaries. This bottleneck limits rapid exploration and raises a fundamental question: how much interpretable structure is already visible from activation geometry before training another neural dictionary? Our intuition is simple: many interpretable directions are selective on tokens, and these directions should look less Gaussian than random directions. We therefore revisit independent component analysis (ICA), a classical method for finding non-Gaussian directions, as a compact lens for language-model interpretability. We find that ICA has been underestimated for LLM interpretability, because prior uses often relied on off-the-shelf ICA implementations that are brittle on LLM activations and lacked systematic tools for inspecting and evaluating the recovered directions. To bridge these gaps, we introduce ICALens, the first practical workflow for stable, efficient, and auditable ICA analysis of LLM representations. It combines an optimized GPU-parallel FastICA pipeline with LLM-specific stability recipes and better fitting diagnostics, enabling efficient and reliable layer-wise analysis. Across GPT-2 Small, Gemma 2 2B, and Qwen 3.5 2B Base, ICALens efficiently recovers compact, human-interpretable directions without per-layer gradient-based dictionary training. On SAEBench, ICA is competitive with public SAEs in sparse probing and outperforms them in targeted probe perturbation under small-to-medium budgets. These results suggest that ICA should not be viewed as a weak baseline, but as an efficient and complementary first lens for exploring language-model representations.

2606.11719 2026-06-11 cs.CV cs.AI 新提交

Ouroboros-Spatial: Closing the Data-Model Loop for Spatial Reasoning

Ouroboros-Spatial:闭环数据-模型循环的空间推理

Enhan Zhao, Wei Wu, Yuanrui Zhang, Xueliang Zhao, Di He

发表机构 * Peking University(北京大学) Ant International(蚂蚁国际) The University of Hong Kong(香港大学)

AI总结 提出Ouroboros-Spatial自演化框架,通过提议器与求解器闭环交互,动态生成与模型能力匹配的训练样本,在六个空间推理基准上以十分之一数据量显著提升Qwen3-VL性能。

详情
AI中文摘要

空间推理仍然是多模态大语言模型(MLLM)的一个持续挑战。现有方法主要依赖大规模、静态整理的数据集,其中所有训练样本被统一对待,而不考虑模型不断演变的能力。这种静态范式本质上是数据低效的:训练能力通常浪费在模型当前阶段过于简单或过于困难的样本上。为解决这一局限,我们提出Ouroboros-Spatial,一个自演进的训练框架,其中模型扮演提议器和求解器的双重角色。在每次迭代中,冻结的提议器从3D场景元数据和原始视频帧生成空间问答对,以及用于推导可靠真实值的可执行代码。然后,可学习的求解器在接受的样本上进行微调,其每个样本的预测置信度作为难度信号。该信号在下一迭代中反馈给提议器,引导其生成与求解器当前能力更匹配的问题。通过这种闭环设计,训练分布与模型能力共同演化,减少冗余的简单示例,同时过滤掉具有有限学习价值的模糊或无信息样本。在六个空间推理基准上,Ouroboros-Spatial显著提升了Qwen3-VL-4B和Qwen3-VL-8B的性能,同时使用的训练样本数量比近期大规模整理数据集少一个数量级。在VSI-Bench上,它对4B和8B模型分别取得了9.9和6.8个百分点的绝对提升,使两者均优于一系列强大的开源和专有基线模型。

英文摘要

Spatial reasoning remains a persistent challenge for multimodal large language models (MLLMs). Existing approaches largely rely on large-scale, statically curated datasets, where all training samples are treated uniformly regardless of the model's evolving capabilities. This static paradigm is inherently data-inefficient: training capacity is often spent on samples that are either trivial or overly difficult for the model at its current stage. To address this limitation, we propose Ouroboros-Spatial, a self-evolving training framework in which the model plays dual roles as a proposer and a solver. In each iteration, a frozen proposer generates spatial question-answer (QA) pairs from 3D scene metadata and raw video frames, together with executable code for deriving reliable ground truth. A learnable solver is then fine-tuned on the accepted samples, and its per-sample prediction confidence is used as a difficulty signal. This signal is fed back to the proposer in the next iteration, guiding it to generate questions better matched to the solver's current capabilities. Through this closed-loop design, the training distribution co-evolves with model ability, reducing redundant trivial examples while filtering out ambiguous or uninformative samples with limited learning value. Across six spatial reasoning benchmarks, Ouroboros-Spatial substantially improves Qwen3-VL-4B and Qwen3-VL-8B while using an order of magnitude fewer training examples than recent large-scale curated datasets. On VSI-Bench, it yields absolute gains of 9.9 and 6.8 points for the 4B and 8B models, respectively, enabling both to outperform a wide range of strong open-source and proprietary baselines.

2606.11712 2026-06-11 cs.CL cs.AI cs.LG 新提交

Substrate Asymmetry in User-Side Memory: A Diagnostic Framework

用户侧记忆中的子模块不对称性:一个诊断框架

Youwang Deng

发表机构 * EpistemicaLab — Independent Research(EpistemicaLab — 独立研究)

AI总结 提出一个诊断框架,将LLM用户侧记忆分解为行为一致性、事实存在和事实缺失三个正交子模块,发现参数记忆与检索记忆在不同子模块上存在不对称性,且RLHF调优加剧了这种不对称性。

Comments Preprint. Code: https://github.com/EpistemicaLab/substrate-asymmetry-memory

详情
AI中文摘要

LLM中的用户侧记忆通常被评分为单一的“个性化”能力:给定用户历史,输出是否更了解用户?我们表明这种聚合指标隐藏了相反方向的失败。记忆至少可分解为三个正交轴——行为一致性(风格、语气)、事实存在(回忆历史中的事实)和事实缺失(当事实缺失时弃权)——并且没有单一子模块能在所有三个轴上获胜。在受控的50用户合成语料库和真实数据探针(LaMP-3)上,比较每个用户的gamma-LoRA(在每个用户历史上训练的小型LoRA适配器;gamma表示每个用户,而非每个任务)与BGE-large密集top-K检索,我们发现gamma-LoRA在行为风格上决定性获胜,而RAG在事实缺失上决定性获胜——并且注意力层21-35中的相同查询投影细胞因果地承载了这两个相反方向的效果(将这些LoRA权重归零会使缺失探针TPR提高33个百分点,并使存在探针TPR下降20个百分点)。在更经过RLHF调优的Llama-3.1-8B-Instruct上,不对称性增强而非愈合:参数记忆的行为优势崩溃,而其相对于检索的缺失校准赤字扩大——这是对参数用户记忆的对齐税。在真实数据LaMP-3上,gamma-LoRA表现低于多数基线;一个9条件缓解扫描诊断出这是指令遵循崩溃,而非子模块失败(9x2交叉乘积显示评估时的{1..5} logit掩码使每个配方的主准确率达到>=0.995),并且最佳训练时修复在Llama上逐位复制。最后,子模块选择路由是问题分类,而非校准:仅基于问题文本的110M DistilBERT击败了每个基于logit的路由器。我们贡献了诊断框架、诊断出的真实数据负例、对齐税复制以及路由即分类的发现。

英文摘要

User-side memory in LLMs is typically scored as a single "personalization" capability: given a user's history, is the output more user-aware? We show this aggregate metric hides opposite-direction failures. Memory factorises into at least three orthogonal axes -- behavioral consistency (style, voice), factual presence (recall facts in history), and factual absence (abstain when a fact is absent) -- and no single substrate wins all three. Comparing per-user gamma-LoRA (a small LoRA adapter trained on each user's history; gamma denotes per-user, not per-task) against BGE-large dense top-K retrieval on a controlled 50-user synthetic corpus and a real-data probe (LaMP-3), we find gamma-LoRA decisively wins behavioral style while RAG decisively wins factual absence -- and the same query-projection cells in attention layers 21-35 causally load-bear both effects in opposite directions (zeroing those LoRA weights raises absence-probe TPR by +33 pp and drops presence-probe TPR by 20 pp). On the more heavily RLHF-tuned Llama-3.1-8B-Instruct the asymmetry strengthens, not heals: parametric memory's behavioral advantage collapses while its absence-calibration deficit against retrieval widens -- an alignment tax on parametric user-memory. On real-data LaMP-3, gamma-LoRA underperforms a majority baseline; a 9-condition mitigation sweep diagnoses this as instruction-following collapse, not substrate failure (a 9x2 cross-product shows the eval-time {1..5} logit mask drives main_acc to >=0.995 on every recipe), and the best training-time fix replicates bit-identically on Llama. Finally, substrate-selection routing is question-classification, not calibration: a 110M DistilBERT on the question text alone beats every logit-based router. We contribute the diagnostic framework, the diagnosed real-data negative, the alignment-tax replication, and the routing-as-classification finding.

2606.11711 2026-06-11 cs.LG stat.ML 新提交

Capacity-Constrained Online Convex Optimization with Delayed Feedback

具有延迟反馈的容量受限在线凸优化

Alexander Ryabchenko, Idan Attias, Daniel M. Roy

发表机构 * Department of Statistical Sciences, University of Toronto(多伦多大学统计科学系) Vector Institute(向量研究所) Institute for Data, Econometrics, Algorithms, and Learning (IDEAL), hosted by UIC and TTIC(数据、计量经济学、算法与学习研究所(IDEAL),由伊利诺伊大学芝加哥分校和丰田工业大学芝加哥分校主办)

AI总结 研究在硬容量约束下(最多同时跟踪C个待处理轮次)的延迟在线凸优化,通过引入半先知模型和延迟加权FTRL算法,首次给出了凸和强凸损失下容量受限OCO的遗憾界。

详情
AI中文摘要

具有延迟反馈的在线学习通常假设学习者可以跟踪所有待处理轮次直到其反馈到达。在实践中,跟踪资源是有限的,未跟踪轮次的反馈将永久丢失。在本文中,我们研究了在硬容量约束下的延迟在线凸优化(OCO),其中任何时候最多可以跟踪$C$个待处理轮次。为了建模延迟信息,我们引入了一个半先知模型,该模型细化了先前工作中的先知假设:学习者不需要在预测时知道延迟,而是在线观察延迟到期,这与经典的无约束延迟设置一致。我们的方法通过归约到一个新颖的“延迟且加权”的OCO问题来实现,使用一个随机化跟踪决策并对结果观测进行重要性加权的调度器。对于这个基础问题,我们提出并分析了延迟加权FTRL及其赌博机变体,建立了明确刻画时变权重与延迟反馈之间相互作用的遗憾界。将这些基础学习器与我们的调度器相结合,首次给出了在凸和强凸损失下容量受限OCO的遗憾保证,适用于一阶和赌博机反馈。对于一阶反馈,容量$C = \Omega(\log T)$足以在忽略对数因子的情况下恢复标准延迟OCO的速率。对于赌博机反馈,遗憾率由$(1 + \sigma_{\text{max}}/C)$的幂次调制,其中$\sigma_{\text{max}}$是任何时候的最大待处理观测数。这使得当$C < \sigma_{\text{max}}$时遗憾界能够优雅地退化,同时保持次线性。

英文摘要

Online learning with delayed feedback typically assumes that the learner can track all pending rounds until their feedback arrives. In practice, tracking resources are finite, and feedback from untracked rounds is permanently lost. In this paper, we study delayed online convex optimization (OCO) under a hard capacity constraint, where at most $C$ pending rounds can be tracked at any time. To model delay information, we introduce a semi-clairvoyant model that refines the clairvoyant assumption from prior work: rather than requiring delays to be known at prediction time, the learner observes delay expirations online, consistent with the classical unconstrained delayed setting. Our approach proceeds via a reduction to a novel ``delayed and weighted'' OCO problem, using a scheduler that randomizes tracking decisions and importance-weights the resulting observations. For this base problem, we propose and analyze Delayed-Weighted FTRL and its bandit analogue, establishing regret bounds that explicitly characterize the interaction between time-varying weights and delayed feedback. Combining these base learners with our schedulers yields the first regret guarantees for capacity-constrained OCO under convex and strongly convex losses, for both first-order and bandit feedback. For first-order feedback, capacity $C = Ω(\log T)$ suffices to recover standard delayed OCO rates up to logarithmic factors. For bandit feedback, the regret rates are modulated by powers of $(1 + σ_{\text{max}}/C)$, where $σ_{\text{max}}$ is the maximum number of pending observations at any time. This allows the regret bound to degrade gracefully when $C < σ_{\text{max}}$, while remaining sublinear.

2606.11709 2026-06-11 cs.LG cs.CL 新提交

RLCSD: Reinforcement Learning with Contrastive On-Policy Self-Distillation

RLCSD: 基于对比策略自蒸馏的强化学习

Leyi Pan, Shuchang Tao, Yunpeng Zhai, Lingzhe Zhang, Zhaoyang Liu, Bolin Ding, Aiwei Liu, Lijie Wen

发表机构 * Tsinghua University(清华大学) Tongyi Lab, Alibaba Group(阿里巴巴集团通义实验室) Peking University(北京大学)

AI总结 针对策略自蒸馏中特权诱导的风格漂移问题,提出RLCSD方法,通过对比正确与错误提示下的师生差距来抑制风格偏移,提升推理模型在数学和逻辑推理任务上的性能。

Comments 20 pages, 9 figures, 9 tables

详情
AI中文摘要

策略自蒸馏(OPSD)通过将模型自身的分布与在特权上下文(通常是已验证的解决方案)下产生的分布对齐,为推理模型提供密集的令牌级监督。然而,我们表明从这种分布差距中提取的学习信号集中在风格令牌而非任务承载令牌上,因为提示模型倾向于产生更直接、更短的输出。我们将这种病理现象称为\emph{特权诱导的风格漂移},它会破坏训练稳定性或导致响应长度缩短。为了解决这个问题,我们提出\textbf{RLCSD}(基于对比策略自蒸馏的强化学习),通过对比正确提示下的师生差距与错误提示下的师生差距来缓解这种漂移,抑制无论正确与否,条件于提示往往诱发的风格转变,并产生更集中于任务承载令牌的信号。在Qwen3(1.7B/4B/8B)和Olmo-3-7B-Think上的数学和逻辑推理实验表明,RLCSD始终优于GRPO和先前的OPSD方法。我们进一步表明,对比原则是通用的:它可以嵌入现有的OPSD方法中以提高它们,并且其潜在见解可扩展到更广泛的跨模型策略蒸馏设置。

英文摘要

On-policy self-distillation (OPSD) provides dense, token-level supervision for reasoning models by aligning a model's own distribution with the distribution it produces under privileged context, typically a verified solution. However, we show that the learning signal drawn from this distributional gap concentrates on style tokens rather than task-bearing ones, as the hinted model tends to produce more direct, shorter outputs. We term this pathology \emph{privilege-induced style drift}, which destabilizes training or causes response length to shrink. To address this, we propose \textbf{RLCSD} (Reinforcement Learning with Contrastive on-policy Self-Distillation), which mitigates this drift by contrasting the teacher-student gap under a correct hint against that under a wrong hint, suppressing the style shift that conditioning on a hint tends to induce regardless of correctness, and yielding a signal that is more concentrated on task-bearing tokens. Experiments on Qwen3 (1.7B/4B/8B) and Olmo-3-7B-Think across mathematical and logical reasoning show that RLCSD consistently outperforms GRPO and prior OPSD methods. We further show that the contrastive principle is general: it plugs into existing OPSD methods to improve them, and its underlying insight extends to the broader cross-model on-policy distillation setting.

2606.11708 2026-06-11 cs.RO 新提交

Explore From Sketch: Accelerating UAV Exploration in Large-scale Environments with Prior Maps

从草图探索:利用先验地图加速无人机在大规模环境中的探索

Tiancheng Lai, Yuman Gao, Xiangyu Li, Ruitian Pang, Xingpeng Wang, Siqi Shen, Mengke Zhang, Yin He, Fei Gao, Chao Xu, Yanjun Cao

发表机构 * Institute of Cyber-Systems and Control, College of Control Science and Engineering, Zhejiang University(浙江大学控制科学与工程学院工业控制技术研究所) Huzhou Institute, Zhejiang University, and Huzhou Key Laboratory of Autonomous System(浙江大学湖州研究院,湖州市自主系统重点实验室) Zhejiang Zhongyan Industry Co., Ltd(浙江中烟工业有限责任公司) Differential Robotics Technology Company(微分机器人技术有限公司)

AI总结 提出利用稀疏、未对齐甚至不一致的2D先验地图加速无人机大规模环境探索的框架,通过鲁棒的2D-3D点云配准和层次化视点规划,实现效率提升34.2%。

Comments 25 pages, 22 figures

详情
AI中文摘要

无人机在大规模、拓扑复杂环境中的自主探索常因次优调度和绕路而效率低下。先验地图(如施工图纸)虽然通常不精确且有缺陷,但在许多场景中易于获取,并具有提供全局结构指导的潜力。本文提出一种新颖的探索框架,利用稀疏、未对齐甚至不一致的2D先验地图进行基于LiDAR的无人机探索。首先,提出一种鲁棒的2D-3D点云配准流程,将LiDAR观测与先验地图对齐。该配准流程结合了用于单帧候选检索的GeoContext描述符、用于带异常值剔除的粗变换估计的多帧验证机制,以及用于精化的Scale-ICP算法。配准模块能够处理地图差异,并在几何歧义出现时提供多个假设。为了有效利用配准结果进行探索规划,我们进一步开发了一种在定位不确定性下的层次化视点规划策略。该层次化策略首先将局部视点空间附着到先验引导点上,并采用蒙特卡洛树搜索求解器确定每个配准假设下的遍历顺序。为减轻配准不确定性,风险感知选择器使用置信度加权的旅行风险评估先验序列,并在选定的先验引导下,通过固定端点旅行商问题生成高效的局部覆盖路径。基准评估显示,与最先进方法相比,探索效率提升高达34.2%,飞行距离减少37.9%,而广泛的仿真和现场实验进一步证明了对先验地图不完整和变形的鲁棒性。

英文摘要

Autonomous exploration with UAVs in large-scale, topologically complex environments often suffers from low efficiency due to suboptimal scheduling and detours. Prior maps (e.g., construction drawings), although usually imprecise and flawed, are readily available in many scenarios and have the potential to provide global structural guidance. This paper presents a novel exploration framework that leverages sparse, unaligned, and even discrepant 2D prior maps for LiDAR-based UAV exploration. First, a robust 2D-3D point cloud registration pipeline is proposed to align LiDAR observations with prior maps. The registration pipeline combines a GeoContext descriptor for single-frame candidate retrieval, a multi-frame verification mechanism for coarse transformation estimation with outlier rejection, and a Scale-ICP algorithm for refinement. The registration module can handle map discrepancies and provide multiple hypotheses when geometric ambiguities arise. To effectively utilize the registration results for exploration planning, we further develop a hierarchical viewpoint planning strategy under localization uncertainties. The hierarchical strategy first spatially attaches local viewpoints to prior guidepoints and adopts a Monte Carlo Tree Search solver to determine their traversal sequence under each registration hypothesis. To mitigate registration uncertainty, a risk-aware selector evaluates prior sequences using confidence-weighted travel risk, and a fixed-endpoint traveling salesman problem is formulated to generate an efficient local coverage path under the selected prior guidance. Benchmark evaluations reveal up to 34.2% improvement in exploration efficiency and 37.9% reduction in flight distance compared to state-of-the-art methods, while extensive simulations and field experiments further demonstrate robustness to prior map incompleteness and deformations.

2606.11704 2026-06-11 cs.RO 新提交

Improving Human Diving Endurance with a Field-Deployable, Untethered Exoskeleton

利用可现场部署的无系留外骨骼提高人类潜水耐力

Zhihao Zhou, Zhenmeng Ju, Rui Yang, Chenxi Zhang, Zhihao Zhou, Ming Xu, Enhao Zheng, Dongjie Jiang, Lecheng Ruan, Jingeng Mai, Qining Wang

发表机构 * Institute for Artificial Intelligence, Peking University(北京大学人工智能研究院) Beijing Engineering Research Center of Intelligent Rehabilitation Engineering(北京市智能康复工程技术研究中心) School of Advanced Manufacturing and Robotics, Peking University(北京大学先进制造与机器人学院) State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所多模态人工智能系统国家重点实验室) Department of Sports Medicine, Peking University Third Hospital(北京大学第三医院运动医学科) School of Rehabilitation Sciences and Engineering, University of Health and Rehabilitation Sciences(康复大学康复科学与工程学院)

AI总结 本文提出DiveMate外骨骼,通过自适应踢腿辅助在真实水下环境中将潜水距离提高42.9%,潜水时长延长54.9%,净耗气率降低47.0%,显著提升人类潜水耐力。

详情
AI中文摘要

人类在水下运动中的耐力从根本上受到高能量需求(克服阻力)和自持呼吸气体有限供应的限制。虽然外骨骼技术可以降低人类在陆地运动中的代谢成本,但其在增强水下潜水耐力方面的潜力尚未被探索。本文介绍了DiveMate,一种可现场部署的无系留外骨骼,旨在通过自适应踢腿辅助在真实水下环境中提高人类潜水耐力。在自然潜水过程中,DiveMate通过降低耗气率,使给定能量(呼吸气体)下的行进距离增加42.9%,潜水时长延长54.9%。肌肉激活的显著减少表明生理消耗降低,净耗气率降低47.0%。运动学特征和规律性的改善进一步支撑了高效的能量经济性。这些结果表明,应用外骨骼辅助有利于提高人类潜水耐力,增强其探索水下世界的能力。本研究拓展了外骨骼的应用前沿,并为未来水下辅助设备的设计和评估提供了潜在参考。

英文摘要

Human endurance in underwater locomotion is fundamentally restricted by high energetic demands to overcome drag and the finite supply of self-contained breathing gas. While exoskeleton technology can reduce the metabolic cost of humans in terrestrial locomotion, its potential to enhance human endurance during underwater diving remains entirely unexplored. Here, we present DiveMate, a field-deployable, untethered exoskeleton designed to improve human diving endurance via adaptive kick assistance in real-world underwater environments. During naturalistic diving, DiveMate increases the travel distance using a given energy (breathing gas) by 42.9% and extends dive duration by 54.9% through reducing gas consumption rate. Marked reductions in muscle activation indicate a decrease in physiological exertion, with the net gas consumption rate decreasing by 47.0%. Kinematic characteristics and regularity improvements further underpin efficient energy economy. These results suggest that applying exoskeleton assistance is beneficial for improving human diving endurance and augmenting their ability to explore the aquatic world. This study extends the application frontier of exoskeletons and provides a potential reference for the design and assessment of future underwater assistive devices.

2606.11702 2026-06-11 cs.CV cs.AI cs.CL 新提交

MedCTA: A Benchmark for Clinical Tool Agents

MedCTA: 临床工具智能体基准

Tajamul Ashraf, Hyewon Jeong, Fida Mohammad Thoker, Bernard Ghanem

发表机构 * King Abdullah University of Science and Technology (KAUST)(阿卜杜拉国王科技大学) Massachusetts Institute of Technology (MIT)(麻省理工学院)

AI总结 提出MedCTA基准,基于放射影像、病理切片和报告等真实临床多模态输入,评估医疗AI智能体在工具检索、证据获取和集成方面的规划与执行能力。

Comments Project Page: https://ivul-kaust.github.io/MedCTA/ Code: https://github.com/IVUL-KAUST/MedCTA Data: https://huggingface.co/datasets/IVUL-KAUST/MedCTA

详情
AI中文摘要

为了做出临床合理的决策,医疗AI智能体需要超越简单的识别,具备工具检索、证据获取和集成能力。现有基准主要评估孤立的感知或单轮问答,因此对规划、工具调用和部署可靠性的失败可见性有限。我们提出了MedCTA,一个用于评估医疗工具智能体的基准,基于临床验证的、步骤隐含的任务,这些任务基于真实的多模态临床输入,包括放射影像、病理切片和报告。MedCTA包含107个真实临床任务,具有临床医生验证的、在5个部署工具上的可执行轨迹,并支持对工具选择、参数有效性、执行稳定性、轨迹保真度和结果质量的过程感知评估。我们对18个开源和闭源多模态模型进行了基准测试,发现即使是最先进的系统在多步骤临床工具使用中仍然脆弱:自主部署主要由协议失败、过早停止和错误工具调用主导,而黄金标准工具路由带来了巨大但仍不完整的改进。这些结果表明,强大的骨干感知能力并不能转化为临床环境中可靠的智能体行为。MedCTA为审计、诊断和推进可信赖的医疗AI智能体提供了一个严格的测试平台。数据集和评估套件可在该https URL获取。

英文摘要

To make clinically grounded decisions, medical AI agents are expected to go beyond simple recognition and be capable of tool retrieval, evidence acquisition, and integration. Existing benchmarks largely evaluate isolated perception or single-turn question answering, and therefore provide limited visibility into failures of planning, tool recruitment, and rollout reliability. We introduce MedCTA, a benchmark for evaluating medical tool agents on clinician-validated, step-implicit tasks grounded in realistic multimodal clinical inputs, including radiology images, pathology slides, and reports. MedCTA comprises 107 real-world clinical tasks with clinician-verified executable trajectories over 5 deployed tools, and supports process-aware evaluation of tool selection, argument validity, execution stability, trajectory fidelity, and outcome quality. We benchmark 18 open- and closed-source multimodal models and find that even frontier systems remain brittle in multi-step clinical tool use: autonomous rollouts are dominated by protocol failures, premature stopping, and incorrect tool recruitment, while gold-standard tool routing yields large but still incomplete gains. These results show that strong backbone perception does not translate into reliable agentic behavior in clinical settings. MedCTA provides a rigorous testbed for auditing, diagnosing, and advancing trustworthy medical AI agents. The dataset and evaluation suite are available at https://ivul-kaust.github.io/MedCTA/

2606.11699 2026-06-11 cs.LG 新提交

A Data-Centric Framework for Detecting and Correcting Corrupted Labels

一个用于检测和纠正损坏标签的数据中心框架

Ha-Linh Nguyen, Hong-Anh Nguyen, Minh-Duc La, Thu-Trang Nguyen, Son Nguyen, Hieu Dinh Vo

发表机构 * Faculty of Information Technology, VNU University of Engineering and Technology, Hanoi, Vietnam(越南河内国立大学工程与技术学院信息技术系)

AI总结 提出Relabeler框架,联合利用局部和全局关系检测损坏标签,并基于输入特征和噪声标签估计最可能的干净标签进行纠正,在多个数据集上实现高达58%的标签纠正精度提升和6%的下游任务性能提升。

详情
AI中文摘要

机器学习和深度学习模型的性能在很大程度上取决于训练数据的质量。然而,现实世界数据集的质量常常因噪声标签而受损,这会显著降低模型的准确性和可靠性。为了解决这一挑战,我们提出了Relabeler,一个端到端的数据中心框架,用于检测和纠正损坏的标签。对于损坏标签检测,Relabeler联合利用数据实例之间的局部和全局关系来识别潜在的噪声样本。在检测到可疑实例后,Relabeler进一步通过基于每个实例的输入特征和观察到的噪声标签估计最可能的干净标签来执行标签纠正。跨多个数据集、噪声类型和噪声率的大量实验表明,Relabeler始终优于最先进的基线,在标签纠正精度上实现了高达58%的提升,在下游任务性能上实现了6%的提升。

英文摘要

The performance of machine learning and deep learning models largely depends on the quality of the training data. However, the quality of the real-world datasets is often compromised by noisy labels, which can substantially degrade model accuracy and reliability. To address this challenge, we propose Relabeler, an end-to-end data-centric framework for detecting and correcting corrupted labels. For corrupted label detection, Relabeler jointly leverages both local and global relationships among data instances to identify potentially noisy samples. After detecting suspicious instances, Relabeler further performs label correction by estimating the most probable clean label for each instance based on both its input features and observed noisy label. Extensive experiments across multiple datasets, noise types, and noise rates demonstrate that Relabeler consistently outperforms state-of-the-art baselines, achieving up to 58% improvement in label correction precision and 6% improvement in downstream task performance.

2606.11695 2026-06-11 cs.LG cs.AI 新提交

Noise-Aware Framework for Correcting Corrupted Labels

噪声感知框架用于纠正损坏标签

Ha-Linh Nguyen, Hong-Anh Nguyen, Minh-Duc La, Phong Lam, Thu-Trang Nguyen, Son Nguyen, Hieu Dinh Vo

发表机构 * Faculty of Information Technology, VNU University of Engineering and Technology(越南国立大学工程与技术学院信息技术系)

AI总结 提出CANOLA框架,通过噪声感知学习和迭代标签精炼来纠正损坏标签,在六个数据集上相比现有方法错误率降低19%-52%。

详情
AI中文摘要

高质量的标注数据对于训练可靠的ML/DL模型至关重要。然而,现实世界的数据集通常包含相当比例的损坏标签,这会严重降低模型性能。为了解决这个问题,我们提出了CANOLA,一种通过噪声感知学习和迭代标签精炼来纠正损坏标签的新框架。CANOLA明确估计数据集的潜在噪声分布,并将此信息纳入噪声感知深度神经网络的训练中。通过在训练过程中融入噪声特征,CANOLA使模型能够降低不可靠监督信号的权重,并专注于可信模式,从而提高鲁棒性和泛化能力。标签纠正是通过谨慎的迭代软标签精炼进行的,其中模型预测与观察到的标签混合,以防止过早或错误的更新。这种渐进式精炼使得数据集能够以稳定且可控的方式得到修复。我们在六个广泛使用的数据集上,在现实噪声标注场景下评估了CANOLA。实验结果表明,CANOLA始终优于最先进的标签纠正方法,在错误减少方面实现了19%到52%的相对改进。此外,在由CANOLA纠正的数据集上训练的模型获得了显著的下游性能提升。即使在CANOLA纠正的数据上训练的简单分类器,其性能也能超过复杂的以模型为中心的方法,最高可达67%。

英文摘要

High-quality labeled data is essential for training reliable ML/DL models. However, real-world datasets often contain a considerable proportion of corrupted labels, which can severely degrade model performance. To address this problem, we propose CANOLA, a novel framework for correcting corrupted labels through noise-aware learning and iterative label refinement. CANOLA explicitly estimates the underlying noise distribution of the dataset and incorporates this information into the training of a noise-aware Deep Neural Network. By incorporating noise characteristics during learning, CANOLA enables the model to down-weight unreliable supervision signals and focus on trustworthy patterns, thereby improving robustness and generalization. Label correction is performed via cautious, iterative soft label refinement, in which model predictions are blended with observed labels to prevent premature or erroneous updates. This progressive refinement allows the dataset to be repaired in a stable and controlled manner. We evaluate CANOLA on six widely used datasets under realistic noisy labeling scenarios. Experimental results show that CANOLA consistently outperforms SOTA label correction methods, achieving relative improvements ranging from 19% to 52% in error reduction. Moreover, models trained on datasets corrected by CANOLA obtain substantial downstream performance gains. Even simple classifiers trained on CANOLA's corrected data can outperform complex model-centric approaches by margins of up to 67%.

2606.11691 2026-06-11 cs.LG physics.flu-dyn 新提交

Spectrally Regularized Latent Flow Matching for Turbulence Generation

谱正则化潜流匹配用于湍流生成

Khalid Rafiq, Aditya G. Nair

发表机构 * Department of Mechanical Engineering, University of Nevada, Reno(内华达大学里诺分校机械工程系)

AI总结 针对潜扩散和流匹配模型在湍流生成中低估耗散区振幅的问题,提出谱正则化潜流匹配框架,通过区域加权对数谱目标将深度耗散保留谱功率从25%提升至94%,并显著改善采样成本-保真度权衡。

Comments Accepted at the AI4Physics Workshop at ICML 2026. OpenReview: https://openreview.net/forum?id=MEZ1otYgXS

详情
AI中文摘要

潜扩散和流匹配已成为合成湍流生成的主要方法,但它们系统性地低估了耗散范围的振幅。我们引入了一个潜流匹配框架,其中包含一个直接针对此失效模式的谱正则化压缩阶段。在Re_f ≈ 2250的256^2 DNS数据集上,将MSE训练的VAE替换为区域加权对数谱目标,在重建中将深度耗散保留谱功率从25%提升至94%,在无条件生成中从20%提升至79%。改进的潜表示还产生了显著更好的采样成本-保真度权衡:MSE训练的潜空间在DD偏差-0.70附近施加了一个基本质量上限,任何积分器或步数都无法克服,而谱正则化的潜空间在仅20次函数评估时就达到了DD偏差-0.117。从机制上讲,编码器-解码器交换实验表明,改进主要由编码器诱导的潜重组驱动,而非解码器容量;而支持-振幅分解揭示,MSE训练的模型表现为保守抑制模型,通过衰减间歇性高波数结构来最小化逐点误差。两种管道都恢复了二阶结构函数和S_3的正确符号,表明在没有显式监督的情况下正确的级联方向。S_3幅度上的一个小残余差距表明,相位相干三元组组织仍然是未来生成湍流模型中振幅保真度的补充轴。

英文摘要

Latent diffusion and flow matching have emerged as leading approaches for synthetic turbulence generation, yet they systematically under-represent dissipation-range amplitudes. We introduce a latent flow matching framework with a spectrally regularized compression stage that directly targets this failure mode. On a 256^2 DNS dataset at Re_f \approx 2250, replacing an MSE-trained VAE with a zone-weighted log-spectral objective raises deep-dissipation retained spectral power from 25% to 94% in reconstruction and from 20% to 79% in unconditional generation. The improved latent representation also yields a substantially better sampling cost-fidelity tradeoff: the MSE-trained latent space imposes a fundamental quality ceiling near DD bias -0.70 that no integrator or step-count can overcome, while the spectrally regularized latent space reaches DD bias -0.117 at just 20 function evaluations. Mechanistically, encoder-decoder swap experiments show that the improvement is driven primarily by encoder-induced latent reorganization rather than decoder capacity, while a support-amplitude decomposition reveals that MSE-trained models behave as conservative suppression models, minimizing pointwise error by attenuating intermittent high-wavenumber structure. Both pipelines recover the second-order structure function and the correct sign of S_3, indicating the correct cascade direction without explicit supervision. A small residual gap in the magnitude of S_3 suggests that phase-coherent triadic organization remains a complementary axis to amplitude fidelity for future generative turbulence models.

2606.11689 2026-06-11 cs.CV 新提交

RankVR: Low-Rank Structure Perception and Value Recalibration for Robust Composed Image Retrieval

RankVR: 低秩结构感知与价值重新校准用于鲁棒组合图像检索

Jiale Huang, Zixu Li, Zhiheng Fu, Zhiwei Chen, Qinlei Huang, Yupeng Hu

发表机构 * Shandong University(山东大学)

AI总结 针对组合图像检索中噪声三元组对应问题,提出RankVR框架,通过全局结构一致性感知模块利用相关矩阵有效秩解耦干净样本,并设计自适应语义价值校准模块动态量化三元组价值,在FashionIQ和CIRR数据集上显著优于现有方法。

Comments Accepted by ICMR 2026

详情
AI中文摘要

组合图像检索(CIR)构成了一种关键范式,要求模型对参考图像和修改文本进行联合推理。然而,大规模数据集中普遍存在的噪声三元组对应(NTC)严重限制了模型性能。现有的去噪方法要么针对二元不匹配,要么依赖基于标量的逐点估计,忽略了样本群体中丰富的全局结构相关性和训练过程中的动态价值变化,从而产生次优结果。本文识别了两个关键未解决的挑战:语义相关性的全局结构不一致性和难样本判别不确定性。为了解决这些问题,我们提出了RankVR,一个通过全局结构一致性和动态价值感知构建鲁棒CIR模型的框架。具体来说,我们引入了全局结构一致性感知(GSCP)模块,该模块利用相关矩阵的有效秩将干净样本从结构噪声中解耦。通过测量秩差异,GSCP识别出破坏宏观语义对称性的样本。此外,我们开发了自适应语义价值校准(ASVC)模块,以区分高价值的难干净样本。通过整合训练潜力和可靠性,它动态量化每个三元组的语义价值,确保有效利用难样本,同时抑制以逻辑冲突为特征的噪声。在FashionIQ和CIRR基准数据集上的大量实验表明,RankVR显著优于现有最先进方法,验证了其在噪声环境中的卓越鲁棒性。

英文摘要

Composed Image Retrieval (CIR) constitutes a pivotal paradigm requiring models to perform joint reasoning on reference images and modification texts. However, the prevalence of Noisy Triplet Correspondence (NTC) in large-scale datasets severely constrains model performance. Existing denoising methods either target binary mismatches or rely on scalar-based point-wise estimation, neglecting rich global structural correlations among sample populations and dynamic value variations during training, thereby yielding suboptimal results. This paper identifies two critical unresolved challenges: Global Structural Inconsistency of Semantic Correlations and Hard Sample Discrimination Uncertainty. To address these, we propose RankVR, a framework designed to construct a robust CIR model via global structure consistency and dynamic value perception. Specifically, we introduce the Global Structure Consistency Perception (GSCP) module, which utilizes the Effective Rank of the Correlation Matrix to decouple clean samples from structural noise. By measuring rank difference, GSCP identifies samples disrupting macroscopic semantic symmetry. Furthermore, we develop the Adaptive Semantic Value Calibration (ASVC) module to distinguish high-value hard clean samples. By integrating training potential and reliability, it dynamically quantifies the semantic value of each triplet, ensuring effective utilization of hard samples while suppressing noise characterized by logical conflicts. Extensive experiments on the FashionIQ and CIRR benchmark datasets demonstrate that RankVR significantly outperforms existing state-of-the-art methods, validating its superior robustness in noisy environments.

2606.11688 2026-06-11 cs.CL cs.AI 新提交

Goal-Autopilot: A Verifiable Anti-Fabrication Firewall for Unattended Long-Horizon Agents

Goal-Autopilot: 一种可验证的防伪造防火墙,用于无人值守的长周期智能体

Youwang Deng

发表机构 * EpistemicaLab — Independent Research(EpistemicaLab — 独立研究)

AI总结 提出Autopilot执行模型,通过外部化状态到有限状态机并强制门控验证,使智能体无法虚假声称成功,在3,150个单元测试中伪造率降至0.95%,显著低于基线方法。

Comments Preprint. Code: https://github.com/EpistemicaLab/goal-compiled-autopilot

详情
AI中文摘要

长周期LLM智能体在无人值守时不可信:没有人类监控,它们自信地报告从未验证的成功。我们将诚实性——限制智能体在终止时可能声称的内容——视为无人值守自主性的首要指标,与能力区分开来。我们提出Autopilot,一种执行模型,使得静默伪造的成功在结构上不可能,而不仅仅是更罕见。Autopilot将所有工作状态外部化到一个持久的、门控的有限状态机中,调度器每次以无状态滴答推进;一个硬性下限禁止任何终端“完成”声明,其可伪造的门并未实际执行并通过。我们证明了一个无假成功定理——在门控正确性、下限执行和计划覆盖下,终止意味着目标成立——其唯一信任点可经验测量,并表明最坏情况退化为诚实的停顿,而非伪造的成功。由于每个滴答仅重新水化状态机,每步上下文成本在时间范围内恒定。在3,150个单元的配对语料库(70个任务×3个系统×3个模型×5个种子,包括跨11个开源仓库的50个SWE-bench Lite任务)上,Autopilot在0.95%的单元上伪造[95% CI 0.38–1.62],而Reflexion和StateFlow基线分别在8.10% [6.48–9.81]和25.05% [22.48–27.62]上伪造。主要对比存在于困难场景:在SWE-bench Lite上,防火墙将伪造率从33.7%(StateFlow)降至0.67%,配对差异为-33.07个百分点[95% CI -36.53, -29.73]。机制在于门控而非模型:所有十个Autopilot伪造均来自最强模型,而两个较弱的中间模型在700个配对单元中从未伪造。防火墙设计上以覆盖换取诚实——诚实的停顿是可恢复的;而自信的错误输出向下游发送则不可恢复。

英文摘要

Long-horizon LLM agents are not trusted to run unattended: with no human watching, they confidently report success they never verified. We treat honesty -- bounding what an agent may claim at termination -- as a first-class metric for unattended autonomy, distinct from capability. We present Autopilot, an execution model that makes silent fabricated success structurally impossible rather than merely rarer. Autopilot externalizes all working state into a durable, gated finite-state machine that a scheduler advances one stateless tick at a time; a hard floor forbids any terminal "done" claim whose falsifiable gate did not actually execute and pass. We prove a No-False-Success theorem -- under gate soundness, floor enforcement, and plan coverage, termination implies the goal holds -- whose only trust points are empirically measurable, and show the worst case degrades to an honest stall, never a fabricated success. Because each tick rehydrates only the state machine, per-step context cost is constant in the horizon. Across a 3,150-cell paired corpus (70 tasks $\times$ 3 systems $\times$ 3 models $\times$ 5 seeds, including 50 SWE-bench Lite tasks across 11 OSS repos), Autopilot fabricates on 0.95% of cells [95% CI 0.38--1.62] while Reflexion and StateFlow baselines fabricate on 8.10% [6.48--9.81] and 25.05% [22.48--27.62] respectively. The headline contrast lives in the hard regime: on SWE-bench Lite, the firewall reduces fabrication from 33.7% (StateFlow) to 0.67%, a paired difference of $-33.07$ pp [95% CI $-36.53, -29.73$]. The mechanism is the gate, not the model: all ten Autopilot fabrications come from the strongest model, while two weaker mid-tier models never fabricate across 700 paired cells. The firewall trades coverage for honesty by design -- an honest stall is recoverable; a confident wrong output shipped downstream is not.

2606.11687 2026-06-11 cs.CV cs.LG cs.RO 新提交

DroneShield-AI: A Multi-Modal Sensor Fusion Framework for Real-Time Autonomous Drone Threat Detection, Behavioral Intent Classification, and Swarm Intelligence in Contested Airspace

DroneShield-AI:一种用于受争议空域中实时自主无人机威胁检测、行为意图分类和群体智能的多模态传感器融合框架

Marius Bayizere

发表机构 * Independent Researcher(独立研究者)

AI总结 提出DroneShield-AI框架,集成RF信号分类、声学检测、YOLOv8视觉检测等六层处理,通过行为意图分类引擎(BICE)实现六类威胁分类并提前30秒预警,以及图神经网络群体智能模块(GNN-SIM)分析多无人机编队,在低成本硬件上达到96.1%检测精度和142ms延迟。

Comments 23 pages, 6 figures, 11 tables. Code available at https://github.com/bayizeremarius/DroneShield-AI

详情
AI中文摘要

无人机(UAV)威胁已成为21世纪定义性的安全挑战。本文提出DroneShield-AI,一个统一的开放框架,集成了六个处理层:RF信号分类、声学电机特征检测、基于YOLOv8的视觉检测、证据加权传感器融合、行为意图分类引擎(BICE)和图神经网络群体智能模块(GNN-SIM)。BICE首次引入了针对无人机飞行模式的系统性六类威胁分类法,能够提前30秒发出预测性操作员警报。GNN-SIM是首个用于对抗性多无人机编队分析的开放框架,采用图注意力网络。在三个公开的真实世界数据集上评估,融合流水线在约500-780美元总系统成本的商用CPU级硬件上实现了96.1%的检测准确率、3.2%的误报率、AUC-ROC:0.981以及142ms的端到端延迟。所有代码、模型权重和仿真数据集在提交时公开发布。

英文摘要

Unmanned Aerial Vehicle (UAV) threats have emerged as a defining security challenge of the 21st century. This paper presents DroneShield-AI, a unified open framework integrating six processing layers: RF signal classification, acoustic motor-signature detection, YOLOv8-based visual detection, evidence-weighted sensor fusion, a Behavioral Intent Classification Engine (BICE), and a Graph Neural Network Swarm Intelligence Module (GNN-SIM). BICE introduces the first systematic six-class threat taxonomy for drone flight patterns, enabling predictive operator alerts with a 30-second advance-warning horizon. GNN-SIM is the first open framework for adversarial multi-drone formation analysis using Graph Attention Networks. Evaluated on three publicly available real-world datasets, the fused pipeline achieves 96.1% detection accuracy, 3.2% false alarm rate, AUC-ROC: 0.981, and 142ms end-to-end latency on commodity CPU-class hardware at approximately $500-$780 USD total system cost. All code, model weights, and simulation datasets are publicly released at submission.

2606.11686 2026-06-11 cs.CL cs.AI 新提交

Layer-Isolated Evaluation: Gating the Deterministic Scaffold of a Production LLM Agent with a No-LLM, Regression-Locked Test Harness

层隔离评估:使用无LLM、回归锁定的测试工具对生产级LLM代理的确定性框架进行门控

Sawyer Zhang, Alexander Wang, Sophie Lei

发表机构 * Lumivate (Lumi)(Lumivate(Lumi))

AI总结 提出层隔离评估方法,将LLM代理分解为固定层次,用确定性无LLM测试套件逐层检测回归,证明聚合指标会掩盖局部退化,而逐层基线门控可准确定位。

Comments 12 pages, 2 figures, 5 tables

详情
AI中文摘要

端到端任务成功是评估LLM代理的主要方式,但一个聚合数字只能告诉你代理发生了回归,却无法指出具体位置。我们提出层隔离评估:将一个部署的订单代理分解为固定的层次分类(本体、意图、路由、分解、升级、安全、记忆以及跨领域的封装/防御),每一层由其在确定性、无LLM“纯”模式下的断言切片独立测试。纯测试套件(23个切片共238个案例;225个在2.39秒内运行,约10毫秒/案例)在每次变更时针对锁定的逐切片基线在CI中运行。我们通过受控回归注入进行验证,一次退化一个非安全层(共七个层)。我们未设计的效果是掩蔽:聚合通过率几乎不变(六个局部回归的变化范围为-1.7至-5.9个百分点),而匹配的切片则大幅下降(-25至-91个百分点)。一个层的切片对其自身故障做出反应部分是由构造决定的;测量结果是(i)聚合掩蔽以及(ii)损伤不会扩散到其他切片:注入层的切片在7个案例中的5个中是受影响最严重的,在7个案例中的7个中位列前三(平均排名1.29/19)。定位在第二个结构不同的租户(星巴克新加坡)上复现:所有七个匹配切片均大幅下降,因此这不是单一目录的伪像。我们将其定位为EDDOps规定但未实现的组件级评估的具体确定性实例,以CheckList为前身,并作为全工作流随机突变测试的确定性镜像。我们的贡献:(a)为生产代理提供了一个完全分解的、亚秒级、无LLM的逐层测试工具,(b)一个覆盖诚实性测试充分性标准,拒绝为未执行的层打分,以及(c)回归注入演示,证明逐切片基线锁定可以定位聚合指标掩盖的回归。

英文摘要

End-to-end task-success is the dominant way to evaluate LLM agents, but one aggregate number tells you that an agent regressed, not where. We present layer-isolated evaluation: a deployed ordering agent is decomposed into a fixed taxonomy of layers (ontology, intent, routing, decomposition, escalation, safety, memory, and cross-cutting envelope/defense), each exercised by its own assertion slice in a deterministic, no-LLM "pure" mode. The pure suite (238 cases across 23 slices; 225 run in 2.39 s, ~10 ms/case) runs in CI on every change against a locked per-slice baseline. We validate by controlled regression injection, degrading one layer at a time across seven non-safety layers. The effect we did not design in is masking: the aggregate pass-rate barely moves (-1.7 to -5.9 pp for six local regressions), while the matching slice craters (-25 to -91 pp). A layer's slice reacting to its own fault is partly by construction; the measured results are (i) the aggregate masking and (ii) that damage stays off the other slices: the injected layer's slice is the single worst-hit in 5 of 7 cases and top-3 in 7 of 7 (mean rank 1.29 of 19). Localization replicates on a second, structurally different tenant (Starbucks SG): all seven matching slices crater, so it is not a single-catalog artifact. We position it as a concrete, deterministic instantiation of the component-level evaluation EDDOps prescribes but leaves unimplemented, with CheckList as ancestor and as the deterministic mirror image of whole-workflow stochastic mutation testing. Our contributions: (a) a fully decomposed, sub-second, no-LLM per-layer harness for a production agent, (b) a coverage-honesty test-adequacy criterion that refuses to score an unexercised layer, and (c) the regression-injection demonstration that per-slice baseline-locked gates localize regressions an aggregate metric masks.

2606.11683 2026-06-11 cs.CV cs.AI 新提交

Reason, Then Re-reason: Cross-view Revisiting Improves Spatial Reasoning

推理,再推理:跨视角重访提升空间推理

Chaofan Ma, Zhenjie Mao, Yuhuan Yang, Fanqin Zeng, Yue Shi, Yingjie Zhou, Xiaofeng Cao, Jiangchao Yao

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出ReRe框架,通过生成互补新视角视频让MLLM先推理再验证,无需训练即可显著提升空间推理性能。

Comments ICML 2026

详情
AI中文摘要

从自我中心视频进行空间推理本质上是具有挑战性的,因为可观察的证据受到相机轨迹的限制。现有方法依赖单轮推理,迫使模型通过语义先验而非可验证证据来解决几何歧义。我们认为空间推理应该是可重访的:在有限证据下形成的结论在获得互补视角时应保持开放以进行修正。基于这一见解,我们提出“推理,再推理”(ReRe),一种无需训练、推理时框架,包含两个阶段:在推理阶段,MLLM从原始视频形成空间假设;在再推理阶段,它通过观察合成的新视角视频来验证或修正假设。为了实现有效的跨视角重访,我们设计了一个几何到视频的流水线,从预测的3D几何中渲染出策略性互补的新视角。这些视角具有升高的、倾斜的视角,覆盖整个场景,同时保持MLLM的原生视频接口,无需架构修改。在VSI-Bench和STI-Bench上的广泛评估表明,ReRe显著提升开源MLLM,使其与专有最先进性能相媲美。项目页面:此https URL

英文摘要

Spatial reasoning from egocentric videos is inherently challenging because the observable evidence is constrained by the camera trajectory. Existing methods rely on single-turn inference, forcing models to resolve geometric ambiguity through semantic priors rather than verifiable evidence. We argue that spatial reasoning should be revisitable: conclusions formed under limited evidence should remain open to revision when complementary viewpoints become available. Building on this insight, we propose Reason, then Re-reason (ReRe), a training-free, inference-time framework with two phases: in the Reason Phase, an MLLM forms a spatial hypothesis from the original video; in the Re-reason Phase, it verifies or revises the hypothesis by observing a synthesized novel-view video. To enable effective cross-view revisiting, we design a Geometry-to-Video pipeline that renders strategically complementary novel views from predicted 3D geometry. These views feature an elevated, oblique perspective with scene-spanning coverage, while preserving the MLLM's native video interface without architectural modifications. Extensive evaluations on VSI-Bench and STI-Bench demonstrate that ReRe substantially boosts open-source MLLMs to rival proprietary state-of-the-art performance. Project page: https://zhenjiemao.github.io/ReRe/

2606.11682 2026-06-11 cs.CV cs.LG 新提交

Parameter-Efficient Adapter Tuning for Tabular-Image Multimodal Learning

面向表格-图像多模态学习的参数高效适配器微调

Jiaqi Luo

发表机构 * School of Mathematical Sciences, Soochow University(苏州大学数学科学学院)

AI总结 提出TI-Adapter框架,通过冻结表格编码器并添加适配器,以及图像分支的嵌入层和瓶颈层适配器,实现高效多模态微调,在20个数据集上以更少参数达到或超越全微调性能。

详情
AI中文摘要

表格-图像多模态学习旨在通过联合使用结构化表格属性和视觉数据来提高预测建模能力。尽管预训练编码器提供了强大的模态特定表示,但全微调可能计算成本高昂,而保持编码器冻结可能限制任务特定适应。我们提出了表格-图像适配器(TI-Adapter),一种基于模态特定适配器的微调框架,用于高效的多模态适应。TI-Adapter冻结预训练的表格编码器,并在提取的表格嵌入后学习一个适配器,同时通过嵌入级和瓶颈级适配器来适应图像分支,而不是全微调。在20个表格-图像数据集上的实验表明,TI-Adapter在使用显著更少的可训练参数的情况下,达到了与全微调相当或更好的预测性能。消融研究进一步证明了适配器放置对于平衡性能和实际效率的重要性。

英文摘要

Tabular-image multimodal learning aims to improve predictive modeling by jointly using structured tabular attributes and visual data. Although pretrained encoders provide strong modality-specific representations, full fine-tuning can be computationally expensive, while keeping encoders frozen may limit task-specific adaptation. We propose the Tabular-Image Adapter (TI-Adapter), a modality-specific adapter-based fine-tuning framework for efficient multimodal adaptation. TI-Adapter freezes the pretrained tabular encoder and learns an adapter after the extracted tabular embedding, while adapting the image branch with embedding-level and bottleneck-level adapters instead of full fine-tuning. Experiments on 20 tabular-image datasets show that TI-Adapter achieves competitive or better predictive performance than full fine-tuning while using substantially fewer trainable parameters. Ablation studies further demonstrate the importance of adapter placement for balancing performance and practical efficiency.

2606.11680 2026-06-11 cs.AI cs.CL cs.LG 新提交

Organize then Retrieve: Hierarchical Memory Navigation for Efficient Agents

先组织再检索:面向高效智能体的层次化记忆导航

Hao-Lun Hsu, Nikki Lijing Kuang, Boyi Liu, Zhewei Yao, Yuxiong He

发表机构 * Duke University(杜克大学) Snowflake AI Research(Snowflake AI研究)

AI总结 提出HORMA框架,通过构建文件系统式的层次化记忆结构并利用强化学习训练的轻量级导航代理,实现高效检索,在长时任务中提升性能并降低令牌消耗。

详情
AI中文摘要

大型语言模型(LLM)智能体由于固有的无状态性,在处理长时任务时面临挑战,所有任务相关信息必须编码到不断增长的输入上下文中,导致推理质量下降、推理成本增加和延迟升高,因此需要高效的工作记忆机制。然而,现有方法要么依赖有损压缩,要么基于相似性检索,往往无法捕捉多步智能体任务所需的时间结构和因果依赖关系。在这项工作中,我们提出了HORMA,一种层次化组织与检索记忆智能体,它将经验组织成类似文件系统的层次化结构,其中总结的实体链接到相应的原始轨迹,从而在保留详细信息的同时实现高效访问。HORMA将工作记忆分解为两个阶段:结构化记忆构建和基于导航的检索。构建模块通过区分由信息缺失导致的失败和由误导性或过载上下文导致的失败,迭代地优化经验的结构化方式。导航模块使用强化学习训练的轻量级代理遍历层次结构,选择最小但充分的上下文,从而减少关键执行路径上的延迟。在ALFWorld、LoCoMo和LongMemEval上,HORMA在受限上下文预算下提升了任务性能,同时在长对话任务中最多仅使用基线22.17%的令牌。与现有方法相比,它始终实现了更好的效率-性能权衡,并能有效泛化到未见任务。

英文摘要

Large language model (LLM) agents struggle with long-horizon tasks due to their inherent statelessness, requiring all task-relevant information to be encoded in growing input contexts. The resulting degraded reasoning quality, increased inference cost, and higher latency necessitate efficient working memory mechanisms. However, existing approaches either rely on lossy compression or similarity-based retrieval, which often fail to capture temporal structure and causal dependencies required for multi-step agentic tasks. In this work, we present HORMA, a Hierarchical Organize-and-Retrieve Memory Agent that organizes experience into a file-system-like hierarchical structure, where summarized entities are linked to the corresponding raw trajectories, enabling efficient access without losing detailed information. HORMA decomposes working memory into two stages: structured memory construction and navigation-based retrieval. The construction module iteratively refines how experiences are structured by distinguishing between failures caused by missing information and those caused by misleading or overloaded context. The navigation module retrieves task-relevant context by traversing the hierarchy using a lightweight agent trained with reinforcement learning to select minimal yet sufficient context, thereby reducing latency along the critical execution path. Across ALFWorld, LoCoMo, and LongMemEval, HORMA improves task performance under constrained context budgets while requiring at most 22.17% of the baseline token usage in long conversation tasks. Compared to existing methods, it consistently achieves better efficiency-performance trade-offs and generalizes effectively to unseen tasks.

2606.11678 2026-06-11 cs.CL 新提交

Can AI Reason Like an Urban Planner? Benchmarking Large Language Models Against Professional Judgment

AI能像城市规划师一样推理吗?基于专业判断的大语言模型基准测试

Yijie Deng, He Zhu, Wen Wang, Junyou Su, Minxin Chen, Wenjia Zhang

发表机构 * School of Architecture and Urban Planning, Shenzhen University(深圳大学建筑与城市规划学院) Shenzhen Key Laboratory of Urban Spatial Information and Intelligent Modeling(深圳市城市空间信息与智能建模重点实验室) Department of Urban Planning and Design, The University of Hong Kong(香港大学城市规划与设计系)

AI总结 提出UPBench框架,通过4×5知识支柱与认知水平矩阵评估25个LLM,发现模型在分析任务上优于事实回忆和综合判断,揭示了规划知识的制度依赖性。

详情
AI中文摘要

问题、研究策略与发现:大语言模型(LLM)的兴起为城市规划提出了一个关键问题:AI能复制哪些专业规划知识,哪些仍需人类判断?尽管AI工具在规划实践中日益普及,但目前仍缺乏系统性框架来测试它们是否能以规划专业知识核心的情境敏感性、价值意识和制度素养进行推理。本文介绍了Urban Planning Bench(UPBench),这是一个领域特定的评估框架,通过改编自布鲁姆修订分类法的四个知识支柱和五个认知水平构成的4x5矩阵来评估LLM推理。通过自动评分和专家评审对25个LLM进行评估,我们发现了一条非单调的认知曲线:模型在高级分析任务上的表现优于事实回忆和综合判断。这表明,通常被视为低阶的规划知识深受制度、司法和时间背景的影响,使得LLM难以泛化。我们将这些局限性总结为四个认知诊断:监管幻觉、概念混淆、棘手性瘫痪和实践智慧缺陷。实践启示:研究结果支持规划中的差异化委托。LLM可以协助跨学科综合、文献综述、情景生成和初步政策分析。然而,它们在特定司法管辖区的法规、规范冲突解决和情境敏感程序方面仍不可靠。机构应要求对AI辅助监管分析进行验证,而规划教育应强调制度素养、规范判断和情境敏感性。

英文摘要

Problem, Research Strategy, and Findings: The rise of large language models (LLMs) raises a key question for urban planning: which forms of professional planning knowledge can AI replicate, and which still require human judgment? Although AI tools are increasingly used in planning practice, there is still no systematic framework for testing whether they can reason with the contextual sensitivity, value awareness, and institutional literacy central to planning expertise. This paper introduces Urban Planning Bench (UPBench), a domain-specific evaluation framework that assesses LLM reasoning through a 4x5 matrix of four knowledge pillars and five cognitive levels adapted from Bloom's revised taxonomy. Evaluating 25 LLMs with automated scoring and expert review, we find a non-monotonic cognitive curve: models perform better on higher-order analytical tasks than on factual recall and integrative judgment. This suggests that planning knowledge often treated as lower-order is deeply shaped by institutional, jurisdictional, and temporal context, making it hard for LLMs to generalize. We summarize these limits as four epistemic diagnostics: regulatory hallucination, conceptual conflation, wickedness paralysis, and phronetic deficit. Takeaway for Practice: The findings support differential delegation in planning. LLMs can assist with cross-disciplinary synthesis, literature review, scenario generation, and preliminary policy analysis. However, they remain unreliable for jurisdiction-specific regulation, normative conflict resolution, and context-sensitive procedure. Agencies should require verification for AI-assisted regulatory analysis, while planning education should emphasize institutional literacy, normative judgment, and contextual sensitivity.