arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.11779 2026-06-11 cs.CV 新提交

Battery detection of XRay images using transfer learning

基于迁移学习的X射线图像电池检测

Nermeen Abou Baker, David Rohrschneider, Uwe Handmann

发表机构 * Ruhr West University of Applied Sciences（鲁尔西应用科学大学）

AI总结本研究利用迁移学习，基于YOLOv5m模型检测X射线图像中的电池，并分类三种锂离子电池，检测精度达94%，推理时间22毫秒。

Comments Published at the European Symposium on Artificial Neural Networks (ESANN 2022)

2606.11770 2026-06-11 cs.AI 新提交

SVoT: State-aware Visualization-of-Thought for Spatial Reasoning via Reinforcement Learning

SVoT: 基于强化学习的空间推理状态感知思维可视化

Chao Lei, Yanbei Jiang, Markus Hiller, Zhijian Zhou, Xunye Tian, Krista A. Ehinger, Nir Lipovetzky

发表机构 * School of Computing and Information Systems, The University of Melbourne（墨尔本大学计算与信息系统学院）

AI总结提出SVoT框架，通过强化学习生成可验证的中间状态和可视化，结合文本与视觉推理链，提升多模态大模型在多跳空间推理中的可靠性。

详情

AI中文摘要

空间推理对多模态大语言模型（MLLMs）仍是一个挑战，因为它需要在中间状态和状态转换上进行可靠的多跳推理。当前研究通常不验证中间状态，并将状态转换视为隐式过程，这限制了多跳空间推理的可靠性。为解决这一问题，我们提出状态感知思维可视化（SVoT），一种强化学习框架，生成交错、可验证的中间状态和可视化。SVoT将转换推理链整合到生成过程中，使模型能够通过交错的文本和视觉推理验证动作前提和效果。我们通过组相对策略优化（GRPO）训练SVoT，通过奖励设计实例化验证，并评估不同细粒度奖励的效果。由于现有基准将状态转换简化为单变量更新，大大简化了问题，我们通过扩展经典环境并引入两个需要多对象交互和数值推理的新领域Pacman和Gather，建立了五个领域。这些领域支持对多跳空间推理的系统评估，并对生成的中间状态和转换推理进行定量验证。具有转换感知监督的SVoT在引入的领域中达到了最先进的性能，在分布外测试集上实现了高达65%的绝对准确率提升。

英文摘要

Spatial reasoning remains a challenge for Multimodal Large Language Models (MLLMs), as it requires reliable multi-hop inference over both intermediate states and state transitions. Current studies often leave intermediate states unverified and treat state transitions as implicit processes, which limits reliability in multi-hop spatial reasoning. To address this, we propose State-aware Visualization-of-Thought (SVoT), a reinforcement learning framework that generates interleaved, verifiable intermediate states and visualizations. SVoT integrates transition reasoning chains into the generation processes, enabling the model to verify action preconditions and effects through interleaved textual and visual reasoning. We train SVoT via Group Relative Policy Optimization (GRPO), instantiating verification through reward design and evaluating the efficacy of different fine-grained rewards. As existing benchmarks reduce state transitions to single-variable updates, substantially simplifying the problems, we establish five domains by extending classical environments and introducing two novel domains, Pacman and Gather, that require multi-object interactions and numerical reasoning. These domains support systematic evaluation of multi-hop spatial reasoning with quantitative verification of generated intermediate states and transition reasoning. SVoT with transition-aware supervision achieves state-of-the-art performance across the introduced domains, yielding up to a 65% absolute accuracy gain on out-of-distribution test sets.

URL PDF HTML ☆

赞 0 踩 0

2606.11769 2026-06-11 cs.AI cs.LG 新提交

When Do Data-Driven Systems Exhibit the Capability to Infer?

数据驱动系统何时展现出推理能力？

Maximilian Poretschkin, Tabea Naeven

发表机构 * Fraunhofer Institute for Intelligent Analysis and Information Systems (IAIS)（弗劳恩霍夫智能分析与信息系统研究所）； University of Bonn（波恩大学）； Lamarr Institute for Machine Learning and Artificial Intelligence（拉马尔机器学习和人工智能研究所）

AI总结针对欧盟AI法案中推理能力定义模糊的问题，基于统计学习理论提出分级框架，通过信用评分案例展示如何判断系统是否具备推理能力。

详情

AI中文摘要

欧盟AI法案是第一部全面的人工智能法规，为所谓高风险和通用AI系统规定了广泛的义务。AI法案下AI系统的一个关键区别特征是推理能力。由于AI法案未明确定义推理，某些数据驱动系统存在灰色地带。一个具体例子是信用评分系统，被AI法案附件三列出。然而，这些系统通常使用统计模型实现，不清楚它们是否具有推理能力，从而是否属于AI法案的AI定义。受统计学习理论启发，本文开发了一个分级不同推理能力水平的框架。基于AI法案和委员会关于人工智能系统定义的指南，我们分析了哪些水平构成AI法案意义上的充分推理能力，以及哪些地方需要进一步的监管明确性。我们通过创建两个现实的信用评分工作流程来说明该框架，并展示推理是否以及在哪里发生。我们的分析表明，不仅需要考虑单个模型，还需要考虑整个数据处理工作流程。它还表明，开发过程中人类专家的参与可能对推理能力产生重大影响。代码可在此https URL找到。

英文摘要

The European AI Act is the first comprehensive regulation of artificial intelligence (AI), setting out extensive obligations, particularly for so-called high-risk and general-purpose AI systems. A key distinguishing feature of AI systems under the AI Act is the capability to infer. Since the AI Act does not clearly define what inference is, there is a gray area for certain data-driven systems. A specific example is credit scoring systems, which are listed by Annex III of the AI Act. At the same time, however, these are often implemented using statistical models for which it is unclear whether they have the capability to infer and thus fall under the AI definition of the AI Act at all. Motivated by statistical learning theory, this work develops a framework for grading different levels of the capability to infer. Based on the AI Act and the Commission Guidelines on the definition of an artificial intelligence system, we analyze which levels constitute sufficient capability to infer within the meaning of the AI Act and where further regulatory clarity is needed. We illustrate the framework by creating two realistic credit scoring workflows and show whether and where inference occurs in them. Our analysis illustrates that not only individual models but the entire data processing workflow must be considered. It also shows that the involvement of human experts during development can have significant influence on the capability to infer. Code can be found at https://github.com/fraunhofer-iais/inference-framework-creditscorecards.

URL PDF HTML ☆

赞 0 踩 0

2606.11762 2026-06-11 cs.CL cs.AI 新提交

Automated Creativity Evaluation of Language Models Across Open-Ended Tasks

语言模型在开放式任务中的自动化创造力评估

Min Sen Tan, Zachary Kit Chun Choy, Syed Ali Redha Alsagoff, Nadya Yuki Wangsajaya, Mohor Banerjee, Swaagat Bikash Saikia, Alvin Chan

发表机构 * Raffles Institution（莱佛士书院）； College of Computing and Data Science, Nanyang Technological University（南洋理工大学计算与数据科学学院）； Lee Kong Chian School of Medicine, Nanyang Technological University（南洋理工大学李光前医学院）； Centre of AI in Medicine (C-AIM), Nanyang Technological University（南洋理工大学人工智能医学中心）

AI总结提出一种领域无关的自动化框架，通过语义熵和检索式多智能体评估，量化LLM在开放式任务中的发散与收敛创造力，并在问题解决、研究构思和创意写作三个领域验证其有效性。

Comments Accepted to ACL 2026 (Main Conference). 35 pages, 16 figures. Code: https://github.com/tanminsen/creativity-eval

详情

AI中文摘要

大型语言模型（LLMs）在语言理解、推理和生成方面取得了显著进展，激发了对其创造潜力的日益关注。实现这一潜力需要系统化和可扩展的方法来评估跨不同任务的创造力。然而，大多数现有的创造力指标与特定任务紧密耦合，将领域假设嵌入评估过程，限制了可扩展性和通用性。为解决这一差距，我们引入了一个自动化、领域无关的框架，用于量化LLM在开放式任务中的创造力。我们的方法将测量装置与创造性任务本身分离，实现了可扩展、任务无关的评估。发散创造力通过语义熵（一种无参考且稳健的新颖性和多样性指标）进行测量，并针对人类注释、基于LLM的新颖性判断和基线多样性度量进行了验证。收敛创造力通过一种新颖的基于检索的多智能体评判框架进行评估，该框架提供上下文敏感的任务完成评估，效率提升超过60%。我们在三个性质不同的领域验证了我们的框架：问题解决（MacGyver）、研究构思（HypoGen）和创意写作（BookMIA），使用了广泛的LLM套件。实证结果表明，我们的框架可靠地捕捉了创造力的关键方面，包括新颖性、多样性和任务完成，并揭示了模型属性（如大小、温度、时效性和推理）如何影响创造性表现。我们的工作为自动化的LLM创造力评估建立了可重复和可泛化的标准，为可扩展的基准测试铺平了道路，并加速了创造性AI的进展。

英文摘要

Large language models (LLMs) have achieved remarkable progress in language understanding, reasoning, and generation, sparking growing interest in their creative potential. Realizing this potential requires systematic and scalable methods for evaluating creativity across diverse tasks. However, most existing creativity metrics are tightly coupled to specific tasks, embedding domain assumptions into the evaluation process, and limiting scalability and generality. To address this gap, we introduce an automated, domain-agnostic framework for quantifying LLM creativity across open-ended tasks. Our approach separates the measurement apparatus from the creative task itself, enabling scalable, task-agnostic assessment. Divergent creativity is measured using semantic entropy, a reference-free and robust metric for novelty and diversity, validated against human annotations, LLM-based novelty judgments and baseline diversity measures. Convergent creativity is assessed via a novel retrieval-based multi-agent judge framework that delivers context-sensitive evaluation of task fulfilment with over 60% improved efficiency. We validate our framework in three qualitatively distinct domains: problem-solving (MacGyver), research ideation (HypoGen), and creative writing (BookMIA), using a broad suite of LLMs. Empirical results show that our framework reliably captures key facets of creativity, including novelty, diversity, and task fulfilment, and reveal how model properties, such as size, temperature, recency, and reasoning, impact creative performance. Our work establishes a reproducible and generalizable standard for automated LLM creativity evaluation, paving the way for scalable benchmarking and accelerating progress in creative AI.

URL PDF HTML ☆

赞 0 踩 0

2606.11761 2026-06-11 cs.LG 新提交

RCAP: Robust, Class-Aware, Probabilistic Dynamic Dataset Pruning

RCAP: 鲁棒的、类别感知的、概率性动态数据集剪枝

Atif Hassan, Swanand Khare, Jiaul H. Paik

发表机构 * IIT Kharagpur（印度理工学院卡哈拉格普尔分校）

AI总结提出RCAP算法，通过闭式解估计每类样本保留比例并自适应调整，结合高损失样本优先采样策略，在多种数据集和训练范式下优于现有方法，仅用10%数据即可提升类别不平衡数据集性能1%以上。

Comments Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence (UAI 2025)

详情

Journal ref: pages={1648-1662}, year={2025}, volume={286}, publisher={PMLR}

AI中文摘要

动态数据剪枝技术旨在通过模型训练期间定期选择输入数据的代表性子集来降低计算成本，同时最小化信息损失。然而，现有方法在平衡和不平衡数据集中，特别是在高剪枝率下，往往难以保持较强的最差组准确率。为了解决这一挑战，我们提出了RCAP，一种用于分类任务的鲁棒的、类别感知的、概率性动态数据集剪枝算法。RCAP应用闭式解来估计每个类别应包含在训练子集中的样本比例。该比例通过类别聚合损失在每个epoch自适应调整。随后，它采用自适应采样策略，优先选择具有高损失的样本来填充类别子集。我们在六个从类别平衡到高度不平衡的多样化数据集上，使用五种不同的模型，在三种训练范式（从头训练、迁移学习和微调）下评估了RCAP。我们的方法在所有剪枝率下始终优于最先进的数据集剪枝方法，实现了卓越的最差组准确率。值得注意的是，仅使用10%的数据，RCAP在类别不平衡数据集上相比全数据训练性能提升超过1%，同时平均加速8.69倍。代码可在此https URL获取。

英文摘要

Dynamic data pruning techniques aim to reduce computational cost while minimizing information loss by periodically selecting representative subsets of input data during model training. However, existing methods often struggle to maintain strong worst-group accuracy, particularly at high pruning rates, across balanced and imbalanced datasets. To address this challenge, we propose RCAP, a Robust, Class-Aware, Probabilistic dynamic dataset pruning algorithm for classification tasks. RCAP applies a closed-form solution to estimate the fraction of samples to be included in the training subset for each individual class. This fraction is adaptively adjusted in every epoch using class-wise aggregated loss. Thereafter, it employs an adaptive sampling strategy that prioritizes samples having high loss for populating the class-wise subsets. We evaluate RCAP on six diverse datasets ranging from class-balanced to highly imbalanced using five distinct models across three training paradigms: training from scratch, transfer learning, and fine-tuning. Our approach consistently outperforms state-of-the-art dataset pruning methods, achieving superior worst-group accuracy at all pruning rates. Remarkably, with only $10\%$ data, RCAP delivers $>1\%$ improvement in performance on class-imbalanced datasets compared to full data training while providing an average $8.69\times$ speedup. The code can be accessed at https://github.com/atif-hassan/RCAP-dynamic-dataset-pruning

URL PDF HTML ☆

赞 0 踩 0

2606.11745 2026-06-11 cs.CV cs.AI 新提交

From Prompts to Tokens: Internalizing Causal Supervision in Vision-Language Model for Multi-Image Causal Reasoning

从提示到标记：将因果监督内化到视觉-语言模型中进行多图像因果推理

Haoping Yu, Yuanxi Li, Jing Ma

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出BridgeVLM，通过从多图像输入诱导因果图并转换为因果标记，注入LLM解码器进行因果消息传递，显著提升多图像因果推理性能。

详情

AI中文摘要

视觉因果推理对于理解和干预物理世界至关重要，需要从视觉输入中识别因果变量并推理干预效果。尽管最近取得了进展，大型视觉-语言模型（VLM）在此类任务上仍然脆弱，尤其是对于多图像输入上的干预和反事实查询。大多数现有探索通过文本提示注入因果知识，使因果机制外在于模型执行，限制了推理过程中的可靠控制。为了解决这个问题，我们提出了BridgeVLM，它通过从多图像输入中诱导因果图并将其转换为结构化的因果标记，由注入到LLM解码器中的RAMP层执行因果消息传递，从而内化视觉因果推理。我们进一步引入了一个统一的训练接口M3S，用于不同粒度（局部/全局级别）的细粒度因果监督。BridgeVLM在CausalVLBench的干预任务上达到了54.4%的准确率（而提示级监督为33.2%），在Causal3D上将结果从43.6%提升到49.0%，并在CausalVLBench上显著改善了因果结构学习（$F_1$：33.4% → 75.1%）。

英文摘要

Visual causal reasoning is essential for understanding and intervening in the physical world, requiring identification of causal variables from visual inputs and reasoning over intervention effects. Despite recent progress, large vision--language models (VLMs) remain brittle at such tasks, especially for interventional and counterfactual queries over multi-image inputs. Most existing explorations inject causal knowledge via textual prompts, leaving causal mechanisms external to model execution and limiting reliable control during inference. To address this problem, we propose BridgeVLM, which internalizes visual causal reasoning by inducing a causal graph from multi-image inputs and converting it into structured Causal Tokens executed by RAMP layers injected into the LLM decoder for causal message passing. We further introduce a unified training interface M3S for fine-grained causal supervision from different granularities (local/global level). BridgeVLM achieves 54.4% accuracy on intervention tasks on CausalVLBench (vs. 33.2% with prompt-level supervision), improves results on Causal3D from 43.6% to 49.0%, and substantially improves causal structure learning on CausalVLBench ($F_1$: 33.4% $\rightarrow$ 75.1%).

URL PDF HTML ☆

赞 0 踩 0

2606.11744 2026-06-11 cs.CL cs.AI 新提交

Hey Chat, Can You Teach Me? Structuring Socratic Dialogue for Human Learning in the Wild

嘿，聊天机器人，你能教我吗？为人类学习构建结构化苏格拉底式对话

Sidney Tio, Arunesh Sinha, Pradeep Varakantham

发表机构 * School of Computing and Information Systems, Singapore Management University（新加坡管理大学计算与信息系统学院）； Department of Management Science and Information Systems, Rutgers Business School（罗格斯大学商学院管理科学与信息系统系）

AI总结针对LLM在长对话中教学效果差的问题，提出分离课程规划、苏格拉底对话和知识状态推断的系统，使用PPO策略决定教学顺序，在STEM和非STEM主题上优于基线模型。

Comments 10 Main Body Pages, with Appendices

详情

AI中文摘要

大型语言模型现在被广泛用于日常学习，但底层交互通常是非结构化的聊天，而不是遵循课程。与正式的在线学习系统不同，这些交互没有学生的先前记录，因此对学生已知内容的任何估计都必须从对话本身推断。我们表明，仅通过扩展模型并不能弥补这一差距。前沿和教育调优的LLM在要求长时间辅导学生时表现不佳，因为这需要同时做三件事：导师必须安排课程顺序，进行苏格拉底式对话，并从对话中推断学生的知识状态。我们建议分离这些职责。给定学生查询，我们的系统构建一个先决知识图谱，其中子主题是节点，依赖关系是边，并将辅导视为决定下一个要教授哪个节点以及在该节点上花费多少轮对话后再继续。一个轻量级的PPO策略处理这个顺序决策，而LLM在所选节点进行苏格拉底式交流并返回学生进展信号。在保留的STEM和非STEM主题上，我们的PPO配对导师优于启发式基线、前沿通用模型以及专门用于苏格拉底式对话的模型：无论是在学生达到完全课程掌握的速度上，还是在所需的对话轮数上。明确的课程结构带来了底层模型扩展所无法提供的收益。

英文摘要

Large language models are now widely used for everyday learning, but the underlying interactions are typically unstructured chats rather than following a curriculum. Unlike formal online learning systems, these interactions carry no prior record of the student, so any estimate of what the student already knows must be inferred from the dialogue itself. We show that this gap is not closed by scaling models alone. Frontier and education-tuned LLMs perform poorly when asked to tutor a student over an extended session, because doing so requires three things at once. The tutor must sequence a curriculum, conduct Socratic dialogue, and infer the student's knowledge state from that dialogue. We propose separating these responsibilities. Given a student query, our system constructs a prerequisite knowledge graph in which subtopics are nodes and dependencies are edges, and frames tutoring as deciding which node to teach next and how many dialogue turns to spend on it before moving on. A lightweight PPO policy handles this sequencing decision, while an LLM conducts the Socratic exchange at the chosen node and returns a signal of student progress. Across held-out STEM and non-STEM topics, our PPO-paired tutor outperforms heuristic baselines, frontier general-purpose models, and a model specialised for Socratic dialogue: on both the rate at which students reach full curriculum mastery and the number of turns required. Explicit curriculum structure delivers gains that scaling the underlying model does not.

URL PDF HTML ☆

赞 0 踩 0

2606.11743 2026-06-11 cs.RO cs.GR cs.LG 新提交

TacCoRL: Integrating Tactile Feedback into VLA via Simulation

TacCoRL: 通过仿真将触觉反馈集成到视觉-语言-动作模型中

Siyu Ma, Yuqi Liang, Chang Yu, Yunuo Chen, Hao Su, Yixin Zhu, Yin Yang, Chenfanfu Jiang

发表机构 * University of California, Los Angeles（加利福尼亚大学洛杉矶分校）； University of California, San Diego（加利福尼亚大学圣迭戈分校）； University of Electronic Science and Technology of China（电子科技大学）； Peking University（北京大学）； University of Utah（犹他大学）

AI总结提出TacCoRL框架，通过仿真与真实联合训练和强化学习，将触觉反馈注入视觉-语言-动作策略，在接触密集型任务中平均成功率提升22.5%。

详情

AI中文摘要

视觉-语言-动作（VLA）模型为机器人操作提供了强大的视觉、语言和动作先验，但仅凭视觉观察往往缺失接触密集型任务所需的局部接触状态。我们提出TacCoRL，一个可扩展的框架，将触觉反馈注入VLA策略，并通过仿真-真实联合训练和基于仿真的强化学习（RL）进行改进，无需大规模触觉预训练或广泛的真实世界接触探索。关键思想不仅是添加触觉作为输入，而是学习在接近失败状态下接触读数应如何调节动作响应，这些状态在演示中罕见且在硬件上收集风险高。我们使用真实对齐的仿真器作为接触交互的闭环训练环境。混合的仿真和真实轨迹首先在预训练策略中热启动触觉条件动作。具有可验证任务奖励的强化学习随后通过仿真接触回滚优化策略。它强化导致任务完成的触觉条件动作，而真实轨迹上的监督目标将精炼策略锚定到部署的视觉、触觉和动作分布。所得策略直接转移到真实机器人，无需特权仿真状态或在线真实世界RL。在四个双臂接触密集型任务中，最终的视觉-触觉策略平均成功率达到72.5%，而基线为50.0%。结果视频和更多细节见此链接。

英文摘要

Vision-language-action (VLA) models provide strong visual, language, and action priors for robot manipulation, but visual observations alone often miss the local contact state required for contact-rich tasks. We present TacCoRL, a scalable framework that injects Tactile feedback into VLA policies and improves them through sim-real Co-training and simulation-based reinforcement learning (RL), without requiring large-scale tactile pretraining or extensive real-world contact exploration. The key idea is not only adding touch as an input, but learning how contact readings should modulate action responses in near-failure states that are rare in demonstrations and risky to collect on hardware. We use a real-aligned simulator as a closed-loop training environment for contact interaction. Mixed simulated and real trajectories first warm-start tactile-conditioned actions in the pretrained policy. Reinforcement learning with verifiable task rewards then optimizes the policy using simulated contact rollouts. It reinforces tactile-conditioned actions that lead to task completion, while a supervised objective on real trajectories keeps the refined policy anchored to deployment visual, tactile, and action distributions. The resulting policy transfers directly to the real robot without privileged simulation state or online real-world RL. Across four bimanual contact-rich tasks, the final visuo-tactile policy achieves an average success rate of 72.5%, compared to baseline of 50.0%. Result videos and more details are available at https://tac-corl.github.io/

URL PDF HTML ☆

赞 0 踩 0

2606.11740 2026-06-11 cs.CV cs.CL 新提交

UniReason-Med: A Shared Grounded Reasoning Interface for 2D-to-3D Transfer in Medical VQA

UniReason-Med: 用于医学VQA中二维到三维迁移的共享基础推理接口

Mengzhuo Chen, Yan Shu, Chi Liu, Hongming Piao, Xidong Wang, Derek Li, Bryan Dai

发表机构 * IQuest Research

AI总结提出UniReason-Med框架，通过共享基础推理接口从2D医学图像向3D医学VQA迁移推理能力，结合监督微调和强化学习，显著提升3D推理性能。

详情

AI中文摘要

我们研究了当两种输入类型通过共同的推理接口对齐时，来自丰富2D医学图像的基础推理监督是否能够改善3D医学VQA。我们引入了UniReason-Med，一个单一检查点框架，在推理时处理2D图像或切片序列化的3D体积，通过共享框语法、区域标记注入和共同的基础推理策略生成交错文本推理和局部视觉证据。为了训练这个接口，我们构建了UniMed-CoT，一个包含220K指令微调数据集，具有交错的文本推理和基础视觉证据，包括170K 2D和50K 3D样本。通过监督微调后接结果级强化学习，UniReason-Med学会生成基础推理轨迹，而在强化学习期间无需基于IoU/Dice的定位奖励。数据混合和组件消融实验表明，联合2D+3D基础监督显著改善了仅3D训练的3D推理，而基础化和区域标记注入对2D和3D任务都有持续益处。这些结果表明，共享的基础推理接口可以将推理结构从2D图像迁移到切片序列化的体积医学理解。代码和数据公开在https://this URL。

英文摘要

We study whether grounded reasoning supervision from abundant 2D medical images can improve 3D medical VQA when both input types are aligned through a common reasoning interface. We introduce UniReason-Med, a single-checkpoint framework that processes either a 2D image or a slice-serialized 3D volume at inference time, generating interleaved textual reasoning and localized visual evidence through shared box syntax, region-token injection, and a common grounded reasoning policy. To train this interface, we construct UniMed-CoT, a 220K instruction-tuning dataset with interleaved textual reasoning and grounded visual evidence, including 170K 2D and 50K 3D samples. Through supervised fine-tuning followed by outcome-level reinforcement learning, UniReason-Med learns to generate grounded reasoning traces without IoU/Dice-based localization rewards during RL. Data-mixture and component ablations show that joint 2D+3D grounded supervision substantially improves 3D reasoning over 3D-only training, while grounding and region-token injection consistently benefit both 2D and 3D tasks. These results suggest that a shared grounded reasoning interface can transfer reasoning structure from 2D images to slice-serialized volumetric medical understanding. The code and data are publicly available at https://github.com/IQuestLab/unireason-med.

URL PDF HTML ☆

赞 0 踩 0

2606.11739 2026-06-11 cs.CV cs.AI 新提交

Multi-View In-Cabin Monitoring System for Public Transport Vehicles

公共交通车辆的多视角座舱内监控系统

Evgeny Gorelik, Kenny Dean Karrow, Fikret Sivrikaya, Sahin Albayrak, Christian Baumann

发表机构 * Technische Universität Berlin（柏林工业大学）； German Research Center for Artificial Intelligence (DFKI)（德国人工智能研究中心）

AI总结提出一个多视角座舱内监控数据集，包含同步RGB-D图像和LiDAR数据，并提供3D人体姿态和边界框标注，支持多视角3D检测模型评估。

Comments Submitted to ICDM2026

2606.11724 2026-06-11 cs.AI 新提交

Mind the Perspective: Let's Reason Recursively for Theory of Mind

注意视角：递归推理实现心智理论

Chao Lei, Guang Hu, Meng Yang, Yanbei Jiang, Nir Lipovetzky

发表机构 * School of Computing and Information Systems, The University of Melbourne, Australia（墨尔本大学计算与信息系统学院）； SensiLab, Monash University, Australia（蒙纳士大学SensiLab）

AI总结提出RecToM框架，通过递归视角构建建模嵌套信念，将高阶信念问题转化为实际世界问题，在多个ToM基准上达到最先进性能。

详情

AI中文摘要

心智理论（ToM）推理需要从部分且不对称的观察中推断智能体的信念，这对大语言模型（LLM）来说仍然是一个开放的挑战。现有的基于提示的方法通过可观察事件过滤或时间信念链来改进ToM推理，但没有显式建模嵌套信念。我们引入了RecToM，一个用于ToM推理的推理时框架，通过递归视角构建来建模嵌套信念。RecToM沿着问题指定的角色链，从先前的角色视角构建每个角色视角，将高阶信念问题简化为最终构建视角内的实际世界问题。我们进一步提供了KD45分析，表明RecToM的视角构建诱导了超越简单事件过滤的良好信念模态。在包括Hi-ToM、Big-ToM和FanToM在内的ToM基准上，跨多个LLM骨干网络的实验表明，RecToM持续优于最近的高级方法，达到了最先进的性能。值得注意的是，RecToM在GPT-5.4和Qwen3.5上达到了Hi-ToM的100%准确率，这是一个需要高阶ToM推理的基准。

英文摘要

Theory of Mind (ToM) reasoning requires inferring agents' beliefs from partial and asymmetric observations, which remains an open challenge for LLMs. Existing prompting-based approaches improve ToM reasoning through observable-event filtering or temporal belief chains, without explicitly modeling nested beliefs. We introduce RecToM, an inference-time framework for ToM reasoning that models nested beliefs via recursive perspective construction. RecToM constructs each character perspective from the preceding character perspective along the character chain specified by the question, reducing higher-order belief questions to actual-world questions within the final constructed perspective. We further provide a KD45 analysis showing that RecToM's perspective construction induces a well-formed belief modality beyond simple event filtering. Experiments on ToM benchmarks, including Hi-ToM, Big-ToM, and FanToM, across multiple LLM backbones show that RecToM consistently outperforms recent advanced approaches, achieving state-of-the-art performance. Notably, RecToM reaches 100\% accuracy on Hi-ToM with GPT-5.4 and Qwen3.5, a benchmark requiring higher-order ToM reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.11722 2026-06-11 cs.LG cs.AI cs.CL 新提交

ICA Lens: Interpreting Language Models Without Training Another Dictionary

ICA Lens: 无需训练另一本词典即可解释语言模型

Sida Liu, Feijiang Han

发表机构 * Independent Researcher（独立研究员）； University of Maryland（马里兰大学）

AI总结提出ICALens，基于独立成分分析（ICA）高效提取语言模型表示中可解释方向，无需训练稀疏自编码器，在SAEBench上表现竞争力。

Comments Ongoing Project

详情

AI中文摘要

在语言模型表示中找到可解释方向对于理解和控制模型行为至关重要。稀疏自编码器（SAE）已成为此目的的标准工具，但将其作为默认的第一透镜通常需要训练、存储和评估大型过完备字典。这一瓶颈限制了快速探索，并提出了一个基本问题：在训练另一个神经字典之前，从激活几何中已经可以看到多少可解释结构？我们的直觉很简单：许多可解释方向对令牌具有选择性，这些方向看起来比随机方向更不服从高斯分布。因此，我们重新审视独立成分分析（ICA），这是一种寻找非高斯方向的经典方法，作为语言模型可解释性的紧凑透镜。我们发现ICA在LLM可解释性中被低估了，因为先前的使用通常依赖于现成的ICA实现，这些实现在LLM激活上不稳定，并且缺乏用于检查和评估恢复方向的系统工具。为弥补这些差距，我们引入了ICALens，这是第一个用于LLM表示的稳定、高效和可审计ICA分析的实用工作流。它结合了优化的GPU并行FastICA流水线、LLM特定的稳定性配方和更好的拟合诊断，实现了高效可靠的逐层分析。在GPT-2 Small、Gemma 2 2B和Qwen 3.5 2B Base上，ICALens高效地恢复了紧凑、人类可解释的方向，无需逐层基于梯度的字典训练。在SAEBench上，ICA在稀疏探测中与公共SAE竞争，并在中小预算下的目标探测扰动中优于它们。这些结果表明，ICA不应被视为弱基线，而应被视为探索语言模型表示的高效且互补的第一透镜。

英文摘要

Finding interpretable directions in language-model representations is critical for understanding and controlling model behavior. Sparse autoencoders (SAEs) have become the standard tool for this purpose, but using them as the default first lens often requires training, storing, and evaluating large overcomplete dictionaries. This bottleneck limits rapid exploration and raises a fundamental question: how much interpretable structure is already visible from activation geometry before training another neural dictionary? Our intuition is simple: many interpretable directions are selective on tokens, and these directions should look less Gaussian than random directions. We therefore revisit independent component analysis (ICA), a classical method for finding non-Gaussian directions, as a compact lens for language-model interpretability. We find that ICA has been underestimated for LLM interpretability, because prior uses often relied on off-the-shelf ICA implementations that are brittle on LLM activations and lacked systematic tools for inspecting and evaluating the recovered directions. To bridge these gaps, we introduce ICALens, the first practical workflow for stable, efficient, and auditable ICA analysis of LLM representations. It combines an optimized GPU-parallel FastICA pipeline with LLM-specific stability recipes and better fitting diagnostics, enabling efficient and reliable layer-wise analysis. Across GPT-2 Small, Gemma 2 2B, and Qwen 3.5 2B Base, ICALens efficiently recovers compact, human-interpretable directions without per-layer gradient-based dictionary training. On SAEBench, ICA is competitive with public SAEs in sparse probing and outperforms them in targeted probe perturbation under small-to-medium budgets. These results suggest that ICA should not be viewed as a weak baseline, but as an efficient and complementary first lens for exploring language-model representations.

URL PDF HTML ☆

赞 0 踩 0

2606.11719 2026-06-11 cs.CV cs.AI 新提交

Ouroboros-Spatial: Closing the Data-Model Loop for Spatial Reasoning

Ouroboros-Spatial：闭环数据-模型循环的空间推理

Enhan Zhao, Wei Wu, Yuanrui Zhang, Xueliang Zhao, Di He

发表机构 * Peking University（北京大学）； Ant International（蚂蚁国际）； The University of Hong Kong（香港大学）

AI总结提出Ouroboros-Spatial自演化框架，通过提议器与求解器闭环交互，动态生成与模型能力匹配的训练样本，在六个空间推理基准上以十分之一数据量显著提升Qwen3-VL性能。

详情

AI中文摘要

空间推理仍然是多模态大语言模型（MLLM）的一个持续挑战。现有方法主要依赖大规模、静态整理的数据集，其中所有训练样本被统一对待，而不考虑模型不断演变的能力。这种静态范式本质上是数据低效的：训练能力通常浪费在模型当前阶段过于简单或过于困难的样本上。为解决这一局限，我们提出Ouroboros-Spatial，一个自演进的训练框架，其中模型扮演提议器和求解器的双重角色。在每次迭代中，冻结的提议器从3D场景元数据和原始视频帧生成空间问答对，以及用于推导可靠真实值的可执行代码。然后，可学习的求解器在接受的样本上进行微调，其每个样本的预测置信度作为难度信号。该信号在下一迭代中反馈给提议器，引导其生成与求解器当前能力更匹配的问题。通过这种闭环设计，训练分布与模型能力共同演化，减少冗余的简单示例，同时过滤掉具有有限学习价值的模糊或无信息样本。在六个空间推理基准上，Ouroboros-Spatial显著提升了Qwen3-VL-4B和Qwen3-VL-8B的性能，同时使用的训练样本数量比近期大规模整理数据集少一个数量级。在VSI-Bench上，它对4B和8B模型分别取得了9.9和6.8个百分点的绝对提升，使两者均优于一系列强大的开源和专有基线模型。

英文摘要

Spatial reasoning remains a persistent challenge for multimodal large language models (MLLMs). Existing approaches largely rely on large-scale, statically curated datasets, where all training samples are treated uniformly regardless of the model's evolving capabilities. This static paradigm is inherently data-inefficient: training capacity is often spent on samples that are either trivial or overly difficult for the model at its current stage. To address this limitation, we propose Ouroboros-Spatial, a self-evolving training framework in which the model plays dual roles as a proposer and a solver. In each iteration, a frozen proposer generates spatial question-answer (QA) pairs from 3D scene metadata and raw video frames, together with executable code for deriving reliable ground truth. A learnable solver is then fine-tuned on the accepted samples, and its per-sample prediction confidence is used as a difficulty signal. This signal is fed back to the proposer in the next iteration, guiding it to generate questions better matched to the solver's current capabilities. Through this closed-loop design, the training distribution co-evolves with model ability, reducing redundant trivial examples while filtering out ambiguous or uninformative samples with limited learning value. Across six spatial reasoning benchmarks, Ouroboros-Spatial substantially improves Qwen3-VL-4B and Qwen3-VL-8B while using an order of magnitude fewer training examples than recent large-scale curated datasets. On VSI-Bench, it yields absolute gains of 9.9 and 6.8 points for the 4B and 8B models, respectively, enabling both to outperform a wide range of strong open-source and proprietary baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.11712 2026-06-11 cs.CL cs.AI cs.LG 新提交

Substrate Asymmetry in User-Side Memory: A Diagnostic Framework

用户侧记忆中的子模块不对称性：一个诊断框架

Youwang Deng

发表机构 * EpistemicaLab — Independent Research（EpistemicaLab — 独立研究）

AI总结提出一个诊断框架，将LLM用户侧记忆分解为行为一致性、事实存在和事实缺失三个正交子模块，发现参数记忆与检索记忆在不同子模块上存在不对称性，且RLHF调优加剧了这种不对称性。

Comments Preprint. Code: https://github.com/EpistemicaLab/substrate-asymmetry-memory

详情

AI中文摘要

LLM中的用户侧记忆通常被评分为单一的“个性化”能力：给定用户历史，输出是否更了解用户？我们表明这种聚合指标隐藏了相反方向的失败。记忆至少可分解为三个正交轴——行为一致性（风格、语气）、事实存在（回忆历史中的事实）和事实缺失（当事实缺失时弃权）——并且没有单一子模块能在所有三个轴上获胜。在受控的50用户合成语料库和真实数据探针（LaMP-3）上，比较每个用户的gamma-LoRA（在每个用户历史上训练的小型LoRA适配器；gamma表示每个用户，而非每个任务）与BGE-large密集top-K检索，我们发现gamma-LoRA在行为风格上决定性获胜，而RAG在事实缺失上决定性获胜——并且注意力层21-35中的相同查询投影细胞因果地承载了这两个相反方向的效果（将这些LoRA权重归零会使缺失探针TPR提高33个百分点，并使存在探针TPR下降20个百分点）。在更经过RLHF调优的Llama-3.1-8B-Instruct上，不对称性增强而非愈合：参数记忆的行为优势崩溃，而其相对于检索的缺失校准赤字扩大——这是对参数用户记忆的对齐税。在真实数据LaMP-3上，gamma-LoRA表现低于多数基线；一个9条件缓解扫描诊断出这是指令遵循崩溃，而非子模块失败（9x2交叉乘积显示评估时的{1..5} logit掩码使每个配方的主准确率达到>=0.995），并且最佳训练时修复在Llama上逐位复制。最后，子模块选择路由是问题分类，而非校准：仅基于问题文本的110M DistilBERT击败了每个基于logit的路由器。我们贡献了诊断框架、诊断出的真实数据负例、对齐税复制以及路由即分类的发现。

英文摘要

User-side memory in LLMs is typically scored as a single "personalization" capability: given a user's history, is the output more user-aware? We show this aggregate metric hides opposite-direction failures. Memory factorises into at least three orthogonal axes -- behavioral consistency (style, voice), factual presence (recall facts in history), and factual absence (abstain when a fact is absent) -- and no single substrate wins all three. Comparing per-user gamma-LoRA (a small LoRA adapter trained on each user's history; gamma denotes per-user, not per-task) against BGE-large dense top-K retrieval on a controlled 50-user synthetic corpus and a real-data probe (LaMP-3), we find gamma-LoRA decisively wins behavioral style while RAG decisively wins factual absence -- and the same query-projection cells in attention layers 21-35 causally load-bear both effects in opposite directions (zeroing those LoRA weights raises absence-probe TPR by +33 pp and drops presence-probe TPR by 20 pp). On the more heavily RLHF-tuned Llama-3.1-8B-Instruct the asymmetry strengthens, not heals: parametric memory's behavioral advantage collapses while its absence-calibration deficit against retrieval widens -- an alignment tax on parametric user-memory. On real-data LaMP-3, gamma-LoRA underperforms a majority baseline; a 9-condition mitigation sweep diagnoses this as instruction-following collapse, not substrate failure (a 9x2 cross-product shows the eval-time {1..5} logit mask drives main_acc to >=0.995 on every recipe), and the best training-time fix replicates bit-identically on Llama. Finally, substrate-selection routing is question-classification, not calibration: a 110M DistilBERT on the question text alone beats every logit-based router. We contribute the diagnostic framework, the diagnosed real-data negative, the alignment-tax replication, and the routing-as-classification finding.

URL PDF HTML ☆

赞 0 踩 0

2606.11711 2026-06-11 cs.LG stat.ML 新提交

利用可现场部署的无系留外骨骼提高人类潜水耐力

Zhihao Zhou, Zhenmeng Ju, Rui Yang, Chenxi Zhang, Zhihao Zhou, Ming Xu, Enhao Zheng, Dongjie Jiang, Lecheng Ruan, Jingeng Mai, Qining Wang

发表机构 * Institute for Artificial Intelligence, Peking University（北京大学人工智能研究院）； Beijing Engineering Research Center of Intelligent Rehabilitation Engineering（北京市智能康复工程技术研究中心）； School of Advanced Manufacturing and Robotics, Peking University（北京大学先进制造与机器人学院）； State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所多模态人工智能系统国家重点实验室）； Department of Sports Medicine, Peking University Third Hospital（北京大学第三医院运动医学科）； School of Rehabilitation Sciences and Engineering, University of Health and Rehabilitation Sciences（康复大学康复科学与工程学院）

AI总结本文提出DiveMate外骨骼，通过自适应踢腿辅助在真实水下环境中将潜水距离提高42.9%，潜水时长延长54.9%，净耗气率降低47.0%，显著提升人类潜水耐力。

详情

AI中文摘要

人类在水下运动中的耐力从根本上受到高能量需求（克服阻力）和自持呼吸气体有限供应的限制。虽然外骨骼技术可以降低人类在陆地运动中的代谢成本，但其在增强水下潜水耐力方面的潜力尚未被探索。本文介绍了DiveMate，一种可现场部署的无系留外骨骼，旨在通过自适应踢腿辅助在真实水下环境中提高人类潜水耐力。在自然潜水过程中，DiveMate通过降低耗气率，使给定能量（呼吸气体）下的行进距离增加42.9%，潜水时长延长54.9%。肌肉激活的显著减少表明生理消耗降低，净耗气率降低47.0%。运动学特征和规律性的改善进一步支撑了高效的能量经济性。这些结果表明，应用外骨骼辅助有利于提高人类潜水耐力，增强其探索水下世界的能力。本研究拓展了外骨骼的应用前沿，并为未来水下辅助设备的设计和评估提供了潜在参考。

英文摘要

Human endurance in underwater locomotion is fundamentally restricted by high energetic demands to overcome drag and the finite supply of self-contained breathing gas. While exoskeleton technology can reduce the metabolic cost of humans in terrestrial locomotion, its potential to enhance human endurance during underwater diving remains entirely unexplored. Here, we present DiveMate, a field-deployable, untethered exoskeleton designed to improve human diving endurance via adaptive kick assistance in real-world underwater environments. During naturalistic diving, DiveMate increases the travel distance using a given energy (breathing gas) by 42.9% and extends dive duration by 54.9% through reducing gas consumption rate. Marked reductions in muscle activation indicate a decrease in physiological exertion, with the net gas consumption rate decreasing by 47.0%. Kinematic characteristics and regularity improvements further underpin efficient energy economy. These results suggest that applying exoskeleton assistance is beneficial for improving human diving endurance and augmenting their ability to explore the aquatic world. This study extends the application frontier of exoskeletons and provides a potential reference for the design and assessment of future underwater assistive devices.

URL PDF HTML ☆

赞 0 踩 0

2606.11702 2026-06-11 cs.CV cs.AI cs.CL 新提交

MedCTA: A Benchmark for Clinical Tool Agents

MedCTA: 临床工具智能体基准

Tajamul Ashraf, Hyewon Jeong, Fida Mohammad Thoker, Bernard Ghanem

发表机构 * King Abdullah University of Science and Technology (KAUST)（阿卜杜拉国王科技大学）； Massachusetts Institute of Technology (MIT)（麻省理工学院）

AI总结提出MedCTA基准，基于放射影像、病理切片和报告等真实临床多模态输入，评估医疗AI智能体在工具检索、证据获取和集成方面的规划与执行能力。

Comments Project Page: https://ivul-kaust.github.io/MedCTA/ Code: https://github.com/IVUL-KAUST/MedCTA Data: https://huggingface.co/datasets/IVUL-KAUST/MedCTA

详情

AI中文摘要

为了做出临床合理的决策，医疗AI智能体需要超越简单的识别，具备工具检索、证据获取和集成能力。现有基准主要评估孤立的感知或单轮问答，因此对规划、工具调用和部署可靠性的失败可见性有限。我们提出了MedCTA，一个用于评估医疗工具智能体的基准，基于临床验证的、步骤隐含的任务，这些任务基于真实的多模态临床输入，包括放射影像、病理切片和报告。MedCTA包含107个真实临床任务，具有临床医生验证的、在5个部署工具上的可执行轨迹，并支持对工具选择、参数有效性、执行稳定性、轨迹保真度和结果质量的过程感知评估。我们对18个开源和闭源多模态模型进行了基准测试，发现即使是最先进的系统在多步骤临床工具使用中仍然脆弱：自主部署主要由协议失败、过早停止和错误工具调用主导，而黄金标准工具路由带来了巨大但仍不完整的改进。这些结果表明，强大的骨干感知能力并不能转化为临床环境中可靠的智能体行为。MedCTA为审计、诊断和推进可信赖的医疗AI智能体提供了一个严格的测试平台。数据集和评估套件可在该https URL获取。

英文摘要

To make clinically grounded decisions, medical AI agents are expected to go beyond simple recognition and be capable of tool retrieval, evidence acquisition, and integration. Existing benchmarks largely evaluate isolated perception or single-turn question answering, and therefore provide limited visibility into failures of planning, tool recruitment, and rollout reliability. We introduce MedCTA, a benchmark for evaluating medical tool agents on clinician-validated, step-implicit tasks grounded in realistic multimodal clinical inputs, including radiology images, pathology slides, and reports. MedCTA comprises 107 real-world clinical tasks with clinician-verified executable trajectories over 5 deployed tools, and supports process-aware evaluation of tool selection, argument validity, execution stability, trajectory fidelity, and outcome quality. We benchmark 18 open- and closed-source multimodal models and find that even frontier systems remain brittle in multi-step clinical tool use: autonomous rollouts are dominated by protocol failures, premature stopping, and incorrect tool recruitment, while gold-standard tool routing yields large but still incomplete gains. These results show that strong backbone perception does not translate into reliable agentic behavior in clinical settings. MedCTA provides a rigorous testbed for auditing, diagnosing, and advancing trustworthy medical AI agents. The dataset and evaluation suite are available at https://ivul-kaust.github.io/MedCTA/

URL PDF HTML ☆

赞 0 踩 0

2606.11699 2026-06-11 cs.LG 新提交

A Data-Centric Framework for Detecting and Correcting Corrupted Labels

一个用于检测和纠正损坏标签的数据中心框架

Ha-Linh Nguyen, Hong-Anh Nguyen, Minh-Duc La, Thu-Trang Nguyen, Son Nguyen, Hieu Dinh Vo

发表机构 * Faculty of Information Technology, VNU University of Engineering and Technology, Hanoi, Vietnam（越南河内国立大学工程与技术学院信息技术系）

AI总结提出Relabeler框架，联合利用局部和全局关系检测损坏标签，并基于输入特征和噪声标签估计最可能的干净标签进行纠正，在多个数据集上实现高达58%的标签纠正精度提升和6%的下游任务性能提升。

详情

AI中文摘要

机器学习和深度学习模型的性能在很大程度上取决于训练数据的质量。然而，现实世界数据集的质量常常因噪声标签而受损，这会显著降低模型的准确性和可靠性。为了解决这一挑战，我们提出了Relabeler，一个端到端的数据中心框架，用于检测和纠正损坏的标签。对于损坏标签检测，Relabeler联合利用数据实例之间的局部和全局关系来识别潜在的噪声样本。在检测到可疑实例后，Relabeler进一步通过基于每个实例的输入特征和观察到的噪声标签估计最可能的干净标签来执行标签纠正。跨多个数据集、噪声类型和噪声率的大量实验表明，Relabeler始终优于最先进的基线，在标签纠正精度上实现了高达58%的提升，在下游任务性能上实现了6%的提升。

英文摘要

The performance of machine learning and deep learning models largely depends on the quality of the training data. However, the quality of the real-world datasets is often compromised by noisy labels, which can substantially degrade model accuracy and reliability. To address this challenge, we propose Relabeler, an end-to-end data-centric framework for detecting and correcting corrupted labels. For corrupted label detection, Relabeler jointly leverages both local and global relationships among data instances to identify potentially noisy samples. After detecting suspicious instances, Relabeler further performs label correction by estimating the most probable clean label for each instance based on both its input features and observed noisy label. Extensive experiments across multiple datasets, noise types, and noise rates demonstrate that Relabeler consistently outperforms state-of-the-art baselines, achieving up to 58% improvement in label correction precision and 6% improvement in downstream task performance.

URL PDF HTML ☆

赞 0 踩 0

2606.11695 2026-06-11 cs.LG cs.AI 新提交

面向表格-图像多模态学习的参数高效适配器微调

Jiaqi Luo

发表机构 * School of Mathematical Sciences, Soochow University（苏州大学数学科学学院）

AI总结提出TI-Adapter框架，通过冻结表格编码器并添加适配器，以及图像分支的嵌入层和瓶颈层适配器，实现高效多模态微调，在20个数据集上以更少参数达到或超越全微调性能。

详情

AI中文摘要

表格-图像多模态学习旨在通过联合使用结构化表格属性和视觉数据来提高预测建模能力。尽管预训练编码器提供了强大的模态特定表示，但全微调可能计算成本高昂，而保持编码器冻结可能限制任务特定适应。我们提出了表格-图像适配器（TI-Adapter），一种基于模态特定适配器的微调框架，用于高效的多模态适应。TI-Adapter冻结预训练的表格编码器，并在提取的表格嵌入后学习一个适配器，同时通过嵌入级和瓶颈级适配器来适应图像分支，而不是全微调。在20个表格-图像数据集上的实验表明，TI-Adapter在使用显著更少的可训练参数的情况下，达到了与全微调相当或更好的预测性能。消融研究进一步证明了适配器放置对于平衡性能和实际效率的重要性。

英文摘要

Tabular-image multimodal learning aims to improve predictive modeling by jointly using structured tabular attributes and visual data. Although pretrained encoders provide strong modality-specific representations, full fine-tuning can be computationally expensive, while keeping encoders frozen may limit task-specific adaptation. We propose the Tabular-Image Adapter (TI-Adapter), a modality-specific adapter-based fine-tuning framework for efficient multimodal adaptation. TI-Adapter freezes the pretrained tabular encoder and learns an adapter after the extracted tabular embedding, while adapting the image branch with embedding-level and bottleneck-level adapters instead of full fine-tuning. Experiments on 20 tabular-image datasets show that TI-Adapter achieves competitive or better predictive performance than full fine-tuning while using substantially fewer trainable parameters. Ablation studies further demonstrate the importance of adapter placement for balancing performance and practical efficiency.

URL PDF HTML ☆

赞 0 踩 0

2606.11680 2026-06-11 cs.AI cs.CL cs.LG 新提交

Organize then Retrieve: Hierarchical Memory Navigation for Efficient Agents

先组织再检索：面向高效智能体的层次化记忆导航

Hao-Lun Hsu, Nikki Lijing Kuang, Boyi Liu, Zhewei Yao, Yuxiong He

发表机构 * Duke University（杜克大学）； Snowflake AI Research（Snowflake AI研究）

AI总结提出HORMA框架，通过构建文件系统式的层次化记忆结构并利用强化学习训练的轻量级导航代理，实现高效检索，在长时任务中提升性能并降低令牌消耗。

详情

AI中文摘要

大型语言模型（LLM）智能体由于固有的无状态性，在处理长时任务时面临挑战，所有任务相关信息必须编码到不断增长的输入上下文中，导致推理质量下降、推理成本增加和延迟升高，因此需要高效的工作记忆机制。然而，现有方法要么依赖有损压缩，要么基于相似性检索，往往无法捕捉多步智能体任务所需的时间结构和因果依赖关系。在这项工作中，我们提出了HORMA，一种层次化组织与检索记忆智能体，它将经验组织成类似文件系统的层次化结构，其中总结的实体链接到相应的原始轨迹，从而在保留详细信息的同时实现高效访问。HORMA将工作记忆分解为两个阶段：结构化记忆构建和基于导航的检索。构建模块通过区分由信息缺失导致的失败和由误导性或过载上下文导致的失败，迭代地优化经验的结构化方式。导航模块使用强化学习训练的轻量级代理遍历层次结构，选择最小但充分的上下文，从而减少关键执行路径上的延迟。在ALFWorld、LoCoMo和LongMemEval上，HORMA在受限上下文预算下提升了任务性能，同时在长对话任务中最多仅使用基线22.17%的令牌。与现有方法相比，它始终实现了更好的效率-性能权衡，并能有效泛化到未见任务。

英文摘要

Large language model (LLM) agents struggle with long-horizon tasks due to their inherent statelessness, requiring all task-relevant information to be encoded in growing input contexts. The resulting degraded reasoning quality, increased inference cost, and higher latency necessitate efficient working memory mechanisms. However, existing approaches either rely on lossy compression or similarity-based retrieval, which often fail to capture temporal structure and causal dependencies required for multi-step agentic tasks. In this work, we present HORMA, a Hierarchical Organize-and-Retrieve Memory Agent that organizes experience into a file-system-like hierarchical structure, where summarized entities are linked to the corresponding raw trajectories, enabling efficient access without losing detailed information. HORMA decomposes working memory into two stages: structured memory construction and navigation-based retrieval. The construction module iteratively refines how experiences are structured by distinguishing between failures caused by missing information and those caused by misleading or overloaded context. The navigation module retrieves task-relevant context by traversing the hierarchy using a lightweight agent trained with reinforcement learning to select minimal yet sufficient context, thereby reducing latency along the critical execution path. Across ALFWorld, LoCoMo, and LongMemEval, HORMA improves task performance under constrained context budgets while requiring at most 22.17% of the baseline token usage in long conversation tasks. Compared to existing methods, it consistently achieves better efficiency-performance trade-offs and generalizes effectively to unseen tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.11678 2026-06-11 cs.CL 新提交

Can AI Reason Like an Urban Planner? Benchmarking Large Language Models Against Professional Judgment

AI能像城市规划师一样推理吗？基于专业判断的大语言模型基准测试

Yijie Deng, He Zhu, Wen Wang, Junyou Su, Minxin Chen, Wenjia Zhang

发表机构 * School of Architecture and Urban Planning, Shenzhen University（深圳大学建筑与城市规划学院）； Shenzhen Key Laboratory of Urban Spatial Information and Intelligent Modeling（深圳市城市空间信息与智能建模重点实验室）； Department of Urban Planning and Design, The University of Hong Kong（香港大学城市规划与设计系）

AI总结提出UPBench框架，通过4×5知识支柱与认知水平矩阵评估25个LLM，发现模型在分析任务上优于事实回忆和综合判断，揭示了规划知识的制度依赖性。

详情

AI中文摘要

问题、研究策略与发现：大语言模型（LLM）的兴起为城市规划提出了一个关键问题：AI能复制哪些专业规划知识，哪些仍需人类判断？尽管AI工具在规划实践中日益普及，但目前仍缺乏系统性框架来测试它们是否能以规划专业知识核心的情境敏感性、价值意识和制度素养进行推理。本文介绍了Urban Planning Bench（UPBench），这是一个领域特定的评估框架，通过改编自布鲁姆修订分类法的四个知识支柱和五个认知水平构成的4x5矩阵来评估LLM推理。通过自动评分和专家评审对25个LLM进行评估，我们发现了一条非单调的认知曲线：模型在高级分析任务上的表现优于事实回忆和综合判断。这表明，通常被视为低阶的规划知识深受制度、司法和时间背景的影响，使得LLM难以泛化。我们将这些局限性总结为四个认知诊断：监管幻觉、概念混淆、棘手性瘫痪和实践智慧缺陷。实践启示：研究结果支持规划中的差异化委托。LLM可以协助跨学科综合、文献综述、情景生成和初步政策分析。然而，它们在特定司法管辖区的法规、规范冲突解决和情境敏感程序方面仍不可靠。机构应要求对AI辅助监管分析进行验证，而规划教育应强调制度素养、规范判断和情境敏感性。

英文摘要

Problem, Research Strategy, and Findings: The rise of large language models (LLMs) raises a key question for urban planning: which forms of professional planning knowledge can AI replicate, and which still require human judgment? Although AI tools are increasingly used in planning practice, there is still no systematic framework for testing whether they can reason with the contextual sensitivity, value awareness, and institutional literacy central to planning expertise. This paper introduces Urban Planning Bench (UPBench), a domain-specific evaluation framework that assesses LLM reasoning through a 4x5 matrix of four knowledge pillars and five cognitive levels adapted from Bloom's revised taxonomy. Evaluating 25 LLMs with automated scoring and expert review, we find a non-monotonic cognitive curve: models perform better on higher-order analytical tasks than on factual recall and integrative judgment. This suggests that planning knowledge often treated as lower-order is deeply shaped by institutional, jurisdictional, and temporal context, making it hard for LLMs to generalize. We summarize these limits as four epistemic diagnostics: regulatory hallucination, conceptual conflation, wickedness paralysis, and phronetic deficit. Takeaway for Practice: The findings support differential delegation in planning. LLMs can assist with cross-disciplinary synthesis, literature review, scenario generation, and preliminary policy analysis. However, they remain unreliable for jurisdiction-specific regulation, normative conflict resolution, and context-sensitive procedure. Agencies should require verification for AI-assisted regulatory analysis, while planning education should emphasize institutional literacy, normative judgment, and contextual sensitivity.

URL PDF HTML ☆

赞 0 踩 0