arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.13222 2026-06-12 cs.RO cs.AI 新提交

Proprioceptive-visual correspondence enables self-other distinction in humanoid robots

本体感觉-视觉对应使能人形机器人的自我-他人区分

Yurun Chen, Tianyuan Gao, Yizhong Ge, Shikun Ban, Yizhou Wang, Hongkai Xiong, Wenjun Zeng, Wentao Zhu

发表机构 * Eastern Institute of Technology, Ningbo（宁波东方理工大学）； Shanghai Jiao Tong University（上海交通大学）； Peking University（北京大学）； Carnegie Mellon University（卡内基梅隆大学）； East China Normal University（华东师范大学）； Ningbo Institute of Digital Twin（宁波数字孪生研究院）

AI总结提出通过本体感觉与视觉的对应学习自我-他人区分，无需身份标签或运动学模型，并建立预测性自我模型，支持目标到达、碰撞感知运动规划和运动重定向。

Comments 23 pages, 9 figures, 1 supplementary table

详情

AI中文摘要

区分自我与他人是社会智能的前提，然而与人类共享工作空间的人形机器人仍然缺乏这种能力。在这里，我们展示了一个人形机器人可以通过本体感觉-视觉对应学习自我-他人区分，无需任何身份标签或运动学模型。一旦建立，这种区分引导出一个预测性自我模型，该模型将关节配置映射到三维身体占用，捕捉机器人身体如何随动作变化。在涉及人类或形态相同机器人的多智能体场景中，系统可靠地识别自身，学习三维自我模型，并支持下游任务，包括目标到达、碰撞感知运动规划和人类到机器人的运动重定向。这些结果共同勾勒出一条路径，使机器人在共享物理环境中与其他人行动和协调时具备身体自我表征。项目页面：此 https URL。

英文摘要

Distinguishing self from others is a prerequisite for social intelligence, yet humanoid robots that increasingly share workspaces with humans still lack this ability. Here we show that a humanoid robot can learn self-other distinction from proprioceptive-visual correspondence, without any identity labels or kinematic models. Once established, this distinction bootstraps a predictive self-model that maps joint configurations to three-dimensional body occupancy, capturing how the robot's body changes with action. In multi-agent scenes involving humans or morphologically identical robots, the system reliably identifies itself, learns a 3D self-model, and supports downstream tasks including target reaching, collision-aware motion planning, and human-to-robot motion retargeting. Together, these results outline a route toward bodily self-representation in robots that act and coordinate alongside others in shared physical environments. Project page: https://euron-zc.github.io/humanoid-self-model/.

URL PDF HTML ☆

赞 0 踩 0

2606.13220 2026-06-12 cs.AI cs.CE cs.ET cs.LG cs.MA 新提交

LLM-as-an-Investigator: Evidence-First Reasoning for Robust Interactive Problem Diagnosis

LLM作为调查员：基于证据优先的鲁棒交互式问题诊断

Fabrizio Marozzo, Pietro Liò

发表机构 * University of Calabria（卡拉布里亚大学）； University of Cambridge（剑桥大学）

AI总结提出证据优先的AI方法LLM-as-an-Investigator，通过估计问题歧义、生成假设、提问澄清并更新概率，避免过早接受用户假设，提升诊断准确性。

详情

AI中文摘要

大型语言模型（LLM）越来越多地被用作技术问题解决的交互式助手。然而，当用户提供不完整的描述或看似合理但未经证实的解释时，LLM可能会过早地认同这些假设，并在收集足够证据之前提出解决方案。我们将这种行为称为用户驱动的谄媚：LLM倾向于强化用户提供的假设，而不是测试其他解释。本文介绍了LLM-as-an-Investigator，一种基于证据优先的智能体AI方法，用于鲁棒的问题诊断。该方法通过一个解决方案调查智能体实现，该智能体估计初始问题描述的模糊性，生成候选假设，提出有针对性的澄清问题，并在每次回答后更新假设概率。该智能体不是立即给出响应，而是继续调查，直到证据使一个候选解释比其他解释更强。为了评估该方法，我们从机械、电气和液压领域已解决的技术论坛帖子中构建了一个基准测试。我们使用一个三智能体评估流程：问题-解决方案提取智能体将已解决的帖子转换为结构化案例，真实答案评估智能体在隐藏已知解决方案的同时模拟用户，被测试的助手通过对话尝试恢复解决方案。实验比较了标准助手、面向推理的LLM和基于调查员的模型，使用不同的LLM骨干网络。除了诊断准确性，我们还分析了标准助手在诊断案例中如何遵循误导性的用户假设。结果表明，所提出的方法比直接提示和仅推理基线更准确地识别问题，而其证据优先协议有助于减少用户引发的对话偏差。

英文摘要

Large language models (LLMs) are increasingly used as interactive assistants for technical problem solving. However, when users provide incomplete descriptions or plausible but unverified explanations, LLMs may prematurely align with these assumptions and propose solutions before collecting sufficient evidence. We refer to this behavior as user-driven sycophancy: the tendency of an LLM to reinforce a user-provided hypothesis instead of testing alternative explanations. This paper introduces LLM-as-an-Investigator, an evidence-first agentic AI methodology for robust problem diagnosis. The approach is implemented through a Solution Investigator Agent, which estimates the ambiguity of an initial problem description, generates candidate hypotheses, asks targeted clarification questions, and updates hypothesis probabilities after each answer. Rather than producing an immediate response, the agent continues the investigation until the evidence makes one candidate explanation stronger than the alternatives. To evaluate the approach, we build a benchmark from solved technical forum threads in mechanical, electrical, and hydraulic domains. We use a three-agent evaluation pipeline in which a Problem-Solution Extractor Agent converts solved threads into structured cases, a Ground-Truth Evaluator Agent simulates the user while hiding the known solution, and the tested assistant attempts to recover the solution through dialogue. The experiments compare standard assistants, reasoning-oriented LLMs, and the proposed investigator-based model across LLM backbones. In addition to diagnostic accuracy, we analyze how standard assistants follow misleading user hypotheses in diagnostic cases. The results show that the proposed approach identifies the problem more accurately than direct prompting and reasoning-only baselines, while its evidence-first protocol helps reduce user-induced conversational bias.

URL PDF HTML ☆

赞 0 踩 0

2606.13218 2026-06-12 cs.CL 新提交

When Similar Means Different: Evaluating LLMs on Arabic--Hebrew Cognates

当相似意味着不同：评估大语言模型在阿拉伯语-希伯来语同源词上的表现

Junhong Liang, Noor Abo Mokh, Bashar Alhafni

发表机构 * Mohamed bin Zayed University of Artificial Intelligence（莫扎德·穆扎伊德人工智能大学）

AI总结针对阿拉伯语和希伯来语同源词、假朋友和借词，构建SemCog Bench基准（1858对词对），评估LLM跨语言语义理解，发现模型依赖表面形式相似性，在假朋友和借词上表现差，上下文帮助有限。

详情

AI中文摘要

阿拉伯语和希伯来语作为密切相关的闪米特语言，共享大量真正的同源词、误导性的假朋友和现代借词。这种重叠对大语言模型（LLM）的跨语言语义理解构成了挑战。为了评估这一能力，我们引入了SemCog Bench，这是一个精心策划的基准，包含1,858个阿拉伯语-希伯来语词对，并带有用于同源词识别和语义消歧的句子级注释。我们评估了开源和商业LLM在多种输入表示（原始、带变音符号、罗马化和音标）下的表现，揭示了跨语言推理中的关键差距。虽然模型在真正的同源词上达到了高准确率，但在假朋友和借词上性能急剧下降，反映出对表面形式相似性的强烈依赖。此外，句子级上下文仅带来微小的改进，表明仅靠上下文线索不足以克服误导性的形式信号。这些发现揭示了当前LLM在解决跨语言形式-意义冲突方面的根本局限性，并将SemCog Bench确立为多语言语义推理的严格基准。我们的代码和数据已公开。

英文摘要

Arabic and Hebrew, as closely related Semitic languages, share a substantial lexicon of true cognates, misleading false friends, and modern loanwords. This overlap poses a challenge for cross-lingual semantic understanding in large language models (LLMs). To evaluate this capability, we introduce SemCog Bench, a curated benchmark of 1,858 Arabic--Hebrew word pairs with sentence-level annotations for cognate identification and semantic disambiguation. We evaluate open-source and commercial LLMs across multiple input representations (raw, diacritized, Romanized, and phonetic) and reveal a critical gap in cross-lingual reasoning. While models achieve high accuracy on true cognates, performance drops sharply on false friends and loanwords, reflecting a strong reliance on surface-form similarity. Furthermore, sentence-level context yields only modest improvements, suggesting that contextual cues alone are insufficient to overcome misleading form-based signals. These findings reveal a fundamental limitation of current LLMs in resolving cross-lingual form--meaning conflicts and establish SemCog Bench as a rigorous benchmark for multilingual semantic reasoning. Our code and data are publicly available.

URL PDF HTML ☆

赞 0 踩 0

2606.13216 2026-06-12 cs.CL cs.LG 新提交

Layer-Resolved Optimal Transport for Hallucination Detection in NMT and Abstractive Summarization

分层最优传输用于神经机器翻译和抽象摘要中的幻觉检测

Mariia Onyshchuk, Maksym-Vasyl Tarnavskyi, Marta Sumyk

发表机构 * Fairseq ； AggreFact

AI总结通过最优传输分析跨注意力分布，发现幻觉检测集中于解码器前四层，且该方法在源脱离时有效，但无法检测注意力下游的不忠实摘要。

Comments Accepted to ICML Mechanistic Interpretability Workshop 2026

详情

AI中文摘要

最优传输（OT）已被证明可以通过测量跨注意力分布与参考分布之间的几何距离来检测神经机器翻译（NMT）中的幻觉，无需任何监督。我们将此分析扩展到Fairseq DE-EN模型的所有六个解码器层（$N=3{,}414$），表明Wass-to-Unif和Wass-to-Data是互补的检测器，专门针对不同类型的幻觉；检测集中在L1--L4层，而L5层对较微妙的类型具有反预测性；并且幻觉翻译缺乏正确翻译从第一步解码开始就存在的探索性注意力阶段。我们进一步评估了几何信号是否可迁移到抽象摘要忠实性检测：在AggreFact（$N=1{,}116$）上，我们的无监督OT检测器在CNN/XSum上达到$57.2\%$/$57.6\%$的平衡准确率——高于随机水平，但远低于有监督的MiniCheck-Flan-T5-L（$69.9\%$/$74.3\%$）。这种差距是原则性的：与NMT幻觉不同，不忠实的摘要可以正确关注源标记，同时歪曲其内容，这种失败模式在基于集中度的OT指标中由于构造原因而不可见。在T5-base上的结构实验证实了解码器在深度上的一致组织，其中第3层显示峰值集中度，第12层对生成质量最为关键。总之，结果确立了当失败模式是源脱离时，跨注意力的OT是一种可靠的检测器；无论任务如何，它都是一种原则性的可解释性工具；而当忠实性失败发生在注意力下游时，它则具有根本局限性。

英文摘要

Optimal transport (OT) has been shown to detect hallucinations in neural machine translation (NMT) by measuring the geometric distance between cross-attention distributions and a reference distribution, without any supervision. We extend this analysis to all six decoder layers of the Fairseq DE-EN model ($N=3{,}414$), showing that Wass-to-Unif and Wass-to-Data are complementary detectors specialised across hallucination types, that detection is concentrated in layers L1--L4 with L5 anti-predictive for subtler types, and that hallucinated translations lack the exploratory attention phase present in correct translations from the first decoding step. We further evaluate whether the geometric signal transfers to abstractive summarization faithfulness detection: our unsupervised OT detector on AggreFact ($N=1{,}116$) achieves $57.2\%$/$57.6\%$ balanced accuracy on CNN/XSum -- above chance but substantially below supervised MiniCheck-Flan-T5-L($69.9\%$/$74.3\%$). This gap is principled: unlike NMT hallucinations, unfaithful summaries can attend correctly to source tokens while misrepresenting their content, a failure mode invisible to concentration-based OT metrics by construction. Structural experiments on T5-base confirm consistent decoder organisation across depth, with Layer~3 showing peak concentration and Layer~12 being most critical for generation quality. Together, the results establish OT on cross-attention as a reliable detector when the failure mode is source disengagement, a principled interpretability tool regardless of task, and fundamentally limited when faithfulness failures occur downstream of attention.

URL PDF HTML ☆

赞 0 踩 0

2606.13211 2026-06-12 cs.AI 新提交

Hallucination in Medical Imaging AI: A Cross-Modality Analytical Framework for Taxonomy, Detection, and Mitigation under Regulatory Constraints

医学影像AI中的幻觉：跨模态分析框架用于分类、检测与监管约束下的缓解

Omar Alshahrani, Muzammil Behzad

发表机构 * King Fahd University of Petroleum & Minerals, Saudi Arabia（沙特阿拉伯法赫德国王石油矿产大学）； SDAIA-KFUPM Joint Research Center for Artificial Intelligence, Saudi Arabia（沙特阿拉伯SDAIA-KFUPM人工智能联合研究中心）

AI总结本文提出跨模态分析框架，统一五种影像模态的幻觉分类、检测与缓解策略，发现通用基础模型在幻觉基准上优于医学专用模型，并映射到FDA全生命周期监管。

详情

AI中文摘要

AI系统在医学影像中的部署速度超过了对其故障模式的理解。当前，最受临床关注的故障是幻觉：临床看似合理但事实错误的输出，包括虚构的解剖结构、遗漏的发现、错误的侧向性以及生成报告中的虚构测量值，直接影响到活检决策、分期和治疗计划。本结构化综述综合了同行评审研究、基准数据集和FDA监管指南，涵盖五种影像模态，对幻觉的分类、病因、检测和缓解进行了跨模态分析。具体而言，我们研究了三个问题：(1) 现有分类法如何跨模态统一？(2) 医学专用基础模型为何比通用模型产生更少的幻觉？(3) 哪些缓解策略有效且与FDA生命周期监督兼容？我们注意到，三种分类框架共同覆盖了影像流程，而单一框架无法做到。我们还强调，通用基础模型在幻觉特定基准上优于医学专用模型，表明狭窄领域微调可能引入过拟合导致的虚构。同时，放射科医生的监督仍然至关重要；例如，很高比例的AI生成标记在临床使用前需要专家修正。物理信息架构约束、思维链提示和人在回路保障各自针对不同的故障模式，并在组合时有效。所有发现均映射到FDA的总产品生命周期和预定变更控制计划框架，这些框架将幻觉管理视为生命周期义务而非部署前检查清单。

英文摘要

AI systems are being deployed across medical imaging faster than their failure modes are understood. At this point in time, the failure of greatest clinical concern is hallucination: clinically plausible but factually incorrect outputs, including fabricated anatomical structures, missed findings, incorrect laterality, and invented measurements in generated reports, with direct consequences, for example, for biopsy decisions, staging, and treatment planning. This structured narrative synthesizes peer-reviewed studies, benchmark datasets, and FDA regulatory guidance across five imaging modalities to produce a cross-modality analysis of hallucination taxonomy, etiology, detection, and mitigation. Specifically, we address three questions in this study: (1) how can existing taxonomies be unified across modalities?, (2) how do medical-specialized foundation models hallucinate less than general-purpose ones?, and (3) which mitigation strategies are effective and compatible with FDA lifecycle oversight? We note that three taxonomic frameworks together cover the imaging pipeline in a way no single framework does alone. We also highlight that general-purpose foundation models outperform medical-specialized models on hallucination-specific benchmarks, indicating that narrow domain fine-tuning can introduce overfitting-induced confabulation. At the same time, the oversight of radiologists remains essential; for instance, a very high percentage of of AI-generated flags required expert correction before clinical use. Physics-informed architectural constraints, Chain-of-Thought prompting, and human-in-the-loop safeguards each address different failure modes and is effective when combined. All findings are mapped to the FDA's Total Product Lifecycle and Predetermined Change Control Plan frameworks, which treat hallucination management as a lifecycle obligation rather than a pre-deployment checklist.

URL PDF HTML ☆

赞 0 踩 0

2606.13209 2026-06-12 cs.LG cs.CL 新提交

Understanding helpfulness and harmless tension in reward models

理解奖励模型中的有用性与无害性张力

Eshaan Tanwar, Pepa Atanasova

发表机构 * University of Copenhagen（哥本哈根大学）

AI总结通过激活分析和消融实验，发现奖励模型中有用性和无害性目标存在干扰，共享神经元对模型行为影响不成比例，导致对齐张力。

Comments The source code used in this study is publicly available at: https://github.com/EshaanT/RM-alignment\_tension

详情

AI中文摘要

奖励模型是从人类反馈中进行强化学习（RLHF）的关键组成部分，使语言模型在有用性和无害性行为上对齐。然而，这些目标背后的内部机制及其冲突仍知之甚少。我们研究了在仅有用性、仅无害性和混合目标设置下训练的奖励模型中的对齐张力。我们发现混合目标模型通常表现不如单目标模型，表明目标之间存在干扰。使用基于激活的方法，我们识别了与每个目标相关的神经元，并通过定向消融研究其功能角色。我们发现这些神经元因果地支持其对应目标，同时往往对对立目标产生负面影响。我们发现相当比例的神经元在有用性和无害性之间共享，并且这些共享神经元对模型行为产生不成比例的影响，导致对齐张力。此外，我们的结果提供了关于对齐目标如何在奖励模型中表示以及为什么多目标对齐仍然具有挑战性的见解和机制解释，为未来关于解耦和可控对齐方法的研究提供了动力。

英文摘要

Reward models are a key component of reinforcement learning from human feedback (RLHF), aligning language models toward both helpful and harmless behaviour. However, the internal mechanisms underlying these objectives and their conflicts remain poorly understood. We study alignment tension in reward models trained under helpfulness-only, harmlessness-only, and mixed-objective settings. We find that mixed-objective models often underperform single-objective models, indicating interference between objectives. Using activation-based methods, we identify neurons associated with each objective and study their functional roles via targeted ablations. We find that these neurons causally support their corresponding objectives while often negatively affecting the opposing one. We find that a substantial proportion of neurons are shared between helpfulness and harmlessness, and that these shared neurons exert a disproportionate influence on model behaviour, contributing to alignment tension. Additionally, our results provide insights and mechanistic interpretation into how alignment objectives are represented in reward models and why multi-objective alignment remains challenging, motivating future work on disentangled and controllable alignment methods.

URL PDF HTML ☆

赞 0 踩 0

2606.13206 2026-06-12 cs.CV cs.RO 新提交

Visual Place Recognition in Forests with Depth-Aware Distillation

基于深度感知蒸馏的森林视觉地点识别

Walter Nedov, Saimunur Rahman, Kavindie Katuwandeniya, David Hall, Kaushik Roy, Peyman Moghadam

发表机构 * CSIRO Robotics, Brisbane, Australia（澳大利亚联邦科学与工业研究组织机器人实验室，布里斯班，澳大利亚）； University of Queensland, Brisbane, Australia（昆士兰大学，布里斯班，澳大利亚）； Queensland University of Technology, Brisbane, Australia（昆士兰科技大学，布里斯班，澳大利亚）

AI总结针对森林环境中视觉地点识别因植被重复、结构线索弱及外观变化大而困难的问题，提出轻量级深度感知蒸馏框架，将几何线索注入DINOv2模型，在WildCross基准上提升鲁棒性。

Comments IEEE ICRA Workshop on Field Robotics 2026

2606.13203 2026-06-12 cs.RO 新提交

Embedding ISO 10218 Safety Compliance in Robots via Control Barrier Functions for Human-Robot Collaboration

通过控制障碍函数将ISO 10218安全合规性嵌入机器人以实现人机协作

Federico Parma, Cesare Tonola, Nicola Pedrocchi, Manuel Beschi

发表机构 * Dept. of Electrical and Information Engineering, Polytechnic of Bari（巴里理工大学电气与信息工程系）； Dipartimento di Ingegneria Meccanica e Industriale, University of Brescia（布雷西亚大学机械与工业工程系）； Institute of Intelligent Industrial Technologies and Systems, National Research Council of Italy, STIIMA-CNR（意大利国家研究委员会智能工业技术与系统研究所）

AI总结提出基于控制障碍函数（CBF）的方法，利用人体加速度数据预测最小人机距离，并通过序列二次规划（SQP）框架实现安全约束，在UR10e上验证了该方法在遵守ISO 10218标准的同时减少轨迹误差63%。

详情

AI中文摘要

人机协作（HRC）需要严格遵守安全标准（如ISO 10218），以防止有害交互。标准的速度与分离监控（SSM）滤波器基于保守假设（如人体速度恒定）计算安全机器人速度，这阻碍了对最小分离距离的准确预测，并导致不必要的操作停止。本文提出一种控制障碍函数（CBF），明确纳入人体加速度数据，以在机器人最坏情况制动轨迹期间解析地前向预测最小人机分离距离。为保证控制层面的安全性，该预测性CBF作为不等式约束被集成到序列二次规划（SQP）框架中。具体地，提出了两种方法：方法I，一种CBF约束的PD安全滤波器；方法II，一种执行空间管约束的任务缩放SQP控制器。在UR10e机器人上的仿真和实际实验评估了两种方法相对于标准工业SSM模块基线的性能。结果表明，方法II动态调节执行速度并限制空间偏差。与方法I相比，方法II在平均轨迹误差上减少了63%，并避免了过度规避动作，在遵守ISO 10218 SSM指南的同时确保了高任务吞吐量。

英文摘要

Human-Robot Collaboration (HRC) requires strict adherence to safety standards, such as ISO 10218, to prevent harmful interactions. Standard Speed and Separation Monitoring (SSM) filters calculate safe robotic speeds based on conservative assumptions, such as constant human velocity, which prevents accurate predictions of minimum separation distances and causes unnecessary operational halts. This paper proposes a Control Barrier Function (CBF) that explicitly incorporates human acceleration data to analytically forward-predict the minimum human-robot separation distance during a worst-case robotic stopping trajectory. To guarantee safety at the control level, this predictive CBF is integrated as an inequality constraint within a Sequential Quadratic Programming (SQP) framework. Specifically, two methods are proposed: Method I, a CBF-constrained PD safety filter; and Method II, a task-scaling SQP controller that enforces a spatial tube constraint. Simulated and real-world experiments on a UR10e robot evaluate the two proposed methods against a standard industrial SSM module baseline. Results demonstrate that Method II dynamically modulates execution speed and confines spatial deviations. Compared to Method I, Method II achieves a 63\% reduction in mean trajectory error and avoids excessive evasive manoeuvres, ensuring high task throughput while complying with ISO 10218 SSM guidelines.

URL PDF HTML ☆

赞 0 踩 0

2606.13201 2026-06-12 cs.AI 新提交

A Minimal Model of Bounded Trade-Off Screening in Multi-Attribute Choice

多属性选择中有限权衡筛选的最小模型

Manisha Dubey, Anirban Sarkar, Subramanian Ramamoorthy

发表机构 * School of Informatics, University of Edinburgh, UK（英国爱丁堡大学信息学院）； Cold Spring Harbor Laboratory, USA（美国冷泉港实验室）

AI总结提出有限权衡推理框架，通过引入权衡容忍参数模拟筛选过程，产生不同于标准效用模型的偏好模式，解释多属性选择中的情境依赖行为。

Comments 3 pages, 1 figure, accepted as extended abstract at Annual Conference on Cognitive Computational Neuroscience 2026

2606.13197 2026-06-12 cs.AI 新提交

ARMOR-MAD: Adaptive Routing for Heterogeneous Multi-Agent Debate in Large Language Model Reasoning

ARMOR-MAD：大语言模型推理中异构多智能体辩论的自适应路由

Fuqiang Niu, Bowen Zhang

发表机构 * School of Cyber Science and Technology, University of Science and Technology of China（中国科学技术大学网络空间安全学院）； School of Artificial Intelligence, Shenzhen Technology University（深圳技术大学人工智能学院）

AI总结提出ARMOR-MAD框架，通过辩论前协议路由、早期一致停止评估和语义异常检测，自适应控制异构多智能体辩论，提升推理准确性和效率。

详情

AI中文摘要

多智能体辩论（MAD）可以改进大语言模型推理，但固定的辩论流程常常浪费计算资源，并可能放大相似智能体之间的相关错误。我们提出ARMOR-MAD，一个无需训练的异构MAD框架，将辩论视为条件计算。ARMOR-MAD结合了三个组件：辩论前协议路由（PAR）决定独立生成的第0轮答案是否需要辩论；早期一致停止评估器（EASE）在收敛后停止辩论；以及语义异常检测（SOD）在聚合过程中降低异常最终答案的权重。在MATH Level 5、GSM8K、MMLU和MMLU-Pro上，ARMOR-MAD在使用相同模型池的情况下，始终优于固定轮次的异构辩论，分别达到65.5%、96.5%、90.0%和81.5%的准确率。结果表明，真正的模型异构性和基于协议的控制对于使MAD更准确和高效都很重要。

英文摘要

Multi-agent debate (MAD) can improve large language model reasoning, but fixed debate pipelines often waste computation and can amplify correlated errors among similar agents. We propose ARMOR-MAD, a training-free heterogeneous MAD framework that treats debate as conditional computation. ARMOR-MAD combines three components: Pre-debate Agreement Routing (PAR) decides whether independently generated Round-0 answers require debate; Early Agreement Stopping Evaluator (EASE) stops debate after convergence; and Semantic Outlier Detection (SOD) down-weights abnormal final answers during aggregation. Across MATH Level 5, GSM8K, MMLU, and MMLU-Pro, ARMOR-MAD consistently improves over fixed-round heterogeneous debate with the same model pool, reaching 65.5\%, 96.5\%, 90.0\%, and 81.5\% accuracy, respectively. The results suggest that genuine model heterogeneity and agreement-based control are both important for making MAD more accurate and efficient.

URL PDF HTML ☆

赞 0 踩 0

2606.13194 2026-06-12 cs.LG 新提交

WHAR Arena: Benchmarking the State of the Art in Efficient Wearable Human Activity Recognition

WHAR Arena: 基准测试高效可穿戴人体活动识别的最新进展

Maximilian Burzer, Tobias King, Till Riedel, Michael Beigl, Tobias Röddiger

发表机构 * Karlsruhe Institute of Technology（卡尔斯鲁厄理工学院）； IPAI Foundation gGmbH（IPAI基金会有限责任公司）

AI总结为解决可穿戴人体活动识别中的可比性危机，构建了包含30个数据集的大规模基准，评估17种架构，发现预测性能趋于饱和，而紧凑模型和随机森林在部署效率上构成帕累托前沿。

Comments 20 pages, 9 Figures, 3 Tables

详情

AI中文摘要

深度学习已成为可穿戴人体活动识别（WHAR）的主导范式，但进展因可比性危机而变得模糊。结果通常使用不一致的数据集、自定义数据处理和不同的评估协议报告，使得最新技术的声明脆弱。我们通过一个大规模、开源基准来解决这个问题，该基准在标准化处理、统一模型接口和共享的跨主体评估协议下整合了30个不同的数据集。在4760次训练运行中评估了17种代表性架构，我们共同测量了预测性能以及Android参考设备上的设备延迟、峰值内存和模型大小。我们的结果表明，WHAR的最新进展是分布式的，而非由单一架构主导。虽然CNN-HAR实现了最高的平均宏F1，但表现最佳的模型紧密聚集，表明当代架构已接近预测性能上限。当考虑部署效率时，紧凑神经模型（如TinierHAR）和经典随机森林定义了实际相关的帕累托前沿，而较大的循环和混合模型则产生高硬件成本而无相应的性能增益。因此，尽管预测性能已趋于平稳，但在优化部署效率和改进对领域变化的适应方面，未来仍有巨大潜力。我们发布完整框架以支持透明的重用和扩展。

英文摘要

Deep learning has become the dominant paradigm in Wearable Human Activity Recognition (WHAR), yet progress is obscured by a comparability crisis. Results are often reported using inconsistent datasets, custom data processing, and varying evaluation protocols, making state-of-the-art claims fragile. We address this with a large-scale, open-source benchmark that integrates 30 diverse datasets under standardized processing, unified model interfaces, and a shared cross-subject evaluation protocol. Evaluating 17 representative architectures across 4760 training runs, we jointly measure predictive performance alongside on-device latency, peak memory, and model size on an Android reference device. Our results reveal that the WHAR state of the art is distributed rather than dominated by a single architecture. While CNN-HAR achieves the highest mean macro-F1, top-performing models cluster tightly, indicating contemporary architectures have converged near a predictive performance ceiling. When accounting for deployment efficiency, compact neural models, such as TinierHAR, and classical Random Forests define the practically relevant Pareto frontier, whereas larger recurrent and hybrid models incur high hardware costs without corresponding performance gains. Consequently, while predictive performance has plateaued, substantial potential for future progress remains in optimizing deployment efficiency and improving adaptation to domain shifts. We release our full framework to support transparent reuse and extension.

URL PDF HTML ☆

赞 0 踩 0

2606.13192 2026-06-12 cs.AI 新提交

Reasoning for Mobile User Experience with Multimodal LLMs: Task, Benchmark, and Approach

基于多模态大语言模型的移动用户体验推理：任务、基准与方法

Ruichao Mao, Zhou Fang, Teng Guo, Hao Yang, Yaping Li, Shaohua Peng, Maji Huang, Xiaoyu Lin, Shuoyang Liu, Xuepeng Li, Yuyu Zhang, Hai Rao

发表机构 * Ant Group（蚂蚁集团）

AI总结提出UXBench基准（2000个VQA样本）评估多模态大模型在UI推理上的能力，并设计UI-UX模型，通过奖励路由和不对称过渡奖励机制在UXBench上达到0.7963准确率，超越Claude-4.5-Sonnet。

Comments 10 pages, 6 figures, Accepted at CVPR 2026 Findings

详情

AI中文摘要

以可用性、感知一致性和功能清晰性为中心的用户体验（UX）是现实世界用户界面（UI）的基础。多模态大语言模型（MLLMs）在用户界面领域的应用正在快速发展，例如视觉元素定位、图形用户界面（GUI）代理和设计到代码生成。然而，基于UI截图评估UX的研究工作仍不成熟。为此，我们提出UXBench，一个包含2000个VQA数据样本的新型多模态基准，旨在评估MLLMs执行基于UI的推理能力。UXBench包括基于真实UI截图的8个任务，需要对布局关系、视觉层次和内容一致性中的UX问题进行细粒度诊断。我们对主流MLLMs的广泛评估表明，它们在基于UI的推理能力上仍然存在根本性限制。结果强调了该领域进一步发展的必要性。为弥补这一差距，我们提出UI-UX，一个基于Qwen3-VL-4B-Thinking基础模型并通过强化学习增强的MLLM，具有两个关键创新：一个奖励路由机制，在推理过程中动态平衡感知理解和逻辑推理；以及一个非对称过渡奖励，抑制冗余或不足的推理步骤。实验表明，UI-UX在UXBench上达到了最先进的性能，准确率达到0.7963——超过Claude-4.5-Sonnet的0.6550——同时在各种UI任务中表现出强大的泛化能力并保持低推理延迟。

英文摘要

User experience (UX) centered on usability, perceived consistency, and functional clarity is fundamental to real-world user interfaces (UI). The application of multimodal large language models (MLLMs) in the field of user interfaces is evolving rapidly, such as visual element grounding, graphical user interface (GUI) agents, and design-to-code generation. However, research efforts on evaluating UX based on UI screenshots are still immature. To address this, we propose UXBench, a novel multimodal benchmark consisting of 2,000 VQA data samples designed to assess MLLMs' ability to perform UI-based reasoning. UXBench includes 8 tasks based on real-world UI screenshots that require fine-grained diagnosis of UX issues across layout relationships, visual hierarchy, and content consistency. Our extensive evaluation of mainstream MLLMs shows that they remain fundamentally limited in their capacity for UI-based reasoning. The results underscore the need for further advancements in this area. To bridge this gap, we propose UI-UX, an MLLM based on Qwen3-VL-4B-Thinking foundation model and enhanced via reinforcement learning with two key innovations: a reward routing mechanism that dynamically balances perceptual understanding and logical reasoning during inference, and an asymmetric transition reward that suppresses redundant or insufficient reasoning steps. Experiments demonstrate that UI-UX achieves state-of-the-art (SOTA) performance on UXBench, attaining an accuracy of 0.7963 -- surpassing Claude-4.5-Sonnet's 0.6550 -- while exhibiting strong generalization across diverse UI tasks and maintaining low inference latency.

URL PDF HTML ☆

赞 0 踩 0

2606.13191 2026-06-12 cs.LG 新提交

The Geometry of Phase Transitions in Generative Dynamics via Projection Caustics

生成动力学中相变的几何：投影焦散视角

Ryosuke Sakamoto, Kotaro Sakamoto

发表机构 * Institute for the Advanced Study of Human Biology, Institute for Advanced Study, Kyoto University（京都大学高等研究院人类生物学高等研究所）； Graduate School of Engineering, The University of Tokyo（东京大学大学院工学系研究科）

AI总结本文通过投影焦散几何解释生成动力学中的相变行为，提出临界边界检测器（CBD）诊断分数方向不稳定性，定位模式承诺并支持敏感区域控制。

详情

AI中文摘要

连续状态生成采样器（包括扩散和流匹配模型）通过连续逆时间动力学演化，但其样本经常经历突然的定性变化：轨迹承诺于模式，语义替代坍缩，窄时间窗口内的小扰动可产生大的下游效应。本文对这种相变般行为进行了几何解释。我们将去噪视为自由能景观上的梯度下降，并表明尖锐转变出现在投影焦散附近，此时数据支撑上的最近点投影不再唯一。受此视角启发，我们引入临界边界检测器（CBD）作为分数方向不稳定性的实用诊断工具。在玩具模型、标准扩散模型和潜在文本到图像扩散模型中，CBD定位了模式承诺，预测了干预敏感窗口，并支持几何敏感区域中的目标控制。我们的结果连接了数据的几何与扩散生成的动力学。

英文摘要

Continuous-state generative samplers, including diffusion and flow-matching models, evolve through continuous reverse-time dynamics, yet their samples often undergo abrupt qualitative changes: trajectories commit to modes, semantic alternatives collapse, and small perturbations in narrow time windows can produce large downstream effects. This paper develops a geometric account of such phase-transition-like behaviour. We view denoising as gradient descent on a free energy landscape and show that sharp transitions arise near projection caustics, where the nearest-point projection onto the data support ceases to be unique. Motivated by this perspective, we introduce the Critical Boundary Detector (CBD), as practical diagnostics for score-direction instability. Across toy models, standard diffusion models, and latent text-to-image diffusion models, CBD localises mode commitment, predicts intervention-sensitive windows, and supports targeted control in geometrically sensitive regions. Our results connect geometry of data and dynamics of diffusion generation.

URL PDF HTML ☆

赞 0 踩 0

2606.13190 2026-06-12 cs.RO cs.HC 新提交

Multi-Modal Multi-Agent Robotic Cognitive Alignment enabled by Non-Invasive Consumer Brain Computer Interfaces: A Proof of Concept Exploration

基于非侵入式消费级脑机接口的多模态多智能体机器人认知对齐：概念验证探索

Nataliya Kosmyna, Liz Jenkins, Anoop K. Sinha

发表机构 * GOOGLE（谷歌）； Paradigms of Intelligence（智能范式）； Cambridge, MA, United States（马萨诸塞州剑桥市，美国）； Mountain View, CA, United States（加利福尼亚州山景城，美国）

AI总结提出一种框架，利用消费级脑机接口监测脑电信号，在高认知负荷时延迟智能体通信，实现认知对齐的多智能体交互，初步验证了实时信号处理、大语言模型与机器人结合的可行性。

Comments 19 pages, 9 figures, for associated video, see https://youtu.be/0Tav-G87XGs

详情

AI中文摘要

尽管非语言行为和表达性动作对于自然的人机交互至关重要，但现有方法常常忽略一个关键要素：人类的内在认知状态。主动式多智能体系统经常在不合时宜的时刻打断人类，导致认知过载和任务性能下降。本文引入了一个生成“认知对齐”多智能体交互的框架，增强了机器人系统在人类高心理工作负荷和高投入度时刻，能够上下文相关地延迟向智能体系统用户发送通信的能力。我们介绍了一种闭环架构的设计与实现，该架构探索了自主任务执行与实时神经生理学专注度之间的相互作用。使用消费级脑机接口（BCI），我们的方法在人类执行投入度诱导任务时持续监测脑电图（EEG）频谱带功率。我们提出了一种基于投入度的流水线，其中基于HTTP的信令机制在检测到高投入度时将主智能体的感官输入和音频输出置于保持状态，从而允许次级智能体在后台无缝处理复杂的委托任务。一旦人类的认知状态恢复到较低的认知负荷基线，主智能体释放排队的智能体消息。我们的初步结果证明了利用实时信号处理、大语言模型（LLMs）和物理机器人实体创建认知感知、非侵入式多智能体系统的可行性。

英文摘要

While non-verbal behaviors and expressive movements are essential for natural human-robot interaction, existing methods often overlook a crucial element: the human's internal cognitive state. Frequently, proactive multi-agent systems can interrupt humans at inopportune moments, leading to cognitive overload and decreased task performance. This paper introduces a framework for generating "cognitively aligned" multi-agent interactions, enhancing the ability of robotic systems to contextually defer communications to the user of an agent system during moments of high human mental workload and engagement. We present the design and implementation of a closed-loop architecture that explores the interplay between autonomous task execution and real-time neurophysiological focus. Using a consumer-grade Brain-Computer Interface (BCI), our approach continuously monitors Electroencephalography (EEG) spectral band powers while a human performs an engagement-inducing task. We propose an engagement-driven pipeline where an HTTP-based signaling mechanism places a primary agent's sensory inputs and audio outputs into a holding state upon detecting high engagement. This allows secondary agents to seamlessly process complex, delegated tasks in the background. Once the human's cognitive state returns to a lower cognitive load baseline, the primary agent releases the queued agent message. Our preliminary results demonstrate the feasibility of leveraging real-time signal processing, Large Language Models (LLMs), and physical robotic embodiments to create cognitively-aware, non-intrusive multi-agent systems.

URL PDF HTML ☆

赞 0 踩 0

2606.13189 2026-06-12 cs.CL 新提交

SICI: A Semantic-Pragmatic Complexity Index Reveals Regime Shifts in LLM Stance Detection

SICI：一种揭示LLM立场检测中相变的语义-语用复杂度指数

Fuqiang Niu, Bowen Zhang

发表机构 * School of Cyber Science and Technology, University of Science and Technology of China（中国科学技术大学网络空间安全学院）； School of Artificial Intelligence, Shenzhen Technology University（深圳技术大学人工智能学院）

AI总结提出SICI指数，从七维语义-语用复杂度诊断立场检测难度，揭示LLM错误随复杂度增加从过度归因到集中弃权的相变规律，且干预方法仅沿归因-弃权轴移动而非消除瓶颈。

详情

AI中文摘要

基于提示的LLM越来越多地用于立场检测，但更难的例子并不总是通过更清晰的指令、推理提示、检索或辩论来修复。我们提出了SICI（立场推理复杂度指数），这是一个七维诊断指标，用于衡量目标-文本对施加的语义-语用负担。在SemEval-2016和VAST上，SICI比表面代理更好地预测LLM准确率，并显示出显著的跨评分者可靠性（$\alpha=0.771$）。更重要的是，随着SICI增加，LLM错误发生相变：低复杂度例子容易过度归因，尤其是对“反对”预测；中等复杂度例子形成不稳定边界；高复杂度例子迅速集中在“无”上。这种类似相变的结构在GPT-3.5、GPT-4o-mini、DeepSeek-V3和GPT-4o中持续存在，尽管更强的模型移动了边界。一项15种方法的干预研究进一步表明，提示、检索和辩论通常沿着归因-弃权轴移动模型，而不是消除高复杂度瓶颈。

英文摘要

Prompt-based LLMs are increasingly used for stance detection, but harder examples are not always repaired by clearer instructions, reasoning prompts, retrieval, or debate. We introduce SICI (Stance Inference Complexity Index), a seven-dimensional diagnostic measure of the semantic-pragmatic burden imposed by a target--text pair. Across SemEval-2016 and VAST, SICI predicts LLM accuracy better than surface proxies and shows substantial cross-scorer reliability ($α=0.771$). More importantly, LLM errors change regime as SICI increases: low-complexity examples invite over-attribution, especially Against predictions; intermediate examples form an unstable boundary; and high-complexity examples rapidly concentrate on None. This phase-transition-like structure persists across GPT-3.5, GPT-4o-mini, DeepSeek-V3, and GPT-4o, although stronger models move the boundaries. A 15-method intervention study further shows that prompting, retrieval, and debate often shift models along the attribution--abstention axis rather than removing the high-complexity bottleneck.

URL PDF HTML ☆

赞 0 踩 0

2606.13188 2026-06-12 cs.CV cs.AI 新提交

Transformer-Guided Graph Attention for Direct Cardiac Mesh Reconstruction: A Structural Digital Twin Framework

Transformer引导的图注意力直接心脏网格重建：一种结构数字孪生框架

Abhishek H S, Akash Ganamukhi, Abhimanyu Suresh, Aditya G Hiremath, Prasad B Honnavalli, Adithya Balasubramanyam

发表机构 * CAVE Labs, C-IoT, Dept. of CSE, PES University（PES大学计算机科学与工程系C-IoT实验室CAVE实验室）； C-IoT, Dept. of CSE, PES University（PES大学计算机科学与工程系C-IoT实验室）

AI总结提出端到端网络，结合3D Swin Transformer和GAT，直接从医学图像生成平滑的心脏表面网格，避免传统后处理，在MM-WHS 2017上实现1.8 mm平均Chamfer距离。

详情

AI中文摘要

构建患者特异性心脏模型是精准心脏病学的核心，但这些模型在临床应用中始终面临同一障碍：网格生成缓慢、混乱且令人沮丧。标准工作流程——分割图像、运行Marching Cubes、然后手动清理结果——耗时、操作者间不一致，并且需要大多数临床团队不具备的专业知识。我们采取了一种根本不同的方法。我们不将分割和网格生成视为两个独立问题，而是训练一个单一的端到端网络，直接从原始3D医学图像生成平滑、可用于模拟的心脏表面网格。核心是一个3D Swin Transformer编码器-解码器，从CT或MRI体积中提取体积特征，配以一个图注意力网络（GAT）头，迭代变形模板网格以拟合患者心脏边界。我们在MM-WHS 2017基准上使用CT和MRI进行了测试。分割分数具有竞争力（CT上Dice为0.84，MRI上为0.83），但主要关注点是网格质量：平均Chamfer距离为1.8 mm，95%分位数表面距离低于5 mm。每个网格通过单次前向传播生成——无需Marching Cubes、平滑滤波器或手动清理。我们认为，对于心脏数字孪生管道，几何保真度和拓扑正确性比像素级Dice分数更重要。通过消除后处理瓶颈，该方法使患者特异性心脏模拟在临床使用中变得更加可行。

英文摘要

Building patient-specific cardiac models sits at the heart of precision cardiology, yet getting those models into clinical use keeps running into the same wall: mesh generation is slow, messy, and frustrating. The standard workflow -- segmenting the image, running Marching Cubes, and then manually cleaning up the result -- is time-consuming, inconsistent across operators, and demands specialist knowledge most clinical teams do not have. We take a fundamentally different approach. Instead of treating segmentation and mesh generation as two separate problems, we train a single end-to-end network that goes directly from a raw 3D medical image to a smooth, simulation-ready cardiac surface mesh. The core is a 3D Swin Transformer encoder-decoder that extracts volumetric features from CT or MRI volumes, paired with a Graph Attention Network (GAT) head that iteratively deforms a template mesh to fit the patient's cardiac boundary. We tested on the MM-WHS 2017 benchmark using both CT and MRI. Segmentation scores were competitive (Dice of 0.84 on CT, 0.83 on MRI), but the primary focus is mesh quality: mean Chamfer distance of 1.8 mm, with 95th-percentile surface distance below 5 mm. Every mesh is produced in a single forward pass -- no Marching Cubes, no smoothing filters, no manual cleanup. We argue that for cardiac digital twin pipelines, geometric fidelity and topological correctness matter more than pixel-level Dice scores. By removing the post-processing bottleneck, this approach makes patient-specific cardiac simulation substantially more accessible for clinical use.

URL PDF HTML ☆

赞 0 踩 0

2606.13187 2026-06-12 cs.CL 新提交

A Context-Aware Dataset for Stance Detection in Bioethical Controversies on Reddit

Reddit生物伦理争议中立场检测的上下文感知数据集

Hu Huang, Genan Dai, Fuqiang Niu, Yi Yang, Zhaoya Gong, Bowen Zhang

发表机构 * School of Cyber Science and Technology, University of Science and Technology of China（中国科学技术大学网络空间安全学院）； School of Artificial Intelligence, Shenzhen Technology University（深圳技术大学人工智能学院）； School of Urban Planning and Design, Peking University（北京大学城市规划与设计学院）

AI总结提出BioStance数据集，包含39,600个Reddit生物伦理讨论中的评论-回复对，覆盖六类争议话题，通过三层立场标注实现高可靠性，支持上下文感知的立场检测研究。

详情

AI中文摘要

生物伦理辩论越来越多地在社交媒体上展开，然而立场检测研究缺乏用于建模此类上下文依赖话语的大规模、领域特定资源。我们提出了BioStance，一个上下文感知的数据集，包含来自Reddit生物伦理讨论的39,600个带注释的帖子-评论对。BioStance涵盖了生物伦理争议三个维度上的六个有争议的目标：基本价值冲突、个人自由与集体责任，以及技术不确定性。每个实例保留了层次化的对话上下文，并由三位独立注释者使用三类立场方案进行标注：赞成、反对和无立场。注释的平均Krippendorff's α为0.82，表明可靠性较高。通过结合主题多样性、对话结构和高质量的人工注释，BioStance支持上下文感知的立场检测、论据挖掘和生物伦理话语的计算分析研究。

英文摘要

Bioethical debates increasingly unfold on social media, yet stance detection research lacks large-scale, domain-specific resources for modeling such context-dependent discourse. We present BioStance, a context-aware dataset of 39,600 annotated Post-Comment pairs from Reddit bioethical discussions. BioStance covers six controversial targets across three dimensions of bioethical controversy: fundamental value conflicts, individual liberty versus collective responsibility, and technological uncertainty. Each instance preserves hierarchical conversational context and is labeled by three independent annotators using a three-class stance scheme: Favor, Against, and None. The annotations achieve a mean Krippendorff's $α$ of 0.82, indicating substantial reliability. By combining thematic diversity, conversational structure, and high-quality human annotation, BioStance supports research on context-aware stance detection, argument mining, and computational analysis of bioethical discourse.

URL PDF HTML ☆

赞 0 踩 0

2606.13184 2026-06-12 cs.CL 新提交

LAUKIN: A Multi-jurisdictional Common Law Contract Dataset

LAUKIN：一个多司法管辖区的普通法合同数据集

Amrita Singh, Aditya Joshi, Jiaojiao Jiang, Hye-young Paik, May Fong Cheong

发表机构 * Computer Science and Engineering, UNSW, Sydney Australia（新南威尔士大学计算机科学与工程学院）； Law and Justice, UNSW, Sydney Australia（新南威尔士大学法律与司法学院）

AI总结针对跨国合同审查需求，构建了包含澳大利亚、英国和印度三地法律条款对的数据集LAUKIN，通过多阶段检索与人工标注实现法律等价性分类，基准测试显示跨司法管辖区分类具有挑战性。

Comments 5 pages, 2 figures, 4 tables

详情

AI中文摘要

跨国公司越来越需要跨司法管辖区的合同审查，但现有的法律NLP数据集大多局限于单一司法管辖区。我们引入了LAUKIN（澳大利亚、英国和印度的法律等价数据集），这是一个条款对（AU-UK、UK-IN、IN-AU）数据集，标注了布尔法律等价性。我们开发了一种新颖的多阶段检索和重排序流水线来构建初始条款对映射，随后由法律专家对部分条款对进行等价或不等价的标注。该数据集包含来自8种协议类型的204份合同的14,727个条款对，其中3,000个是手动标注的：900个训练集、600个开发集和1,500个测试集。我们评估了4种技术下的12个模型，最佳宏F1达到65.11%，使LAUKIN成为一个具有挑战性的基准。结果表明，尽管有共同的法律传统，但不同司法管辖区的起草惯例差异显著，使得跨司法管辖区的等价分类并非易事。LAUKIN还包括11,727个未标注的训练对，以支持未来法律NLP中的半监督学习研究。

英文摘要

Multinational companies increasingly require cross-jurisdictional contract review, yet existing legal NLP datasets are largely restricted to a single jurisdiction. We introduce LAUKIN (Legal equivalence dataset of Australia, UK, and INdia), a dataset of clause pairs (AU-UK, UK-IN, IN-AU) labelled for boolean legal equivalence. We develop a novel multi-stage retrieval and reranking pipeline to construct the initial clause pair mapping, with a subset of clause pairs subsequently annotated by legal experts as Equivalent or Not Equivalent. The dataset comprises 14,727 clause pairs from 204 contracts across 8 agreement types, of which 3,000 are manually labelled: 900 train, 600 dev, and 1,500 test. We evaluate 12 models across 4 techniques, achieving a best macro-F1 of 65.11%, establishing LAUKIN as a challenging benchmark. Results reveal that, despite shared legal heritage, drafting conventions diverge significantly across jurisdictions, making cross-jurisdictional equivalence classification non-trivial. LAUKIN also includes 11,727 unlabelled training pairs to support future semi-supervised learning research in legal NLP.

URL PDF HTML ☆

赞 0 踩 0

2606.13178 2026-06-12 cs.LG 新提交

Loss-Shift Transfer via Bayes Quotients

通过贝叶斯商进行损失转移迁移学习

Vasileios Sevetlidis

发表机构 * Athena Research Center（雅典娜研究中心）； Democritus University of Thrace（德谟克利特大学）； International Hellenic University（国际希腊大学）

AI总结本文研究数据分布固定但损失函数变化时的损失转移问题，利用贝叶斯商形式化损失的精炼顺序，证明粗损失的最小表示对严格更细的损失不足，并在有限输出对数损失下给出精确量化关系。

详情

AI中文摘要

迁移学习通常被研究为分布偏移的结果。本文识别了一种正交的失败模式，其中数据分布固定而损失函数变化。这种设置称为\emph{损失转移}。损失决定了$X$中哪些信息是贝叶斯相关的，因此即使在同一联合分布$P(X,Y)$下，两个损失也可能需要不同的表示。该思想使用贝叶斯商形式化，允许按精炼程度对损失排序。在贝叶斯商公式中，严格精炼立即给出定性的障碍。对于较粗损失，源最小表示对于严格更细的目标损失是不充分的。对于有限输出的对数损失，这个障碍变成了精确的定量恒等式。超额风险是表示丢弃的关于$Y$的条件信息。在受控、学习、合成图像和真实图像设置中的实验显示了预测的效果，即分类等价的表示在固定数据分布下可能具有不同的最优对数损失性能。

英文摘要

Transfer learning is usually studied as a consequence of distribution shift. This paper identifies an orthogonal failure mode in which the data distribution is fixed and the loss changes. This setting is called \emph{loss shift}. A loss determines which information in $X$ is Bayes-relevant, and two losses may therefore require different representations even under the same joint law $P(X,Y)$. The idea is formalized using Bayes quotients, which allow losses to be ordered by refinement. In the Bayes-quotient formulation, strict refinement gives an immediate qualitative obstruction. A source-minimal representation for a coarser loss is insufficient for a strictly finer target loss. For finite-output log loss, this obstruction becomes an exact quantitative identity. The excess risk is the conditional information about $Y$ discarded by the representation. Experiments in controlled, learned, synthetic-image, and real-image settings show the predicted effect, i.e., classification-equivalent representations can have different optimal log-loss performance under a fixed data distribution.

URL PDF HTML ☆

赞 0 踩 0

2606.13177 2026-06-12 cs.CL cs.AI cs.LG 新提交

MemRefine: LLM-Guided Compression for Long-Term Agent Memory

MemRefine: 基于LLM引导的压缩用于长期智能体记忆

Minjae Kim, Jinheon Baek, Soyeong Jeong, Sung Ju Hwang

发表机构 * Korea University（韩国大学）； KAIST（韩国科学技术院）

AI总结提出MemRefine框架，利用LLM判断事实内容，通过删除、合并和保留操作将记忆库压缩到固定预算内，在多个基准上保持下游性能并优于基于规则的基线。

详情

AI中文摘要

大型语言模型（LLM）智能体越来越需要在长期交互中运行，其中过去对话中的信息必须被保留和回忆以支持未来任务。然而，随着交互的积累，记忆存储无限制增长，并充满冗余条目，这些条目增加了存储成本，并通过排挤最有用的证据而降低了检索质量。此外，在具有硬性内存预算的资源受限平台上，这尤其受限，促使我们制定了有存储预算的记忆管理任务，即在固定预算内保持已构建的记忆库，同时保留对未来交互有用的信息。为此，我们提出了MemRefine，一个基于LLM引导的框架，由于表面相似性不能很好地反映事实价值，该框架仅使用相似性来提出候选对，并将删除、合并和保留决策推迟给基于事实内容的LLM判断，迭代直到满足预算。在多个记忆框架和长期对话基准上，MemRefine始终满足目标预算，同时保持下游性能，并在紧预算下优于基于规则的基线。

英文摘要

Large language model (LLM) agents are increasingly expected to operate over long-term interactions, where information from past dialogues must be preserved and recalled to support future tasks. However, as interactions accumulate, the memory store grows without bound and fills with redundant entries that inflate storage cost and degrade retrieval by crowding out the most useful evidence. Furthermore, this is especially limiting on resource-constrained platforms with hard memory budgets, motivating us to formulate storage-budgeted memory management, the task of keeping an already constructed memory store within a fixed budget while preserving information useful for future interactions. To this end, we then propose MemRefine, an LLM-guided framework that, since surface similarity poorly reflects factual value, uses similarity only to propose candidate pairs and defers delete, merge, and preserve decisions to an LLM judge based on factual content, iterating until the budget is met. Across multiple memory frameworks and long-term conversation benchmarks, MemRefine consistently meets target budgets while preserving downstream performance and outperforming rule-based baselines under tight budgets.

URL PDF HTML ☆

赞 0 踩 0

2606.13176 2026-06-12 cs.AI 新提交

Mental-R1: Aligning LLM Reasoning for Mental Health Assessment

Mental-R1：面向心理健康评估的对齐LLM推理

Xin Wang, Boyan Gao, Yibo Yang, David A. Clifton

发表机构 * University of Oxford（牛津大学）； Oxford Suzhou Centre for Advanced Research（牛津大学苏州高等研究院）

AI总结提出认知相对策略优化（CRPO）框架，通过阶段依赖不确定性建模和熵正则化机制，使LLM推理对齐人类认知过程，在8个心理健康数据集上加权F1平均提升10.4个百分点。

详情

AI中文摘要

焦虑、抑郁和自杀等心理健康问题仍然是紧迫的全球挑战，及时准确的评估对于有效干预至关重要。最近，大型语言模型已被探索用于心理健康评估。然而，现有的通用后训练方法与人类评估的认知过程不一致，可能导致不可靠的推理结果。为弥合这一差距，我们提出了认知相对策略优化（CRPO），这是一个专为心理健康领域设计的强化学习框架。CRPO通过将阶段依赖的不确定性建模集成到策略优化过程中，扩展了组相对策略优化。具体来说，我们引入了一种阶段熵正则化机制，该机制在早期推理阶段鼓励广泛探索，并在后期阶段逐步强制执行自信决策，模仿人类从不确定性到确定性的认知转变。此外，受认知评价理论的启发，我们形式化了认知推理阶段，从而指导基于理论的可解释推理。在8个心理健康数据集上的实验表明，CRPO在加权F1分数上比最佳强化学习基线平均提高了10.4个百分点。此外，CRPO训练的模型Mental-R1在推理密集型案例上相比现有大型语言模型展现出明显优势，表明CRPO增强了心理健康评估的推理能力。

英文摘要

Mental health problems such as anxiety, depression, and suicide remain urgent global challenges, where timely and accurate assessment is critical for effective intervention. Recently, large language models have been explored for mental health assessment. However, existing general-purpose post-training methods do not align with the cognitive processes of human assessment, which may lead to unreliable reasoning outcomes. To bridge this gap, we propose Cognitive Relative Policy Optimization (CRPO), a reinforcement learning framework tailored for the mental health domain. CRPO extends group relative policy optimization by integrating stage-dependent uncertainty modeling into the policy optimization process. Specifically, we introduce a stage-wise entropy regularization mechanism that encourages broad exploration in early reasoning phases and progressively enforces confident decision-making in later stages, mimicking the human cognitive shift from uncertainty to certainty. In addition, inspired by cognitive appraisal theory, we formalize cognitive reasoning stages, thereby guiding theory-grounded interpretable inference. Experiments on 8 mental health datasets show that CRPO achieves an average improvement of 10.4 percentage points in weighted F1-score over the best reinforcement learning baseline. Furthermore, the CRPO-trained model Mental-R1 demonstrates clear advantages compared with existing large language models on reasoning-intensive cases, suggesting that CRPO enhances reasoning capabilities for mental health assessment.

URL PDF HTML ☆

赞 0 踩 0

2606.13174 2026-06-12 cs.LG cs.CL 新提交

Getting Better at Working With You: Compiling User Corrections into Runtime Enforcement for Coding Agents

与你合作得更好：将用户修正编译为编码代理的运行时强制

Yujun Zhou, Kehan Guo, Haomin Zhuang, Xiangqi Wang, Yue Huang, Zhenwen Liang, Pin-Yu Chen, Tian Gao, Nuno Moniz, Nitesh V. Chawla, Xiangliang Zhang

发表机构 * University of Notre Dame（圣母大学）； IBM Research（IBM研究院）； Tencent AI Lab（腾讯AI实验室）

AI总结提出TRACE方法，通过将用户修正编译为原子规则并在运行时强制执行，显著减少编码代理在后续任务中的偏好违反，优于纯记忆方法。

详情

AI中文摘要

交互式LLM代理正成为日常工作的组成部分，但它们并不会随着时间的推移而变得更易于合作：在一个会话中记住的修正可能在下一个会话中仍被违反。我们研究了偏好访问与偏好遵从之间的差距。在源自匿名真实用户摩擦案例的任务中，Mem0记忆仍然导致57.5%的适用偏好检查被违反。我们引入了测试时规则获取与编译强制（TRACE），这是一个用于编码代理运行时的即插即用技能层管道，它挖掘用户修正，将其重写为原子规则，并编译为运行时检查，这些检查必须在代理完成未来任务之前通过。与开发者提前编写的运行时检查不同，TRACE技能来自用户自己的聊天修正。我们通过在ClawArena编码代理任务和MemoryArena衍生的内存密集型任务上进行模拟用户参与实验来评估TRACE。在ClawArena上，TRACE将分布内任务的保留偏好违反从100.0%降低到37.6%，将分布外任务从100.0%降低到2.0%。在MemoryArena衍生的任务上，TRACE将分布内违反从100.0%降低到60.5%，同时在任务通过率上匹配或超过最强的记忆基线。这些结果表明，将修正编译为运行时强制可以解决记忆单独无法可靠解决的重复摩擦失败模式，减少用户在未来会话中重复相同修正的需求。实验代码可在此https URL获取，可部署的技能可在此https URL获取。

英文摘要

Interactive LLM agents are becoming part of daily work, but they do not reliably become easier to work with over time: a correction remembered in one session may still be violated in the next. We study this gap between preference access and preference compliance. In tasks derived from anonymized real-user friction cases, Mem0 memory still leaves 57.5% of applicable preference checks violated. We introduce Test-time Rule Acquisition and Compiled Enforcement (TRACE), a drop-in skill-layer pipeline for coding-agent runtimes that mines user corrections, rewrites them as atomic rules, and compiles them into runtime checks that must pass before an agent completes future tasks. Unlike runtime checks written ahead of time by developers, TRACE skills come from the user's own chat corrections. We evaluate TRACE with simulated user-in-the-loop experiments on ClawArena coding-agent tasks and MemoryArena-derived memory-intensive tasks. On ClawArena, TRACE reduces held-out preference violation from 100.0% to 37.6% on in-distribution tasks and from 100.0% to 2.0% on out-of-distribution tasks. On MemoryArena-derived tasks, TRACE reduces in-distribution violation from 100.0% to 60.5% while matching or exceeding the strongest memory baseline on task pass. These results suggest that compiling corrections into runtime enforcement can address a repeated-friction failure mode that memory alone does not reliably solve, reducing the need for users to restate the same correction across future sessions. Experiment code is available at https://github.com/YujunZhou/TRACE_exp, and the deployable skill is available at https://github.com/YujunZhou/tellonce.

URL PDF HTML ☆

赞 0 踩 0

2606.13172 2026-06-12 cs.LG 新提交

Detecting Explanatory Insufficiency in Learned Representations: A Framework for Representational Vigilance

检测学习表示中的解释不充分性：表示警觉性框架

Jacques Raynal, Pierre Slangen, Elsa Raynal, Jacques Margerit

发表机构 * Laboratory of Bioengineering and Nanosciences (LBN), University of Montpellier（蒙彼利埃大学生物工程与纳米科学实验室）； EuroMov Digital Health in Motion, University of Montpellier, IMT Mines Alès（蒙彼利埃大学EuroMov数字健康运动实验室，IMT阿莱斯矿业学院）； Certified Sophrologist, Sensorimotor Practice（认证心理放松治疗师，感觉运动实践）； Emeritus Professor, University of Montpellier（蒙彼利埃大学名誉教授）

AI总结提出VER框架，通过识别持久残差结构来监测学习表示的充分性，补充传统评估方法。

Comments 22 pages, 1 figure. Conceptual framework for representation diagnostics in machine learning

详情

AI中文摘要

学习表示是现代机器学习的核心，通常通过预测性能、鲁棒性、不确定性估计或泛化能力来评估。然而，一个学习表示可能在操作上仍然成功，同时逐渐无法组织未被传统评估指标完全捕获的持久残差结构。本文介绍了VER（表示警觉评估器），一个用于监测学习表示充分性的概念框架。VER不提出新的学习算法、损失函数或模型架构。相反，它形式化了一个诊断过程，通过该过程可以识别、分析持久残差结构，并将其解释为解释不充分性的潜在指标。该框架将表示不充分性与普通预测误差、不确定性、噪声和分布偏移区分开来。它引入了一个基于表示识别、解释域界定、残差结构检测、解释阻力评估和警觉信号发出的监测序列。VER旨在作为机器学习中表示诊断的贡献。其目标不是取代现有的评估方法，而是通过将表示充分性视为明确的探究对象来补充它们。还概述了通过表示警觉性基准进行实证评估的路径。

英文摘要

Learned representations are central to modern machine learning and are commonly evaluated through predictive performance, robustness, uncertainty estimation, or generalization. However, a learned representation may remain operationally successful while progressively failing to organize persistent residual structures that are not fully captured by conventional evaluation metrics. This article introduces VER, the Vigilant Evaluator of Representations, a conceptual framework for monitoring representational adequacy in learned representations. VER does not propose a new learning algorithm, loss function, or model architecture. Instead, it formalizes a diagnostic process through which persistent residual structures may be identified, analyzed, and interpreted as potential indicators of explanatory insufficiency. The framework distinguishes representational inadequacy from ordinary prediction error, uncertainty, noise, and distribution shift. It introduces a monitoring sequence based on representation identification, explanatory-domain delimitation, residual-structure detection, explanatory-resistance evaluation, and vigilance signaling. VER is intended as a contribution to representation diagnostics in machine learning. Its objective is not to replace existing evaluation methods but to complement them by treating representational adequacy as an explicit object of inquiry. A path toward empirical evaluation through representational-vigilance benchmarks is also outlined.

URL PDF HTML ☆

赞 0 踩 0

2606.13171 2026-06-12 cs.CL cs.AI 新提交

NTS-CoT: Mitigating Hallucinations in LLM-based News Timeline Summarization with Chain-of-Thought Reasoning

NTS-CoT: 基于思维链推理减轻大模型新闻时间线摘要中的幻觉

Feng Lyu, Huiqin Yan, Sijing Duan, Hao Wu, Shuang Gu, Xue Qiao, Weixu Zhang, Haolun Wu

发表机构 * Central South University（中南大学）； Tsinghua University（清华大学）； Nanjing University（南京大学）； Suzhou Aerospace Information Research Institute（苏州空天信息研究院）； McGill University（麦吉尔大学）

AI总结针对大模型在新闻时间线摘要中产生内容不忠实和信息遗漏两类幻觉，提出NTS-CoT框架，通过元素思维链、日期选择和因果思维链三个模块有效缓解幻觉，在三个基准上超越现有方法。

详情

AI中文摘要

在线新闻的快速更新使得追踪事件发展具有挑战性，凸显了时间线摘要（TLS）的需求。幻觉（即大模型生成内容偏离源新闻）仍然是基于大模型的TLS中的关键问题，且现有研究对此关注不足。为弥补这一差距，我们识别出两类主要幻觉：新闻摘要中的不忠实内容和日期事件摘要中的信息遗漏。然后，我们提出NTS-CoT，一种利用思维链（CoT）推理来减轻TLS中幻觉的新框架。该框架包含三个关键模块：i) Element-CoT，用于捕获关键新闻元素以实现忠实摘要；ii) Date Selection，结合时间显著性和事件突出性进行时间戳选择；iii) Causal-CoT，用于推断因果关系并减少日期事件摘要中的遗漏。大量实验，包括在三个TLS基准上的定量分析和人工评估，表明NTS-CoT优于最先进的基线，有效减轻了幻觉并提升了基于大模型的TLS性能。我们的源代码可在该 https URL 获取。

英文摘要

The rapid updates of online news make tracking event developments challenging, highlighting the need for timeline summarization (TLS). Hallucinations, where LLM-generated content deviates from source news, still remain a critical issue in LLM-based TLS and are not well studied in existing works. To bridge this gap, we identify two primary types of hallucinations: unfaithful content during news summarization and information omission in date-event summarization. Then, we propose NTS-CoT, a novel framework that leverages Chain-of-Thought (CoT) reasoning to mitigate hallucinations in TLS. The framework consists of three key modules: i) Element-CoT to capture essential news elements for faithful summarization, ii) Date Selection to combine temporal saliency and event prominence for timestamp selection, and iii) Causal-CoT to infer causal relationships and reduce omissions in date-event summarization. Extensive experiments, including quantitative analysis on three TLS benchmarks and human evaluation, demonstrate that NTS-CoT outperforms state-of-the-art baselines, effectively mitigating hallucinations and improving LLM-based TLS performance. Our source code is available at https://anonymous.4open.science/r/NTS-CoT .

URL PDF HTML ☆

赞 0 踩 0

2606.13169 2026-06-12 cs.RO 新提交

Redesigning Regularization for Effective Policy Smoothing

重新设计正则化以实现有效的策略平滑

Taisuke Kobayashi, Naoto Yamanaka

发表机构 * National Institute of Informatics (NII)（国立信息学研究所）； The Graduate University for Advanced Studies (SOKENDAI)（综合研究大学院大学）

AI总结针对强化学习中策略平滑问题，本文指出现有正则化实现的理论与实践差异，提出改进方案，在多个任务和算法中实现平滑运动并提升控制性能，并在四足机器人仿真到现实迁移中验证了平滑性对目标速度突变鲁棒性的提升。

Comments submitted to RA-L

详情

AI中文摘要

本文提出了一种新颖的正则化设计，以有效平滑强化学习中的策略函数。虽然最初考虑了增强“全局”Lipschitz连续性的正则化，但由于平滑性与表达性之间的权衡，它被限制为“局部”Lipschitz连续性。然而，显而易见的是，原始实现繁琐且无法提供足够的平滑效果，导致人们倾向于更简单的实现。这源于理论与实现之间的差异，而更合适的实现有望促进平滑。因此，本文指出了原始实现无法正常工作的三个原因，并提供了相应的补救措施。这种改进的正则化在多个任务和算法中表现良好，成功实现了平滑运动，同时提高了控制性能。此外，通过将其应用于四足机器人的仿真到现实强化学习，证明了平滑运动能够提供对目标速度命令突变的鲁棒性。

英文摘要

This paper proposes a novel regularization design to effectively smooth policy functions in reinforcement learning. While regularization that enhances ``global'' Lipschitz continuity was initially considered, it has been limited to ``local'' Lipschitz continuity due to a tradeoff between smoothness and expressiveness. However, it has become apparent that the original implementation is cumbersome and does not provide sufficient smoothing, leading to a preference for simpler implementations. This stems from a discrepancy between theory and implementation, and a more appropriate implementation can expect to facilitate smoothing. Therefore, this paper identifies three reasons why the original implementation does not function adequately and provide remedies for them. This modified regularization performs well across multiple tasks and algorithms, successfully achieving smooth motion while improving control performance. Furthermore, by applying it to sim-to-real reinforcement learning for a quadruped robot, it is demonstrated that smooth motion provides robustness against sudden changes in target velocity commands.

URL PDF HTML ☆

赞 0 踩 0

2606.13168 2026-06-12 cs.LG 新提交

When Does Routing Become Interpretable? Causal Probes on Block Attention Residuals

路由何时变得可解释？对块注意力残差的因果探针

Aydin Javadov

发表机构 * ETH Zurich（苏黎世联邦理工学院）

AI总结研究块注意力残差中路由的可解释性，发现仅当路由参与训练时才出现结构化深度路由，且路由权重与因果重要性存在分离，需用因果干预验证。

详情

AI中文摘要

块注意力残差（Block AttnRes）通过将固定的加性残差替换为基于早期深度源表示的学习softmax，在前向传播中将跨层路由暴露为可检查的张量。这是一个诱人的可解释性目标：通常间接推断的信息流现在可以直接观察。我们询问这种暴露是否足以进行机制解释。我们在相同的路由消融干预下探测了两个同规模（0.6B）的Block AttnRes检查点：一个是通过确定性近因偏差调度（代码库将其视为路由等效加载路径）包装的普通Qwen3推理，另一个是从头训练且路由作为优化一部分的Block AttnRes Qwen3。包装基线的路由权重与内容无关，并重现了调度的分析预测。而训练的AttnRes检查点则表现出三种局部路由模式：通过早期层MLP的嵌入源路径、通过早期层注意力和MLP的当前状态路径，以及通过后期层注意力的较旧历史路径。除了这种分层之外，我们发现平均路由质量与因果重要性之间存在明显分离：在两个子层中，最大的质量切片并非最大的因果贡献，并且一个源家族在干预下携带了可观的质量但没有可检测的因果作用。因此，路由的架构暴露对于机制解释是必要但不充分的：只有当路由是训练的一部分时，结构化的深度路由才会出现，即使如此，描述性路由总结也应被视为待因果干预检验的候选假设，而非其本身的机制证据。

英文摘要

Block Attention Residuals (Block AttnRes) by replace fixed additive residuals with a learned softmax over earlier depth-source representations, surfacing cross-layer routing as an inspectable tensor in the forward pass. This is a tempting interpretability target: information flow normally inferred indirectly is now directly observable. We ask whether such exposure suffices for mechanistic interpretation. We probe two same-scale ($0.6$B) Block AttnRes checkpoints under identical routing-ablation interventions: a vanilla Qwen3 inference-wrapped through a deterministic recency-bias schedule that the codebase admits as a routing-equivalent loading path, and a Block AttnRes Qwen3 trained from scratch with routing as part of optimisation. The wrapped baseline's routing weights are content-independent and reproduce the schedule's analytic prediction. The trained AttnRes checkpoint instead exhibits three localised routing motifs: an embedding-source pathway through early-layer MLP, a current-state pathway through early-layer attention and MLP, and an older-history pathway through late-layer attention. Beyond this stratification, we find a sharp dissociation between average routing mass and causal importance: in both sublayers, the largest mass slice is not the largest causal contribution, and one source family carries appreciable mass with no detectable causal role under intervention. Architectural exposure of routing is therefore necessary but not sufficient for mechanistic interpretation: structured depth routing emerges only when routing has been part of training, and even then, descriptive routing summaries should be treated as candidate hypotheses to be tested by causal interventions, not as evidence of mechanism in their own right.

URL PDF HTML ☆

赞 0 踩 0

2606.13156 2026-06-12 cs.CV cs.AI 新提交

Iterative Visual Thinking: Teaching Vision-Language Models Spatial Self-Correction through Visual Feedback

迭代视觉思维：通过视觉反馈教会视觉语言模型空间自我修正

Animesh Tripathy, Aswanth Krishnan

发表机构 * QpiAI India Pvt. Ltd（QpiAI印度私人有限公司）

AI总结提出迭代视觉思维（IVT）框架，通过视觉反馈闭环和两阶段训练（SFT+GRPO），使视觉语言模型具备空间自我修正能力，在三个基准上提升指标2.4-3.2个百分点。

详情

AI中文摘要

视觉语言模型（VLM）在单次空间定位上表现强劲，但缺乏观察和修正自身预测的机制。我们发现，简单地提示VLM在其预测的渲染可视化上迭代会导致灾难性失败：指代表达理解的Acc@0.5从79.6%骤降至48.7%（下降31个百分点），揭示了定位能力与自我修正能力之间的根本差距。我们提出迭代视觉思维（IVT），一种闭环框架，其中模型预测边界框，观察预测在图像上的渲染结果，并通过视觉反馈迭代优化。两阶段训练方案弥合了自我修正差距：首先，我们利用基础模型自身的预测作为真实错误，并提示教师VLM生成修正推理轨迹，从而无需人工标注即可获得监督数据；其次，我们应用组相对策略优化（GRPO）和简单的IoU奖励来稳定多步优化。在涵盖RefCOCOg、Ref-Adv和Ref-L4的混合基准（505个测试样本）上，使用IVT的SFT预热在每个指标上都超过了单次基础模型：Acc@0.5升至82.0%（+2.4个百分点），Acc@0.7升至74.1%（+3.2个百分点），Acc@0.9升至48.3%（+2.8个百分点）。GRPO进一步将每步IoU退化减少了5倍，稳定了优化轨迹。所有训练仅使用单个GPU上的2400个样本，表明空间自我修正是一种可学习的能力，可以在适度规模下灌输。

英文摘要

Vision-language models (VLMs) achieve strong singleshot spatial grounding, yet lack any mechanism to observe and correct their own predictions. We find that naively prompting a VLM to iterate over rendered visualizations of its predictions causes catastrophic failure: Acc@0.5 on referring expression comprehension collapses from 79.6% to 48.7% (a 31 percentage point drop), revealing a fundamental gap between grounding capability and self-correction ability. We propose Iterative Visual Thinking (IVT), a closed-loop framework in which the model predicts a bounding box, observes the prediction rendered on the image, and iteratively refines through visual feedback. A two-phase training recipe closes the self-correction gap: first, we exploit the base model's own predictions as realistic errors and prompt a teacher VLM to generate corrective reasoning traces, yielding supervised data without human annotation; second, we apply Group Relative Policy Optimization (GRPO) with a simple IoU reward to stabilize multi-step refinement. On a mixed benchmark spanning RefCOCOg, Ref-Adv, and Ref-L4 (505 test samples), SFT warm-up with IVT surpasses the single-shot base model on every metric: Acc@0.5 rises to 82.0% (+2.4pp), Acc@0.7 to 74.1% (+3.2pp), and Acc@0.9 to 48.3% (+2.8pp). GRPO further reduces per-step IoU degradation by 5x, stabilizing the refinement trajectory. All training uses only 2,400 samples on a single GPU, demonstrating that spatial self-correction is a learnable capability that can be instilled at modest scale.

URL PDF HTML ☆

赞 0 踩 0

2606.13148 2026-06-12 cs.AI 新提交

TerraBench: Can Agents Reason Over Heterogeneous Earth-System Data?

TerraBench: 智能体能否对异构地球系统数据进行推理？

Dat Tien Nguyen, Thao Nguyen, Fadillah Adamsyah Maani, Huy M. Le, Muhammad Umer Sheikh, Numan Saeed, Muhammad Haris Khan, Salman Khan

发表机构 * Mohamed bin Zayed University of Artificial Intelligence（穆罕默德·本·扎耶德人工智能大学）

AI总结提出TerraBench基准，基于TerraAgent框架，通过结合大语言模型规划与科学工具，实现跨网格数据、卫星图像、地理空间和模拟器的交互式推理，包含403个任务和24,500个执行步骤。

详情

AI中文摘要

气候和环境决策日益需要对异构输入进行推理，包括网格化物理数据、卫星图像、地理空间背景和模拟器输出。天气和气候基础模型可以很好地预测，但不能以语言进行交互式推理，而大型语言模型（LLM）可以用语言推理，但不能直接操作高维地球系统数据。因此，地球科学中的真实科学工作流仍然得不到充分支持。我们引入了TerraBench，一个基于地球科学推理的基准，构建于TerraAgent之上，这是一个ReAct风格的可执行框架，它交织推理、工具调用和观察，将LLM规划与环境检索、地理空间处理、模拟和基于工件的计算等科学工具相结合。TerraBench在单一可执行界面中统一了对地球观测图像、网格化数据、GIS推理和模拟的分析，而先前的基准将这些能力隔离为狭窄的独立任务。它也是该领域中第一个将过程级工具使用指标与容忍度感知数值评分配对的方法。该基准包含403个广泛的智能体任务，涵盖三个轨道（基础、模拟器基础和文档基础验证）和八个应用领域，共24,500个经过验证的执行步骤。这些结果表明，可靠的地球科学智能体必须超越工具访问，协调异构工作流，精确参数化工具，并保留工件来源。

英文摘要

Climate and environmental decision-making increasingly requires reasoning across heterogeneous inputs, including gridded physical data, satellite imagery, geospatial context, and simulator outputs. Weather and climate foundation models can forecast well, but do not reason interactively in language, while large language models (LLMs) reason in language but cannot operate directly on high-dimensional Earth-system data. As a result, real scientific workflows in Earth-science remain underserved. We introduce TerraBench, a benchmark for grounded Earth-science reasoning, built on TerraAgent, a ReAct-style executable framework that interleaves reasoning, tool calls, and observations to couple LLM planning with scientific tools for environmental retrieval, geospatial processing, simulation, and artifact-backed computation. TerraBench unifies analysis of Earth observation imagery, gridded data, GIS reasoning and simulation in a single executable interface, whereas prior benchmarks isolate these capabilities into narrow individual tasks. It is also the first in this space to pair process-level tool-use metrics with tolerance-aware numeric scoring. The benchmark comprises 403 extensive agentic tasks across three tracks (Fundamentals, Simulator-Grounded, and Document-Grounded Verification) and eight application domains with 24,500 verified execution steps. These results indicate that reliable Earth-science agents must go beyond tool access to coordinate heterogeneous workflows, parameterize tools precisely, and preserve artifact provenance.

URL PDF HTML ☆

赞 0 踩 0

2606.13142 2026-06-12 cs.CL 新提交

HyPE: Category-Aware Hypergraph Encoding with Persistent Edge Embeddings for Persona-Grounded Dialogue

HyPE：基于类别感知的超图编码与持久边嵌入用于人物角色对话

Sangwon Youn, Yoonjin Jang, Youngjoong Ko

发表机构 * Sungkyunkwan University（成均馆大学）

AI总结提出HyPE框架，通过将人物角色文本解析为四元组并构建超图，利用HyperGCN和持久边嵌入（PEE）编码高阶关系，在PersonaChat上优于句子级池化基线。

Comments 11 pages, 2 figures, 4 tables

详情

AI中文摘要

人物角色对话系统旨在生成与说话者角色一致的回复，但现有方法将角色视为一组扁平句子，未能建模角色属性间的高阶关系——例如，多个角色句子共享一个主题类别。我们提出HyPE（超图角色编码器）框架，该框架（i）将每个承载角色的文本分析为（核心、表达、情感、类别）四元组，以及（ii）将角色元素组织成一个超图，其超边由共享类别标签诱导。HyperGCN超图神经网络将此结构传播为角色摘要向量和软记忆库，以条件化回复生成器。我们进一步提出持久边嵌入（PEE），即轻量级的每类别可学习先验，融合到HyperGCN的消息传递步骤中。在贪婪解码下的PersonaChat上，HyPE在GPT-2、LLaMA-3.2-3B和Qwen2.5-3B骨干网络上一致优于句子级池化基线，表明结构化的超边级角色编码在不同模型规模上提供了可迁移的优势。

英文摘要

Persona-grounded dialogue systems aim to produce responses consistent with a speaker's persona, yet existing methods treat personas as a flat set of sentences and fail to model the high-order relations among persona attributes-e.g., that several persona sentences share a topical category. We propose HyPE (Hypergraph Persona Encoder), a framework that (i) analyzes each persona-bearing text as a (Core, Expression, Sentiment, Category) quadruple, and (ii) organizes persona elements into a hypergraph whose hyperedges are induced by shared category labels. An HyperGCN hypergraph neural network propagates this structure into a persona summary vector and a soft-memory bank that condition the response generator. We further propose Persistent Edge Embeddings (PEE), lightweight per-category learnable priors fused into the HyperGCN message-passing step. On PersonaChat under greedy decoding, HyPE consistently outperforms sentence-level pooling baselines across GPT-2, LLaMA-3.2-3B, and Qwen2.5-3B backbones by demonstrating that structured hyperedge-level persona encoding provides a transferable advantage across model scales.

URL PDF HTML ☆

赞 0 踩 0

2606.13141 2026-06-12 cs.AI 新提交

Rethinking RAG in Long Videos: What to Retrieve and How to Use It?

重新思考长视频中的RAG：检索什么以及如何使用？

Yuho Lee, Jisu Shin, Nicole Hee-Yeon Kim, Jihwan Bang, Juntae Lee, Kyuwoong Hwang, Fatih Porikli, Hwanjun Song

发表机构 * Department of Computer Science, Cranberry-Lemon University（蔓越莓柠檬大学计算机科学系）

AI总结针对视频检索增强生成中检索粒度单一和基准测试缺陷，提出V-RAGBench基准和CARVE方法，通过分块自适应重排序实现多配置交错证据，显著提升性能。

详情

AI中文摘要

检索增强生成正从文本扩展到长、自我中心的视频，系统必须跨多种模态和时间粒度选择与查询相关的块。然而，VideoRAG的进展受到两个差距的限制：现有基准允许无需视频即可回答查询，掩盖了检索错误；先前方法对每个查询应用单一模态-粒度配置，忽略了块级变异性。我们通过引入V-RAGBench（一个⟨查询，证据块，答案⟩三元组基准，支持检索和生成的忠实、解耦评估）和CARVE（一种简单方法，跨配置运行并行检索器并采用块自适应重排序以识别每个块的最佳配置）来解决这两个问题。每个块随后以其在检索期间选择的最佳配置进入生成器，产生一种交错证据形式，其中块级决策在检索和生成两个阶段传播。CARVE优于八种最近的VideoRAG基线，提供给生成器的块交错多种配置而非共享单一配置，这是查询级方法无法实现的行为。

英文摘要

Retrieval-augmented generation is moving beyond text into long, egocentric video, where systems must select query-relevant chunks across multiple modalities and temporal granularities. Yet progress in VideoRAG is limited by two gaps: existing benchmarks allow queries to be answered without the video, obscuring retrieval errors, and prior methods apply a single modality-granularity configuration per query, ignoring chunk-level variability. We address both by introducing V-RAGBench, a benchmark of $\langle$query, evidence chunk, answer$\rangle$ triplets that enables faithful, decoupled evaluation of retrieval and generation, and CARVE, a simple method that runs parallel retrievers across configurations and employs chunk-adaptive reranking to identify the winning configuration for each chunk. Each chunk then enters the generator under its winning configuration selected during retrieval, yielding an interleaved evidence form where the chunk-level decision propagates across both stages. CARVE outperforms eight recent VideoRAG baselines, with the chunks supplied to the generator interleaving multiple configurations rather than sharing a single one, a behavior unattainable by query-level methods.

URL PDF HTML ☆

赞 0 踩 0