arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2084
专题追踪
2605.26776 2026-05-27 cs.LG cs.AI

Towards Generalization-Oriented Models for Vehicle Routing Problems with Mixture-of-Experts

面向泛化的混合专家车辆路径问题模型

Changhao Miao, Yuntian Zhang, Tongyu Wu, Fang Deng, Chen Chen

发表机构 * State Key Laboratory of Autonomous Intelligent Unmanned Systems, Beijing Institute of Technology(自主智能无人系统国家重点实验室,北京理工大学) School of AI, Beijing Institute of Technology(北京理工大学人工智能学院)

AI总结 提出基于混合专家架构的残差细化专家与实例级门控机制(R2E-IG),通过模块化策略网络和动态权重适应训练,提升车辆路径问题在分布偏移下的泛化能力。

详情
AI中文摘要

近年来,深度强化学习(DRL)在车辆路径问题(VRPs)上取得了显著进展。然而,现有的基于DRL的方法通常是在均匀分布生成的实例上训练的,这限制了它们在真实世界分布偏移下的性能。在本文中,我们旨在开发一个面向泛化的模型,该模型将策略网络划分为多个模块,并在推理过程中自适应地重组模块以形成特定策略。具体来说,我们提出了具有实例级门控的残差细化专家(R2E-IG)以改进跨分布泛化。我们的贡献有三方面:(1)我们引入了一种残差细化专家(R2E)架构,通过残差细化增强专家表达能力;(2)我们设计了一种实例级门控机制,学习分布感知的实例表示并将输入路由到合适的模块;(3)我们提出了一种配备动态权重适应(DWA)的混合分布训练机制,该机制动态地重新加权来自不同分布的训练数据,以强调更具信息量的数据。大量实验表明,R2E-IG在合成和基准数据集的分布内和分布外实例上均取得了与最先进基线相竞争的性能。此外,R2E-IG是通用的,可以轻松集成到现有的基于DRL的方法中,以进一步提高性能。

英文摘要

In recent years, Deep Reinforcement Learning (DRL) has achieved substantial progress on Vehicle Routing Problems (VRPs). However, existing DRL-based methods are typically trained on instances generated from a uniform distribution, which limits their performance under real-world distribution shifts. In this paper, we aim to develop a generalization-oriented model that partitions the policy network into multiple modules and adaptively recombines modules to form specific policies during inference. Specifically, we propose Residual Refined Experts with Instance-level Gating (R2E-IG) to improve cross-distribution generalization. Our contributions are threefold: (1) We introduce a Residual Refined Expert (R2E) architecture that enhance expert expressiveness via residual refinement; (2) We design an instance-level gating mechanism that learns distribution-aware instance representations and routes inputs to suitable modules; (3) We propose a mixed-distribution training mechanism equipped with Dynamic Weight Adaption (DWA), which dynamically reweights training data from different distributions to emphasize more informative ones. Extensive experiments show that R2E-IG achieves competitive performance against state-of-the-art baselines on both in-distribution and out-of-distribution instances across synthetic and benchmark datasets. Moreover, R2E-IG is generic and can be easily integrated into existing DRL-based methods to further improve performance.

2605.26772 2026-05-27 cs.AI

Beyond a Single Direction: Chain-of-Thought Disrupts Simple Steering of Refusal

超越单一方向:思维链破坏简单的拒绝引导

Kia-Jüng Yang, Dominik Meier, Jiachen Zhao, Terry Ruas, Bela Gipp

发表机构 * University of Göttingen, Germany(哥廷根大学,德国) Northeastern University, Boston, MA, USA(东北大学,波士顿,马萨诸塞州,美国)

AI总结 本文研究大型推理模型(LRM)中拒绝行为的机制,发现思维链(CoT)与激活共同编码拒绝信号,使得仅通过激活引导难以逆转拒绝,但通过两阶段干预(激活引导下重新生成CoT)可显著提高逆转率。

详情
AI中文摘要

大型推理模型(LRM)在生成最终输出之前会生成思维链(CoT)轨迹,引入动态内部状态,可能使拒绝等控制机制复杂化。与指令调优的LLM不同,后者的拒绝由单一方向子空间介导,而LRM中的拒绝还依赖于CoT。在DeepSeek-R1-Distill-LLaMA-8B中,当CoT保持不变时,激活引导仅在39%的情况下逆转拒绝,但完全移除CoT可将此比例提高到70%,表明CoT积极强化拒绝。在两阶段干预中,模型在激活引导下重新生成其CoT,拒绝在94%的情况下被逆转,而即使移除引导,生成的CoT本身仍保留48%的效果。这表明CoT可以独立携带和重建顺从信号。这些发现表明,LRM中的拒绝由残差流激活和CoT共同编码。这种联合编码使得LRM对仅激活层面的干预更具鲁棒性,但使CoT暴露于可能的替代表面攻击。

英文摘要

Large reasoning models (LRMs) generate chain-of-thought (CoT) traces before producing final outputs, introducing a dynamic internal state that may complicate control mechanisms such as refusal. Unlike instruction-tuned LLMs, where refusal is mediated by a single directional subspace, refusal in large reasoning models (LRMs) additionally depends on the CoT. In DeepSeek-R1-Distill-LLaMA-8B, activation steering reverses refusal in only 39% of cases when the CoT is kept fixed, but removing the CoT entirely increases this to 70%, indicating that the CoT actively reinforces refusal. In a two-stage intervention where the model regenerates its CoT under activation steering, refusal is reversed in 94% of cases, while the resulting CoT alone retains 48% of this effect even after steering is removed. This suggests that the CoT can carry and reconstruct the compliance signal independently. These findings indicate that refusal in LRMs is jointly encoded in residual stream activations and CoT. This joint activation makes LRM more robust against activation-level interventions alone, but exposes CoT to a possible alternative surface attack.

2605.26770 2026-05-27 cs.CL

Quality Without Usefulness: LLM-Generated XAI Narratives as Trust Heuristics Rather Than Decision Aids

质量而无用:LLM生成的XAI叙事作为信任启发而非决策辅助

Fabian Lukassen, Jan Herrmann, Christoph Weisser, Alexander Silbersdorff, Benjamin Saefken, Thomas Kneib

发表机构 * University of Göttingen(哥廷根大学) BASF SE(巴斯夫股份有限公司) Hochschule Bielefeld(比勒菲尔德大学) TU Clausthal(Clausthal 技术大学)

AI总结 通过五个受控实验,研究LLM生成的高质量自然语言解释在时间序列能源预测中是否提升任务准确性,发现解释不改善准确性但膨胀自信,存在质量-有用性差距。

详情
AI中文摘要

先前研究表明,大型语言模型(LLMs)可以将可解释人工智能(XAI)输出转换为在合理性、连贯性和可理解性等质量指标上得分很高的自然语言解释(NLEs)。但解释质量是否能转化为实际有用性?我们通过五个受控实验(60个测试实例中的2,730个判断)在时间序列能源预测领域中研究这一问题,每个实验操作化XAI文献中研究的有用性的一个不同方面。在保持NLE质量与先前因子研究确定的高水平一致的情况下,我们发现NLEs在五个任务中的任何一个上都没有提高任务准确性,同时膨胀了自我报告的置信度。一个安慰剂对照表明,这种置信度提升是由文本存在而非内容驱动的。在分布外检测任务中,NLEs降低了LLM判断者标记不可靠预测的能力,提供了掩盖模型失败的虚假安慰。我们将这些发现定性为质量-有用性差距,并认为对XAI到NLE管道的评估必须超越文本质量指标,扩展到下游任务性能。

英文摘要

Prior work shows that Large Language Models (LLMs) can transform Explainable AI (XAI) outputs into Natural Language Explanations (NLEs) that score highly on quality metrics such as plausibility, coherence, and comprehensibility. But does explanation quality translate to practical usefulness? We investigate this question in a time-series energy forecasting domain through five controlled experiments (2,730 judgments across 60 test instances), each operationalising a distinct facet of usefulness studied in the XAI literature. Holding NLE quality constant at the high levels established by a prior factorial study, we find that NLEs do not improve task accuracy on any of the five tasks, while inflating self-reported confidence. A placebic control shows that this confidence boost is driven by text presence rather than content. In an out-of-distribution detection task, NLEs reduce the LLM judge's ability to flag unreliable predictions, providing false reassurance that masks model failure. We characterise these findings as the Quality-Usefulness Gap and argue that evaluation of the XAI-to-NLE pipeline must extend beyond text-quality metrics to downstream task performance.

2605.26763 2026-05-27 cs.LG cs.AI

Adversarial Training for Robust Coverage Network under Worst-case Facility Losses

对抗训练用于最坏设施损失下的鲁棒覆盖网络

Changhao Miao, Yuntian Zhang, Tongyu Wu, Fang Deng, Chen Chen

发表机构 * State Key Laboratory of Autonomous Intelligent Unmanned Systems, Beijing Institute of Technology(自主智能无人系统国家重点实验室,北京理工大学) School of AI, Beijing Institute of Technology(北京理工大学人工智能学院)

AI总结 针对最大覆盖选址-阻断问题,提出基于对抗学习的双智能体深度强化学习框架,实现高效求解与鲁棒决策。

详情
AI中文摘要

最大覆盖选址-阻断问题(MCLIP)是一个经典的双层优化问题,对于韧性基础设施规划至关重要,但计算上仍然难以处理。具体来说,上层确定设施位置以最大化覆盖范围,而下层执行最坏情况下的阻断以最小化覆盖范围。上下层之间的强耦合以及各自的高组合复杂性使得传统方法无效。为了弥补这一差距,我们提出了一种基于对抗学习的双智能体深度强化学习(DADRL)框架,包括对应于上层的选址智能体和对应于下层的阻断智能体。我们的贡献有三方面:(1)选址智能体同时针对不断演化的阻断智能体进行训练,使其有效捕捉上下层之间的动态竞争相互作用;(2)为了充分利用阻断智能体的学习能力,我们提出了一种基于替代的集成推理策略,利用训练好的阻断智能体作为高保真替代来指导选址智能体的决策;(3)在合成和真实世界数据集上的大量实验表明,与其他基线相比,我们的方法在保持高度竞争力的解质量的同时,实现了卓越的计算效率。此外,我们的DADRL框架对网络结构是模型无关的,而其底层的对抗学习范式在解决其他双层优化问题方面显示出强大的潜力。

英文摘要

The Maximal Covering Location-Interdiction Problem (MCLIP) is a classic bi-level optimization problem, which is fundamental to resilient infrastructure planning yet remains computationally intractable. Specifically, the upper level determines facility locations to maximize coverage, while the lower level executes worst-case interdiction to minimize the coverage. The strong coupling between the upper and lower levels, combined with their respective high combinatorial complexity, renders traditional methods ineffective. To bridge this gap, we propose a Dual-Agent Deep Reinforcement Learning (DADRL) framework based on adversarial learning, comprising a location agent corresponding to the upper level and an interdiction agent corresponding to the lower level. Our contributions are threefold: (1) The location agent is trained simultaneously against an evolving interdiction agent, making it effectively capture the dynamic competitive interplay between the upper and lower levels; (2) To fully exploit the learned capabilities of the interdiction agent, we propose a Surrogate-based Ensemble Inference Strategy that utilizes the trained interdiction agent as a high-fidelity surrogate to guide the decisions of location agent; (3) Extensive experiments on synthetic and real-world datasets demonstrate that our approach achieves superior computational efficiency while maintaining highly competitive solution quality compared to other baselines. Furthermore, our DADRL framework is model-agnostic to network structures, while its underlying adversarial learning paradigm demonstrates strong potential for solving other bi-level optimization problems.

2605.26747 2026-05-27 cs.AI

A Dataset of Robot-Patient and Doctor-Patient Medical Dialogues for Spoken Language Processing Tasks

面向口语处理任务的机器人-患者与医生-患者医疗对话数据集

Heriberto Cuayahuitl, Grace Jang

发表机构 * UK’s NHS(英国国家医疗服务体系)

AI总结 提出MeDial-Speech数据集,包含机器人-患者和医生-患者的真实医疗对话语音数据,用于训练和评估医疗AI,并通过句子选择基准测试评估三个大语言模型。

Journal ref IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP-2026)

详情
AI中文摘要

大型语言模型(LLM)为人工智能(AI)带来了巨大改进,可应用于通用任务。然而,它们在文本或口语医疗咨询中的应用仍是一个开放的研究问题。本文提出MeDial-Speech,这是一个新颖的语音数据集,用于训练和评估能够与患者进行咨询的医疗AI。该数据集在真实环境中从机器人-患者和医生-患者对话中收集,包含111小时以上的语音数据(无数据增强),涵盖四种健康状况:路易体痴呆、心力衰竭、肩痛和心绞痛。此外,我们通过句子选择(20个选项)提出了一个对话基准,用于评估三个最先进的LLM:GPT-5 mini、DeepSeek-V3和Claude Sonnet 4。实验结果显示,Claude Sonnet 4在句子选择中表现最佳,使用人工转录的准确率为71.1%,使用自动转录的准确率为74.7%,并且所有LLM在其概率预测中高度过度自信,无论选择医疗对话中的正确或错误句子。该数据集对非商业用途免费提供,网址为:https://huggingface.co/datasets/hcuayahu/MeDial-Speech

英文摘要

Large Language Models (LLMs) have brought huge improvements to Artificial Intelligence (AI), which can be applied to general-purpose tasks. However, their application to textual or spoken medical consultations is still an open research problem. This paper proposes MeDial-Speech, a novel speech dataset for training and evaluating Med-AIs that can carry out consultations with patients. It was collected in realistic environments from robot-patient and doctor-patient dialogues, contains 111+ hours of speech data (without data augmentation), and covers four health conditions: Lewy body dementia, heart failure, shoulder pain, and angina. In addition, we propose a dialogue benchmark via sentence selection (with 20 options) to evaluate three state-of-the-art LLMs: GPT-5 mini, DeepSeek-V3, and Claude Sonnet 4. Experimental results reveal that Claude Sonnet 4 is the best in sentence selection, with 71.1% accuracy using manual transcriptions and 74.7% using automatic transcriptions, and that all LLMs are highly overconfident in their probabilistic predictions, regardless of selecting correct or incorrect sentences in medical dialogues. This dataset is free of charge for non-commercial purposes at: https://huggingface.co/datasets/hcuayahu/MeDial-Speech

2605.26744 2026-05-27 cs.CV

Self-Intersection-Aware 3D Human Motion Generation Using an Efficient Human Sphere Proxy

基于高效人体球代理的自交感知3D人体运动生成

Pascal Herrmann, Maarten Bieshaar, Dennis Mack, Robert Herzog, Juergen Gall

发表机构 * Bosch Research(博世研究院) University of Bonn(波恩大学) Lamarr Institute for Machine Learning and Artificial Intelligence(拉马尔人工智能与机器学习研究所)

AI总结 提出一种基于人体球代理的自交损失函数,用于训练人体运动生成模型,可减少高达49%的自交现象并改善评估指标。

Comments Accepted to BMVC 2025

详情
AI中文摘要

近年来,人体运动生成取得了巨大进展,最先进的方法在领先的评估基准上超越了真实数据。然而,对生成运动的视觉检查揭示了不同情况:即使是最先进的方法也经常生成包含自交(即身体部位相互穿透)的运动,这些强烈的伪影严重限制了感知到的运动质量。我们引入了一种新的损失函数,明确惩罚自交,用于人体运动生成方法的训练。我们的损失基于人体几何的球代理,与基于三角网格的类似方法相比,计算自交损失的速度快98%,内存使用减少83%。该损失与具体方法无关,我们将其添加到最近的人体运动生成方法(人体运动扩散模型MDM和MoMask)的训练中。大量实验表明,生成运动中的自交减少了高达49%,同时改善了其他评估指标。代码可在https://github.com/boschresearch/humansphereproxy获取。

英文摘要

Human motion generation has made tremendous progress in recent years, with state-of-the-art approaches surpassing ground truth data in leading evaluation benchmarks. However, visual inspection of the generated motions paints a different picture. Even state-of-the-art approaches generate motions frequently containing self-intersections, i.e., body parts interpenetrating, which are strong artifacts, severely limiting the perceived motion quality. We introduce a novel loss, which explicitly penalizes self-intersections, to the training of human motion generation methods. We base our loss on a sphere proxy of human geometry, which allows us to calculate a self-intersection loss 98% faster and uses 83% less memory than comparable methods based on triangular meshes. The loss is agnostic to the specific approach, and we add it to the training of the recent human motion generation methods human motion diffusion model (MDM) and MoMask. Our extensive experiments show a reduction of self-intersections in generated motions of up to 49% while improving other evaluation metrics. The code is available at https://github.com/boschresearch/humansphereproxy .

2605.26738 2026-05-27 cs.CL

KARMA: Karma-Aligned Reward Model Adaptation

KARMA:基于Karma对齐的奖励模型适应

Jared Scott, Jesse Roberts

发表机构 * Tennessee Tech University(田纳西科技大学)

AI总结 提出KARMA框架,利用Reddit对话数据训练奖励模型预测语境依赖的回应价值,并通过强化学习微调语言模型以提升语用能力,发现最佳奖励模型不一定带来最优对齐,且KARMA会降低事实性。

详情
AI中文摘要

人类交流依赖于隐含的社会信号,其有效性由语气、语境和对话规范塑造,而不仅仅是语义内容。我们引入了KARMA(Karma对齐的奖励模型适应),这是一个让LLM从大规模社交互动数据中学习语境敏感对话行为的框架。KARMA在Reddit对话上训练奖励模型,以预测基于语境的回应价值,并利用该信号通过强化学习微调语言模型,以提升语用中介任务的表现。关键的是,我们发现表现最好的奖励模型并未带来更好的下游模型对齐:一个完全依赖对话语境的奖励模型在预测Reddit karma方面表现更差,但产生了显著更好的下游性能。我们评估了KARMA应用于下游模型的效果,无论该模型是否直接接触社交媒体数据。得到的模型显示出改进的语用中介行为,同时很大程度上减轻了不良副作用。在所有条件下,KARMA都持续降低了事实性,包括下游模型未直接接触Reddit数据的情况,这表明这种张力嵌入在奖励信号本身中,而非由噪声训练数据引入。

英文摘要

Human communication depends on implicit social signals where effectiveness is shaped by tone, context, and conversational norms rather than semantic content alone. We introduce KARMA (Karma-Aligned Reward Model Adaptation), a framework for LLM learning of context-sensitive conversational behavior from large-scale social interaction data. KARMA trains a reward model on Reddit conversations to predict response valuation conditioned on context, and uses this signal to fine-tune language models via reinforcement learning to improve performance on pragmatics-mediated tasks. Critically, we find that the highest performing reward model does not lead to better downstream model alignment: a reward model relying exclusively on conversational context was a worse predictor of Reddit karma but yielded substantially better downstream performance. We evaluate the effects of KARMA applied to a downstream model with and without direct exposure to the social media data. The resulting models show improved pragmatics-mediated behaviors with largely mitigated undesirable side effects. Factuality is consistently diminished by KARMA across all conditions, including when the downstream model has no direct exposure to Reddit data, suggesting that this tension is embedded in the reward signal itself rather than introduced by noisy training data.

2605.26735 2026-05-27 cs.CL

Rethinking the Multilingual Reasoning Gap with Layer Swap

通过层交换重新思考多语言推理差距

Maxence Lasbordes, Amélie Chatelain, Djamé Seddah

发表机构 * LightOn Inria(法国国家科学研究中心)

AI总结 本文通过构建多语言推理数据集并微调专家模型,发现本地推理与英语枢轴推理的性能差距远小于先前报道,并提出层交换方法将英语专家的推理中间层迁移到本地专家,以缩小差距。

详情
AI中文摘要

最近的推理大语言模型在生成思维链(CoT)时主要使用英语,即使提示使用非英语语言。先前的研究表明,强制CoT保持输入语言(本地推理)会显著降低性能,而允许模型用英语推理然后用输入语言回答(英语枢轴推理)则表现更好。然而,大多数关于这种本地推理差距的研究依赖于推理时的干预或有限的本地语言训练数据。我们在更大规模且可比监督下重新审视这一比较。我们构建了涵盖六种语言(英语、法语、德语、西班牙语、中文和斯瓦希里语)的长篇多语言推理数据集;在Qwen/Qwen3-8B-Base基础上微调本地和英语枢轴两种模式的专家模型,并在数学、科学、通用知识和代码任务上进行评估。在此设置下,五种非英语语言的平均本地推理差距缩小至1.9-3.5%,远小于先前报道。对本地专家的权重空间分析显示,中间层的微调更新具有对齐性,而外层则存在分歧。这表明存在一个很大程度上与语言无关的推理核心,周围环绕着特定语言的层。利用这一结构,我们引入了层交换方法:将英语专家更强的推理中间层迁移到每个本地专家中,从而在保留目标语言CoT的同时,几乎消除了五种非英语语言的本地推理差距。我们发布了所有模型和数据集。

英文摘要

Recent reasoning Large Language Models produce a chain-of-thought (CoT) predominantly in English, even when prompted in non-English languages. Prior work suggests that forcing the CoT to remain in the input language (\emph{native reasoning}) substantially degrades performance relative to allowing the model to reason in English before answering in the input language (\emph{English-pivoted reasoning}). However, most studies of this native reasoning gap rely on inference-time interventions or limited native-language training data. We revisit this comparison at a larger scale and under comparable supervision. We construct long multilingual reasoning datasets across six languages (English, French, German, Spanish, Chinese and Swahili); fine-tune specialists in both native and English-pivoted regimes on top of \texttt{Qwen/Qwen3-8B-Base}, and evaluate across mathematics, science, general knowledge, and code. In this setting, the average native reasoning gap shrinks to 1.9--3.5\% across the five non-English languages, considerably smaller than previously reported. Weight-space analysis of the native specialists reveals aligned fine-tuning updates in the middle layers and divergence in the outer layers. This points to a largely language-agnostic reasoning core surrounded by language-specific layers. Exploiting this structure, we introduce a Layer Swap: transferring the English specialist's stronger reasoning mid-layers into each native specialist, closing most of the native reasoning gap across the five non-English languages while preserving CoT in the target language. We release all models and datasets.

2605.26734 2026-05-27 cs.CV

CIRCLED: A Multi-turn CIR Dataset with Consistent Dialogues across Domains

CIRCLED:跨领域一致对话的多轮CIR数据集

Tomohisa Takeda, Yu-Chieh Lin, Yuji Nozawa, Youyang Ng, Osamu Torii, Yusuke Matsui

发表机构 * Graduate School of Information Science and Technology, The University of Tokyo(信息科学与技术研究生院,东京大学) Kioxia Corporation(铠侠公司)

AI总结 为解决现有MTCIR数据集缺乏对话历史一致性和领域局限的问题,构建了CIRCLED数据集,通过扩展FashionIQ、CIRR和CIRCO,利用CIReVL检索流水线生成多轮会话,并经过多重过滤确保质量,最终提供22,608个多轮会话,涵盖九个子集,规模与通用性显著提升。

详情
AI中文摘要

现有的多轮组合图像检索(MTCIR)数据集缺乏对话历史一致性,且仅限于时尚领域。为解决这些限制,我们通过扩展FashionIQ、CIRR和CIRCO构建了CIRCLED。在CIRCLED中,每一轮的查询逐步逼近目标图像。数据通过基于CIReVL的检索流水线生成,并经过检索成功、轮次长度、一致性和信息冗余等多重过滤以确保质量。我们总共收集了涵盖九个子集的22,608个多轮会话,在规模和通用性上显著超过Multi-turn FashionIQ(11,505个会话)。我们进一步应用了多种基线方法,并在CIRCLED上定量评估了检索准确性。我们的工作提供了一个实用、高质量的基准,以促进未来多轮CIR的研究。数据集和代码公开于https://huggingface.co/datasets/tk1441/CIRCLED和https://github.com/mti-lab/circled。

英文摘要

Existing Multi-Turn Composed Image Retrieval (MTCIR) datasets lack dialogue-history consistency and are restricted to the fashion domain. To address these limitations, we construct CIRCLED by extending FashionIQ, CIRR, and CIRCO. In CIRCLED, the query at each turn progressively approaches the target image. Data are generated via a CIReVL-based retrieval pipeline and curated with multiple filters on retrieval success, turn length, consistency, and information redundancy to ensure quality. In total, we collect 22,608 multi-turn sessions across nine subsets, substantially exceeding Multi-turn FashionIQ (11,505 sessions) in both scale and generality. We further apply multiple baseline methods and quantitatively assess retrieval accuracy on CIRCLED. Our work provides a practical, high-quality benchmark to facilitate future research on multi-turn CIR. The dataset and code are publicly available at https://huggingface.co/datasets/tk1441/CIRCLED and https://github.com/mti-lab/circled.

2605.26733 2026-05-27 cs.LG cs.AI

Stabilizing Recurrent Dynamics for Test-Time Scalable Latent Reasoning in Looped Language Models

循环语言模型中测试时可扩展潜在推理的稳定循环动力学

Xiao-Wen Yang, Ziyu Han, Xi-Hua Zhang, Wen-Da Wei, Jie-Jing Shao, Lan-Zhe Guo, Yu-Feng Li

发表机构 * State Key Laboratory of Novel Software Technology, Nanjing University, Nanjing, China(新型软件技术国家重点实验室,南京大学,南京,中国) School of Artificial Intelligence, Nanjing University, Nanjing, China(人工智能学院,南京大学,南京,中国) School of Intelligence Science and Technology, Nanjing University, Nanjing, China(智能科学与技术学院,南京大学,南京,中国)

AI总结 提出STARS训练框架,通过雅可比谱半径正则化约束潜在状态趋近渐近稳定不动点,解决循环语言模型深度递归时性能崩溃问题,实现可靠的测试时扩展并提升峰值性能。

Comments ICML 2026

详情
AI中文摘要

循环语言模型(LoopLMs)通过深度递归实现高效的潜在推理,但表现出不可靠的测试时缩放行为:性能通常在某个迭代深度达到峰值,然后随着进一步递归而崩溃。通过潜在动力学分析,我们发现现有架构和策略在稳定性和有效性之间存在固有的权衡。通过将推理概念化为不确定性减少,我们提出收敛到稳定不动点同时保持有效性是一种有前景的方法。为此,我们提出了STARS(稳定性驱动的递归缩放),一种训练框架,约束潜在状态趋近渐近稳定不动点。这通过高效的雅可比谱半径正则化和随机循环采样实现,使STARS能够在确保严格稳定性的同时最大化有效性。在算术任务上的实验表明,STARS实现了可靠的测试时缩放,在复杂数学推理中,它显著减轻了随着递归深度增加而出现的性能退化,同时提高了峰值性能。

英文摘要

Looped Language Models (LoopLMs) enable efficient latent reasoning through depth recurrence, yet exhibit unreliable test-time scaling behavior: performance often peaks at a certain iteration depth and then collapses with further recurrence. Through latent dynamics analysis, we find an inherent trade-off between stability and effectiveness in existing architectures and strategies. By conceptualizing reasoning as uncertainty reduction, we propose that convergence toward stable fixed points while preserving effectiveness represents a promising way. To this end, we propose STARS (STAbility-driven Recurrent Scaling), a training framework that constrains latent states to approach asymptotically stable fixed points. This is realized via efficient Jacobian Spectral Radius Regularization with random loop sampling, enabling STARS to maximize effectiveness while ensuring rigorous stability. Experiments on arithmetic tasks show that STARS achieves reliable test-time scaling, and on complex mathematical reasoning it substantially mitigates performance degradation as recurrence depth increases while also improving peak performance.

2605.26732 2026-05-27 cs.LG

APEX: Amplitude Anchors and Phase Priors for Target-Scarce Higher-Frequency Wave Prediction

APEX: 针对稀缺目标的高频波预测的幅度锚定与相位先验

Yifan Sun, Lei Cheng, Sijie Chen, Ting Zhang, Jianlong Li, Shikai Fang

发表机构 * College of Information Science and Electronic Engineering(信息科学与电子工程学院)

AI总结 提出APEX框架,通过低频神经算子预测幅度作为锚点,结合格林函数启发的相位先验和条件流匹配增强器,在目标数据稀缺时实现高频波场预测,在多个基准上优于直接外推和联合生成方法。

详情
AI中文摘要

基于学习的替代模型在波场预测中日益有效,特别是神经算子在观测频率范围内表现出色。然而,在目标监督稀缺的情况下,高频预测仍相对未被充分探索,尤其是在高频数据模拟或测量成本远高于低频数据的波动问题中。一个核心困难是跨频率迁移本质上是不对称的:粗粒度幅度结构在不同频率间保持相对稳定,而相位敏感的振荡结构随着频率增加而迅速恶化。受此不对称性启发,我们提出APEX(从外推粗预测中进行的幅度锚定和相位先验引导增强),一个针对目标稀缺高频波场预测的框架。低频神经算子首先在目标频率范围内提供粗预测,我们仅保留幅度作为可迁移的结构锚点。然后,条件流匹配增强器在格林函数启发的相位先验指导下重建目标高频场。在SimpleWave、Helmholtz和Maxwell基准上的实验表明,在有限的目标频率监督下,APEX始终优于直接的低频到高频外推、目标自适应算子和联合生成基线。我们的结果表明,振荡波场的可靠高频预测不应依赖于完整复数场的直接端到端迁移,而应显式重用可迁移的粗粒度结构,同时单独恢复缺失的振荡细节。

英文摘要

Learning-based surrogates have become increasingly effective for wave-field prediction, and neural operators in particular have shown strong performance within observed frequency regimes. However, higher-frequency prediction under scarce target supervision remains comparatively underexplored, especially in wave problems where higher-frequency data are substantially more expensive to simulate or measure than lower-frequency data. A central difficulty is that cross-frequency transfer is inherently asymmetric: coarse amplitude structure remains relatively stable across frequencies, whereas phase-sensitive oscillatory structure deteriorates much more rapidly as frequency increases. Motivated by this asymmetry, we propose APEX, Amplitude-anchored and Phase-prior-guided Enhancement from eXtrapolated coarse predictions, a framework for target-scarce higher-frequency wave-field prediction. A lower-frequency neural operator first provides a coarse prediction in the target-frequency regime, from which we retain only the amplitude as a transferable structural anchor. A conditional flow-matching enhancer then reconstructs the target higher-frequency field under the guidance of a Green's-function-inspired phase prior. Experiments on SimpleWave, Helmholtz, and Maxwell benchmarks show that APEX consistently outperforms direct lower-to-higher extrapolation, target-adapted operator, and joint generative baselines under limited target-frequency supervision. Our results suggest that reliable higher-frequency prediction of oscillatory wave fields should not rely on direct end-to-end transfer of the full complex field, but instead on explicitly reusing transferable coarse structure while separately recovering the missing oscillatory detail.

2605.26731 2026-05-27 cs.AI cs.CL

It's Not the Capability: Harness Sensitivity Is Non-Monotone Across LLM Agent Tiers

不是能力问题:LLM 智能体层级间的驾驭敏感性非单调

Yong-eun Cho

发表机构 * KailosLab(凯罗斯实验室)

AI总结 通过 432 次实验,发现 LLM 智能体的驾驭敏感性随模型层级非单调变化,且依赖模型类型(聊天 vs. 推理),推翻了“更高能力模型需要更少结构指导”的假设。

Comments 9 pages, 3 figures

详情
AI中文摘要

LLM 智能体部署中的一个普遍假设是,更结构化的驾驭方式普遍能提高可靠性,并且能力更强的模型需要成比例地减少结构指导——这共同暗示了模型能力层级与最优驾驭复杂度之间存在单调反比关系。我们通过一个受控的 432 次实验来检验这一假设,实验跨越了四个能力层级的六个模型,在 HEAT-24(一个基于 git 工作区验证的 24 任务合成基准)上采用了三种驾驭条件(轻量、平衡、严格)。我们的结果从两个方面反驳了单调反比关系。首先,对于评估的前沿聊天模型(Gemini 2.5 Flash),增加驾驭冗长度使 VTSR 降低 29-38 个百分点——这是一个驾驭复杂度悖论。其次,对于评估的前沿推理模型(Qwen3.5-122B,启用扩展思考),严格驾驭实现了最高的 VTSR(91.7%)和最低的延迟,与预测相反。在受限层级内,一个 2B 模型(Gemma4:e2B)在所有驾驭条件下均以 91.7% 的 VTSR 达到了强开放层级的稳定性。由于本研究中每个层级仅由一个模型代表,这些结果应解释为模型特定的观察;驾驭敏感性在所评估的模型中呈现非单调性,并且关键依赖于模型类型(聊天 vs. 推理)。我们引入了一个六标签失败分类法,显示格式违规主导了能力强的模型失败,而错误文件主导了低能力失败,并推导出了实用的层级感知驾驭选择指南。

英文摘要

A prevalent assumption in LLM agent deployment holds that more structured harnesses universally improve reliability, and that higher-capability models need proportionally less structural guidance -- together implying a monotone inverse relationship between model capability tier and optimal harness complexity. We test this hypothesis through a controlled 432-run experiment crossing six models across four capability tiers with three harness conditions (light, balanced, strict) on HEAT-24, a 24-task synthetic benchmark with git-based workspace verification. Our results refute the monotone inverse relationship on two fronts. First, for the frontier chat model evaluated (Gemini 2.5 Flash), increased harness verbosity lowers VTSR by 29-38 percentage points -- a harness-complexity paradox. Second, for the frontier reasoning model evaluated (Qwen3.5-122B, extended thinking enabled), strict harness achieves the highest VTSR (91.7%) and the lowest latency, the opposite of the prediction. Within the constrained tier, a 2B model (Gemma4:e2B) matches strong-open-tier stability at 91.7% across all harnesses. Because each tier is represented by a single model in this study, these results should be interpreted as model-specific observations; harness sensitivity appears non-monotone across the models evaluated, and depends critically on model type (chat vs. reasoning). We introduce a six-label failure taxonomy showing that format_violation dominates capable-model failures while wrong_file dominates low-capability failures, and we derive practical tier-aware harness selection guidelines.

2605.26729 2026-05-27 cs.CV

Learning Reference-Guided Exposure Correction with Hybrid Illumination Characteristics

基于混合光照特性的参考引导曝光校正

Hao Ren, Zetong Bi, Zhaoliang Wan, Hui Cheng

发表机构 * School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China(计算机科学与工程学院,中山大学,广州,中国)

AI总结 提出HICNet,一种参考引导的曝光校正框架,通过轻量编码器提取光照嵌入,结合FiLM全局调整和光度通道重平衡实现精细曝光匹配,无需真值或内在分解即可在基准测试上取得更优精度并泛化到未见场景。

Comments ICASSP2026

详情
AI中文摘要

我们提出了HICNet,一个参考引导的曝光校正框架。一个轻量级、内容无关的编码器将每张图像蒸馏成一个紧凑的光照嵌入,捕获区域亮度、边缘对比度和高阶亮度矩。源图像与其参考图像之间的嵌入差异驱动一个多尺度调制网络,该网络结合基于FiLM的全局调整和光度通道重平衡,实现细粒度的、光照感知的光谱门控,产生曝光匹配的输出,同时忠实保留场景细节。跨批次对比损失对光照流形进行排序,增强了对不同光照条件的鲁棒性。在没有真值或内在分解的情况下训练,HICNet在公共基准测试上达到了更好的精度,并且能够很好地泛化到完全未见过的场景。

英文摘要

We present HICNet, a reference-guided exposure correction framework. A lightweight, content-agnostic encoder distills each image into a compact illumination embedding capturing regional brightness, edge contrast, and higher-order luminance moments. The embedding difference between a source and its reference drives a multi-scale modulation network that combines FiLM-based global adjustment with Photometric Channel Rebalancing for fine-grained, illumination-aware spectral gating, producing exposure-matched outputs while faithfully preserving scene details. A cross-batch contrastive loss orders the illumination manifold, bolstering robustness to diverse lighting conditions. Trained without ground truth or intrinsic decomposition, HICNet attains better accuracy on public benchmarks and generalizes well to entirely unseen scenes.

2605.26725 2026-05-27 cs.CV

Joint 2D-3D Segmentation and Association in Street-level Imaging

街景成像中的联合2D-3D分割与关联

Amir Melnikov, Masayuki Tanaka, Yusuke Monno, Masatoshi Okutomi

发表机构 * Institute of Science Tokyo(东京科学研究所)

AI总结 提出一个统一框架,结合零样本检测分割与运动恢复结构,通过3D驱动的几何一致性机制替代传统2D多目标跟踪,实现街景图像中跨视角的稳定分割与身份关联,在挑战性城市场景中性能提升22%。

Comments 15 pages, 6 image figures, 1 in-body table, 1 in-body algorithm, 2 indexes with tables

详情
AI中文摘要

准确解读街景图像对于大规模城市地图绘制和创建空间数字孪生环境至关重要。本文提出了一个用于联合2D-3D分割与关联的统一框架,该框架将视觉语义与多视图几何推理相结合。与依赖时序帧进行跟踪的传统方法不同,我们的方法利用零样本检测和分割,结合运动恢复结构重建,建立稳定的跨视图对应关系。3D驱动的关联机制取代了传统的2D多目标跟踪,利用几何一致性指导宽基线视角和不同成像条件下的身份保持。通过结合2D纹理线索和全局3D上下文,所提出的管道非常适合可扩展的街景处理,并可适用于多种对象类型。实验表明,与最先进的纯2D跟踪方法相比,我们的方法显著提高了对真实序列的覆盖率和更鲁棒的身份保持,在挑战性城市场景中实现了22%的性能提升。

英文摘要

Accurate interpretation of street-level imagery is essential for large-scale urban mapping and the creation of Spatial Digital Twin (SDT) environments. This work presents a unified framework for joint 2D-3D segmentation and association that integrates visual semantics with multi-view geometric reasoning. Unlike conventional approaches that rely heavily on sequential frames for temporal tracking, our method leverages zero-shot detection and segmentation together with structure-from-motion reconstruction to establish stable cross-view correspondences. A 3D-driven association mechanism replaces traditional 2D multi-object tracking, using geometric consistency to guide identity preservation across wide-baseline viewpoints and varying imaging conditions. By combining 2D texture cues with global 3D context, the proposed pipeline is well-suited for scalable street-level processing and can be used for a variety of object types. Experiments demonstrate substantially improved coverage of ground-truth sequences and more robust identity retention compared to state-of-the-art 2D-only tracking methods, achieving a 22% performance gain in challenging urban scenarios.

2605.26720 2026-05-27 cs.AI

Towards Feedback-to-Plan Decisions for Self-Evolving LLM Agents in CUDA Kernel Generation

面向CUDA内核生成中自进化LLM代理的反馈到计划决策

Yee Hin Chong, Jiaming Wu, Youhui Zhang, Peng Qu

发表机构 * Department of Computer Science and Technology, Tsinghua University, Beijing, China(清华大学计算机科学与技术系) Beijing National Research Center for Information Science and Technology, Beijing, China(北京信息科学国家研究中心)

AI总结 通过轨迹冻结和选择性反馈注入,提出CUDAnalyst框架以归因规划决策对反馈组件的贡献,揭示显式规划仅在反馈对齐时有效,且有效规划源于结构化多反馈交互。

Comments ICML 2026 accpeted, camera-ready in progress

详情
AI中文摘要

大型语言模型(LLMs)作为自进化代理在CUDA内核生成中展现出强大的实证收益,这得益于跨代际的反馈条件规划。然而,规划决策如何归因并组合异构反馈信号仍不透明。标准的端到端消融无法解决这一问题,因为迭代规划放大了早期扰动,并将反馈效应与轨迹依赖漂移混为一谈。我们引入 exttt{CUDAnalyst},一个统一的分析层,通过轨迹冻结和选择性反馈注入,实现对规划决策到反馈组件的受控、代际级归因。 exttt{CUDAnalyst}支持稳定的代际级评估和原则性的联盟式反馈效应及交互归因。我们的结果表明,显式规划仅在反馈对齐时有益,有效规划源于结构化的多反馈交互,且来自更强推理模型的高级规划可部分迁移至较弱模型。这些趋势在参考骨干网络、代表性工作负载和参考归纳机制中保持一致,表明在所研究的受控轴内,识别出的反馈到规划结构是稳健的。

英文摘要

Large language models (LLMs) have shown strong empirical gains as self-evolving agents for CUDA kernel generation, driven by feedback-conditioned planning across generations. However, how planning decisions attribute and combine heterogeneous feedback signals remains opaque. Standard end-to-end ablations fail to resolve this question, as iterative planning amplifies early perturbations and conflates feedback effects with trajectory-dependent drift. We introduce \texttt{CUDAnalyst}, a unified analysis layer for controlled, generation-level attribution of planning decisions to feedback components via trajectory freezing and selective feedback injection. \texttt{CUDAnalyst} enables stable generation-level evaluation and principled coalitional-style attribution of feedback effects and interactions. Our results show that explicit planning is beneficial only when feedback is aligned, that effective planning emerges from structured multi-feedback interactions, and that high-level plans from stronger reasoning models can partially transfer to weaker ones. These trends hold across reference backbones, representative workloads, and reference induction regimes, indicating that the identified feedback-to-plan structure is robust within the controlled axes studied.

2605.26718 2026-05-27 cs.LG

MTL-FNO: A Lightweight Multi-Task Fourier Neural Operator for Sparse Field Reconstruction

MTL-FNO:一种用于稀疏场重建的轻量级多任务傅里叶神经算子

Siyu Ye, Shihang Li, Zhiqiang Gong, Benrong Zhang, Weien Zhou, Yiyong Huang, Wen Yao

发表机构 * Defense Innovation Institute, Academy of Military Science, Beijing, 100071, China(国防科技研究院,军事科学院,北京,100071,中国) Intelligent Game and Decision Laboratory, Beijing, 100071, China(智能游戏与决策实验室,北京,100071,中国)

AI总结 针对航空航天飞行器多场稀疏重建中模型庞大且难以利用跨场相关性的问题,提出基于硬参数共享的轻量级多任务傅里叶神经算子MTL-FNO,通过极坐标解耦优化和Cayley变换实现高效联合训练,在少样本条件下模型大小减少76%和60%且精度相当或更优。

详情
AI中文摘要

高效的星载多场稀疏重建对于航空航天飞行器的自主运行至关重要。虽然现有的深度学习模型在单场重建中表现出潜力,但部署多个独立模型会导致模型尺寸急剧增长,并且无法利用跨场相关性,尤其是在少样本条件下。为了解决这些挑战,我们首先提出了一种轻量级多任务傅里叶神经算子(MTL-FNO),这是一种基于硬参数共享的端到端联合训练框架。在每一层中,参数被分为共享部分和任务特定部分,以捕获各场之间的共同特征,同时保留任务特定特征。此外,任务特定的微调参数被实现为低秩项,实现了显著的模型压缩。其次,为了解决共享参数和任务特定参数及其实部和虚部联合优化的困难,我们从极坐标形式的角度重新审视了FNO的谱权重,并设计了一种具有物理意义的解耦优化方案。具体地,我们应用极分解将谱权重逐片解耦为编码相位信息的酉张量和表征振幅的半正定张量。通过解耦相位和振幅的优化,我们的方法可以有效缓解任务冲突。同时,为了在训练过程中保持酉几何保真度,引入Cayley变换对酉张量进行重参数化,将约束优化问题转化为无约束优化问题。最后,在两个代表性工程案例上验证了所提方法在少样本条件下的有效性。结果表明,MTL-FNO达到了与标准FNO相当甚至更优的精度,同时分别将总模型大小减少了76%和60%。

英文摘要

Efficient onboard multi-field sparse reconstruction is essential for the autonomous operation of aerospace vehicles. While existing deep learning models exhibit promise for single-field reconstruction, deploying multiple independent models leads to prohibitive model size growth and fails to exploit cross-field correlations, particularly under few-shot conditions. To address these challenges, we first propose a lightweight multi-task Fourier neural operator (MTL-FNO), an end-to-end joint training framework based on hard parameter sharing. In each layer, the parameters are divided into shared and task-specific components to capture common features across fields while preserving task-specific characteristics. Moreover, the task-specific fine-tuning parameters are implemented as low-rank terms, achieving substantial model compression. Second, to address the difficulty of co-optimizing shared and task-specific parameters along with their real and imaginary parts, we revisit the FNO's spectral weight from a polar-form perspective and devise a physically meaningful decoupled optimization scheme. Specifically, we apply polar decomposition to slice-wise disentangle the spectral weight into a unitary tensor encoding phase information and a positive semi-definite tensor characterizing amplitude. By decoupling the optimization of phase and amplitude, our method can effectively mitigate tasks conflict. Meanwhile, to preserve unitary geometric fidelity during training, the Cayley transform is introduced to reparameterize the unitary tensor, converting the constrained optimization problem to an unconstrained one. Finally, the effectiveness of the proposed method under few-shot conditions is validated on two representative engineering cases. Results show that MTL-FNO achieves accuracy comparable to or even surpassing that of standard FNO, while reducing total model size by 76% and 60%, respectively.

2605.26712 2026-05-27 cs.CV

METATR: A Multilingual, Evolving Benchmark for Automatic Text Recognition

METATR:一个多语言、不断演进的自动文本识别基准

Mélodie Boillet, Solène Tarride, Christopher Kermorvant

发表机构 * TEKLIA

AI总结 提出METATR基准,通过多样化多语言文档、标准化评估框架和动态更新机制,全面评估自动文本识别系统(尤其是视觉大语言模型)的性能,支持模型比较与选择。

详情
AI中文摘要

反映真实文档多样性和复杂性的基准对于准确评估自动文本识别(ATR)系统,特别是视觉大语言模型(vLLMs)至关重要。尽管最近的模型表现出令人印象深刻的性能,但它们通常在包含现代印刷文本(主要是英语)的数据集上进行评估,这限制了它们与许多实际应用的相关性。因此,为特定用例选择模型需要在与目标文档匹配的数据上进行评估。这突显了代表性基准对于实际应用的重要性。在本文中,我们介绍了METATR(v1.0),一个多语言、不断演进的基准,旨在评估ATR模型在广泛文档上的性能,促进有意义的模型比较和选择。该基准通过包含来自各种公共收藏的文档来最大化多样性。这些文档涵盖29种语言,并包含多种文字和布局的文本。除了数据集本身,METATR还定义了标准化的提示和归一化方法,并建立了一个动态评估框架。这种方法旨在产生可重复的结果,同时随着时间的推移保持可扩展性。我们评估了广泛的最先进系统,包括开源模型和闭源模型。结果从多个维度报告,包括数据集和语言级别的性能、对手写文档的鲁棒性以及计算效率。我们的发现表明,尽管专有模型实现了最一致的性能,但在不同文字和布局之间仍然存在显著差异。总体而言,METATR提供了一个多维度的、面向从业者的框架,用于在真实条件下评估多语言ATR,并随着领域的发展跟踪进展。

英文摘要

Benchmarks that reflect the diversity and complexity of real-world documents are essential for accurately evaluating Automatic Text Recognition (ATR) systems, especially Vision-Large Language Models (vLLMs). Although recent models demonstrate impressive performance, they are often evaluated on datasets containing modern, printed texts mostly written in English, which limits their relevance to many practical applications. Therefore, selecting a model for a specific use case requires evaluating it on data that matches the target documents. This highlights the importance of representative benchmarks for real-world applications. In this paper, we introduce METATR (v1.0), a multilingual, evolving benchmark designed to evaluate ATR models across a wide range of documents, facilitating meaningful model comparison and selection. The benchmark was designed to maximize diversity by including documents from various public collections. These documents cover 29 languages and include texts with multiple scripts and layouts. Beyond the dataset itself, METATR defines a standardized prompting and normalization methodology and establishes a dynamic evaluation framework. This approach is intended to produce reproducible results while remaining extensible over time. We evaluated a wide range of state-of-the-art systems, including open-source models and closed-source models. Results are reported across various dimensions, including performance at the dataset and language levels, robustness to handwritten documents, and computational efficiency. Our findings show that, although proprietary models achieve the most consistent performance, substantial variability persists across scripts and layouts. Overall, METATR provides a multidimensional, practitioner-oriented framework for assessing multilingual ATR in real-world conditions and tracking progress as the field evolves.

2605.26710 2026-05-27 cs.RO

Look Further: Socially-Compliant Navigation System in Residential Buildings

看得更远:住宅楼中的社交合规导航系统

Akira Shiba, Marina Obata, Nathan Kau, Zoltan Beck, Rishi Shah, Michael Sudano, Sabrina Lee

发表机构 * Toyota Woven City(丰田织城)

AI总结 提出一种主动变道(PLC)运动模式,通过将反应距离扩展到8米以上,改善人类对机器人运动的感知,并在直走廊场景中显著提升安全性、流畅性和礼貌性。

Comments 2025 ACM/IEEE International Conference on Human-Robot Interaction

Journal ref 2025 20th ACM/IEEE International Conference on Human-Robot Interaction (HRI), Melbourne, Australia, 2025, pp. 272-282

详情
AI中文摘要

移动机器人对人的反应距离强烈影响人机交互的多种品质。本文聚焦于移动配送机器人在住宅室内走廊环境中的导航。社交导航方法通常侧重于避免令人不适的人机交互,例如机器人侵入某人的个人空间。由于个人空间已被证明仅在几米范围内,社交导航方法通常侧重于解决这些短距离交互。然而,在本工作中,我们证明通过将反应距离扩展到超过8米(远超出典型交互距离),可以改善人类对机器人运动的感知。我们引入了主动变道(PLC)运动模式以及利用该模式在更远距离上对人做出反应的导航系统。该模式包括当机器人在走廊中从中心向侧面导航时,在距离迎面而来的人8米处改变其横向位置。我们进行了一项有42名参与者的用户研究,基于三个服务目标(安全性、流畅性和礼貌性)评估他们对配送机器人的印象。在直走廊场景(正面接近)中,结果显示与文献中典型的运动模式(减速、停止和在靠近人时反应性避障)相比,这三个目标均有显著改善。相比之下,在交叉口(盲角)场景中,没有任何一种方法显著优于其他方法,参与者对机器人运动模式的偏好各不相同。

英文摘要

The distance at which a mobile robot reacts to a person strongly impacts various qualities of the human-robot interaction. In this paper, we focus on the navigation of a mobile delivery robot platform in a residential indoor hallway environment. Social navigation methods typically focus on avoiding uncomfortable human-robot interactions, such as when a robot encroaches on someone's personal space. Since personal space has been shown to be in the range of just a few meters, social navigation methods typically focus on deconflicting and resolving these short-range interactions. In this work, however, we demonstrate that by extending the reaction distance to over eight meters, far beyond the typical interaction distance, we can improve the human's perception of the robot's motion. We introduce the Proactive Lane-Changing (PLC) motion pattern and a navigation system that leverages it to react to people at an increased distance. This pattern consists of changing the robot's lateral position as it navigates down the hallway from the center to the side at an eight-meter distance from an oncoming person. We conducted a user study with 42 participants to assess their impressions of the delivery robot based on three service objectives: safety, smoothness, and politeness. In the straight hallway scenario (Frontal Approach), results showed significant improvement in each of these three objectives compared to typical motion patterns found in the literature: slowing down, stopping, and reactive collision avoidance in the proximity of a person. In contrast, in the intersection (Blind Corner) scenarios, none of the approaches performed significantly better than any other, with participants having a diverse range of preferences among robot motion patterns.

2605.25981 2026-05-27 cs.CL

When Do LLM Agents Treat Surface Noise Differently from Semantic Noise? A 68-Cell Measurement Study with a Held-Out Trace-Level Validation

LLM 代理何时对表面噪声与语义噪声做出不同处理?一项基于 68 个单元格的测量研究及留出轨迹级验证

Liyun Zhang, Jiayi Guo

发表机构 * School of Information and Software Engineering, UESTC(信息与软件工程学院,电子科技大学) Jacobs School of Engineering, UC San Diego(工程学院,加州大学圣地亚哥分校)

AI总结 本研究通过 68 个单元格的测量实验,发现大语言模型驱动的思维链和 ReAct 代理对语义扰动(如释义、同义词)比表现扰动(如格式、重排序)更敏感,并基于留出模型和轨迹级机制分析提出了“隐蔽发散”解释。

详情
AI中文摘要

我们记录了一个经验现象:在来自七个架构家族的十种大语言模型驱动的思维链和 ReAct 代理中,意义承载扰动(例如,释义、同义词)比同等严重程度的表现扰动(例如,格式、重排序)更频繁地改变最终答案。跨越 GSM8K、MATH 和 HotpotQA 的 68 个单元格(1,530 个原始样本和约 11,150 个变体),在严重性匹配后,不一致性差距平均为 +19.69 个百分点(配对 t=9.58,p<0.0001),其中 64/68 个单元格为正。该差距通过了四次严重性代理审计,并且在排除 qwen 模型时仍然显著(+11.10 个百分点,p<0.0001)。几项压力测试诚实地失败了:在更严格的假设下,聚类自助法显著性消失;可处理性对比无法复制;跨架构生成器交换破坏了每个单元格的排名;第二个 LLM 判断器仅产生中等一致性(κ=0.50)。 然后,我们在一个完全留出的第 11 个模型(qwen2.5-14B-Instruct;1,800 条轨迹)上验证了标题效应,并重新测试了一个预先注册的能力×可处理性分区,观察到一个小但正的留出效应(3/4 个单元格为正;合并 Welch t=3.81,p=9.6×10^{-4})。利用留出轨迹,我们探测了四个轨迹级机制信号。两个先前的机制主张未能复制并被明确撤回。两个新的探测反而支持一种“隐蔽发散”图景:语义扰动通常保留第一个动作,但从后续步骤开始导致中间推理发散,并伴随略微更深的轨迹。我们将此定位为一项带有留出复现和部分轨迹级解释的测量贡献,说明语义扰动如何通过代理推理传播。代码、扰动语料库、原始轨迹和分析脚本已匿名发布以供评审。

英文摘要

We document an empirical phenomenon in chain-of-thought and ReAct agents driven by ten large language models from seven architecture families: meaning-bearing perturbations (e.g., paraphrase, synonym) alter final answers more often than presentation perturbations (e.g., formatting, reordering) of comparable severity. Across 68 cells spanning GSM8K, MATH, and HotpotQA (1,530 originals and $\sim$11,150 variants), the inconsistency gap averages +19.69 pp after severity matching (paired $t=9.58$, $p<0.0001$), with 64/68 cells positive. The gap survives four severity-proxy audits and remains significant when excluding qwen models (+11.10 pp, $p<0.0001$). Several stress tests fail honestly: cluster-bootstrap significance disappears under stricter assumptions, tractability contrasts do not replicate, cross-architecture generator swaps break per-cell rankings, and a second LLM judge yields only moderate agreement ($κ=0.50$). We then validate the headline effect on a fully held-out 11th model (qwen2.5-14B-Instruct; 1,800 trajectories) and re-test a pre-registered capability$\times$tractability partition, observing a small but positive held-out effect (3/4 cells positive; pooled Welch $t=3.81$, $p=9.6\times10^{-4}$). Using held-out trajectories, we probe four trace-level mechanism signals. Two prior mechanism claims fail to replicate and are explicitly retracted. Two new probes instead support a \emph{stealth-divergence} picture: semantic perturbations often preserve the first action but induce divergence in intermediate reasoning from later steps onward, accompanied by slightly deeper trajectories. We position this as a measurement contribution with held-out replication and a partial trace-level account of how semantic perturbations propagate through agent reasoning. Code, perturbation corpus, raw trajectories, and analysis scripts are released anonymously for review.

2605.25930 2026-05-27 cs.SD

CosyEdit2: Speech-Editing-Oriented Reinforcement Learning Unlocks Better Zero-Shot TTS

CosyEdit2: 面向语音编辑的强化学习解锁更好的零样本TTS

Junyang Chen, Yuhang Jia, Hui Wang, Jiaming Zhou, Yongchang Gan, Yong Qin

发表机构 * College of Computer Science, Nankai University(南开大学计算机科学学院) College of Artificial Intelligence, Nankai University(南开大学人工智能学院)

AI总结 提出CosyEdit2,通过两阶段后训练框架(监督编辑初始化+基于目标语音无关数据的编辑导向GRPO)解决语音编辑与零样本TTS的局部声学一致性问题,显著提升编辑性能并增强零样本TTS能力。

详情
AI中文摘要

语音编辑和零样本文本到语音(TTS)共享基于语音提示的类似生成基础,但语音编辑对与周围未编辑内容的局部声学一致性要求严格得多。虽然先前工作表明监督微调(SFT)能使TTS模型获得功能性编辑能力,但该方法根本上受限于不完美的配对编辑数据和粗粒度的优化信号。为解决这些限制,我们提出CosyEdit2,一种构建于两阶段后训练框架上的语音编辑模型,该框架从监督编辑初始化逐步过渡到基于目标语音无关数据的编辑导向组相对策略优化(GRPO)。大量实验表明,CosyEdit2不仅显著提升了语音编辑性能,还解锁了更好的零样本TTS能力,揭示了两项任务之间更深层的相互关系。音频样本可在 https://cjy1018.github.io/CosyEdit2 获取。

英文摘要

Speech editing and zero-shot Text-to-Speech (TTS) share a similar generative foundation conditioned on speech prompts, yet speech editing demands far stricter local acoustic consistency with surrounding unedited content. While prior work has shown that Supervised Fine-Tuning (SFT) enables TTS models to acquire functional editing capability, this approach remains fundamentally bottlenecked by imperfect paired editing data and coarse-grained optimization signals. To address these limitations, we propose CosyEdit2, a speech editing model built on a two-stage post-training framework that progresses from supervised editing initialization to editing-oriented Group Relative Policy Optimization (GRPO) over target-speech-free data. Extensive experiments demonstrate that CosyEdit2 not only substantially advances speech editing performance, but also unlocks better zero-shot TTS capability, revealing a deeper mutual relationship between the two tasks. Audio samples are available at https://cjy1018.github.io/CosyEdit2.

2605.25731 2026-05-27 cs.CL

Trait-Aware Policy Optimization for Autoregressive Multi-Trait Essay Scoring

面向自回归多维度作文评分的特质感知策略优化

Zhengyang Wang, Sanwoo Lee, Jiaxin Wang, Chenxi Miao, Weikang Li, Yunfang Wu

发表机构 * Peking University(北京大学) Baidu Inc.(百度公司)

AI总结 提出特质感知策略优化(TAPO)框架,通过分解样本和特质维度的奖励并结合增强提示,提升自回归多维度作文评分性能。

详情
AI中文摘要

多维度作文评分旨在跨多个维度提供写作质量的细粒度评估。然而,如何有效后训练自回归评分模型仍未充分探索。在本文中,我们提出了特质感知策略优化(TAPO),一种专为自回归多维度评分设计的后训练框架。我们的方法沿样本和特质维度分解奖励,结合全局评分一致性、特质级准确性、格式有效性以及跨特质依赖保持。此外,我们在整个训练过程中使用增强提示,通过融入原始提示文本和特质描述,为特质特定分数生成提供更丰富的语义信息。跨多个骨干模型的实验表明,我们的方法在监督微调和标量奖励优化基线上持续提升了多维度评分性能,证明了特质感知后训练在作文评分中的有效性和可迁移性。

英文摘要

Multi-trait essay scoring aims to provide fine-grained evaluation of writing quality across multiple dimensions. However, how to effectively post-train autoregressive scoring models remains underexplored. In this paper, we propose Trait-Aware Policy Optimization (TAPO), a post-training framework tailored to autoregressive multi-trait scoring. Our method decomposes rewards along both the sample and trait dimensions, combining global scoring consistency, trait-level accuracy, format validity, and inter-trait dependency preservation. In addition, we use enhanced prompts throughout training by incorporating original prompt texts and trait descriptions, providing richer semantic information for trait-specific score generation. Experiments across multiple backbone models show that our method consistently improves multi-trait scoring performance over supervised fine-tuning and scalar-reward optimization baselines, demonstrating the effectiveness and transferability of trait-aware post-training for essay scoring.

2605.25510 2026-05-27 cs.CL

The Age of Curiosity Meets the Age of AI: Benchmarking Child Safety in Large Language Models

好奇心时代遇上AI时代:大型语言模型中的儿童安全基准测试

Samee Arif, Angana Borah, Rada Mihalcea

发表机构 * University of Michigan(密歇根大学)

AI总结 针对7-11岁儿童使用LLM的安全性问题,提出基于发展心理学的KIDBench基准,通过隐式线索和显式年龄指令提升安全性,并开发了儿童安全评估器KIDGuardLlama和响应模型KIDLlama。

详情
AI中文摘要

儿童越来越多地接触大型语言模型(LLM),这可能会使他们接触到发展不适当或需要年龄敏感性安全、指导和界限的回应。现有的LLM安全评估主要关注有害内容规避,并未明确针对面向儿童的安全性。我们引入了KIDBench,这是一个使用基于发展心理学的LLM作为评判标准的基准,用于评估面向7-11岁儿童的LLM安全性。KIDBench包含十个类别的真实儿童查询,包括单轮提示和多轮儿童角色模拟。我们比较了无儿童上下文的无提示、暗示儿童说话者的隐式提示以及显式年龄指令。隐式提示使模型得分提高了9-47%,而显式年龄进一步增加了10-30%的增益。跨语言和文化评估显示,不同语言和国家背景下的安全行为不均匀。多轮模拟显示,面向儿童的响应质量从第一轮到最差轮次可能下降6-24%。除了评估,我们还引入了儿童安全评估器KIDGuardLlama和面向儿童的响应模型KIDLlama,展示了KIDBench如何支持更安全的面向儿童AI。

英文摘要

Children increasingly have access to Large Language Models (LLMs), which may expose them to responses that are developmentally inappropriate or require age-sensitive safety, guidance, and boundaries. Existing LLM safety evaluations largely focus on harmful-content avoidance and do not explicitly target child-facing safety. We introduce KIDBench, a benchmark for evaluating child-facing LLM safety for ages 7-11 using a developmental-psychology-grounded LLM-as-a-Judge rubric. KIDBench contains realistic child queries across ten categories, with single-turn prompts and multi-turn child-actor simulations. We compare no-cues prompts with no child context, implicit-cues prompts that suggest a child speaker, and explicit age instructions. Implicit-cues improve scores by 9-47% across models, while explicit age adds a further 10-30% gain. Cross-lingual and cultural evaluations show uneven safety behavior across languages and country contexts. Multi-turn simulations show that child-facing response quality can degrade by 6-24% from the first to worst turn. Beyond evaluation, we introduce KIDGuardLlama, a child-safety evaluator, and KIDLlama, a child-oriented response model, showing how KIDBench supports safer child-facing AI.

2605.25507 2026-05-27 cs.AI

Credit Assignment with Resets in Language Model Reasoning

语言模型推理中带有重置的信用分配

Ankur Samanta, Akshayaa Magesh, Ayush Jain, Youliang Yu, Daniel Jiang, Kavosh Asadi, Kaveh Hassani, Paul Sajda, Jalaj Bhandari, Yonathan Efroni

发表机构 * Meta AI Columbia University(哥伦比亚大学) Meta Superintelligence Labs(Meta超智能实验室) Tel Aviv University(特拉维夫大学)

AI总结 提出随机重置策略优化(RRPO)和自重置策略优化(SRPO)两种方法,通过重置到中间状态并重新采样反事实延续来改进语言模型多步推理中的信用分配,SRPO在多个推理基准上优于标准GRPO和RRPO。

详情
AI中文摘要

当代使用可验证奖励方法的强化学习通过对轨迹中的所有令牌统一分配单一结果奖励来对多步推理进行语言模型后训练。这种统一分配忽略了哪些步骤促成了成功或失败。改进信用分配可以通过实现对错误推理步骤的针对性细化来解决这一限制,而不是统一更新整个轨迹。重置是一种简单的机制,通过返回到中间状态并重新采样反事实延续来实现更精确的信用分配,从而将结果差异归因于该点做出的决策。我们提出了两种这样的方法:随机重置策略优化(RRPO),其中重置状态从推理步骤中随机抽取;以及自重置策略优化(SRPO),其中模型自我定位错误轨迹中的错误步骤并在此重置。我们在保守策略迭代(CPI)框架内分析了这些方法。通过针对可改进状态的信用分配预言机扩展CPI,相比于随机重置可证明改进。在多个模型和推理基准上,SRPO通过仅在自我定位的重置处采样多个后缀延续并从其奖励中学习,仅使用模型本身且无需外部监督,始终优于标准GRPO和RRPO。

英文摘要

Contemporary reinforcement learning with verifiable reward methods post-train language models on multi-step reasoning by assigning a single outcome reward uniformly across all tokens in a trajectory. Such uniform assignment ignores which steps contributed to success or failure. Improving credit assignment can address this limitation by enabling targeted refinement of faulty reasoning steps, rather than updating entire trajectories uniformly. Resets are one such simple mechanism, enabling more precise credit assignment by returning to an intermediate state and resampling counterfactual continuations, so that outcome differences can be attributed to decisions made at that point. We propose two such methods: Random-Reset Policy Optimization (RRPO), where reset states are drawn randomly from reasoning steps, and Self-Reset Policy Optimization (SRPO), where the model self-localizes the erroneous step in an incorrect trajectory and resets there. We analyze these methods within the Conservative Policy Iteration (CPI) framework. Extending CPI with a credit-assignment oracle that targets improvable states yields provable improvements over random resets. Across models and reasoning benchmarks, SRPO consistently outperforms standard GRPO and RRPO by sampling multiple suffix continuations at a self-localized reset and learning from their rewards, using only the model itself with no external supervision.

2605.25480 2026-05-27 cs.CL

Retrieval as Reasoning: Self-Evolving Agent-Native Retrieval via LLM-Wiki

检索即推理:通过LLM-Wiki实现自我进化的智能体原生检索

Haoliang Ming, Feifei Li, Xiaoqing Wu, Wenhui Que

发表机构 * Tencent Inc.(腾讯公司)

AI总结 提出LLM-Wiki系统,将外部知识组织为可编译、可组合、自进化的Wiki页面,通过工具调用接口实现搜索、阅读和链接跟踪操作,并引入错误簿进行持续自纠正,在多项多跳问答基准上取得最优结果。

Comments 15 pages, 3 figures, 10 tables, 1 algorithm

详情
AI中文摘要

LLM智能体需要的检索行为应更像推理(搜索、阅读、遍历、判断证据是否充分),而非一次性上下文获取。然而,当前的检索增强生成(RAG)系统将外部知识组织为扁平块,通过嵌入相似性检索,暴露出一种不适合迭代推理智能体的“检索即查找”接口。我们提出LLM-Wiki,一种智能体原生检索系统,它将外部知识视为可编译、可组合、自进化的结构而非静态检索索引,从而实现了“检索即推理”范式。LLM-Wiki将文档编译为带有双向链接的结构化Wiki页面,通过标准工具调用接口暴露搜索、阅读和链接跟踪操作,并引入错误簿进行持久的结构和语义自纠正。LLM-Wiki在HotpotQA、MuSiQue和2WikiMultiHopQA上取得了最先进的结果,比HippoRAG 2、LightRAG和GraphRAG高出2.0-8.1个F1点。在AuthTrace上,LLM-Wiki取得了最佳总体准确率,在多文档结构化查询上尤其有显著提升,证实了基于编译的检索在链式多跳推理之外也具有泛化能力。

英文摘要

LLM agents require retrieval to behave less like one-shot context fetching and more like reasoning: searching, reading, traversing, and deciding when evidence is sufficient. Yet current Retrieval-Augmented Generation (RAG) systems organize external knowledge as flat chunks retrieved by embedding similarity, exposing a retrieval-as-lookup interface ill-suited to iterative reasoning agents. We propose LLM-Wiki, an agent-native retrieval system that operationalizes the Retrieval-as-Reasoning paradigm by treating external knowledge as a compilable, composable, and self-evolving structure rather than a static retrieval index. LLM-Wiki compiles documents into structured Wiki pages with bidirectional links, exposes search, read, and link-following operations through standard tool-calling interfaces, and introduces an Error Book for persistent structural and semantic self-correction. LLM-Wiki achieves state-of-the-art results on HotpotQA, MuSiQue, and 2WikiMultiHopQA, outperforming HippoRAG 2, LightRAG, and GraphRAG by 2.0-8.1 F1 points. On AuthTrace, LLM-Wiki achieves the best overall accuracy, with especially strong gains on multi-document structured queries, confirming that compilation-based retrieval generalizes beyond chain-style multi-hop reasoning.

2605.25382 2026-05-27 cs.CL

AuthTrace: Diagnosing Evidence Construction in Thematically Dense Single-Author Corpora

AuthTrace: 主题密集的单作者语料库中的证据构建诊断

Xiaoqing Wu, Feifei Li, Haoliang Ming, Wenhui Que

发表机构 * Tencent Inc.(腾讯公司) Beijing, China(中国北京)

AI总结 提出AuthTrace基准,通过扇入梯度诊断主题密集单作者语料库中证据构建系统的召回率、精度和答案正确性,发现证据召回是答案正确性的最强预测因子。

详情
AI中文摘要

证据构建——决定在生成开始前哪些段落到达语言模型的阶段——按范式进行评估,使得从业者无法有原则地诊断哪种组织策略失败、在哪里失败或为什么失败。我们引入了AuthTrace,这是一个基于主题密集的单作者语料库构建的诊断基准,其中近失干扰项与所需证据共享风格、主题和词汇。AuthTrace提供明确的引用证据、精确的扇入注释以及统一的包级协议,用于衡量证据召回率、证据精度和答案正确性。扇入梯度——支持答案所需的源文档数量——作为主要诊断轴,使得能够在检索、记忆、图和结构化证据范式之间进行受控比较。评估两个QA模型上的八个系统,我们发现,在主要读者-判断器对下,证据召回率是答案正确性最强的观察预测因子(r = 0.96);大多数失败源于缺失证据而非答案合成。扇入进一步揭示了特定范式的崩溃模式:平面检索的退化速度比主题组织的证据构建快2-3倍。这些结果表明,扇入分解是一种可重用的诊断镜头,用于识别证据构建系统失败的位置以及哪种范式最适合给定的工作负载。

英文摘要

Evidence construction--the stage that determines which passages reach the language model before generation begins--is evaluated paradigm by paradigm, leaving practitioners with no principled way to diagnose which organization strategy fails, where, or why. We introduce AuthTrace, a diagnostic benchmark built on thematically dense single-author corpora where near-miss distractors share style, topic, and vocabulary with the required evidence. AuthTrace provides explicit quoted evidence, exact fan-in annotation, and a unified pack-level protocol measuring evidence recall, evidence precision, and answer correctness. A fan-in gradient--the number of source documents required to support the answer--serves as the primary diagnostic axis, enabling controlled comparison across retrieval, memory, graph, and structured-evidence paradigms. Evaluating eight systems across two QA models, we find that evidence recall is the strongest observed predictor of answer correctness under the primary reader-judge pair (r = 0.96); most failures stem from missing evidence rather than answer synthesis. Fan-in further exposes paradigm-specific collapse patterns: flat retrieval degrades 2-3x faster than thematically organized evidence construction. These results show fan-in decomposition to be a reusable diagnostic lens for identifying where evidence-construction systems fail and which paradigm best serves a given workload.

2605.25281 2026-05-27 cs.CL cs.AI

READER: Reasoning-Enhanced AI-Generated Text Detection

READER: 增强推理的AI生成文本检测

Pingfan Su, Kai Ye, Shijin Gong, Erhan Xu, Jin Zhu, Giulia Livieri, Chengchun Shi

发表机构 * School of Mathematics, University of Birmingham(布里斯托尔大学数学学院)

AI总结 提出READER方法,通过微调1.5B参数的LLM在结构化推理数据集READ上,结合推理与检测,在分布偏移下优于GPT-5.2等大100-1000倍的模型。

详情
AI中文摘要

近年来,大型语言模型(LLMs)的进步使得区分人类撰写的文本与AI生成的内容变得越来越困难。许多现有的检测器训练有监督的神经分类器,这些分类器在分布内表现强劲,但通常不透明,且在分布偏移下性能可能大幅下降。我们提出READER,一种增强推理的AI文本检测器,它输出人类/AI标签以及描述其决策证据的结构化理由。我们方法的一个关键组成部分是READ,一个包含理由和判决的精心策划的监督集。我们在READ上微调一个LLM以构建READER,该检测器在推理时先推理再检测。尽管只有1.5B参数,READER始终优于现有检测器以及提示式的高容量LLM基线(GPT-5.2、Gemini-3-Pro和DeepSeek-V3.2),这些基线的规模大100到1000倍。

英文摘要

Recent advances in large language models (LLMs) have made it increasingly difficult to distinguish human-written text from AI-generated content. Many existing detectors train supervised neural classifiers that achieve strong in-distribution performance but are often opaque and can degrade substantially under distribution shift. We present READER, a reasoning-enhanced AI text detector that outputs both a human/AI label and a structured rationale describing the evidence for its decision. A key component of our approach is READ, a curated supervision set of rationales and verdicts. We fine-tune an LLM on READ to build READER, which reasons before detecting at inference time. Despite having only 1.5B parameters, READER consistently outperforms existing detectors as well as prompted, high-capacity LLM baselines (GPT-5.2, Gemini-3-Pro, and DeepSeek-V3.2), which are 100 to 1000 times larger in scale.

2605.24636 2026-05-27 cs.AI cs.CL

GlobalDentBench: A Multinational Benchmark for Evaluating LLM Clinical Reasoning in Dentistry with Expert Calibration

GlobalDentBench:一个用于评估牙科领域大语言模型临床推理能力并包含专家校准的多国基准

Junjie Zhao, Jingyi Liang, Zhenyang Cai, Jiaming Zhang, Zhenwei Wen, Shuzhi Deng, Wenjing Yi, Chunfeng Luo, Hexian Zhang, Junying Chen, Tianrui Liu, Zhuhui Bai, Zixu Zhang, Pradeep Singh, Xiang Liu, Jianquan Li, Nhan L Tran, Falk Schwendicke, Zuolin Jin, Lijian Jin, Liangyi Chen, Wei-fa Yang, Benyou Wang, Junwen Wang, Shan Jiang

发表机构 * Division of Applied Oral Sciences and Community Dental Care, Faculty of Dentistry, The University of Hong Kong(香港大学牙科学院应用口腔科学与社区牙科护理系) School of Data Science, The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)数据科学学院) School of Artificial Intelligence, The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)人工智能学院) Department of Periodontology, Shenzhen Stomatology Hospital (Pingshan) of Southern Medical University, Shenzhen, China(南方医科大学深圳口腔医院(平山)牙周科) Beijing Institute of Collaborative Innovation(北京协同创新研究院) Department of Orthodontics, Shenzhen Stomatology Hospital (Pingshan) of Southern Medical University, Shenzhen, China(南方医科大学深圳口腔医院(平山)正畸科) Shenzhen Stomatology Hospital (Pingshan) of Southern Medical University, Shenzhen, China(南方医科大学深圳口腔医院(平山)) College of Future Technology, Peking University(北京大学未来技术学院) Freedom AI New Cornerstone Science Laboratory, National Biomedical Imaging Center, State Key Laboratory of Membrane Biology, Institute of Molecular Medicine, Peking-Tsinghua Center for Life Sciences, College of Future Technology, Peking University, Beijing 100871, China(新基石科学实验室、国家生物医学成像中心、膜生物学国家重点实验室、分子医学研究院、北京大学未来技术学院、生命科学中心,北京大学,北京100871,中国) IDG/McGovern Institute for Brain Research, Peking University, Beijing 100871, China(IDG/ McGovern脑科学研究院,北京大学,北京100871,中国) Division of Oral and Maxillofacial Surgery, Faculty of Dentistry, The University of Hong Kong(香港大学牙科学院口腔颌面外科系) Shenzhen Loop Area Institute(深圳环城区域研究所) Department of Cancer Biology, Mayo Clinic Arizona, 5777 E. Mayo Blvd., IERB-3-504A, Phoenix, Arizona, 85054, USA(梅奥诊所亚利桑那分部癌症生物学部门,5777 E. Mayo Blvd., IERB-3-504A, Phoenix, Arizona, 85054, USA) Department of Conservative Dentistry, Periodontology and Digital Dentistry, LMU University Hospital, LMU Munich, Munich, Germany(慕尼黑大学医院保守牙科、牙周病学和数字牙科部门,慕尼黑,德国,慕尼黑大学) Division of Periodontology & Implant Dentistry, Faculty of Dentistry, The University of Hong Kong, Hong Kong, SAR, China(香港大学牙科学院牙周病学与种植牙科系,香港,中国)

AI总结 提出首个跨国牙科基准GlobalDentBench,包含14个专科、88个国家的8978道专家验证题目,评估三种推理层次,揭示当前大语言模型在牙科临床推理中性能随复杂度下降且存在高风险。

详情
AI中文摘要

尽管大语言模型(LLMs)在医学领域具有变革潜力,但其在真实临床场景中的推理鲁棒性和安全性仍未得到充分探索,尤其是在牙科领域。本文提出GlobalDentBench,首个跨国牙科基准,其分类体系涵盖六大洲88个国家和地区的14个牙科专科。该基准包含8978道专家验证题目,分为三种格式(选择题、简答题和基于案例的题目),并评估三个递进推理层次:知识回忆(L1)、常规推理(L2)和个体化推理(L3)。为确保数据质量,自动构建框架由六名资深牙医校准,选择题和简答题的专家一致率达到99.98%,更复杂的基于案例的题目达到96.78%。在GlobalDentBench上对12个前沿LLMs的评估显示,随着推理复杂度增加,性能呈急剧阶梯式下降。具体而言,准确率从选择题的81.34%骤降至简答题的64.53%和基于案例的题目的22.34%,同时从L1的74.01%显著下降至L2的55.64%和L3的35.71%。更关键的是,对真实牙科案例的风险分析表明,LLM生成的临床建议中总体不安全率高达31.01%,其中4.51%存在导致不可逆患者伤害的风险,且风险在正畸等专科中尤为突出。这些发现暴露了当前LLMs在医学推理和安全性方面的根本局限性。因此,GlobalDentBench为可信赖的临床AI评估提供了可扩展的基础,强调了在医疗领域安全部署这些模型之前迫切需要严格验证。

英文摘要

While large language models (LLMs) hold transformative potential for medicine, their reasoning robustness and safety in real-world clinical scenarios remain critically underexplored, particularly in dentistry. Here we introduce GlobalDentBench, the first multinational dental benchmark, featuring a taxonomy that encompasses 14 dental specialties across 88 countries and regions spanning six continents. The benchmark comprises 8,978 expert-validated questions across three formats (multiple-choice, short-answer, and case-based questions) and assesses three progressive reasoning levels: knowledge recall (L1), routine reasoning (L2), and individualized reasoning (L3). To ensure data quality, the automated construction framework was calibrated by six senior dentists, achieving expert agreement rates of 99.98% for multiple-choice and short-answer questions and 96.78% for the more complex case-based questions. Evaluation of 12 frontier LLMs on GlobalDentBench revealed a sharp, stepwise performance degradation with increasing reasoning complexity. Specifically, accuracy plummeted from 81.34% on multiple-choice to 64.53% on short-answer and 22.34% on case-based questions, while declining markedly from 74.01% at L1 to 55.64% at L2 and 35.71% at L3. More critically, risk analysis of real-world dental cases demonstrated an alarming overall unsafe rate of 31.01% in LLM-generated clinical recommendations, with 4.51% posing risks of irreversible patient harm and risks particularly pronounced in specialties such as orthodontics. These findings expose fundamental limitations in the medical reasoning and safety of current LLMs. Consequently, GlobalDentBench provides a scalable foundation for trustworthy clinical AI evaluation, underscoring the urgent need for rigorous validation before the safe deployment of these models in healthcare.

2605.24465 2026-05-27 cs.RO

Polymander II: an amphibious salamander-inspired robot with contact and flow sensors

Polymander II:一种带有接触和流量传感器的两栖蝾螈启发机器人

Qiyuan Fu, Sudong Lee, Andrea Grillo, Jonathan Arreguit, Louis Gevers, Josie Hughes, Auke J. Ijspeert

发表机构 * Biorobotics Laboratory, EPFL(生物机器人实验室,瑞士联邦理工学院) CREATE Lab, EPFL(CREATE实验室,瑞士联邦理工学院) Innobridge Services Sàrl(Innobridge Services公司)

AI总结 本文提出一种基于霍尔效应传感器的两栖机器人,用于感知足部接触力和侧向水动力,实现陆水环境感知与反馈控制。

Comments This work has been accepted for publication in the 2026 International Conference on Robotics and Automation (ICRA), Vienna, Austria

详情
AI中文摘要

机器人受益于感官信息来协调身体运动、增强对扰动的鲁棒性,并在不同模式间转换以适应各种地形。然而,很少有兩栖机器人能够感知与陆地和水中环境的交互。在本文中,我们提出了一种解决方案,使用霍尔效应传感器来感知一种受蝾螈启发的两栖机器人的足部接触力和侧向水动力。通过两条总线,机器人可以同时以超过500 Hz的频率获取这些外部感受信息,并以100 Hz的频率获取本体感受信息,如关节位置和负载。所使用的霍尔效应传感器体积小巧,适合嵌入机器人多个位置,并且对小力具有高灵敏度。此外,由于传感器可以与测量对象分开放置,防水实现相对容易。我们的测试展示了机器人在穿越两栖环境方面的能力,以及其在利用反馈控制执行更复杂运动任务方面的潜力。

英文摘要

Robots benefit from sensory information to coordinate body movement, gain robustness against perturbations, and transition between different modes to adapt to various terrains. However, few amphibious robots can sense interactions with both terrestrial and aquatic environments. In this paper, we present a solution that uses Hall-effect sensors to sense foot contact forces and lateral hydrodynamic forces on a salamander-inspired amphibious robot. With two bus lines, the robot can simultaneously acquire this exteroceptive information at more than 500 Hz and proprioceptive information, such as joint positions and loads, at 100 Hz. The Hall-effect sensors used are compact, making them suitable for embedding in multiple positions within a robot, and exhibit high sensitivity to small forces. Moreover, because the sensor can be positioned separately from the measured object, waterproofing can be implemented with relative ease. Our tests demonstrate the robot's capabilities in traversing amphibious environments and its potential in using feedback control for more complex locomotion tasks.

2605.24456 2026-05-27 cs.CV

EgoProx: Evaluating MLLMs on Egocentric 3D Proximity Reasoning Across a Cognitive Hierarchy

EgoProx: 在认知层级上评估多模态大语言模型的自我中心3D邻近推理能力

Jinzhao Li, Yinuo Chen, Dongxu Piao, Panwang Pan, Yifan Yu, Dong Wang, Honglei Yan, Liang Yue, Shaofei Wang, Yixin Chen, Siyuan Huang, Miao Liu

发表机构 * College of AI, Tsinghua University(清华大学人工智能学院) ByteDance(字节跳动) State Key Laboratory of General Artificial Intelligence, BIGAI(通用人工智能国家重点实验室,BIGAI)

AI总结 提出EgoProx基准,通过认知链任务和基于智能体的数据引擎,评估多模态大语言模型在自我中心3D邻近推理中的表现,发现模型虽具备空间知识但难以有效利用。

Comments Accepted to CVPR 2026

详情
AI中文摘要

人类不断推理3D邻近性,即身体与周围物体之间的关系,以指导日常生活中的感知和行动。多模态大语言模型(MLLMs)能否进行这种具身3D推理尚不清楚。为此,我们引入了EgoProx,一个用于自我中心3D邻近推理的基准。我们沿着认知链组织任务,涵盖意图、探索、利用和行动链推理。我们还设计了一个基于智能体的数据引擎,能够大规模生成多样且一致的问答对。我们在EgoProx上对主流MLLMs进行了基准测试,并通过数据集特定和任务特定的指令微调进行了额外分析。我们观察到较大的跨领域增益,表明当前的MLLMs包含一些空间知识;然而,它们仍然难以有效利用这些知识进行空间推理VQA。

英文摘要

Humans constantly reason about 3D proximity, the relations between their body and surrounding objects, to guide perception and action in daily life. Whether multimodal large language models (MLLMs) can perform such embodied 3D reasoning remains unclear. To this end, we introduce EgoProx, a benchmark for egocentric 3D proximity reasoning. We organize our tasks along a cognitive chain, covering intention, exploration, exploitation, and chain-of-actions reasoning. We also design an agent based data engine that produces diverse and consistent QA pairs at scale. We benchmark prevailing MLLMs on EgoProx and conduct additional analyses with dataset specific and task specific instruction tuning. We observe large cross-domain gains, indicating that current MLLMs contain some spatial knowledge; however, they still struggle to effectively leverage it for spatial reasoning VQA.

2605.24219 2026-05-27 cs.AI

Beyond Final Answers: Auditing Trajectory-Level Hallucinations in Multi-Agent Industrial Workflows

超越最终答案:多智能体工业工作流中的轨迹级幻觉审计

Harshada Badave, Santosh Borse, Andrea Gomez, Harshitha Narahari, Sara Carter, Vishwa Bhatt, Aishani Rachakonda, Shuxin Lin, Dhaval Patel

发表机构 * IBM Columbia University(哥伦比亚大学)

AI总结 提出Trajel数据集和评估框架,通过五类幻觉分类法审计多智能体工业工作流中的轨迹级幻觉,发现现有基准忽略的常见失败模式,并证明轨迹感知检测优于标准事后验证。

详情
AI中文摘要

大型语言模型(LLM)越来越多地被部署为自主智能体,能够推理、使用工具并执行多步操作。然而,大多数幻觉基准仍然只评估最终输出,忽略了源自中间思考-行动-观察步骤的失败。我们提出了Trajel,一个用于审计多智能体工业工作流中轨迹级幻觉的数据集和评估框架。Trajel在来自AssetOpsBench的专家注释智能体轨迹上引入了一个五类幻觉分类法(事实性、指代性、逻辑性、程序性和范围性)。我们在子任务、轨迹和长上下文级别对监督检测模型进行了基准测试。我们的结果表明,最常见的失败模式被现有基准忽略,近一半的幻觉轨迹同时涉及多种类型,并且具有高二元准确率的自动检测器仍然错误分类最微妙的类型。轨迹感知检测显著优于标准事后验证,使得基于分类法的评估对于更安全的智能体部署成为必要。

英文摘要

Large Language Models (LLMs) are increasingly deployed as autonomous agents that reason, use tools, and act over multiple steps. Yet most hallucination benchmarks still evaluate only the final output, missing failures that originate in intermediate Thought-Action-Observation steps. We present Trajel, a dataset and evaluation framework for auditing trajectory-level hallucinations in multi-agent industrial workflows. Trajel introduces a five-type hallucination taxonomy (factual, referential, logical, procedural, and scope-based) over expert-annotated agent traces from AssetOpsBench. We benchmark supervised detection models at the subtask, trajectory, and long-context levels. Our results show that the most common failure modes are missed by existing benchmarks, that nearly half of hallucinated trajectories involve multiple types at once, and that automated detectors with high binary accuracy still misclassify the subtlest types. Trajectory-aware detection significantly outperforms standard post-hoc verification, making taxonomy-grounded evaluation necessary for safer agentic deployment.